Skip to main content

Jordan Fisher — Skipping the Line with Autonomous Checkout

Jordan explains how Standard AI uses machine learning to track products and customers in challenging retail environments
Created on July 27|Last edited on August 4


About this episode

Jordan Fisher is the CEO and co-founder of Standard AI, an autonomous checkout company that’s pushing the boundaries of computer vision.
In this episode, Jordan discusses “the Wild West” of the MLOps stack and tells Lukas why Rust beats Python. He also explains why AutoML shouldn't be overlooked and uses a bag of chips to help explain the Manifold Hypothesis.

Connect with Jordan & Standard AI:

Listen



Timestamps

00:00 Intro
57:35 Outro

Watch on YouTube



Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to riley@wandb.com. Thank you!

Intro

Jordan:
Throw the data at it, get an AutoML baseline and then see if you can do better. Maybe you'll be surprised, maybe you can't, right? But your job is not to build models. Your job is to have a business impact.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald.
Today I'm talking to Jordan Fisher, who is the CEO and co-founder of Standard AI. Standard AI is an autonomous checkout company that actually has autonomous checkout working. So this is an amazing conversation with someone who has big deep learning models really working in the real world in real conditions. It's a very informative conversation.

The origins of Standard AI

Jordan, it's good to talk to you. It's been a long time.
Jordan:
Yeah.
Lukas:
I thought it might be good to start by saying a little bit about what Standard AI is, for folks who don't know.
Jordan:
Yeah. So I'll give you the quick spiel on Standard. We build computer vision powered checkout for retail. Probably a lot of people have heard of Amazon Go. "Go for everyone else" is the way we think about ourselves and in particular, we're trying to make it easy to just pick it up and put it into existing stores.
It's a camera-only system, we install onto the ceilings of existing stores. And then we do all this magic behind the scenes to ultimately figure out what do people have so they can get on with their day and skip the line and get their receipt automatically.
Lukas:
And I guess what's the founding story?
Jordan:
There's two pieces to this. Piece one is just, I despise lines.
Lukas:
Nice.
Jordan:
I mean, life is short. I struggle seeing just wasted human capital, right? We spend literally billions of hours waiting in line every year. It's pretty staggering when you stack it all up at once and just look at it, that amount of human capital, just literally incinerated.
And we're doing it just for the sake of commerce, right? We're just waiting in line doing nothing. And there's another person on the other side, just waiting for us to get there, to then do this transaction. It's the most mind boggling thing. So that sucks. Life is short and we shouldn't spend it waiting in line. So I think that's the obvious piece.
But the real sort of inception of Standard was more tech-driven. We had a really cool tech team that I was working with closely at the SCC and a few folks who were sort of in the surrounding industry. This was going back six-ish years now, and we just saw this revolution happening in ML, obviously, but really computer vision at the time, I felt it was having this moment where it was just so clear that everything was about to change, that you could suddenly reach human parity almost in some of these tasks.
And, "Wow, if that's true, if humans and machines can sort of see equally well, what does that mean for the world?" It should change basically everything, right? Every industry where you can put a camera and see stuff should get revolutionized by this coming revolution. That was our really strong conviction.
We didn't have a particular product in mind. We just sat down and said, okay, we're going to... Actually, we weren't computer vision experts. We were just ML experts. We were building ML for the SCC. And we said, "Okay, well, we're just going to retool."
We sat down and for about a year, we just read every computer vision research paper that was coming out. We had one business guy, a really good old colleague of mine, who was sitting in on the brainstorming sessions and was just helping us do the market analysis to TAM and kind of figure out what was what, and we had a bunch of really dumb ideas, which I'll tell you about only over a beer sometime.
Lukas:
Oh come on, tell me one right now, I want to hear one.
Jordan:
Actually, one of them is starting to become more real in a more home setting.
I was really passionate about this idea, which was smart gyms and this idea of...because we were pretty sold already on overhead cameras being the modality that we wanted. We're like, "Where can you just put overhead cameras and enable a cool experience? Well, gyms, right?"
Just put the cameras up, and then you put your AirPod in your ear, and you should be able to just walk into a gym, and your synthetic personal trainer just starts talking to you. It knows what you've done in the past. It starts counting your reps. It pushes you to do one more rep or go for the 15 pound instead of the 12.5 pound. And it takes all the drudgery out of doing the self tracking, which no one wants to do and can also kind of push you to do more. And then help you with proper form, et cetera.
We're starting to see this now in at home gyms where there can be a camera that will help coach you, but I don't think we've seen it yet in the full gym setting.
Lukas:
But then you decided to do this checkout-less store idea. Was it obvious that was the best idea? How did you come to it? How did you validate it?
Jordan:
It was super obvious. Because when you start running the numbers, it's just wild, how big the opportunity is. And it's only gotten bigger since then. Check out on its own is huge. Like I said, it's literally billions of hours a year across the world. It's just massive. And we spend hundreds of billions of dollars to make that happen.
There's so much else that needs to happen in stores, right? That human capital has an insane number of things that we need from it in these stores. Stocking the store and customer support and refacing, et cetera. We hear that now that we're talking to retailers, they want all these things.
Of course they want autonomous checkout because they don't want a line either and they want to do the best thing for their shoppers, but they also just can't even staff their stores effectively right now. There are retailers that have 10-20,000 plus open positions right now.
Our product teams especially go and interview not just the retailers, but employees in the store. And they get super excited, as much as anyone else about autonomous checkout. Because they're like, "Oh, if we have this in my store, I get to go do all the things that I want to do," right? I get to go interact with customers.
I have my locals, my regulars, that I really like talking to. I can finally have a spare minute to go fix the out-of-stocks that are actually hurting the bottom line of the store, because someone walks in and wants their Snickers bar and we haven't had a chance to restock it. That's a massive hit to retail.
So really, everyone's super excited about it. All these industries — these retail tech industries — can be done better with computer vision powering them, right? It's inventory and out of stock and loss prevention and insights and analytics. And then it's also checkout, right?
We started pulling back all the layers of this onion. We're like, this is really going to just change the entire 25 trillion dollar physical retail industry. Every aspect of it gets better once you have a smart system in the store. So we were just like, "This is insane."
That was one of our metrics, obviously. A huge TAM. Another one of our metrics was we wanted it to be a really hard tech problem. Just from a personal satisfaction place, we love working on hard tech. But then my personal preference is working in super rarefied industries where there's a huge barrier to entry from a technical challenge perspective, because it rarefies there, right? There's only a handful of teams that are going to be competing with you.
It was kind of that sweet spot of, "This is really hard, but it's not quite as hard as autonomous vehicles, where we're going to be bashing our head against this for a decade and probably need to go raise a trillion dollars to compete with Waymo." It hit all that, hit the sweet spot of all those things.
Lukas:
Is the challenge to actually see when someone takes something off the shelf? Is that the challenge, or is there a point where you sort of show it to a camera and then check out? How does the experience work?
Jordan:
Yeah. From the experience perspective, we're really trying to make it just completely seamless. You forget that you're doing this, that you're shopping. The goal is to make it feel like it's your personal pantry.
Just walk in, grab stuff, put it in your jacket, put it in your pocket, put it in your purse. You no longer think about transacting. We're hoping that it sort of does to shopping what Uber and Lyft did to taxis. You're still transacting, but you don't think about that transaction moment anymore. You're just hitting a button. A car shows up, you get in, you get out.
It's just so seamless that now you take a Lyft more than you take a taxi, right? You're growing the pie and that's what we're really hoping to do with retail.

Getting Standard into stores

Lukas:
What do you do on day one when you decide to make a company that does this? What was the next step? Did you go talk to stores and try to get them to let you install cameras and run ML models? How does it work?
Jordan:
We did. Yeah, we did actually.
We actually had a pretty big co-founding team, but Michael — who's our chief business officer, was our "business guy" — me and Michael, we were in New York at the time and we hadn't quit our jobs yet, so we didn't have enough conviction, but that came shortly thereafter.
But yeah, we just walked around Williamsburg in Brooklyn and just started talking to store owners. It was Saturday...the very first thing we learned about retail was you don't bother retailers on a Saturday because that's the most important day of the week for them to sell stuff. We were just going in talking to store managers and they're like, "This sounds cool, I guess, but you need to get out of my store right now. I got stuff to sell."
So we started coming back on Mondays and Tuesdays. I mean, you go talk to retailers anywhere — even five years ago, six years ago — and it was already clear that everyone wanted this, right? We were super lucky. Once our name just got out, once we incorporated and put out even just a little bit of videos of what we were doing, we just got insane inbound from basically all of the retailers in the world.
It was small mom-and-pops all the way up to mega Fortune 10 companies. That's how we knew that if we can build this — at the time, it was not clear that we'd succeed — but if we could build this, then yes, there is this ridiculous amazing demand at the end of the rainbow.
Lukas:
And so where are you at? Can I go to a Standard AI store and pull stuff off the shelves?
Jordan:
Yeah, yeah, for sure. I mean, it is hard tech, right? We're five years in and we haven't deployed everywhere yet, but I like to call this space "AV light". You probably have a lot of AV folks on your store, so I'm going to be incessantly pinching them. Everyone in AV should come over to autonomous checkout, because the time is now. But we get to go to market faster.
Actually our tech is probably not as advanced as AV. I think we're pretty sophisticated, we do some really cool stuff, but we haven't invested quite as much as the Waymos of the world. But we get to go to market faster because we can make a mistake, and if so someone gets ketchup for free.
It's actually an okay experience for someone, and retailers are used to it as well. They have a built-in margin that they expect to lose because there's loss and theft and mistakes and breakage, et cetera. So it's just a really more friendly place to be.
We're just now kind of exiting MVP stage. We're at 10 stores now, that we've launched in with real retailers. They're just regular stores that we showed up and installed our cameras and transformed them.
One here in the Bay Area, actually at San Jose State University.We just launched about two months ago, which was super awesome. Because we have more adoption from that one store than our other 10 combined, because the students, they're like "Yes, early adopters, great." We have 500 people using our system every single day just at that one store, which is super exciting.
It's still early, but at the same time we're kind of exiting MVP and really starting to ramp up with our retail clients. They're finally seeing the tech work in their own stores or their competitor stores. And they're getting really excited about how much shoppers love this and also all these other value props that we have been pitching for the last five years around inventory, et cetera. They're finally seeing it now and we're growing into that and starting to expand. So that's really exciting.
Lukas:
In those five years that you've been working on it, what's unlocked the ability to put it into stores? Has it really been kind of making the models more accurate or something else?
Jordan:
It's a whole slew of things for sure, right? There's product work for sure. Because at the end of the day there is real experience and the way that you're presenting it to people in the store matters.
There was also just some go-to-market aspect of it too, right? Where when we started, we were like, "We're just going to put this in every store in the world," which is our intent, but we were like, "Let's go sign deals with everyone."
We were going out and talking to 500,000 square foot stores, mega grocery stores. And then we had to kind of take a step back and say, "Well, look, this is cutting edge tech. We need to start a little bit smaller. We can still partner with big companies and mom-and-pops, but let's go after convenience stores to start with."
It's a smaller footprint. You don't need as many cameras, et cetera, smaller number of items. There was a little bit of kind of a reality check there that we should start a little bit smaller, which we did. So now we've kind of mastered convenience stores and are going to be expanding from there.
But yeah, for sure, the engineering, the machine learning, the operations. I think, for me, operations are always a super under-appreciated aspect of ML. You just got to go heavy on ops, and care a lot about your data, care a lot about your labels and your quality. That's been super important for us.
Lukas:
Interesting. So when you say ops, you mean labeling? That's the primary ops component?
Jordan:
Yeah. Tons of labeling for sure. I mean we definitely have big datasets, and we have a little bit of a HITL (human-in-the-loop) too, as part of our live system. Which is another kind of...I guess you see HITL on some AV systems as well, where there can be a disengagement and then a remote pilot will take over.
But for us that's actually a much easier part of the process, because you're not driving a car, so you don't need this 10-millisecond response time. We just need to get someone a receipt in the next 5 to 10 minutes. So if we kick off a background thread and have a human take a look at something, that's totally fine. And then that's another label that we can throw back into the system. So it's sort of all self-feeding.
Lukas:
When you set up a system in a new store, do you have to train it on the particular inventory in that store? Or I guess even inventory can change over time. Do you keep retraining your systems?
Jordan:
Yeah, for sure. I mean even the stuff that's not...so the items definitely change, the SKU set and the catalog change, but even the stuff that you would hope would be more generic is not quite as generic as you want.
Our people detection and people tracking systems are, in theory, fully generic. We show up at a store, we install cameras, we flip a switch, and we basically have a multi-view tracking system that can fully anonymously track 20, 30, 40 people within a space in real time, which is super, super cool.
But nonetheless, it does get better if you fine-tune on that store, right? So we'll go in, and over the course of the first month or two we'll label a little bit of data, fine-tune the model, and then redeploy to that store. And you do get a boost by doing that. I think at some point you probably start seeing diminishing returns. We're only at 10 stores, but presumably at 1,000 stores or 10,000 stores, that human model is going to be so general that there's probably no point in fine-tuning on a per store environment.
But when it comes to products, like you said, there could be a different product in every single store. That plateaus too. To give you some rough numbers, a C-store can have maybe 5,000 unique SKUs in their store and they're going to have maybe 30,000 unique SKUs across their fleet. But that fleet might be 1,000 stores. You get a pretty good economy of scales once you start getting to fleet-scale, because to go from one store to the full fleet, you're only going to 6x the size of your catalog, but you're going to 1000x the size of your deployments.
It pays off in the long run, but it's super expensive when you're only in 10 stores like we are. So we work really hard to stay on top of those ops of the churn of new SKUs that are showing up — it's the Easter version of the Snickers bar — it's just constantly churning for sure.
Lukas:
What do you do in that time when it's training, that month or two where people are coming in? Is it all sort of human-operated at that point, and then it gradually cedes to the ML algorithms? Or do people have some other mechanism for paying?
Jordan:
What's cool is we run in the background because we're showing up at existing stores. We're not building a new store, right? So the same store is there. We're not getting rid of the existing point-of-sale system, the existing checkout system.
We install our cameras, we're doing our things behind the scene, and then the store's just running as is. It's only when we're ready — and we've showed the retailer that we've reached a certain accuracy — that we flip the switch on. But then even then, once we flip the switch, it's not a hard crossover.
Which actually is really nice for us, because there's still people...Apple Pay has 6% adoption or something right now. You're not going to see an overnight, 100% adoption of Standard. Although at San Jose State, we do see that, we're basically at 100% adoption at that store. But-
Lukas:
-that's awesome.
Jordan:
-in most stores, you're not going to see 100% overnight. You're going to see like 5%, right? To start with, and it'll take time for everyone to switch over.
But what's cool about that is you get the point-of-sales signal. The point-of-sales system will tell you what the non-Standard shoppers are buying. Our system can still predict what we think they're buying. And then that's actually a nice corrective signal where we can say, "Well, where are we making mistakes?"
We have a team that will do deep dive analysis to sort of suss out what happened and then ultimately see if that needs to be a training label back into the system. That's a really nice flywheel because that's just running before and after we launch.

Supervised learning, the advent of synthetic data, and the manifold hypothesis

Lukas:
Totally, totally. Has your views on computer vision architecture changed since 2017? I feel like computer vision is constantly making advances. Does that affect you? Have you changed at all the way you've thought about training your models?
Jordan:
Yeah. We're still doing a lot of just old-fashioned — at this point — supervised learning. When we started Standard, I had a rule. At the time I was much more involved with the ML team, and had a rule with the ML team, which was, you're only allowed to do supervised learning.
Even five years ago, there was all this fancy stuff, right? Autoencoders and whatever, blah, blah, blah. It was all...I don't want to say BS, it was good research. But it was not production quality, industrial machine learning yet. But it was super attractive, people wanted to play around with those things.
That was my rule. "It has to be just old-fashioned supervised learning. We're just going to throw a bunch of data at this thing. I'm sorry that's not glamorous. It's still going to be really hard, I promise you. You're going to have plenty of chances to solve hard problems." And we did, we solved some really cool stuff. But that was kind of the rule back then.
I kept that rule for a long time and I think it's just now getting to a point where I think there's different ways. Of course supervised is still the mainstay, but I think synthetic data is getting super interesting. I think also just in the last 6 to 8 months, this self-supervised revolution that's happening in vision — that had already happened in NLP — is super fascinating.
We're starting to play around a little bit. It's not in our production models yet, but we're starting to play around with it a little bit. It's pretty wild what some of the stuff can do.
Actually, I had COVID about a month ago, so I had a few days where no one was letting me "work." So I was just programming instead. I was like, "I'm just going to play around with some of this self-supervised stuff." I took all of our images from all of our shelves from production, with no labels whatsoever. It was hundreds and hundreds of gigabytes of just images of products.
I trained one of these massive Vision Transformer Masked Autoencoders. I just let it run while I had COVID because it's about a week to recover, so I just let it train the whole time.
The things that were super striking about this was, first of all, it took me like four hours to do this. Shout out to Hugging Face and all these...5 years ago, even if I knew what the model architecture should have been 5 years ago, I would've spent a month programming this thing.
Lukas:
Totally.
Jordan:
Here it is, a couple hours punching around GitHub, tuning up a little bit of stuff, I spin up an instance on Google, and then I just let this thing run.
Lukas:
Wow. You just ran on one instance, it wasn't even distributed?
Jordan:
Yeah, I got the biggest instance I could. It was a 16-A100 instance. Which is something that only I get to rent. Hope no one from our company is listening and is like, "Oh that means I get to go rent 16 A100 GPUs."
But yeah, you didn't even need to...I think, the Vision Transformers are still not as big as the NLP Transformers, right? You don't need the...was it the PaLM model from Google, where they had two v4 TPU super pods. Like 5,000 TPUs or something, right?
Lukas:
Right.
Jordan:
For who knows how long. There's no vision models that are even close to that big right now, but maybe we should be starting to do that stuff. I don't know.
But anyway, just to wrap this, I know I'm going on super long. I trained this masked autoencoder and it's basically perfect. It's insane. You can mask out 95% of an image. In the paper they talk about doing 75% masking, but there's just such a clear signal from products — because CPG products have just such clear packaging — that you can mask out even more of the image and it'll reproduce super faithfully basically the whole package. Because it's able to learn what the packaging should look like, right?
If you think about it, we always talk about the manifold hypothesis where images sit on some sub-manifold, which is maybe or maybe not true, but it's definitely true for CPG. Because if you have a CPG product, the manifold is six degrees of freedom, right? It's rotation and translation with a little bit of lighting. But that's it, it's literally a very low-dimensional manifold. The model's just able to learn that on its own and it just completely faithfully replicates these. For me, it's just wild.
Lukas:
I've not heard of the manifold hypothesis. Could you describe that? It seems like it would be more than six dimensions of freedom for a packaged item.
Jordan:
Well, for other stuff it's way more. Actually, you can see it as it shows up...sorry, I've been eating snacks here. When you see how well it does on things like chips, it still does really well. But a bag of chips has more than six degrees of freedom, right? Because it's not just rotation and translation in three-space, which is six degrees of freedom. It's got all of this — sorry for the noise — all the crumpling, right?
There's actually a lot more degrees of freedom. But for something that's rigid, rigid body motion says it's just six degrees of freedom. X, Y, Z, and then yaw, roll, pitch. Is that what it is? And that's it, right? Compared to a fixed camera, there's only six degrees of freedom. If you take out the lighting aspect of it, which adds some additional degrees of freedom.
But yeah, the manifold hypothesis is that real, natural images live on these much smaller — and human has much more manifold dimensions to it — but these CPG packages are six dimensions. So you can learn it pretty quickly, apparently.
Lukas:
That's amazing that you're able to spend time training your own model. I'm jealous.
Jordan:
I'm jealous too, because it doesn't happen very often.

What's important in a MLOps stack

Lukas:
I'm curious. You guys have been Weights & Biases customers since the very early days, and I'm not here to advertise Weights & Biases, but I would love to know more about your stack. It sounds like you're playing with Hugging Face. Are you using that in production? What other tools are kind of important to you to make the whole system work?
Jordan:
Yeah, yeah, for sure. I'm super fascinated by this question and the new word "MLOps". I don't know when it showed up on the scene, but now everyone talks about MLOps. And we have the holy religious war around whether or not...I think it's very similar to the process that DevOps went through, where DevOps started as a methodology. It was sort of a practice and then it very quickly transformed into a role, right?
Where it was like, once you can enumerate the things that are in practice, then certain engineers don't want to do it anymore. So you want another engineer to do it for you and you give them that title. You're a DevOps engineer. That was against the whole purpose of what DevOps was about. But then same with MLOps, right? MLOps came around and it was like, this is a practice for how ML engineers should be doing their own day-to-day ML development, right?
Lukas:
Right.
Jordan:
For me, we call this end-to-end, full-cycle machine learning at Standard, which is how we tend to run things. You're responsible for thinking about your business impact, which starts with thinking about the metric that you care about.
I'm a big — I don't know what word I want to use — proponent of thinking about metrics. The easiest thing in the world is to look at a research paper and be like, "Oh, so I'm going to use this mean average precision or whatever. That's what all the researchers are doing," but it's like, "No, stop."
The first thing you need to do is spend a couple weeks just thinking about your metric, because we're in production. We have real world use cases. And I guarantee you that the researcher that came up with mean average precision had no real use cases in mind. They just came up with it because they needed the number, and it is definitely not the thing you want to optimize for, right?
You need to think really hard about what your metric is, and validate that that is the right metric. For me, full-cycle ML is, "Think about your business impact, your metric, your data. Get hands on with your data, get hands on with labeling, get hands on with model training and get hands on with deployment monitoring," and then what we call "closing the loop".
You need to have those tools that will meet you at the end and say, "Actually your journey has just begun. Let's see how things are failing in production. Let's make sure that we're taking those as hard examples to bring back to the flock."
Lukas:
Right, totally.
Jordan:
That's super exciting, because it's a whole discipline, but it's also exciting because it's still wild west in terms of "What does the full stack need to look like?" Weights & Biases is super cool. We're on GCP, so we're big Google users. I think they're innovating a lot in terms of what their AI stack looks like.

The merits of AutoML

Lukas:
Oh, you use their AI stack? What's your favorite stuff?
Jordan:
I've never used any of it, I just know we use it. We use Vertex and...that's not true actually. I'm a big believer too in — sorry I say this a lot — AutoML where it's...I personally have definitely played around a lot with Google's AutoML. For me, it's another one of those places where, as an ML practitioner, you don't think to go to AutoML first.
You're like, "AutoML was built for an old-fashioned engineer or someone with a business problem. They don't know how to do ML. So Google built this thing to make it easier for them to dip their feet." It was like, "No, no, no, no, no. Take a step back, first of all." First of all, I'm sure whoever built AutoML put a thousand times more resources into this than then you're going to put into your custom ML model. Second of all, even if it's not better, it's a great baseline, right?
Lukas:
Totally.
Jordan:
Just do it, throw the data at it, get an AutoML baseline and then see if you can do better. Maybe you'll be surprised, maybe you can't, right? But your job is not to build models. Your job is to have a business impact. And if you can do that faster with AutoML or any other tools, just go for it. It's right there. Sorry, I'm preaching to some choir out there.
Lukas:
No, no. It's funny. We had Anthony Goldbloom on the podcast — the CEO and founder of Kaggle — and he was saying that he used Google's AutoML and it got him in the top 10th percentile on a Kaggle competition. Which I thought was amazing, it's like, "Come on guys, use AutoML".
Jordan:
Yeah. I mean, that's what these tools are for, right?
I think that's cool actually, because it's...again, for me it hearkens back to the previous wild west that we had in engineering, right? Where we used to write assembly code. That's what we all did. Not me personally but that's what we did back in the day, right? And then we started developing compilers and blah, blah, blah, and starting to move up the stack.
You had the same thing happen back then, where people were like, "No, no, no. You can't use a compiler. You're never going to be able to write assembly the way that I can write assembly." And sure enough compilers got way better than people and we kept moving up the stack of abstraction.
I think the same thing's going to happen with ML. We're not going to be sitting here tuning, manually writing, "Layer Six goes into Layer Seven, and it's going to go from 128 features to 256 features." That's not our future, I think, as MLEs. It's definitely many levels above that in abstraction.

Deep learning frameworks

Lukas:
You're one of the people that has big deep learning models as a really core part of your business and you're successfully deploying a lot of them and continuously improving them. So I'm sure people are going to be interested in more specifics around your stack. Could you share if you have a point of view on frameworks, do you use it all? Give me the stuff you like and don't like, I think that would be the most valuable thing you could offer our audience.
Jordan:
For sure. I pride myself throughout my career — even pre-ML — on picking the right horses, the right stacks that end up...even early, they end up playing out.
Lukas:
All right. So tell me about your 2017 stack then, because I know Weights & Biases is in there. You were one of our first customers.
Jordan:
So, that was great, but for sure the thing we didn't pick correctly...I picked TensorFlow at the time, and I think the whole world has revolted against TensorFlow. I think that the challenge is, you pick the wrong tech and then it gets steeped in your stack, right? It gets really hard to pull it out. We've switched over to PyTorch since then.
Lukas:
Okay. Wait, let's talk about that. Because everyone's got a different take on this. Why do you think PyTorch beat out TensorFlow? What do you think it was?
Jordan:
I mean, for ease of use is...it's the dev experience, all day. I don't necessarily think that it's technically superior. And I think Google's got a great contender now, I've been playing around with JAX in my spare time. We don't have anything in production in JAX, but we have a few irons in the fire, a few things that we're looking at and-
Lukas:
-nice. You're going to get a couple resumes from our community, I guarantee it.
Jordan:
JAX has got the same great dev experience. The ecosystem's a little bit more nascent, but that's to be expected, right? And I think it's more technically excellent. I think it's got way more head room to grow. And hopefully it's not going to be as painful of a shift to go from PyTorch to JAX. We'll see.
I think, as the deployment stories are maturing too, we're getting to this place where it doesn't really matter the way you train your model and the way you're iterating on your experimentation. The way you're going to production can be decoupled from that. So as long as you have the weights, then you can take it to production in a different way, potentially.
Lukas:
What about CI/CD, production monitoring? Do you use any of the stuff out there? Is it home-grown? How do you think about that?
Jordan:
That is a little past my...I'm not as in the weeds anymore, so I can't answer too much of that. I know we're always complaining about it internally, so I know we haven't settled on the right thing.
Lukas:
The 2017 stack though, so it's TensorFlow. What else?
Jordan:
We definitely built a company on Python to start. It was just a pragmatic choice, because ML is Python, unfortunately. I despise Python.

Python versus Rust

Lukas:
You despise Python. Wow. That's strong.
Jordan:
You got to come out of the gate strong with these.
Lukas:
I love it. Yeah. Tell me more. What do you like?
Jordan:
I mean, it's great as an...if I'm going to write a 50-line script, it's fantastic. If I'm going to write a 50,000-line script, and it's going to be sitting across 20 engineers, it's a disaster and-
Lukas:
-what do you want? What would you prefer?
Jordan:
This is the normal religious war, so I'll just harp on the same points, right? But for me, I like strong types because I think that they're...it's not even that you get faster speed — which you do, that's great — but it's a people contract. It's not a machine contract, it's a people contract and it enforces it.
So I know when I come to this piece of code — whether I wrote it 20 years ago or someone else wrote it yesterday — I know exactly what the output is, what the input needs to be. It's a contract. Whereas, with Python you still have contracts. You still spend a lot of engineering resources coming up with the right problem decoupling and "Where should our API boundaries be?" But it's never enforced.
So, is the contract that we all agreed upon actually what's happening or isn't it? And you just have to trust, or build like a ton of unit tests. But there's no guarantees. So for me, strong typing's all about building trust and being able to communicate better with other people. Because for me engineering, it's a team sport. It's a people sport, actually. It's very much collaborative and that's why I like strong types.
Lukas:
So favorite typed language is what?
Jordan:
This actually became part of our stack evolution, was we picked Rust actually.
Lukas:
Oh, you're going to get a lot of resumes. Okay. I was going to guess that.
Jordan:
We're definitely hiring plenty of Rust engineers right now. So if you like ML, and you like productionizing ML, and you like Rust, and you like streaming a lot of data, you definitely want to come to Standard.
We're still not 100% on Rust. We have some other stuff in our stack too, but we have a good healthy amount of Rust. Actually one of our early wins was — and this is why we ended up choosing Rust — was...one of our founders was a huge Rust proselytizer.
Lukas:
Wow. In 2017?
Jordan:
Yeah. Even years before that. This was Brandon, one of our co-founders. We were working together for years before that, and he was always pitching me on Rust. He's like, "Jordan, Let's use Rust." And my job was to say no. My job as engineering manager is to say no.
Lukas:
It's a funny story. I called the Streamlit founder — who was also on the podcast — after he sold his company for $800 million, I was like, "Dude, what are you going to do?" And he's like, "Oh, I just want to write more Rust code. I feel like I now can finally do it, so." You should send him your job opening.
Jordan:
That's cool. Well, if he doesn't have enough money and he just wants to come write some Rust code with us.
But yeah, it's a cool language, right? It took Brandon a while to...because he kept telling me what it was good at and what it wasn't good at. And I was like, "Okay, using what you're telling me, the problem we're working on right now is not what it's for. So you just want to work on this because it's cool."
But then we finally had this problem where it was the right thing. This was our multi-view tracking algorithm, which is this...we run deep learning models per camera to extract these features. And then you have to merge them together across all the different camera feeds in order to build up a single cohesive understanding of how people are moving through the entire store.
That part is not deep learning. It's just this super gnarly graph theory, combinatoric optimization problem. It's dealing with a ton of data, right? Doing a lot of heuristics and it has to be super fast. It has to be soft real-time because it's stream processing, maybe 100 cameras each at 30 FPSs.
We were building that algorithm and it was just getting wrecked. And then we were investing so much engineering resources into parallelizing the Python code. And that's when you have to take a step back from Python. When you start fighting the GIL — the global interpreter lock — and you're doing all this funky magic to get around that, and you start introducing the worst possible technical debt to paper over this fundamental limitation of Python, then it's really time to take a step back.
So we evaluated Rust. We're like, "Let's see if we can rewrite this whole..." And it was a big algorithm. ML's awesome because you write like 50 lines of ML and you get magic. This algorithm's like 10,000 lines of gross, nasty, massive heuristics. But we sat down and rewrote it in Rust pretty fast. And we got a 50x speed up.
Lukas:
Wow.
Jordan:
We've since then gotten like additional 2- or 3x speed up, because...the whole cool thing about Rust is fearless parallelism. I don't know if you know that expression for Rust. It's not just strongly typed. It's hella strongly typed. It has this ability to identify race conditions and make sure that when you're doing parallel programming, you're not going to shoot yourself in the foot. So you can more confidently move into multi-threading.
We've just gotten huge benefits by moving over to Rust, from a speed perspective and from a confidence perspective too. You have this super complex algorithm, you change something, and you want to push it to production.
The thing that we would've had to rewrite it in otherwise would've been C++. It scared the hell out of me, because C and C++ are just...I've had to deal with production code in those languages in the past. Memory leaks and segfaults...and you're having ML engineers write these algorithms, and they're not necessarily experts at memory management, right?
What's cool is now we have more research-oriented people that can make tweaks to this multi-view tracking code, and we don't have memory leaks in production. We don't have segfaults in production. They confidently make changes to the algorithm, push it to production, and it works. That's super cool. So I'm definitely a Rust proponent.

Raw camera data versus video

Lukas:
Any other early technical choices that you really feel proud of?
Jordan:
That's a good question.
I'll tell you one choice we made that was totally wrong. Which was, I had this belief at the time that ended up being correct, but it was still the wrong choice. The belief was that raw camera footage is better than decompressed video.
If you have a camera feed, the best thing you can possibly do is record those pixels raw to disk, train ML models off those raw pixels, and then deploy your model. Never allow H.264, H.265 compression to sit in between, because it's obviously throwing away information. That's its job, right? It's tuned for human fidelity, so that we can't tell the difference. But I was fanatical that we had to use raw only.
All this engineering work went into just being able to store all this raw data all the time. And it just got way too slow to maintain that engineering work. We finally ended up doing an experiment where we collected a bunch of video data, and we labeled it both from video and from raw, and then we trained the models.
Sure enough, there was a pretty sizeable gap between how much accuracy you could get — I don't remember exactly what it was, but it was meaningful — but we sat down and we were just like, "It doesn't make sense. Sure, that accuracy matters, but we won't have a company if we don't move up to video."
I think the scary thing is, to this day, we still have a little bit of vestiges of working off of images instead of video. You make these decisions early on, and they're weeds that are so hard to pull out. We're all still paying for my sins.
That's the dangers of making some of those bets. But I think we've made good bets as well, in the past.
Lukas:
Where do you store all your data and how do you retrieve it when you want to train on it?
Jordan:
We started off as on-prem stack. We bought GPUs, and we built machines, and we put them into convenience stores. That was wild, because we had to upgrade the HVAC in order to make sure the convenience store didn't melt down.
Now we run everything in the cloud. We stream everything to the cloud, which is great for iterations. You just have access to whatever you need. I mean, we have retention policies, et cetera, obviously. But I think moving forward — probably in the next year or two — I suspect we'll be taking some of it back on-prem.
Mostly just from a cost calculus perspective. Because the cloud's great — it's super flexible — but it's not necessarily the most economical. Especially when you're talking about renting GPUs, which is still an arm and a leg up on the cloud.

The future of autonomous checkout

Lukas:
Do you worry at all that the problem you're solving as ML gets better and better might get too easy and would no longer be a deep technical problem?
Jordan:
I don't worry about it. I know that it's going to happen.
We talked about this even early days of Standard. Back then we said 10 years from now, it's going to be a "git clone" to do autonomous checkout. Or worst case, a four-hour project, right? An undergrad's going to be doing it over the weekend or something.
We knew that was going to happen. And I think we're seeing the progress too, right? Even the story I told you about the masked auto encoders, that's such a...and there's real applications to those too. You use that as a pre-training step, and you get better accuracy on item classification. Better than purely supervised, right? It's just this crazy bump in your ability, and it took us couple hours to do it now that it's just a "git clone".
So it's definitely happening. I think we still have a few years left before it's a "git clone". I would still guess like four years maybe, four or five years.
Lukas:
Just four years, wow.
Jordan:
Four or five years. Things are moving fast, man. It's crazy.
Lukas:
Wow. That's crazy.
Jordan:
But what we told ourselves five years ago was, "Yes, that's going to happen, but the same is true for any industry." A point-of-sales system is...I like to talk about this a lot too. A barcode scanner hooked up to a point-of-sale system was literally state-of-the-art physics 50 or 60 years ago, right?
It's a laser. We didn't even know lasers were physically possible. And then we hypothesized the physics, we validated the physics, we productionized the technology, and now it's so ruggedized that it's in every single store in the world, right? And you don't even think about it as technology.
So yeah, that's going to happen. That's okay, I'm sure we'll have other cool hard problems to solve in 10 years. But what we need to do is transition this tech lead that we have into a sort of a true moat, a true flywheel.
And I think that's making ourselves indispensable to retailers, just providing them so much value that...sure someone else could come along and "git clone" autonomous checkout, but our customer support is amazing. Our product is super refined. The experience is amazing. We've got 30 other amazing features that sit on top of the stack that's invaluable to the retailer. The shopper has come to depend on this because it's Standard in their pocket and they expect people to walk into a store and just have Standard work for them.
I think you have to use this tech advantage to turn it into the normal types of advantages that regular startups are using to build a moat. That's okay. I think that happens to every hard tech company
Lukas:
Or they don't make what they set out to make. I think that might a common failure, but-
Jordan:
-yeah. For sure, for sure.
Lukas:
What's one non-obvious thing that you could do to enhance the experience? I'm sure you've thought about this a lot.
Jordan:
There is still this friction in the experience, which is...our visual system is fully anonymous. We don't know who you are. You're Person 17 when you walk into the store, and that's intentional. We don't do facial recognition, et cetera. But we have to tie your payment information to Person 17 somehow.
If you've been to Amazon Go, they do these gates. When you walk up to the store, to get in you have to pass through a gate literally, and you use the Amazon app to open the gate. There's a visual sync, basically, where behind the scenes, Amazon's saying, "Okay, Susan just badged in. We see Person 17 at the gate. So Person 17 must be Susan." You do that, we call it association.
We do something very similar. We don't do gates, because we believe gates are antithetical to good retail. You don't put friction at the beginning, you put friction at the end. Amazon knows that too, so I don't know why they...they're the best e-commerce player in the world, they know that you put friction at the end. You never put it at the beginning. Sorry, just a tirade.
We're strong believers that you don't put gates up. What we do is we put NFC stickers in the store. What you do is the same thing. Anytime during your trip — you don't need to do it to get into the store — you can just come shop. Anytime during your trip, you take your phone out, you bump one of these NFC stickers. And then we do the same thing, where we know when the bump happens on the backend; and we know that that's Susan; and then we know Person 17 was the one bumping because we have this fine-grained 3D reconstruction of Susan as Person 17, so we know where their hand is; and then we do the association.
But there's still that friction, right? You have to take your phone out of your pocket and think about transacting. I have this belief that we'll be able to get rid of that at some point in the next couple years, where — without being privacy invasive — you can keep your phone in your pocket and using additional signals like Bluetooth we should be able to narrow in and figure out Person 17 is Susan. Because then you can really just walk into a store, and walk out, and never have to think about transacting.

Sharing the StandardSim data set

Lukas:
That's cool. One thing I want to make sure I asked you about is your StandardSim data set. Could you maybe describe what that is and why you released a public dataset?
Jordan:
Yeah. This was super cool. It's a 3D sim, basically, of stores. It builds 3D...not 3D reconstructions, but it builds 3D models of stores totally synthetically. Where are the shelves, where are the cameras, where are the products in the shelves; it tries to simulate the way that the products are stocked in the shelves. It has a decent corpus of SKUs, et cetera.
It's just a way to build up these 3D representations of stores. Obviously what's cool about that is you can generate infinite image data of stores, synthetic stores. That's a huge leg up to build, to move quickly, and get off the ground, and start training.
We often see a lot of models where if you train on synthetic, it doesn't give you as good results as if you train on real data. That's definitely true still for some of our models, but there are some models where you just can't get the data labels in particular, in the real world, right? Or you can, but it's just insanely expensive. Segmentation is a good example where it's just so expensive to do segmentation.
For us, actually, we were working on this model called change detection. Which is, if you look at a shelf over time, you can see the item sort of be taken and removed. That's a really interesting...we can create that dataset, the real data set, but how do you label it? Asking a human to look at a before and after image and draw a segmentation mask of where the item was staged, it's not an easy thing to do.
But with the synthetic data, you can just simulate it and get a billion images of before and after with perfect segmentation masks. So that was the original inspiration for creating that data set. And then I think we're all big proponents of open source and I think open source data is sort of the next version of that. If ML's going to revolutionize the world — which it is — we have to make that more democratic. The code is becoming super democratic, the data is not, right?
I think that's sort of an interesting gap. I'm not exactly sure how to fully close that gap, but I think that open sourcing this synthetic data set at the very least is a cool way to help.
Lukas:
Interesting. But this seems very core to your business. You weren't worried that a competitor might use this dataset to build a competitive algorithm?
Jordan:
My opinion is there's some great teams working on checkout. Obviously, I think highly of Standard, but there's a couple other great teams out there.
They're doing this their own way. It would be sort of like Cruise using a synthetic generator from Waymo or vice versa. It could happen, sure. If they do, great, best of luck to you. But I assume they've got their own stuff that they're doing. And switching costs are so high that they're so deeply invested in whatever synthetic thing they've got, or XYZ that they're doing, it's just going to be too expensive for them to switch.
I think really the value of these open source initiatives is for the broader community, so that people can get their hands on this, play around with it, and come up with some other really cool application. And show us what's possible, right? We're so tunnel visioned on trying to build this one thing that we're trying to get out. Maybe there's some other cool stuff that you can do with this.
Lukas:
Have you seen any interesting applications yet?
Jordan:
Not yet, but hopefully someone who wants to come do Rust machine learning will "git clone".
Lukas:
Don't forget about JAX.
Jordan:
JAX, yeah. "git clone", do something cool with it, and then start.
Lukas:
And Weights & Biases, don't forget.
Jordan:
Yes, yes, exactly.
Lukas:
Do you have any kind of benchmark for accuracy on this dataset? Do you think about it like that at all?
Jordan:
We do, yeah. I mean, it's going to be similar...I don't know exactly what it is, we can follow up with the folks that built that dataset, but it's something more similar like intersection over union, right? It's, "How close are you getting that segmentation mask, basically, to the ground truth?"

Picking the right tools

Lukas:
Right, right.
All right. Well, we always end with two questions that I want to make sure I ask you. One question is, what's an underrated topic anywhere in ML that, if you had extra time, you'd love to look into or study?
Jordan:
We touched on a lot of them. I guess some of them are underrated, right?
I'm a huge believer in tooling. You got to pick the right tools and you got to keep pushing the tools forward. Huge believer in ops, whether it's labeling or having some human-in-the-loop component, you've got to invest world-class ops. Those are the unsung heroes in the world. Everyone wants to be an MLE, but the operators are amazing folks who really make this possible.
In terms of more like research-y in the ML world, maybe this isn't a hot take, but I'm still a big believer in symbolic reasoning. Maybe I'm just one of those old foggies that is going to die on a hill, but it's just so clear to me that the way our brains work is partially symbolic, right? Not fully. Obviously, you get some stroke of intuition, et cetera, for the way we do item classification.
It's like, who knows? It's literally a deep network of real neurons. But it's so clear that when I'm introspecting the way that my brain works for something slightly higher and more abstract, that it's doing something more symbolic and is really kind of thinking through the sort of graph structure of the problem and breaking it down, exploring different aspects of the tree.
I think there's got to be some way to merge it together, right? So if I had just made $800 million, I would be using Rust to solve how to bring symbolic logic and mega Transformer models together to rule the world and solve world hunger.

Overcoming dynamic data set challenges

Lukas:
Wow, I kind of hope there's an exit in your future.
I guess the last question is, what's the hardest part about making machine learning work? In your case, maybe I would say what's been the most surprisingly difficult part of going from these image models to a working system in production that people can actually use to purchase stuff?
Jordan:
So many things, but the world is messy. It's super messy, right? And then in this case, I literally mean messy. Because stores are chaotic places, right? There's thousands and thousands of items, and most of them aren't in the place that the retailer wants them to be.
They have these meticulous plans that they invest in called planograms, where they optimize where all the products should go. And the CPGs are investing too because they're trying to sell you more Snickers. I don't know why I keep using Snickers. They have this plan, but then show up at a C-store, show up at a grocery store, and stuff's everywhere, right?
There's people unpacking boxes, and misplaced items, and there's just random stuff on the floor. They try really hard to keep the store clean, obviously, but it's just a pretty chaotic place. Retail is chaotic, right? You've got thousands of people coming through the store every day. It's going to get messy.
That's challenging. It's a really dynamic visual dataset. And just random stuff happens, right? In the AV world, they talk about the long tail distribution of reality, but yeah, we see that.
Lukas:
All right. Give me some long-tail cases. I love these.
Jordan:
One of our stores, we had a Listeria outbreak, so they had to throw away all the fresh foods. In retail, it's called selling air. You can't sell air, so they had to put something on the shelves, but they didn't have any fresh foods. So that store manager...and store managers are typically super empowered in retail. There's these massive companies, but store managers actually get to have a lot of say, because they're the ones that are trying to sell stuff, right? And they know the local clientele, et cetera.
So that local store manager was like, "Well, I'm just going to go get fresh food. I need sandwiches. I'm going to go get sandwiches." They went and got new sandwiches same day, brought them back, stocked their shelves. And now suddenly from a computer vision perspective, you're like, "Well, we've never seen these products before. We don't know what the barcodes are. We have no data set for this." But the store manager's like, "I need to turn around and start selling this stuff right now," right?
We were able to turn that around, and start selling it pretty quickly, but that's super hard. And again, it's this really rich intersection of engineering, ML, and operations, and client support too. You have to bring all those things together. This is not just ML.
That's a lesson that we've learned over and over again. Every piece of this has some connection to the shopper, to the retailer, to us as a business, and you have to bring all the stakeholders together. We're a super cross-functional team, and we love coming together and looking at all the different sides of the problem to ultimately make something that we can put out into the real world.

Outro

Lukas:
Awesome. Well, thanks so much for your time, Jordan. That was super fun, super informative.
Jordan:
Yeah. This was awesome.
Lukas:
Thank you.
Jordan:
Awesome. Yeah. Thanks for having me on.
Lukas:
My pleasure.
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce. So, check it out.

Iterate on AI agents and models faster. Try Weights & Biases today.