Skip to main content

Jehan Wickramasuriya — AI in High-Stress Scenarios

Jehan discusses applications for AI in public safety and enterprise security
Created on October 5|Last edited on October 6


About this episode

Jehan Wickramasuriya is the Vice President of AI, Platform & Data Services at Motorola Solutions, a global leader in public safety and enterprise security.
In this episode, Jehan discusses how Motorola Solutions uses AI to simplify data streams to help maximize human potential in high-stress situations. He also shares his thoughts on augmenting real data with synthetic data and the challenges posed in partnering with startups.

Connect with Jehan and Motorola Solutions:

Listen



Timestamps

00:00 Intro
59:36 Outro

Watch on YouTube





Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to riley@wandb.com. Thank you!

Intro

Jehan:
This is what I mean when I say “the complexity for a machine learning team is actually exponentially increasing.” You have to look at these other machine-driven ways to increase the quality of your data and augment your data sets.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. 
Jehan is VP of AI, Platform and Data Services at Motorola Solutions, where he's responsible for a vast number of machine learning models in lots of different applications running live in production. This is a very practical, useful conversation, and I hope you enjoy it.

How AI fits into the safety/security industry

Lukas:
I was thinking that probably the place to start here is actually what Motorola does. I feel like Motorola has this brand for people my age making phones.
Jehan:
I was just telling Kelly that I—actually even stranger—when I finished my PhD, I started actually working at Motorola before, right around the time the iPhone came out, and actually worked on mobile devices. Then after a stint at other companies came back. I think like telling people what… definitely that brand is stuck in people’s heads. When they think of Motorola, they think of those things, which is not what this company does at all!
Lukas:
Right, right. So why don’t we start there? What are, like, the key things that your company does?
Jehan:
Yeah. So Motorola Solutions essentially is completely focused on the safety and security of the community and enterprises. So essentially there are a couple of different segments of the business. One focuses on enterprise security and video security, physical access control. And then the other focuses on public safety for first responders. Essentially everything from when a 911 call comes in to dispatch units, and then resolution of the incident and case closure, that’s Motorola Solutions’ focus. 
I think in the public safety space, we’re most well known for our mission-critical communications infrastructure, which has been something that first responders have relied on for decades now, which is when times get rough, when you see firefighters charging into burning buildings, the radio is what they really focus on, especially back when communication coverage through broadband was much more sparse than it is now. And it’s still a huge challenge in many parts of the world, not just the United States, but even overseas in the UK. 
So in general, that is Motorola Solutions’ mission: essentially to provide safety and security for those two segments. Essentially it’s the same audience. It’s making the community safer, but in terms of how the product portfolio is situated, it’s basically those two segments of the business.
Lukas:
Interesting. So how does AI fit into that? I can think of lots of different ways, but like practically, what goes on? It seems like a really high-stakes place to introduce artificial intelligence.
Jehan:
It definitely is. And if I think about my journey coming to the company, I’ve really worked in consumer most of my career. So machine learning, we kind of took it for granted that it’s just a tool that you use you know… the applications and services you’re building, you use it to essentially accelerate, automate and help decision making. Here you do the same things except, like you said, the opportunity cost for some decisions that may be incorrect—and also bridging that human understanding—is quite high. So I mean, I would say the mission is still the same for all of us who work in machine learning, where we want to kind of maximize human potential and use it as an assistive tool. 
I think the reason that this is so important here is that many of our users are in very high-stress situations. So when your cognitive bandwidth is limited, your ability to make decisions as a human is definitely hampered. Now one thing that flows complementary to that is that the amount of data is exploding. 
The amount of data that these users have to consider day in day out is exploding, whether it’s a 911 call taker or a security guard: more video, more audio, more unstructured text, more structured data, more communication. So then the question becomes, “How can I use AI to be able to simplify that?” And I think it’s not just an AI problem, it’s also a usability problem. 
And actually, it’s funny. This weekend, I was reading a book by Katie Swindler, which is “Life and Death Design.” And increasingly there’s a lot of these kind of usability considerations for designing for people in high stress situations. And I think once you get past the frozen response where your prefrontal cortex kicks in and then you’re like, “Okay, now what do I need to do?”— I think one of the things that really stuck out for me was designing for experts. 
Because in public safety, and even in video security, you have a lot of expert users—whether they’re someone who’s been watching video for years and years, they know exactly what’s happening on every single camera, they know the playbook that when something goes awry what to do—expert users typically, when you speed things up for them, they tend to do better because they automate a lot of the standard stuff. The stuff that kind of has to happen ahead of actually using their brain to actually figure out a problem, they automate a lot of it. 
Whereas if you take a novice user—and I’ll get to why this is important in a second—for novice users, they do want to think it through before they get to anything. I think for expert users, that’s becoming a luxury that many of these roles don’t have anymore. Staffing is challenging.
I don’t know how much the audience knows about, I would say, public safety. But a lot of those roles… like when you call 911, your life is essentially in the hands of someone who’s taking that call, figuring out what’s going on and bridging that help to you when you need it. Those expert users are now churning much, much more, in which case training becomes a huge problem.  Expert users tend to do better because they’ve kind of simplified the workflow. 
I think this is where AI can really help in that process. I think for novice users, AI can bridge some of that gap. They don’t have years of expertise to fall back on, where AI can help bridge some of that so that they can actually focus their attention more effectively.
Lukas:
Can we get a little more specific about a single use case and what your software is doing in that?
Jehan:
Yeah, let’s take video security, for example. So, traditionally, when you think about video security, you think of someone who’s watching video. 
As a company, one of the North Star goals that we have is that no one should watch video in the limit because it’s actually impossible. It’s so ridiculous when you actually visit one of these, whether it’s the security operations center or real-time crime center, you’ll see how ridiculous it is to have someone watching all of that video.
The second is managing disparate systems. Whether it’s enterprise or public safety, you have a lot of different vendors in the space. The space is extremely fragmented. I think about it a little bit like healthcare sometimes, where you have information— it’s just present in different systems. 
So the question really becomes, "How can you centrally manage that?" And we’ll talk about cloud and AI a little bit, maybe, as we go on. But it’s really about, “How do you optimize that response?”
So we use analytics, we use AI to be able to help the operator not only focus on what matters within an individual video stream, but also across those different video streams and different systems, be able to surface relevant information. And relevance is really, I think, the key part of what we’re focusing on now.
Lukas:
And can we get even a little more specific? Where are we? What’s a customer? What are they trying to do? I mean, I know nothing about video security, so I think you’re really going to need to walk me through it.
Jehan:
Okay, let’s set the stage. For enterprise, some of our biggest customers are, for example, schools. 
School is a very unique operating environment, I would say. Especially in the United States, with a lot of the issues that have happened here and continue. So typically you have two classes of users in video security.
You have those who monitor video, so a SOC where they essentially pay people to watch video and essentially deal with alerts from the system. At that point, you have systems that may be not AI-enhanced at all—no analytics at all—where you’re just watching video, and you have to watch the video and essentially deal with it as a human. Increasingly, many of those video security systems have AI. So you’re actually watching events and you’re viewing video. 
The second class of user is not watching video at all. In fact, it is very rare to have someone spend the whole day at their desk in many of these cases, especially at school. You might have a single roaming security guard who is going about their job checking on different things in the school, dealing with student related issues, tending to the staff. The only thing they have is a mobile phone in their pocket where you may be getting alerts from your underlying video security system. So essentially you have to figure out how to deal with those alerts, including the accuracy of those alerts and triaging that to the right response.
But that’s basically the two customer bases that we have. So the problem we want to be able to solve is, “How do we get the most relevant alerts to those customers and build a user experience where they can effectively deal with the situation when they’re under stress.”

Event matching and object detection

Lukas:
And so what does an alert look like?
Jehan:
So an alert may be just an event that comes from a video security system. For example, many of the cameras that we build today have AI embedded in them. That AI essentially allows you to set up different rules. The customer sets up different rules. 
For example, they may set up a line crossing rule that says, “Okay, when someone crosses this line, send me an event.” That event will basically have, “This rule was triggered. Here’s a snapshot of what happened,” so a person crossing the line, and some other metadata depending on how the rule is configured.
Essentially, an alert will be… it’s very similar to an Event-Condition-Action-type workflow, where the action is performed by the human, but the event is usually taken care of by AI, typically, whether it’s using some combination of object detection and tracking classification. And then the condition is usually set up by humans. And we should talk about that, because as a company—and from an AI perspective—we don’t believe that rules are the right way to go, even though much of what we know as AI came from rule-based systems. Configuring a system using rules makes it very difficult for humans to be able to take what they have in their head visually and then map that to something that they need to look out for in the future. 
Because proactively, you set up all of this configuration in the rule—which depends on analytics metadata and AI metadata—but most of us typically don’t know what’s going to happen in the future. This has happened at setup and might never be changed for a long time because it may be complex to go in and change it. 
But me, as a human, when I see something, I know that it’s not right. That’s one thing as humans we do a very, very good job of is that when we visually see something, we can reason about the fact that there’s something gone awry there. There’s something that we want to know about in the future.
The way we’re building systems today is to be able to get closer to how humans think and allow humans to essentially visually specify the things that they care about so that we can essentially push this workflow from being largely a very reactive workflow—and I say that across public safety and enterprise—to a more proactive workflow. And this is where AI can really help.
Lukas:
And so how do you frame the problem as a vision problem? Like, are you trying to track all the people and objects and then set up a rule that’s like, “If a person goes in this area where there’s not supposed to be a person, we fire an event?” Or is it more unstructured, like if we have training data that’s like, “People in the area there’s not supposed to be people,” and then we’re just sort of looking for a custom model to flag something?
Jehan:
That’s a really good question. So I would say the majority of use cases and most vendors in this space… and many vendors have transitioned into deep learning-based models which really opened up from a vision standpoint what we can do, obviously.
But typically you have object detection as kind of the core to everything. Which is, people and vehicles are the biggest thing that you care about in most of these. Doesn’t matter, vertical-specific or otherwise, you’ve got to have that. If you’re running analytics on a camera, you’re also doing tracking, obviously. Because tracking can provide you with a lot of other pieces of metadata, not only to make your object detection more efficient, but also you can create different rules around speed, direction, things like that.
So how it’s done today is you have a set of analytics that run, whether it’s at the edge, on a server or in the cloud… and we should talk about distributed computation, because I think this is a key part of where we’re going from an AI perspective that I think is a little bit different from where we are today. You typically have those analytics generating primitives, metadata around, “I detected a person, okay, I subclassified this person based on attributes I understand,” same thing for vehicles. Now I take that metadata and I set up different rules as you said. 
So if I want to know, for example… if a blue car is what I care about, I can use that metadata to my advantage to set up a rule. That rule fires, you get an alert and then that alert goes back to what we talked about at the start where a human can take some action on it.
Lukas:
And does it get as detailed as, “These specific people can go here and if they’re not sort of on this list of people, don’t let them go in this area?” How advanced does this get?
Jehan:
That’s—again—I think a very important point. Detection and even matching happens at a level on a computer vision domain where we don’t need identity, for example. What you’re talking about is now connecting that metadata with identity. So, “I do not want a particular person to enter because this person is on a watch list. It might be someone who’s dangerous,” and so on and so forth.
So that’s when you get into things like OCR and facial recognition, for example, where now you’re connecting identity with those descriptive AI analytics, where I don’t know who it is, but I know how to find the person in the visual domain, for example. That is something customers can do on their sites, and that information is managed completely by them. But in terms of getting the analytics down from an object to an actual individual, you need that second piece of information to be able to connect identity.

Running models on the right hardware

Lukas:
Interesting. Do your models typically run on the edge or the cloud or is there some kind of hybrid situation? How do you handle that?
Jehan:
That’s a great question. For our analytics, we use all three, and our vision is that AI really needs to be democratized for users regardless of the equipment that they’re using. 
Some people may invest a lot in edge hardware, where it’s typically quite expensive, but you can run a lot of the AI computation efficiently at the edge. We use a variety of AI SOCs, depending on the platform, but we also leverage distributed computation. Because one of the usability factors that we think is important is the ability to centrally manage information that comes from your AI models. For users, that’s a game changer. 
And so we distribute computation. It should be transparent to the user. You might have a cheap IP camera, for example, but you still want to get the benefits of AI. So at that point, you may be doing the bulk of your computation in the cloud or on a server on premises. How you make that cost effective, there are some interesting things that we do to do that.
I think the biggest benefit—how we think about the edge when it comes to AI in vision—is the camera really tells you where to look. Once you can focus attention, then you can actually be much more opinionated and sure about how much computation you spend to analyze that attention. 
But in the limit, we typically can deal with very simple cameras that essentially only have motion-based alerts, which can be very noisy because they can be triggered constantly. And then our cloud AI essentially is able to analyze that and figure out if that’s actually a true event or not.
Lukas:
Interesting. So the primary reason for going to edge is just that it’s faster and uses less bandwidth. Is that right? I sort of thought there might be data privacy issues that would cause a lot of customers to go local.
Jehan:
Absolutely. I think customers have lots of different reasons. So outside of the technical challenges around bandwidth and compute, absolutely. Some customers prefer to manage their data entirely on-premises, and they have that option. That’s essentially the way we build our system. They have the option of doing that. 
I think increasingly customers are seeing the benefits of centrally managing, even if their data is on-premises. Which is where federated systems become very, very important, right? So, “How do I bring the benefits of centrally managed AI while still operating on AI metadata that is generated on-premises?” We do have solutions that also do that. 
For example, I may be able to conduct a natural language search from the cloud, but that cloud search gets executed on-premises. So if I’m doing a similarity search, for example, where I’m essentially searching in the embedding space for an answer, I may not be storing any of that data in the cloud. It may be on-premises.
Flexibility is key, I think, both in terms of privacy and in terms of managing compute and bandwidth.

Scaling model evaluation

Lukas:
How do you think about the evaluation of your models? Let’s just take the object recognition to be really concrete. I would think that every customer would have different levels of quality in their object recognition depending on what cameras they’re using, what the background looks like even. And then I would imagine that both kinds of errors are bad for you, right? 
You don’t want false positives, obviously. And then you also don’t want false negatives because it’s sort of operator fatigue. But then also, I’d imagine that you might be violating contracts if you sort of miss a real event that you want to trigger. So how do you think about that?
Jehan:
That’s an excellent question. And I think it also comes back to... one of the reasons why we work with Weights & Biases is, essentially, exactly that. The evaluation of these things have gotten a lot more complicated and nuanced, I would say, even over the last few years where, initially, a lot of vendors—including us—trained one model, essentially, that is deployed on the camera, and you can essentially buy a camera and use it anywhere. 
What we’ve learned over the last few years is that these different form factors—different fields of view, different environmental conditions—mean that it really doesn’t matter how much data you have. There’s always going to be an element to that that needs to be essentially adapted to the customer. So, in terms of evaluation, how we look at it is we kind of break the machine learning components down to its primitive pieces. 
For an object detector, we may not change that as often because that is something that uses the most diverse data that we have, is as generalizable as possible. The one thing you might do in that case is you might train different variants, different resolutions, you might have different models that deal with thermal, for example. We have thermal cameras that are trained on specific data that is specific to that. So, you have models that don’t change that often but essentially provide you that core signal of where to look.
And then as you go downstream, you might start getting much more specialized, where a user might have a very specific notion of what attributes they want to subclassify. And it’s not the same across different users. This is where the cloud helps and also where we can deploy additional types of models. 
In terms of evaluation, I would say that those types of models are evaluated on a much more fine-grained case basis. So customer by customer, region by region—we kind of cluster different types of customer types to be able to understand how the models are doing. And I think part of it is also our customers have gotten a lot more used to video analytics-driven systems, where they will come to us and say, “Hey, this isn’t performing this well in this situation.” At that point, we’ll go and take a look with the customer. We would collect data to try to understand what’s going on. And then our machine learning teams would dig in to try to make adjustments essentially based on that customer. 
So being able to actually do that and scale a machine learning team has been one of the big challenges I would say in the last couple of years, where we’ve really focused on the best tools. It started with data and data operations. We worked with Figure Eight when they transitioned up—and I know that’s kind of like your background—and when I came in and started running the team over here, that was my number one concern: data, data annotation and data efficiency. I think now we’ve got a good handle on working with different annotation vendors and we realize that you kind of need a multiplicity of annotators and different coverage to be able to do these different tasks. 
Now what’s happened is evaluation has become a bigger problem, which is, “How do I connect my inference infrastructure with my machine learning evaluation tools?” How do I visualize that information so the different stakeholders, whether it’s a firmware engineer or a machine learning engineer, can see, “How did my new model do compared to my previous model.” And, “This specific customer problem that we fixed: how did it affect all the other customers?” 
And so managing my annotated image data sets, custom visualizations. Adding loggers to our visualization system so it can actually deal with our machine learning training repos and our model zoos. This has become probably the biggest area of concentration, I would say, in the last 12 to 18 months. Which is why we use tools like Weights & Biases across our team, because it makes it impossible to… We can do the evaluation. It’s bookkeeping for those, measuring across those different data sets and actually increasing our speed to be able to do that is probably the biggest… I would say that the barrier to entry right now is probably that in terms of getting new models out.
Lukas:
So does every customer kind of get their own evaluation before it goes live?
Jehan:
Typically no, because we have so many customers. We have thousands and thousands of customers, so that becomes very difficult. I would say the thing that we’ve learned to do is… so a lot of customers, if they have an issue with a particular model or the analytics will come through support, it’ll get triaged into our machine learning team, for example. 
What we try to do is figure out if what these customers are seeing is common. What other customers may be having this issue? How do we cluster and segment that so that we can go after the problem? Because, as you know, machine learning time is precious and so we try not to solve problems that end up truly being a one-off. There may be other ways that a customer may be able to deal with that problem. 
Also, you have to separate model performance from installation and other things like that, where a customer… like, their camera might have moved. They might have not positioned it in the right way for that task. 
And actually, this is some of the stuff that we’re doing now in the cloud to be able to do things like camera help: use machine learning to determine if the camera has moved, if there’s a spider web on it. I mean, things as simple as that, because initially that used to just hit the machine learning team. And they were like, “Why are you seeing bad performance?” Okay, after a bunch of back and forth with a customer, you find out this is the problem. 
Okay, we want to automate that. We want to use machine learning to help us find issues like that. And so that’s not as exciting as maybe developing the next object detector, focusing on a new backbone that gives us much better performance. It’s things like this that really help us save time in the machine learning team so that we can do more interesting things.

Monitoring and evaluation challenges

Lukas:
This is probably not a well-formulated question, but I’m just curious, how many models are you working on at any given time? How many models do you have live in customer sites?
Jehan:
So, I run the AI team at Motorola. It’s not just computer vision, it’s speech and audio, language and NLP. So across all of those, that’s probably a very large number of models. If we focus just on the video space, I would say we’re still looking at under 100 models. So like tens of models. 
And very, very specifically, we’ve tried to keep a handle on it, because I think there is definitely a need for more custom solutions. But managing those solutions—as you pointed out before—we want to make sure that we have the ability to monitor those effectively and evaluate those effectively. It’s still a relatively small number of models, but they touch many, many, many, many customers. And so monitoring and evaluation becomes a huge problem, as well as going back to annotation. I mean, we’re looking at other things like weak labeling approaches, confident learning. 
Alex Ratner over at Stanford. Snorkel AI, the company that they spun out. We used Snorkel actually a few years ago when it was an open source project. And it was difficult mostly because of all the engineering and plumbing needed to actually make that happen. And now that’s what I think Alex’s startup is doing now with Snorkel Flow. 
And, you know, I talked to him recently. I think it’s solutions like that that we really need to get into the edge cases for AI. I think you don’t have data from all these customers. And a lot of customers don’t feel comfortable sharing data, which is completely, I think, fine. We have to find other ways to solve the problem. 
Another example is a company called Cleanlab, which is, “How do you learn with noisy data?” At that point, you’ve accumulated a massive amount of data from different places. Label quality may be highly questionable. So then the question becomes, “How do I actually reason across that in a systematic way?” 
I know you’re smiling because these are the exact things that I think Weights & Biases helps a lot of its customers deal with. But this is what I mean when I say “the complexity for a machine learning team is actually exponentially increasing.” You have to look at these other machine-driven ways to increase the quality of your data and augment your data sets. So now you have to evaluate those. You have to evaluate models that actually help your data, which also needs to be evaluated. And so, I think just getting a handle on that is one of the challenges that we have.

Identifying and sorting issues

Lukas:
So this is a really practical question, but I feel like a lot of people ask me what best practice is here. 
When a customer complains, someone is like, “Hey, you know, this thing did the wrong thing in this situation.” How do you actually triage that? And what are the likely things that you end up doing to fix it?
Jehan:
Yeah, really good question. I think especially when it’s distributed computation, where they’ve got a camera that’s running an AI model or a bunch of AI models, they’ve got a server on premises that is also running AI and cloud. So the customer doesn’t care and they shouldn’t care where any of that is running. They’ll just say, “A particular event is false positive… the false positive rate has increased dramatically. I’m seeing this problem, go solve it.” 
And so typically that hits our support team. And I think we are continually trying to make sure that we’re giving our support teams the better tools to be able to triage it. And I know this is a problem that you guys are very familiar with as well, where otherwise it’s a sieve. It passes straight through down to the AI team and you have now you have machine learning engineers and data scientists getting involved way too early. 
Part of the problem…
Lukas:
Before you go on, I’m just curious, are there any tools that you’ve used that you’d recommend? Like, are you using any kind of ML explainability tools or is it kind of home built? And if it’s home built, what kinds of things are you showing to the support team?
Jehan:
It’s a mix. The first thing that we built a lot of in house is a lot of visual tools. So, you know the system. If you can get video from the system—you have video clips or images—how can I feed that in and dump diagnostic data immediately, and then distill that diagnostic data so that the support team can at least try to figure out where the problem is? Is it happening in object detection? Is it happening because of a classification problem? Is it happening because of the environment, the environment has changed where your performance dropped off a cliff? 
So we built some homegrown tools specifically for cameras because the camera is probably one of the most difficult things to debug, because you have essentially AI running on a firmware build. We do have to do manual field testing of those cameras as well. You can’t just test it upstream when we generate a model. So a lot of the homegrown tools are particularly to deal with our cameras where we can dump the data and understand it. 
Explainability is a very interesting point. We’re trying to do more of that, where we’re trying to work with a few more tools that exist out there where not only can we get some of that meaningful information out, we can map it to what they understand. Because as you go up the stack—different levels of sophistication in terms of what we run—I think the really important part is feeding that information back into our evaluation, where you started. 
If we have a problem with a customer and the support team is able to identify it, maybe they pass it to our QA team. So now the QA team has more sophisticated tools. They maybe… actually, they do use Weights & Biases today. So for example, they can go and check the machine learning team’s last, whatever, X releases and all of the results are there. They can go and run an evaluation by themselves. We made it as turnkey as possible. 
So the level at which the AI team operates is different to where the QA team operates, where it’s dead simple. We put some kind of abstracted UI on top of it where they can essentially run the same type of evaluation over the new data that has the problem, be able to understand where the problem is happening and then involve the AI team, where they can jump in and actually do this. I can’t underplay how big a difference this has made because initially all of those requests were coming straight into the AI team where we’re getting overwhelmed with requests.
And a lot of it is triage. I would say 70% of the time on average is triaging and identifying the problem. Fixing the problem is typically not too bad, with the exception of problems where you have gaps in your data or something more fundamental that you need to fix.

Bridging vision and language domains

Lukas:
And what are the fundamental problems that you might have? The ones that are really tough to fix.
Jehan:
I think typically that happens when our data sets essentially don’t have coverage, where you essentially hit a particular environment or a field of view where you just don’t have the training data in the model to be able to actually adequately deal with it.
Or actually, you might have a new model. For example, some of the new models we’re working on specifically focus on identifying very small objects at distance. That is a very difficult problem because it’s difficult for a human and it’s difficult for a CNN. When you try to disambiguate something at 300 meters, it’s basically a patch. I mean, at that point, you’re just doing motion detection. So you have to think outside the box a little bit in terms of figuring out what that is. But typically… that’s one example, where many of our customers still use AI for perimeter protection. 
So object detection at range is something that is a constant query, I would say, especially after we moved to deep learning-based analytics. In some cases, customers think that the previous generation of cascade-based models worked better because they don’t actually have to do detection. It’s essentially blob identification and motion detection. So when they lost some of that capability, they’re like, “Well, why isn’t the CNN, why isn’t the object detector actually picking this up?” And we kind of have to explain it to them. 
One of the things that we’re very proud of today is we’ve been able to combine some of those techniques together, where typically you’ll get a detection that ends up being very low confidence where you would typically wouldn’t pass the threshold for an alarm. Whereas now, for those low confidence detections, we can—under certain circumstances combining different types of metadata—we can say, “Let’s take a second look using a different technique” to be able to say, “Is this actually an alarm that a customer might care about?” To be able to combine those things together. And I think that’s just a large narrative around multimodal analytics. 
I think, for the most part, object detection is largely commoditized. If you look at what startups need to do to get a viable object detector today, whether it’s using the latest YOLO variant or whatever, most people can get going pretty quickly. I think where you end up having issues is exactly the areas that you’ve been asking me questions on, which is the edge cases: whether it’s extreme range, certain types of conditions where you might not have the training data. I think this is where customers end up having problems. 
So to go beyond that, I think… this is almost getting to a part, I won’t say exactly, where speech recognition got to. It got to “good enough” very, very quickly, where essentially gains in training ASR models typically wasn’t worth the kind of exponential effort. So then everything shifted to natural language. It’s like, okay, “Well, the transcripts that I’m generating are pretty good. Now how can I do language-based tasks more effectively?” And there’s a bunch of NLP work that we’re doing in that area. 
And I think NLP has become a huge influence for us in vision, as well. I mean, this past CVPR, for example, everything was language plus vision, whether they’re jointly trained models or separately reasoning, using language to reason across vision-based models. This is something that we’ve been looking at for a while. 
So I would say two big trends in the computer vision space: one was unsupervised, semi-supervised learning. You’ve seen Meta, Google and other companies like that really show what’s possible at extreme scale. And then secondly is effectively using language not only to understand human intent, but also to interpret what the user is seeing.
And, like, this is exactly the question you asked me before, where when you get an alert today, that event image pair is not terribly explainable, right? If you have a lot of training, you can look at that event and that image and say, “Okay, I kind of know what’s going on.” But being able to take that result and, in just plain language, explain what’s happening, not only helps us digest it better from a cognitive bandwidth standpoint, but it’s just way, way better to go, “Yes, I want to capture that. And I want that alert to happen again.” 
And I think this is where we’re really, really hyper-focused on using language as the glue to be able to essentially move away from logic-based rules and use the way we naturally think about problems to be able to capture future alerts. 
Which is also why, I mean, two sides of our business… you asked about alerting. The other side is forensic and search. We truly believe that everything we’re doing in search, which is heavily NLP-based and NLP plus vision-based, can help us bridge the gap to help users actually create new alerts that they can look for proactively.
Lukas:
Sorry, I think you need to give me another real world example of what this forensic search looks like. Why am I doing this and how does it work?
Jehan:
Okay. So, today forensic capabilities in a video management system—leaving aside alerts—I know something happened. Now I’ve got to figure out why it happened or where a particular person is. Now I fall back to using my search engine, essentially, in a video system.
Lukas:
Sorry, I think you need to make this even simpler. Like, why am I doing this? Someone broke into my school and I wanna… why is this hard? I would think you’d just sort of look at the video feed and see what’s going on. I’m sure that’s a stupid, naive interpretation.
Jehan:
So a couple of different reasons. A very, very simple retail use case: loss prevention. Something has gone missing off the shelf, for example, or someone stole something. I know that that happened. How do I trace it back to figure out who it was? When did it happen? For a school, for example, you know something terrible may be happening, where you’re reacting to what happened. 
The question is, “How do I know where that originated?” “How do I make it safer next time so that it doesn’t happen again?” And, “How do I gather information beyond a single camera?” 
I think this is the crux of the use case, actually. Many sites have multiple cameras. A lot of analytics today focus on single-camera events. So a single camera is going to generate an event for you. Now the person has moved on, they’re in a different camera. They’re in a different part of the site. 
This is where search really helps, particularly things like similarity-based search, because now I can use that visual cue of who it was and search across all my cameras. This is really where they dip into the investigative space, where they saw something happening on a single camera, they take what they saw—whether it’s a person or vehicle—enter it into the system, and now the system will show you occurrences of that person or object across many, many different cameras. Now I can go deeper and understand, “Where is that person now, where was that person, and where is that person potentially going, so I can get ahead of the situation?”
Lukas:
And am I asking these questions in natural language then? Is that what the interface looks like?
Jehan:
That’s where we’re focusing a lot of our R&D effort today. Today, if I had to say, there’s two forms you can interrogate a search system. 
Visually—so essentially you can give it an image or an image crop of something, an object of interest, and systems respond to that. We can search the embedding space to be able to figure out if it’s a vehicle or a person or whatever else. 
There is structured search, where you’re looking for a particular attribute. I’m forming my query in the form of, like, “a man with a green shirt,” for example. 
What we’re doing right now—and we have been working on for a while and you’ll start seeing soon—is we want to make that as easy as searching for things on the internet, where you can essentially phrase that in natural language. We can use that natural language representation, then, to do more interesting things in terms of being able to bridge what’s in the vision domain with the language domain.
Lukas:
Wow, that’s really cool. It sounds almost like Star Trek or something.
Jehan:
But, I think on the consumer side it’s natural for us, right? It’s funny, like a lot of these verticals… Actually, I got similar comments where they’re like, “That seems like science fiction,” but if you think about consumer applications, we are very used to doing that today as humans. But in a lot of these verticals—whether it’s healthcare or public safety or enterprise security—that’s just not how they do things, because the systems are just simply not sophisticated enough to be able to understand human intent and map human intent to structured data. 
One of the big problems that we worked on initially was… a lot of our knowledge base lies in relational databases. So then the question becomes, “How do I bridge what I’m seeing visually, or what I’m expressing in natural language, to structured data?” 
I mean, there’s a ton of very interesting work now using transformer-based models to be able to actually figure out, from an indexing standpoint, how do I actually query those structured data systems based on naturally what humans are saying. And we think that’s the future. I think making it easier for users to get information out of systems is really the bottleneck today. And many of the systems are too complex for users to actually figure out how… if I have to think which search to use, I’ve already lost valuable time. And in our business losing valuable time means, as you said at the start of the conversation, is a huge problem.

Challenges and promises of natural language technology

Lukas:
Well, it’s funny. I feel like I, obviously, when I’m talking to a friend I like using natural language. But when I’m engaging with the computer I feel like these natural language interfaces have gotten a bad reputation over the years for over-promising, and then just being frustrating when it’s not doing the thing you want. You don’t know what’s the next thing you should do. 
I guess, do you feel like the natural language understanding technology has gotten to the point where this is really feasible? I feel like I don’t actually engage, maybe ever, with an automated question and answering system that seems to work really well.
Jehan:
I think that’s actually a really, really good observation, and I would say I agree with you. I mean first impressions matter, right? If you use one of the voice assistants and it doesn’t work for you a couple of times most people will abandon it because they just assume that the coverage isn’t there. I still think it’s a huge challenge in general because the language space is so vast and users can interpret their intents in so many different ways. 
One thing we have to our advantage in what Motorola does is… our vocabulary is actually fairly narrow. If you think about safety and security, whether it’s public safety or enterprise security, you generally want to ask the same sort of things. The five W’s, for example. Like, you’re looking for a person or a vehicle and you’re describing the attributes. So I would say that the domain space of intent is narrower but it’s much deeper, so you need to perform really, really well on those very fine grain parts of the intent.
So, for us natural language actually… the last couple of years of work that we’ve been doing has been very promising, because not only can we constrain our models... if you look at a task like captioning, for example. Captioning is a very difficult task to get right. You need a lot of data to be able to perform really, really well. If I think about something like captioning for us, we can really constrain the space that we’re looking for, because we’re looking for those same things. And so we can really double down on what data sets we’re using and how we train those models where they can perform really well. That’s where I think, for us, language is very promising because of the type of problem space that we’re in.

Production environment

Lukas:
That makes sense.
A practical question I have, given that you’re running all these models on live feeds of information, like— you actually really are running at scale and probably need really high uptime. What does your production environment look like? Is this another thing where you’re using third party tools or you’ve built something yourself?
Jehan:
The DevOps situation gets quite complex, especially when you’re thinking about data that’s running on premises as well as in the cloud. I think a lot of the ways we’re bridging that is, essentially, like I said at the start, we’re using central management. A lot of our cloud software runs pretty much the same way as any other vendor runs at scale, and we have redundancy and failover support for that. 
At the edge, it’s really about monitoring. So it’s making sure that we have good information about what our cameras are doing, the health of those cameras, being able to get the right metadata to understand model performance so that we know when something’s going wrong. 
We use a couple of different tools today that we’ve built because we are dealing with formats from our own cameras and data that’s highly proprietary. But I think we’re always looking for other tools where we can essentially centralize a lot of that monitoring capability, because it is very complex. You have multiple pieces of hardware and software running together. 
So, it’s not just, “My cloud service went down.” It’s, “Okay, my camera malfunctioned and now things aren’t working there, in which case, everything downstream is not going to work, either.” I would say it’s a work in progress in terms of making sure that we have good coverage as our solutions become more distributed.

Using synthetic data

Lukas:
Are things like data drift real issues for you that you look to detect?
Jehan:
Absolutely. Especially, I think, when it’s the first model of its type or it’s a new capability that we release. We spend a lot of time in house being able to test a lot of that stuff across as big a diverse and comprehensive data set as possible. But when it’s out in the field, we start seeing things like data drift happening, where it goes back to the question you asked before. 
As we learn from customers… that’s one way we can alleviate that, which is a customer might have an issue. We might recognize that being a common issue where we can address some of those. But we’re also proactively looking at our models and seeing, “How can we combat things like data drift?”
For example, things like synthetic data have become a huge tool for us in certain areas where we’re either unable to collect real data or there’s sensitivity around collecting that sort of data where we simply don’t do it. How do we augment our models with those gaps that we have? 
And we work with a number of companies on this synthetic data front and we’re doing a lot of that in-house as well, where we’re trying to fill some of those gaps to make our models as generalizable as possible. But as you know, it’s definitely a work in progress in terms of keeping a handle, especially as the number of models kind of explodes.
Lukas:
Wow. You know, you’re one of the first people that I’ve talked to—maybe the Waymo head of research was the other one —but most people, I feel like, think of synthetic data as more of a theoretical thing that they’re sort of working on using in the future. It’s interesting to talk to someone that’s actually using synthetic data today to improve the models. 
I’m curious: I mean, if you want to name any vendors that are working well for you or techniques that worked well, I’m sure that would be useful to the people listening.
Jehan:
I mean, I can mention… so, we worked with a company called AI.Reverie. It actually got acquired by Meta not too long ago. So that was a very public vendor. There are a couple of others that we’re talking to right now that I probably can’t share the names just yet. 
But I think one of the areas… you’re right. There’s a lot of, I would say, misinformation and misunderstanding about how synthetic data is useful. I think there is one camp that believes that you can use purely synthetic data to train certain types of models. And that may be true, especially certain classification tasks you’d benefit a lot from essentially just purely using synthetic data to cover the domain gaps that you might have. I think where it gets tricky is when you have a non-trivial amount of real data and you want to be able to augment that with synthetic data. 
At that point… it’s really funny because, initially, we started working with vendors as dataset providers. Essentially, you’d work with them, give requirements, and they would deliver a dataset to you, and you’d do all the training and experimentation in house. And then you realize very quickly that actually you need to do it end-to-end. And now you see a lot of companies actually doing that, where some of them are actually also selling tools for other companies that say, “Okay, you can generate your data. These are the knobs that we’re going to give you. And you can retrain your models and do that kind of in an iterative way.” 
And that’s really where we’ve landed today, where you can’t really think of synthetic data as something you get from a vendor. It really needs to be part of the machine learning development process. 
And for us actually, right now, where synthetic data is the most useful is testing and evaluation. Especially if you think about analytics that go beyond single object, and you’re thinking about groups of things. Whether they’re groups of cars or people, this is a very, very difficult thing to be able to collect data for. 
Even more, I won’t go into this now, but when you think about anomaly detection, especially of a high dimensional data, it becomes extremely difficult to test these things, because these events are so rare to begin with. Right? So you absolutely need to have synthetic data to be able to do that. And I think, for the most part, rather than training… though we’ve done some of that as well for certain types of use cases, particularly subclassifications, attribute classification for certain things. Because obviously you can basically have infinite ability to vary things like color and hue and texture and things like that. 
But testing is huge, especially for things like groups where you want certain patterns. You’re trying to mimic certain patterns. We went back to schools. When people are panicked, especially when you think about a building that has entrances and exits, there are very specific patterns of human motion that you’re not going to be able to collect. Hopefully you never will because those things hopefully don’t happen that often. And working with synthetic data and essentially incorporating it into our end-to-end pipeline is what we’re doing today so we can very quickly model out those scenarios.
Lukas:
Wow. I mean, it’s funny. I feel like synthetic data companies come to me for advice all the time. And I always feel like, you know, it’ll be very clear if your synthetic data is working to help a customer and then you’ll have a great business. But that part seems really hard to do. I would imagine modeling people in a panic is probably an unusual use case, but like incredibly important, and you better get it really right if you’re going to try to-
Jehan:
I think it’s the same thing, actually. I get a lot of startups coming to me and saying, “Hey, we would like to offer this to you.” Especially data startups at this point and MLOps startups focus, honestly… and you’ve seen that if you… again, going back to the latest CVPR, there was a huge push on synthetic data at CVPR, including the release and commitment to a new open data set, for example, for synthetic data. I think the community, especially the academic community, they just simply don’t know what these companies are doing and where they’re focusing in terms of what outcomes they’re looking to enable. And I would say that’s the same advice I give a lot of the synthetic data companies is… these are my problems!
So, for example, “I want to be able to get a lot of data about human attributes where I don’t want to collect real data, can you build photo realistic data that is good enough for me to be able to train a model,” for example. Or focus on a specific vertical. Verticals where it’s difficult to be able to collect real data. And I think that’s what we’re starting to see. Like, if I look at a few different startups now, they’re really trying to find their niche.
The other part is tooling. I think this is one area where I pushed very hard initially when we were looking at, and they simply weren’t ready to share their tools because they were building it in-house to be able to generate data sets for other customers. 
And I think that is one thing where, if you have a machine learning team—like, you’re not outsourcing your machine learning development, you’re actually doing it in-house—those end-to-end tools that you can incorporate into your machine learning development lifecycle are super important. That is when I think a lot of companies will start to see the value of things like synthetic data: when they can actually develop, train, and test iteratively to be able to see how it’s helping them.

Working with startups

Lukas:
I’m curious, as a startup founder: AI.Reverie was a customer of ours too. We saw that they got bought by Meta, and congrats to them. But did that experience make you a little more nervous about working with startups?
Jehan:
That’s a really good question, actually. I think about this all the time, because a lot of the startups that you talk to, they may be here and then not here in a couple of months. And so tying yourself very deeply to one ends up being problematic.
I would say, just in general—or at least our team and what the company does—we like companies that focus on platform and tools and build things in a very modular way. Because not only does it help us really understand what value there is in that… like, for example, data visualization. Huge problem. You don’t want… like, before we had data scientists building all different types of visualization. Hard to share, hard to have a library of those things to be able to replicate. 
Same thing around data. If we tied ourselves to a company that was just generating data for us and then they went away, and we have no idea actually how to generate that data ourselves, I think that becomes problematic. I think companies that focus on tools and platform, where we understand what makes them great because they’re focusing on a very specific problem... but we also have the intuition behind what problem they’re solving so that we can start to invest in it more in-house.
So synthetic data is a great example. I don’t think it can be a completely outsourced thing. Companies are going to go after little slices of the problem. I think if you’re really going to be all in on it, you have to invest in tools and technology on your end as well.
And so I think just as a general rule of thumb, that’s what we try to look at companies that are a little bit more open in terms of how they’re building systems and have a good diversity of customers. So that we’re not the only ones relying on this one capability, so that they will tune the solution because we’re their biggest customer, for example. That becomes problematic as you know.
Lukas:
Are there any other kinds of common mistakes that a guy like me makes pitching a guy like you? Like when startups come to you, do you have any advice for them to, I guess, be a good vendor to Motorola? 
JEHAN:
I think, honestly—and it’s probably just a pet peeve of mine—but very few companies actually do any homework on what we do. So they’re pitching something which, if you just spent 30 seconds looking at what we’re doing, it probably didn’t make sense to pitch it. 
And I think the second part is: the volume of pitches is so high right now, especially in machine learning ops or computer vision, or whatever, NLP, that usually people like me who have to look at it… we have a very small amount of time to be able to actually make a decision. 
And I think when I look at it, when I try to make decisions, people is the number one thing. Like what’s the quality of people? I don’t care what problem you’re solving. Like, what’s the quality of the company? Where did they come from? What problems did they solve? That for me is number one. 
Number two is, “Did they take a little bit of time to pitch me on what they think is a good use of their technology for the problems that I’m solving?”
I think those two things help me make decisions relatively quickly. And I think you can tell the founders or the companies who care when they maybe limit the number of people that they engage with. But when they do engage, they’ve done their homework and they kind of know that… they feel strongly that what they’re building could benefit the company.
I mean, Alex is one example, Alex Ratner. I knew of Snorkel. We’d actually spent a bunch of development time using Snorkel. And I think that was a very easy relationship. I mean, and he himself reached out, which made it super easy because we were able to dive straight into “what problems are you solving with your company” and get engaged with them and say, “OK, now we know what path you’re on. We know that you’re someone we probably want to keep working with at some point.” And so that made it easy.

Multi-task learning, meta-learning, and user experience

Lukas:
Awesome. Well, I’m sure that’s useful advice. Maybe we should end with our two questions that we always end with. The second to last one is: what’s a topic in machine learning that you think is underrated?
Jehan:
Oh, that’s a tough one because there are so many problems, I would say, out there. I still am a very, very strong believer in multi-task learning and meta-learning. I think you’ve seen the academic community go in that direction, but now we’re starting to see real results. 
I mean, I’ll point out one thing that—it’s not so recent but came out of Meta again—which was GrokNet, which is essentially using multitask learning to be able to basically do very accurate product recognition. We don’t do any of that. We’re not an e-commerce company at all. But one lesson there, at least that I learned, was being able to have a single model that does well across a variety of classification tasks and is trained and optimized jointly is something that is very important. And it used to always be that you’d choose a particular loss function that you’d care about. Now you use a multiplicity of different loss functions, some which were not even intended for that particular task. 
So for example, GrokNet uses ArcFace, which was developed for face recognition, but they’re not using it with anything to do with face recognition. They’re essentially using that to be able to find the cosine similarity between different embeddings in a very varied space. 
We do the same thing. We started out having n different models. We want to get it down to some n minus x amount again to the point of managing different models. 
So I would say multi-task learning and meta-learning, I think is still… people think it’s science fiction because, I think, academia-wise, a lot of people look at that and go, “I’ll come back to it in three or five years when it’s kind of ready to use.” But picking your spots, I think this is one area where I think it’s grossly underrated. 
The second I would say is user experience. I didn’t talk about it, but in addition to the artificial intelligence team, I lead the user experience research and design team here at Motorola. And I think those two things are critically essential to each other. 
It used to be that we would just develop algorithms in a vacuum, then go to the designers and go, “Hey, can you help me design some software around it?” And I think we don’t do that anymore. We start with a human problem, we try to design the experience, and then we try to figure out how the model can actually fit in that workflow.
And I think any machine learning company should really, really consider that, especially when you go pitch your solutions to someone and you’re still trying to explain it to them after 30 minutes. I think then you probably need to tackle that. So I would say those are the two factors, for me at least.
Lukas:
Yeah, we hear that user experience thing over and over and over. It’s interesting how there’s a lot of movement back and forth between ML leaders and product leaders, I think, which is super cool.
Jehan:
Yeah.

Optimization and testing across multiple platforms

Lukas:
I guess my final question is: when you look of speccing an ML application to deployed live in production, where do you see the biggest bottleneck is, or what’s the hardest part about getting a new model into production?
Jehan:
Yeah, so I’ll answer that specifically on the Motorola Solutions side, because that’s probably what your customers are interested in. So for us, especially if it involves edge hardware, the complexity is—as you know—synchronizing a software release cycle with a hardware release cycle. That is difficult because you have deadlines, you have supply chain issues and things like that. So having to do that. 
The second part is… 
I would say the easy part is getting a viable model out of research, if you will. Out of R&D. We are very well equipped to do that. We have great tools. Increasingly, our training infrastructure is automated. We do training in the cloud. We have on-prem compute through our distributed training methodologies. The problem is once we have a viable model… and typically there was a framework issue before, whereas now I think we use interchange formats like ONNX, we can get a model out that is somewhat framework independent. 
Second is, “How do I optimize for the platform?” If it’s NVIDIA, I might have to use something like TensorRT. If I’m using an AI SOC, I need to be able to use that company’s tools to be able to not only optimize the model, do post-training quantization, which is not trivial. Now you’ve got to see, “Did I lose anything in terms of my accuracy?” So getting a model that looks good on as much data as we have, then optimizing it for a particular platform, that part is complex because we have to deal with a bunch of different platforms. 
Once we’ve got there, I think the question of “is it good enough”—this is something that machine learning teams struggle with a lot. And I think if you distribute that task between your QA team—or a test team, for example—and the AI team, there are very big differences in opinion on what might be good enough. 
You might go do field testing, and you might test two particular scenes, only two fields of view, and say, “The model is doing terribly here.” The machine learning team will come back and say, “Well, our data set is way, way bigger than that. We trained on, like, a million images across many different scenes, and we think in general it performs well.” 
How are you going to do that when it’s statistically insignificant on the manual testing site? 
So I would say optimization and testing, especially if you’re trying to get these things out across multiple platforms.
Lukas:
And I guess fixing the problems with the real world tests are hard also.
Jehan:
Indeed, it is, insofar as finding candidate sites, are you doing it the right way? And how do you scale that? Again, which is why we’re trying to use things like synthetic data a little bit more effectively. 
And one change we made was our AI data team originally only served our machine learning team. Now the AI data team also serves our platform team and our test team as well, which has started to bridge that gap a little bit in terms of test coverage.

Outro

Lukas:
Awesome. Well, thanks so much for your time. This was really fun.
Jehan:
Oh, thanks Lukas. I really appreciated it, nice conversation.
Lukas:
Yeah, thank you.

Iterate on AI agents and models faster. Try Weights & Biases today.