Building AI-powered primary care with Curai's CTO, Xavier Amatriain

Xavier shares his experience deploying healthcare models, augmenting primary care with AI, the challenges of "ground truth" in medicine, and robustness in ML.
Cayla Sharp

Listen on these platforms

Apple Podcasts Spotify Google Podcasts YouTube SoundCloud

Guest Bio

Xavier Amatriain is co-founder and CTO of Curai, an ML-based primary care chat system. Previously, he was VP of Engineering at Quora, and Research/Engineering Director at Neflix, where he started and led the Algorithms team responsible for Netflix's recommendation systems.

Connect with Xavier

Show Notes

Topics Covered

0:00 Sneak peak, intro
0:49 What is Curai?
5:48 The role of AI within Curai
8:44 Why Curai keeps humans in the loop
15:00 Measuring diagnostic accuracy
18:53 Patient safety
22:39 Different types of models at Curai
25:42 Using GPT-3 to generate training data
32:13 How Curai monitors and debugs models
35:19 Model explainability
39:27 Robustness in ML
45:52 Connecting metrics to impact
49:32 Outro

Links Discussed

  1. The Netflix Prize
    • A competition held by Netflix (2007-2009) to improve its recommendation system
  2. Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians (Barnett et al., 2019)
    • Cross-sectional study comparing the accuracy of diagnoses made by individual physicians to groups of physicians
  3. Medically Aware GPT-3 as a Data Generator for Medical Dialogue Summarization (Chintagunta et al., 2021)
    • Using GPT-3 to generate training data for medical summarization models
  4. François Huet
    • Head of Machine Learning at Curai
  5. "On the “Usefulness” of the Netflix Prize"
    • A blog post by Xavier reflecting on the Netflix Prize
  6. Prototypical Clustering Networks for Dermatological Disease Diagnosis (Prabhu et al., 2018)
    • Image classification with out-of-band distribution
  7. Learning from the experts: From expert systems to machine-learned diagnosis models (Ravuri et al., 2018)
    • Combining expert systems and ML models for medical diagnosis
  8. Research Publications at Curai

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
Xavier:
How do you connect the offline metrics that you have in anything you're doing in any model in the lab to what's going to be the real impact that that model has on your product?
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world and I'm your host Lukas Biewald. Xavier Amatriain is co-founder and CTO of Curai, a ML-based primary care chat system that we're going to talk about today. Before that, he was VP of Engineering at a website called Quora, which I absolutely love. And before that, he ran the recommendation system at Netflix, which is especially famous for the Netflix Recommendation Prize¹. I could not be more excited to talk to him today.

What is Curai?

Lukas:
I want to start with talking about what you're working on now. I mean, you've had a really long and interesting career in ML, but it probably makes sense to talk about Curai, right? Is that-
Xavier:
Yeah, Curai.
Lukas:
Curai. First, can you tell me what Curai does before we get into how machine learning fits in?
Xavier:
Yeah. I mean the basic level is an end-to-end virtual primary care service. It provides everything that you could need from your primary care doctor, but it provides through an application, through chat. Our goal is to provide the best possible health care at the lowest possible price and make it very accessible and very affordable for everyone in the world, while at the same time increasing quality. The way to enable that, is using technology and more concretely, AI and machine learning. We feel like one of the things you can do through machine learning and AI is to automate and therefore make things more efficient, that's pretty obvious, but the other thing that might not be so obvious is that you can also make things higher quality, right? That's very much related to the notion of data-driven decision-making, algorithms, and science in general, which should be behind all the medical decisions. So, the combination of sort of quality accessibility is what drives our product. But again, our product is basically a virtual primary care service that is provided through an application and through a chat-based interaction.
Lukas:
And so, could I use it today? If I had a health issue, I could talk to a virtual-
Xavier:
Yeah. We're now available in seven states in the US. So that's, let me make sure I don't miss any, it's California, Florida, Illinois, Ohio, South Carolina and North Carolina. So those are the seven states. We plan on being available in the 50 states by the end of the summer, so we're expanding rapidly. And the only reason we're not in the other 50 states, it's because there's legal implications of expanding and you need a different license for all the different states. But yes, if you're in one of those seven states, you can download it and start using it for free. After the free trial, the price is very affordable too. So, it's $7.99 a month and you can use it as many times. No copays, you don't pay per usage, it's just like a flat fee and you get everything including prescriptions. You can go and pick up your prescription in the pharmacy, go to your lab test if you need any blood tests or anything and we do all of that through a network of partnerships. The healthcare team, which I'm sure we'll get into, is a combination of humans and AI.
Lukas:
So, it maybe triages the questions and the ones that are easier the AI tries to answer and then the harder ones go to a human, or how do you think about that?
Xavier:
Yeah, it's a great question. That's typically the traditional approach, right? You put the AI up front and then whatever the AI decides it can do it does and then you pass the rest to humans. We go well beyond that. We consider the AI to be just another member of the team and the AI never leaves the room. So, what it will do, is it will call other people. We have a care team that is composed of clinical associates, medical assistants, and then licensed physicians in all the states that we operate and then the AI. Now, the AI will sometimes, as you said, will take over the interaction and just drive it and whenever it's either finished with whatever tasks it was doing, or not sure, it'll call in the physician. But it then stays in the room and it provides assessment and augmentation to the physician, so it's both user-facing and doctor-facing. So, the AI is kind of the connection between the two ends. Very importantly, in order to understand this, I think it was kind of implicit on what I was describing, the doctors are part of our Curai care team. So, it's part of the team that is not only providing the care, but also helping us develop the product and helping the system and the algorithms learn from the data that we're generating. This is a so-called learning healthcare system, because we are. At the same time, the AI is helping and augmenting the doctors and the doctors are learning from the AI. But very importantly, the AI is learning from being part of this team and from the data that is being gathered as part of this end-to-end process.

The role of AI within Curai

Lukas:
How is the AI augmenting the doctors? Is it suggesting links to go to for research or autocompleting possible responses? How does it actually work from a doctor's perspective?
Xavier:
The AI is doing all the above, so yes, it is doing all of that. I mean, as you know, people think of the AI as sort of a magical entity that exists somewhere. And the AI is a combination of different algorithms that are controlled by some protocol, right? So there's different machine learning algorithms doing different things and all of them are augmenting the doctors in different ways. But in a typical, or in a simple, scenario, what will happen is the AI will be part of the so-called history taking and it will start by asking questions to the patients, documenting that as entities in an electronic health record, it will call in the doctor, and then it will say, "Hey, I have a differential diagnosis, which is a set of possible diagnoses that I think this could be happening. Now, you take it from here, but by the way, I can also suggest questions that you could ask the patient if you want to dig into any of these things." The doctor at that point can say, "Oh, wait, this could be COVID. Hold on. Can you suggest a few questions that I could ask the patient to either confirm or invalidate the hypothesis that it's COVID?" And then the algorithm will suggest questions that either confirm or not that particular hypothesis. As it's going along, it's extracting things from the text because these are all chat-based. It's extracting things from the text, it's highlighting important things. It's also summarizing the conversation for the next doctor that comes in to get a summary, and even going all the way to suggesting treatment if the doctor needs suggestions for a treatment once the diagnosis has been confirmed. Very importantly, the AI or the algorithms never make the final decision to either diagnose or to treat. That's always on a physician and we always say it's very important in this kind of environment to have the physician in the loop and to have the physician make the final decision, but we can augment them and make them much more efficient, but also better quality, right? Because in our offline analysis in the labs, our diagnosis algorithm, for example, has higher accuracy than the average physician. So, we're pretty confident and they keep getting better. We're pretty confident that those diagnosis algorithms are going to be better than most physicians. And even with that, we're not saying, we're just going to make the diagnosis. We're just presenting it to a physician and saying, "Hey, this could be one of these 3 things or these 10 things. How do you want to go from here?"

Why Curai keeps humans in the loop

Lukas:
I can totally see from a communicating with the patient standpoint, including me, that it would be comforting to say, "Hey, the doctor always makes the final decision." But this is more of an interview about real-world AI. It does seem like [for] example like, chess used to be before I think, Alpha Chess or maybe the latest version of Stockfish, the best chess programs were these hybrid systems with the human in the loop. But then, at some point, the AI got good enough that the human loop only messes things up, right? Do you ever have cases where you think that the ML system works better than a human operator and maybe it shouldn't actually give the final decision to a doctor?
Xavier:
It's a great question. I think, as I said before, generally speaking, it's not that hard. Well, I mean, it's taking us a few years, but it's not "that hard" to get an algorithm that's better than the average physician. Now, that being said, it's much harder to get an algorithm that is better than the combination of the human plus the AI. So, even in the examples that you're mentioning, the combination of humans plus AI in chess, if the human is relatively good, meaning a professional player, it's hard to beat, right? So, an AI alone versus a combination of AI plus human is hard to beat. In the case of healthcare, one of the important things to understand is that it's an imperfect information game. It's not about...if you had the perfect information, the algorithm would probably always beat the human, right? And it would be very easy to just beat the human with sort of all the perfect information in the world. However, in the case of medicine and healthcare, there's a lot that goes on with empathizing with the patient, understanding, even things that are called social determinants. Where do they come from? How are they going to understand? How can you communicate the possibility of something being likely or not? And that is very hard to do if you're not a human that is trained to have this level of empathy so to speak, right? So there's the interesting question and I keep talking to people that have very different opinions with that, right? There is the purely extreme rational opinion that all you want to have from the outcome as a patient is have a list of possible diagnoses with a probability and you'll be able to interpret them. If you're a hyperrational person, that is true. You want to know if you have a 0.2% probability of having cancer, you want to know that there's 0.2 probability and you think you can deal with it. The reality is that most people don't know how to interpret that, right? What does that even mean, a 0.2% probability of having cancer? Do you want to communicate that or do you want to interpret that and then follow the patient along and make sure that that probability doesn't get to a point that is more likely than not? I think that's where the human judgment is really key and that's very different from a pure probability that is out put from any kind of machine learning algorithm.
Lukas:
Interesting. I guess I would think that I would actually want to have the clear probabilities, but maybe everyone thinks that and they don't really want that.
Xavier:
No. I think you're probably right. If you are in the tech bubble, so to speak, and you're rational and you play music and you're a mathematician or you like math, you think you can very rationally deal with those kinds of probabilities and work with them, but there's a lot of people that are not like that and that's where the empathizing and understanding who you're talking to, it's really key. One of the important aspects, which is somewhat connected to what we're talking, is in particular, our service, we are not designing it for the tech savvy people of Silicon Valley or anywhere. We're really using technology to provide a very accessible and high quality service for people that usually don't even have access to high-quality healthcare and they're under-insured, uninsured, and so on. We need to understand the social background of how are these people going to be interacting with the technology and how they are going to need the human aspect of the technology to help them even understand what's happening and how to react to it. I think that's also very important because — I mean, we could get it...this is more of a philosophical... — but we get usually blamed in tech companies that we design things only thinking of people like us. And then you realize — and particularly in healthcare, it's very interesting because as soon as you start talking to doctors and to anyone from sort of medical profession — you understand, it's like, "Gosh, yeah." The way of thinking is different. It's like, even how they think, it's not purely mathematical and you need to have a level of understanding of the different ways that people interpret and process information. Now, that being said, I'm not saying that the traditional paternalistic view of medicine is good. The one which the doctor knew everything, and wouldn't say anything to the patient [but] say, "Trust me, I know the truth. You have to do what I'm telling you, but I'm not even going to say what your diagnosis is." No, I am totally against that. I think there needs to be a middle ground and the patients need to have access to their data and we need to be transparent with what's going on and give information as much as possible. And that's part of our model too, for sure.

Measuring diagnostic accuracy

Lukas:
Going back to a comment you made earlier that your diagnosis is better than the average doctor, or I guess that your system's better than the average doctor...My first question on that is, how would you even know that? Do you follow up and find out later what the real diagnosis was? Also, how would you train a system to be much better than the average doctor? Do you somehow have a way of finding more accurate doctors and then using that for training or how does that even work?
Xavier:
This is a great question. So, when I said that, I specifically added, "in the lab". We're better than the average physician in the lab and that's because the only real ground truth that we have to evaluate this are the so-called clinical vignettes, which are basically cases that are documented and they're agreed upon, and they've been published. There's not many of those, unfortunately, so that's something that is lacking. But, when we are making diagnoses on those vignettes, we kind of agree that that's a ground truth that's been published and that's the one that we use as the measuring bar. There's a public dataset, which is pretty small, but we also have our own internal one that we keep using for development. And we even use synthetic data and all kinds of different data that we can get into. Now, unfortunately, the generation of ground truth in medicine is extremely hard. There are a lot of studies out there that with doctors, for example, there's a — well, famous in our field — well-known publication by the Human DX Project² where they found out that the average accuracy of a single doctor on similar vignettes to the ones that I'm saying, so medical cases, it was roughly around 60%, between 50% and 60%. And in order to get past a reasonable 80% accuracy, you had to aggregate the opinion of six to eight doctors. So basically, the only way you have to really increase that accuracy is saying, "Okay, I'm going to ask eight different doctors and then take the opinion of the ones that agree the most and use that as my ground truth," which is, honestly, what many of us do in the lab to generate those vignettes. It's not "trust one single doctor", but ask many and then have quality processes to understand who is right and then take that as the ground truth. But in order to have a learning healthcare system and sort of have this system in proof, the only thing you can do is establish those mechanisms in which the system is actually learning and improving from itself. And you do have sort of humans in the loop having to follow up and saying, "Okay, we diagnosed this first, was this correct?" Very importantly, you also have the ability to have follow-ups and very constant follow-ups to understand if you got it right or if you missed something. One of the nice things about the system that we have, which is all virtual and chat-based and message-based, is that we can follow up, and we can automate follow-up, with the patients at very little cost or almost no cost. So, we can literally have the patient come back every hour and check on the patient. It's like, "Hey, did the fever go up, did it go down? Did we get it right or not?" Which is usually not the case in a normal medical situation, right? You go see the doctor and then if you're lucky, you see them in two weeks. The sampling time between different data points is much coarser than what we have.

Patient safety

Lukas:
Yeah. It's funny, I'm thinking about my own interactions with doctors, and I was thinking, when I call a hospital or call my doctor to ask them what to do, I feel like I can almost guarantee that they're going to ask me to come in and get more tests. My little sister is also a doctor and I feel like when I call her, I can almost guarantee that she's going to tell me, "Lukas, you're fine. You're being ridiculous. Drink some water and get some rest." And so they're clearly optimizing, my sister and a professional that I call, optimizing for kind of different things. How do you think about that? What do you optimize for new interactions? I would imagine that missing a serious condition would be so bad that you would really want to err on the side of caution with your suggestions to patients. But how do you know if you're doing a good job there?
Xavier:
Yeah. I mean, definitely, patient safety is uttermost concern and one that is very critical and our care team is very much fixated on patient safety first. We do things that even go against what would be good for the business because of patient safety and that's understood, and it's the right thing to do. However, one of the important pieces here around patient safety and around not erring on the side of being extremely conservative is, one, the population that we are dealing with is population that doesn't generally have good access to healthcare. So, if our response to their concerns was always, "Hey, go and get a blood test and you need to go through this super expensive procedure and good luck with it and come back to us," and that would be the kind of service we would be providing, these people would not come back, right, because they literally cannot afford it and it's not something that's optimized in any way. So, we need to provide the best possible care with optimizing also the cost side of the equation for them and for the overall system. And the reason we can do that is because we have this high-level of access and accessibility. So, we can play it safe because we can always tell them, "Hey, come back in two hours if your fever gets past this," or "If you start coughing tonight, come back." That's something that most doctors...one of the reasons they err on the safety side, sometimes excessively, is liability, but the other one is because they can't assume that they're going to be in touch with you for the next few weeks, right? So it's like, "Gosh, I need to just make sure that this doesn't happen in the next two weeks." If they had the ability to say, "Hey, you're going to be calling me in every two hours if there's something happening," they could take a little bit more of a little less aggressive approach, but that's usually not possible. In a system like this where there's a lot of automation and a lot of accessibility through a virtual...and through an application, through a phone, you can actually do that and it's much more efficient. More importantly, it's more efficient also long-term for the health of the patient, right, because you're catching things right when they happen and you're not letting it get to a point that it's like, "Oh gosh, now it's too late. Now we need to do this surgery."

Different types of models at Curai

Lukas:
Can you tell me a little bit about your tech stack behind the scenes? You're actually really deploying, it sounds like, multiple models into production and running them live. Are you continuously updating these models? How do you think about that? Are you retraining them constantly on the feedback that you're getting from the human operators?
Xavier:
Yeah. There's a combination of different models and each one has its own cycle. We do have what we call the learning loop, which is the ability to inject data back into the models and retrain them. But there's a combination of different models that have different levels of, I would say, velocity in the way that they can be retrained and they can be redeployed. In my experience, that is not any different than any other company. When I was at Netflix, we had the same. We had some that had a lot of data and were retrained daily and there were others that, honestly, they needed to have longer windows of data and more data to be retrained and you didn't need to retrain that that often. So, we're in the same place. Particularly, for example, things like diagnosis models, we don't get that much good quality granular data on diagnoses daily, right? So, it doesn't make sense. And we need to make sure that that data is high quality, we combine it with synthetic data that we generate from a simulator and there's a lot of sort of data cooking going behind the scenes for making sure that those diagnosis models are good. So, that's a model that is not going to be updated that frequently. Now, there are others that are around, say, entity recognition or intent classification or things like that, that we do gather more constant data and those can be updated more often. I will say, just to clarify for everyone who's listening, our modeling and even our research is at the intersection of natural language on one side and then medical decision-making on the other. And they both intersect, right? So there's an intersection of both, but we kind of go all the way from using GPT-3 and language models, to using synthetic data from expert systems to train diagnosis models. There's a very cool intersection of both things, whereas the purely knowledge-based, knowledge-intensive approach of traditional AI systems in medicine and all the way to language models and very much deep learning approaches. We have different models that are in the intersection of those. Some of which, as you can tell, the ones that are more on the data intensive language side, we do get more constant data and we can retrain. The ones that are more knowledge-intensive, we have to sort of do intermediate processes so to speak.

Using GPT-3 to generate medical summaries

Lukas:
That makes sense. Do you literally use GPT-3?
Xavier:
We do. In fact, we just published a paper about it³. We won the best paper award at one of the works up at ACL. In that particular case, we were using GPT-3 for generating training data for language summarization. So, that's an interesting approach, I think, one that I know several people are following in different domains. Instead of using GPT-3 directly at inference time, to use it as a way to enhance and generate high volumes of training data with different priming mechanisms, it's a very interesting approach and one that we showed in our publication that it's actually better than just having a lot of humans generating training data. So that's an interesting case.
Lukas:
Can you tell me more about how this works? How do you exactly generate the data and what's the...is it a summarization task?
Xavier:
Yeah. It is a summarization task and summarization of medical conversations is pretty hard because you need to generate the data, but also you need to generate data that is...Sorry, you need to have the original data, but then generate summaries and you need to generate summaries and examples of summaries, which are mostly correct, but some that might be incorrect to make decisions on where you're training the model. It has to learn what is a good medical summarization and what's a bad medical summarization. So, in the case of this project, what we did is prime GPT-3 with a number of examples of both positive and negative summaries to conversations, and then have it generate thousands of different training examples that we use to train our own offline model. And interestingly, I mean, the availability of more data, but also more nuanced variabilities that GPT-3 was generating itself, was made that the final model that we were training was better than anything that we could have trained with our own data and our own human labelers.
Lukas:
It's so interesting because you would think that the generation task could be so much harder than the decision task, if something's a good summary or not. It's kind of amazing to me that that works so well.
Xavier:
Yeah. I mean, to be clear, we could have tried to use GPT-3 directly for the task at hand, if we had had access to sort of unlimited resources and fine-tuning. By the way, I know that OpenAI is going to open the API for fine tuning soon, but we didn't have at that time. Also very importantly, there's a tricky aspect here with the privacy aspect of the data that we're dealing with, right? We don't want to be in a situation where we are sending GPT-3 data that is private from our patient, unless there are some guarantees of very strict compliance and privacy. So, if all those things were met, you could use GPT-3 directly and you would probably get a summarization that is as good as the one that we were generating. However, because that did not exist, it's a very interesting intermediate step to, again, prime GPT-3 with some knowledge and some examples, and then let it generate all these other training examples that you can then use to train your own. I mean, you're not going to train a GPT-3, but you don't need to, right, because the complexity of the model and the number of parameters that GPT-3 has is because it's a language model and it's a universal language model, right? But the model that you're training, which is very much focused on summarization in a particular domain, you can train a much smaller model, much more efficient with the right data and you're going to get the same... I mean, I'd be interested, I don't know if it's exactly the same accuracy or it's even better because, again, there's the question of how much a universal language model can be as good as a smaller model on a very specific task, right? Which is what we train.
Lukas:
That makes total sense. That's really cool. Do you worry about training on the conversations that you have? I imagine those are incredibly sensitive conversations with patients. If you use that data to train models, is it possible that some of the information could kind of bleed through into the models? Do you take precautions somehow to try to remove personal identifying information before you train a model on the data?
Xavier:
Yes, we do. All our models...sorry, all our data sets that are used for our models go through a de-identification process and we do make sure that the identifiers that our original data sets have are actually extracted. That being said, you can never guarantee 100% perfection, right, on those. De-identification of texts is in itself a research task, so there's different approaches to it and there's different things that can be done. But even with that, you'll get as far as a particular percentage of accuracy. We're pretty confident that most of our data sets that we use to train the models are pretty well de-identified, which then in turn means that the likelihood that then something even bleeds into the model is very, very small, right? Because it would need to be the combination of "something makes it through the identification step" plus "something gets picked up in the model that then can be retrieved". But that, otherwise, sure, it would be a concern. Right?

How Curai monitors and debugs models

Lukas:
Do you have systems to evaluate the quality of the models before you deploy a new one into production or do you do live production monitoring on the quality of models as they run?
Xavier:
We do have systems and we do have a process in place. We have different data sets, different metrics and different sort of processes to make sure that we detect any anomaly It's interesting because, in fact, I was talking today to François⁴ who is leading my AI engineering team. They're building a tool now that we're going to be using that basically automatically enables you to analyze the anomalies that we detect when we change a model, but by seeing actual examples of what is the actual case. I was talking about the vignettes that we have, for example for diagnosis, right? So if you train a new model and all of a sudden you see a different like, "Hey, this metric is lower than in the previous version of the model." That's okay. But in this case you really want to understand, is it being unfair to a particular demographic? Is it worse for older people or for women or for...or can I actually go and see where it made the error? And then, interestingly, now you need the collaboration of a physician or a doctor to sit with you and say, "Hey, this new version of the model decided that this thing, instead of the flu, was a cold. Is this correct or what's going on?" And then you need to debug. In most cases — and I know this is something that in other companies, people have this kind of debugging tools — but they are usually debugging tools that a layman or a layperson can understand. When I was at Netflix, we did have a similar tool that you would see the shows and like, "Whoa, this ranking doesn't make sense." But if you're dealing with a highly knowledge-intensive domain like medicine, you actually need that collaboration with the doctors. And we do have doctors in the development team and we do have experts that are sitting hand-in-hand with the engineers and the researchers to do those kinds of iterations and debugging and QA of the medical models.
Lukas:
That's cool. So what does the interface look like? It sort of shows somehow the explanation behind why the model made the choice that it made?
Xavier:
Yes, yes. It shows the overall difference between the previous model and the current model, and then you can click and see sort of, like, "Okay, what are the ones that it got right and what are the ones it got wrong?", compare the two models and you can kind of see the diff with a color code, so you can actually dig and say, "Okay, well, yeah, this one they got wrong. It's very wrong, so we should not move forward."

Model explainability

Lukas:
Do you try to build models — I mean, GPT would be kind of the furthest from explainability you could go to, but do some of your models — do you try to build them in ways that they maintain explainability?
Xavier:
Yeah, it's a great question. I think explainability, it's important, but it's also kind of tricky in the case of medicine, in the sense that not even doctors many times have an explanation for their decisions. In fact, something that is kind of a little nuanced, but I think it might be interesting is, many times, doctors will go all the way to prescribing without having a clear diagnosis. That's called symptomatic treatment, right? So it's like, "Oh gosh, I don't know if this is flu or a cold, but no matter what, I'm going to prescribe you this particular thing because it's going to be good for both things." And they don't really have a clear diagnosis. That's not bad. I mean, it's okay. It's better to do that than to do nothing. In fact, the good thing is to be doing some symptomatic treatment and then following up and understanding like, "What's the evolution? Did I get it right or not?" So, as long as you have a possibility to follow-up. So, explanation is not always possible and it's not always available in an imperfect information situation, right? Now, that being said, if you do have it, it's good to provide it and it's something that we have definitely worked on, on providing explanations. I'm actually a fan of adding explainability as a post hoc process to the model. I think it's something that has a lot of value and does not necessarily require the model itself to be explainable, but you need to go after the fact and understand like, "Okay, this is why the model picked this and is there an explanation that can explain in a simple way, why did the model pick this particular option or this particular cluster?"
Lukas:
So, how do you do that? If you have a really complicated model, too complicated to inspect, what kinds of methods do you like to use to get at some explainability of why the model did what it did?
Xavier:
Yeah. I mean, there's different approaches to adding explainability, right? I mean, the simplest one is you approximate the decision boundaries of your model, no matter how complex, no matter whether it's a deep model or not, by a simpler linear model and then use that to build the explanation, right? That is a typical approach that many of the explainability solutions take and that's one that can actually work pretty well. And it's one that we have experimented with and even implemented. I will say that's not really implemented in the product yet, but it's been implemented sort of as a prototype. And I think we even wrote about it in one of our blog posts. So that's, I think one of the easiest, but also at the same time, more effective ways to explain things that have a complex non-linear decision boundary and cannot be explained in easy terms. I will, again, say that in many cases in medicine, those decisions do exist, right? And even though as much as we try to infer causality from the decisions, those are hard to come by because there's a lot of nuances in the ways that the information is being processed and the decision boundaries of the models are being constructed.
Lukas:
Interesting.

Robustness in ML

Lukas:
Well, we're getting close to running out of time and we always have two question that we end on, and I want to make sure that I cover them with you. The second-to-last question that we always ask is, what's an underrated aspect of machine learning? And I guess I would say across your career at Netflix and Quora and Curai, what's been a topic in ML that's been maybe particularly useful to you or important that you feel like research doesn't cover as much as it should?
Xavier:
I think an important topic that is not covered in research enough, despite the fact that I've tried to put in myself because I was a researcher back in the days before going into industry, is what actually happens to models in the wild, right? It's like, it's a different thing that you build a model with perfect data that has been cooked in the lab and you know what it is and you can have control over the boundaries and even understand the distribution of the noise and all the different variables, than to deploy a product in the wild and to then be faced with all kinds of different drifts and the data distribution noise and whatnot. I think that is something that is not usually researched enough, understandably so, because you can't. I mean, most research is done with data sets that are available and are distributed in a way that they're kind of artificial, right? I mean, I went into Netflix through the Netflix Prize⁵, so I know that that was a dataset that was very, very good and very exciting to make progress in recommendations and the recommender systems arena. However, it was very different from the data that I found out we had at Netflix when I was there, and there were kind of all these other things happening, right?
Lukas:
Right.
Xavier:
So, yeah.
Lukas:
And I guess it's also kind of a hard problem to formalize, right? There's so many variations on it. I mean, there are a lot of people talking about it, but it's hard to...I guess it's like ML robustness or something, right?
Xavier:
Yeah. I think, yeah, robustness is one. I think another one that is very interesting, we have done some work on that is, out-of-band classification or prediction. It's like building models that can react to classes that have not been seen during training. For example, that's a very important aspect and one that gets a little bit of attention in research, but not so much. I will say, for example, that particular problem is one that's relatively easy to replicate in a lab, right? So basically, you can build models that say, "Hey, I have 100 classes, but I'm only going to let the model see 50 during training, see what it does with the other 50." And the model needs to understand, "Hey, I haven't seen this class. Sorry, I don't know what to say." So that's an example of something that...out-of-band classification is one that kind of mimics some of the problems you see in real life. Because in real life, you will deploy a model and it will see something that is very different from the things they've been trained on. Having the model be able to raise their hand and say, "Hey, I don't know what this is because I've never seen it before," that's a very interesting, for example...it's a specific concrete case, but it's one that relates very much to having these models in real life and being able to replicate these kind of situations and in a lab. For example, another one that we've worked on is on introducing artificial noise on the training and testing by using some domain knowledge, right? For example, in medicine, you know the prevalence of some symptoms and you can say, "I know that if I ask people if they have a headache, many people are going to say yes, because most people have a headache", right? It might not even be related to the current situation and the current condition, but people are going to say yes. Well, you can play around with those knobs and introduce artificial noise in your training datasets to then anticipate some of the noise you're going to be finding in the wild and in real life. So, that's another example. It is hard to recreate the exact situation you're going to find out there, but I think there are some interesting ways to mimic at least some of those situations that probably deserve more attention than they usually get.
Lukas:
It's fine if not, but do you have a favorite paper on the topic that you'd like us to point to or any research you could send people to who want to learn more?
Xavier:
Well, I have our papers that I could point you to. I mean, the two things that I mentioned we did...dermatology, image classification with out-of-band distribution⁶ — that one that refers to the first thing that I was talking about — and the artificial noise that we introduced in synthetic data, that's a paper we wrote on diagnosis and diagnosis training⁷. And of course, those papers cite a lot of other papers that could be interesting, so I could definitely point you to those.
Lukas:
Cool. That's perfect, having a good starting place.
Xavier:
Yeah. By the way, if people are interested in our tech block at Curai, we have a full list of our publications⁸. We probably have now in the order of 15 or 20 publications. I like to be very open about the research we do and I think that comes from my old times as a researcher. I'm very much in favor of open publications and sharing knowledge and you will find most of that in our blog posts.
Lukas:
Cool. Awesome. We'll definitely put a link to that.

Connecting metrics to impact

Lukas:
My final question is kind of broad, but I'm really curious in your case. Basically, what's been the hardest part, or maybe the most unexpectedly challenging part, of getting ML models deployed in production? Just kind of going from conception of "This is the thing we want to do" to "Here's a working model". Where are the big bottlenecks?
Xavier:
Well, yeah, that is a broad question and I think there's a number of things that come to mind. I think at the highest level, one of the very difficult things to get right, is how do you connect the offline metrics that you have in anything you're doing in any model in the lab to what's going to be the real impact that that model has on your product. We would love for that to be a clean thing and say, "Hey, if I get my precision and recall, then my F1 measure increases. I know that that's going to work in production and that's going to generate this much lift and whatever." People are either going to click more, or be happier, or love the product more. That's usually that road. From what you see in your model in the lab to the modeling production is not that straight. And there's a lot of issues that get in the way and a lot of questions and a lot of things that are really, really important to get right. Some of them relate to mundane things like the UX, right? How is the user experience? How are you presenting things to, in our case, the patient or the doctor, and how are they reacting? Your model might be awesome, but it's like, if they're not seeing it or it's confusing, or you're not explaining it right, that's not going to help in any way, or it might be worse. It might be confusing. So, I think the connection between research and modeling and the actual user experience and the interface and how that's actually introduced into the product is an aspect that I find fascinating myself and it's very hard to have people that actually understand the end-to-end. It's like, because you need to have a very broad experience that goes all the way from the modeling to the metrics, to the product, to the user. Understanding the user research, that's really hard to cover end-to-end. And then you need to build all that through collaboration and through teams that have the ability to collaborate. In medicine where that's even harder because you throw in the domain knowledge, that even becomes more tricky, right? It's like something that you might see in the lab and it's like, "Whoa, this is fantastic. This is getting the metric. This is actually going to be a killer featur". It might turn out that it's a killer feature in the wrong way. Sorry, I shouldn't have used that metaphor probably in this context, but it's important to understand that the results that you get in your experiments are mediated by many things before they can be evaluated in an A/B test, for example.
Lukas:
Awesome. Well, thanks so much for your time. I really appreciate this conversation and it's just super interesting.
Xavier:
Thank you. Yeah.

Outro

Lukas:
If you're enjoying Gradient Dissent, I'd really love for you to check out Fully Connected, which is an inclusive machine learning community that we're building to let everyone know about all the stuff going on in ML and all the new research coming out. If you go to wandb.ai/fc, you can see all the different stuff that we do, including Gradient Dissent, but also salons where we talk about new research and folks share insights, AMAs where you can directly connect with members of our community, and a Slack channel where you can get answers to everything from very basic questions about ML to bug reports on Weights & Biases to how to hire an ML team. We're looking forward to meeting you.