Daphne Koller, CEO of insitro, on digital biology and the next epoch of science

From teaching at Stanford to co-founding Coursera, insitro, and Engageli, Daphne Koller reflects on the importance of education, giving back, and cross-functional research.
Angelica Pan

Listen on these platforms

Apple Podcasts Spotify Google Podcasts YouTube Soundcloud

Guest Bio

Daphne Koller is the founder and CEO of insitro, a company using machine learning to rethink drug discovery and development. She is a MacArthur Fellowship recipient, member of the National Academy of Engineering, member of the American Academy of Arts and Science, and has been a Professor in the Department of Computer Science at Stanford University. In 2012, Daphne co-founded Coursera, one of the world's largest online education platforms. She is also a co-founder of Engageli, a digital platform designed to optimize student success.

Show Notes

Topics Covered

0:00​ Giving back and intro
2:10​ insitro's mission statement and Eroom's Law
3:21​ The drug discovery process and how ML helps
10:05​ Protein folding
15:48​ From 2004 to now, what's changed?
22:09​ On the availability of biology and vision datasets
26:17​ Cross-functional collaboration at insitro
28:18​ On teaching and founding Coursera
31:56​ The origins of Engageli
36:38​ Probabilistic graphic models
39:33​ Most underrated topic in ML
43:43​ Biggest day-to-day challenges

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
Daphne:
I'd come from a family where I was privileged in that both of my parents had access to higher education, and I saw how much opportunity that created for me, that others just didn't have. And I guess I've always felt, and still feel, and really try and teach my children as well, that for those of us who have been privileged, too much as expected and it's our responsibility to give something back.
Daphne:
So that was, at that point, my way of giving something back, is by teaching. And in fact, that was what led me to also eventually depart Stanford, because I felt like my opportunity to give something back to the world in a much greater scale was available to me by founding Coursera and opening up education to a much, much, much larger number than I would ever be able to teach at Stanford.
Daphne:
And that's actually also what led me to insitro, because I feel like there's an incredible moment in time now in bringing together two disciplines in a way that could be totally transformative to the world. And I think it's kind of an incumbent upon me. There's almost a moral imperative to make that happen if I can do that. And it's not something that many other people can do.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lukas Biewald. I'm excited and maybe a little nervous to interview Daphne Koller, who is a very famous successful machine learning professor and also the founder of insitro, and a founder of Coursera, and recently founder of Engageli. Three super different, super interesting startups.
Lukas:
I should also say she was my first machine learning teacher at 221 at Stanford, and then I TA'd for her. And then I did research for her later on. So she might actually be the reason that I'm here today recording this podcast. So again, super excited to talk to her. The thing I most want to talk to you about actually is insitro, which looks like super fascinating and exciting. And maybe for those who haven't heard of insitro, you could sort of give us a quick overview of the thesis of the company.
Daphne:
Sure. insitro a drug discovery and development company. And if you've been looking at drug discovery for the last 50 years, you will see that we've made a tremendous amount of progress in bringing medicines to patients in nee. But at the same time, there's this thing called your Eroom's law, which is the universe of Moore's law, in which there is an exponential decrease in the productivity of pharmaceutical R&D.
Daphne:
And when you ask yourself why that is, it's because the journey of discovering and developing a drug is really complex and long. And there is many places along that journey where we can take the wrong turn. And when we do it, can take months if not years, and millions if not tens of millions of dollars to realize that we took the wrong trajectory.
Daphne:
So we're trying to do is to really build the company in a way that uses machine learning, which, after all, is something that is helping us make really good predictions in so many other domains, and use that as a way of building this drug discovery and development process in a completely different foundation. So that's what we're really trying to do, is bring better medicines to patients and do it faster.
Lukas:
So what's the standard drug discovery process at a high level? And where does machine learning fit into this or improve it?
Daphne:
So I don't know if one can really talk about a standard journey because it's been an evolving process over the last few years. If you want to draw a very coarse-grained caricature, you can say, "Well, I have a disease and I do a, usually..." That's done in an academic center, a bunch of biology to uncover the genes and the biological mechanisms, pathways, that are implicated in disease. And then someone has a hypothesis about, okay, if I make an intervention at this gene, it may cure, or at least help address, cure is a very broad word, very ambitious word, we've cured, precious few diseases, but to help address some of the aspects of the disease.
Daphne:
And once you have that target, you can start to identify... Well, first of all, you have to validate the target. And oftentimes that's done using animal models that attempt to simulate some aspects of the disease. And for many of the diseases that we have today, the animals don't get the disease naturally. And so you kind of have to create the disease in the animal and then try and address it in the animal. And it oftentimes turns out that what you're addressing really isn't the true disease at some simulation of it that is very imprecise and sometimes just downright wrong.
Daphne:
And then, once you have a target, then you typically look for chemical matter, a compound that helps modulate that target. And there's different, what are called therapeutic modalities, which are different kinds of interventions. It used to be, whatever, 30, 40 years ago, that the main form of a therapeutic modality we had was small molecules. And then around came biologics, which are larger molecules. Basically proteins and antibodies, which are a type of protein that are, in many cases, more precision mechanisms. So they're much more precise in their action, but they're also harder to administer and they are able to address a narrower set of targets.
Daphne:
And now over time, we have additional therapeutic modalities that have emerged over the last two years that help intervene in the body and other types of mechanisms. So everyone's talking about gene therapy as they should, in which case we can come in and intervene in the DNA itself. There's only a very few of those that have been approved so far, but it's very much a growing field. Now with the COVID-19 vaccine, everyone is talking about RNA therapeutics, which is intervening in between DNA and protein at the RNA level.
Daphne:
So all of these are ways that are expanding our capabilities to make intelligent interventions in the human body and hence in a disease process. Oftentimes, where it fails is really at the very beginning, which is, we do not understand biology well at all. And therefore our ability to recognize when intervening in a target is going to actually have meaningful clinical benefit to a human is very, very limited. And oftentimes, we guess, and we guessed wrong. And sometimes we also fail to understand all of the other implications that an intervention in a given target might have. For example, all of the other things that this particular gene does in the body. And if we intervene in a way that maybe even beneficial for this, it might be detrimental for that.
Daphne:
And so that's where a lot of our ability to make valid predictions really falls short. And that's where a lot of drugs fail. And right now, the failure rate, depending on what you consider to be the denominator, like when do you start counting a program as a drug program, is between 90 and 95%. That's the failure rate, not the success rate. Which means between one and 10 and one in 20 drugs actually go on to be approved. And even smaller number actually ended up making a real difference to patients.
Daphne:
And that's what we're looking to fix, is how can we make better predictions, first and foremost, about what kinds of targets you would want to intervene in for a given disease in the context of a given patient population. And then subsequently, fine, we want to intervene at this target, what is the right chemical matter to put in that might have fewer side effects, that might have better drug-like properties? What is the right patient population to use? A lot of the failures that I think we have today are because we try and go after a much broader or miscalibrated patient population.
Daphne:
And so over time, I think there's many questions in this process where machine learning can make an intervention, the target, the drug, the patient population, the biomarker that tells us when a drug is working so that we can cut things short. If it's not, then transitioned the patient to another drug. All of these are areas where I think machine learning can play a role.
Lukas:
And does the machine learning try to kind of model the physical reality of the world here? Does it ignore that and just sort of look at past experiments that were tried?
Daphne:
I think people have tried both. And as we've seen in other cases where machine learning has been applied, there are some benefits to incorporating a lot of prior knowledge about the world, but then, over time, that begins to become a limitation. So I used to work in computer vision way back when people still tried to create models of how light is refracted off of surfaces, and having geometric models for computer vision, and models of elimination, and so on and so forth. And we don't do that anymore.
Daphne:
What we now do is create really, really large training sets and give the computer enough data that it can learn the patterns without having to be told a lot about the structure of the world. We haven't quite hit that tipping point in most biological problems because the data that's been available has just been insufficient. And so right now, there's a lot of problems where models that incorporate more of our understanding of biology are actually, in many cases, outperforming models that are less informed.
Daphne:
But one of, to my mind, a real highlight achievement from the past year that starts to go in the other direction is the incredible success of DeepMind's AlphaFold algorithm which uses somewhat similar machine learning tools to AlphaGo, which they used in a very different domain. And AlphaFold is basically addressing the problem of protein folding. So to take an amino acid sequence that represents a protein and ask what it will look like in 3D space.
Daphne:
There's been multiple groups over the past, I don't know, 10, if not more years that have built computer tools. Some incorporating machine learning, but certainly all incorporating a relatively large amount of prior knowledge about physics, and chemistry, and forces, and electrons, and so on and so forth, and asking what the folded protein would look like. And all of them asymptoted at a certain level of performance, which was reasonable, but not usable.
Daphne:
And by the way, I forgot to say that there has been an biannual competition once every two years called CASP, which is one of the best-designed real blind tests for machine learning model, one where you can't cheat. In which labs that are experimenting on a particular protein by generating its crystal structure, which is the 3D structure, would submit the sequence to the CASP competition and they would not release the solved structure until the competition was done.
Daphne:
And since no one can... It's months of experimental work to come up with that structure. People couldn't cheat on the test data. So in this CASP competition, you could see that there was a plateau of performance. And then this last year, DeepMind really broke through that plateau and achieved a performance that is actually usable for...
Daphne:
And achieved a performance that is actually usable for a real biological problems. And the way they did that is by not incorporating into the model a lot of preconceptions about physics and chemistry and different kinds of chemical bonds, but really just giving the machine learning model enough pairs of sequences and soft structures to train on. And then they said, "Okay, now that you've learned, go and run on a new protein." And they were able to break through that ceiling that we've seen. So I think, to my mind, that's an indication that we need to be really thinking hard about how to generate enough data at scale for biological or chemical problems, so that you could get machine learning to break through that ceiling and performance. And so that's kind of what we're trying to do at insitro is build massive data production capabilities across the problems that we care about, so that we can generate data that's enough high quality and large enough, and that is fit to purpose, so that you can train machine learning models to solve the problems that we care to solve in the drug discovery process.
Lukas:
So, I guess, I want to get back to insitro in a second, but since the protein folding thing was so high-profile, I'll ask you my dumb questions, which is such a waste. But I was kind of curious, What was the insight then? It seems like just actually removing prior beliefs from a model wouldn't be enough to have a breakthrough improvement in the quality. And surely, lots of people had access to lots of examples of proteins and how they fold, right?
Daphne:
So I can't speak to that yet because they have not yet published their latest model. And so we're relying on the very limited information that's in the press release. And so I would be curious to read the paper once it's out. But I do know that they incorporated a lot of insight from the latest machine learning models, in terms of, for instance, attention models that you can look to see where you would want to have one amino acid look elsewhere in the sequence to figure out where to fold. But I wish I could give you more insight into exactly how this works, and I'm hoping that they will publish the results soon and we will all learn from how they did this.
Lukas:
And is protein folding, is that a sub-problem of one of the problems that you mentioned? Or is that just an example of how much momentum there is in the field?
Daphne:
I think people have differing opinions on the extent to which protein folding matters in drug discovery. I think there's a lot of proteins where the structure is actually pretty well understood and we just don't know how to drug them. Protein folding certainly doesn't help you with the fundamental question of picking the right target to go after, because the folding comes after you've decided that this is a target that you need. There certainly are a set of targets where you really would like go after them, and what's missing is an understanding of their 3d structure. How big that set is, I think, is a matter for debate.
Daphne:
So to my mind, it's less about whether protein folding is the key problem in drug discovery. Certainly not the key problem. It may be a problem, but it's certainly not at the core of what is holding drug discovery back. But it's really an illustration of taking a problem that everyone agreed was hard. People had struggled to solve or tried to solve using a range of other methods. And machine learning came in and, with the right type of model and the right type of data, was really able to crack that nut open. And so that, to me, is the real lesson here, rather than we've transformed drug discovery.
Lukas:
Interesting. I guess another question that comes to mind is, I remember back in 2004, you were working on applications of machine learning and biology. And some of them actually sound quite similar to what you're talking about at insitro. And so when I started the company, almost two decades later, is it that the biology has improved or the machine learning has improved or the data has improved? What's the key thing that's changing that makes insitro possible now?
Daphne:
It's a combination of both, actually. The first is the availability of much, much larger amounts of data than we have had before. So in the last decade or so, there has been this tremendous amount of progress in biological tools that are good for data creation. And that includes everything from the incredible growth in the feasibility of DNA sequencing, and not just DNA, but also RNA sequencing and various other aspects of sequencing. Microscopy has grown a tremendous amount in both its throughput and its capabilities. In the chemistry side, we have these really cool things called DNA encoded libraries, which are basically chemical libraries that can have hundreds of millions of molecules all mixed together in a test tube. But because they each have a DNA barcode attached to them, you could basically figure out what they do without... Even though they're all kind of mixed together in a pool. There's microfluidic techniques that allow you to do experiments in teeny little droplets, which achieves both spatial separation, as well as scale.
Daphne:
All of these techniques are things that didn't exist a decade ago. Oh, and not let me forget CRISPR, of course, which is the ability to now start to edit the genome in a very fine-grained way, and then ask what happens to a cell when its genome is edited in a particular way. That is something that when I was doing, even not in 2004, I went and did a sabbatical at UCSF in, I think, 2009. And we were doing these experiments in knocking pairs of genes in yeast. And yeas is a very malleable, editable organism. And the experiments were incredibly slow and painful. And they were in yeast, which has 6,000 genes.
Daphne:
Now, if you want it to do pairwise knockouts in human cells, it's an experiment that you could do in a couple of weeks. And it's just amazing how things have changed that way. So I think that, to me, is actually the biggest transformation, but the other one, of course, is just transformation that we've seen in machine learning. It's hard to imagine thinking back, but in 2004, when we were doing computer vision, and you might remember this, Luke, that we were looking at questions in taking an image. What is in this image? Is there a dog in this image? It's like, "I don't know. Maybe." And it was barely above random.
Daphne:
And now, in 2018, I think the lines crossed where the machine performance is actually above that of a human. And that is for tasks where humans are actually good, that is humans know how to recognize dogs in images. We're trained to it from birth, and yet the machine is outperforming a human. When you're looking at tasks where humans are actually not so good, like, for example, recognizing biological patterns in images, or even worse, in sequencing data, the machine is just so much better than a human.
Lukas:
Yeah. This is a broad question I didn't expect to ask, but I'm curious your thoughts. What were the key insights, you think, between 2004 and 2018? Was there one thing that you think was really the change?
Daphne:
I think it's a combination of three things that came together. One is, yeah, we had better machine learning models, which were often just a matter of having the willingness and courage to not just look at simple models, but be willing to bite the bullet about models that are not convex, that there isn't just a single optimum that really have a lot of dependence on exactly how you optimize them. So that's one thing. The second is the existence of large enough data sets that one could train such models despite the complexity of the space without over fitting radically. And I think that's a place where contributions such as image net and others, which really created large enough data sets so that one could actually start training those models were as important as the models themselves.
Daphne:
And then the last one is compute at the push of a button. It used to be that, in those olden days, I'm feeling really old right now, that when we had to do anything that required large amounts of compute, we had these local compute clusters that were painstakingly maintained by local IP people. And you ran your job, and it took six months to run. And that you hope there was no memory leak. And then at the end of the process, you never ran it again because you would never risk doing it more than once. And now, we have the cloud and you can do this on 10,000 machines and your results come back in a day. And honestly, to me, that's been as, or more, transformative than anything else.
Daphne:
Because our ability to do that, combined, by the way, with platforms such as PyTorch and TensorFlow or an Adam that allow us to program much more quickly, we're now able to experiment and improve our models in an iterative loop that we were never able to do before. So even if our initial models like, eh, the second time and third time and fifth time and 20th time that we iterate and make it better, it's going to get better and better over time. And so that combination of better software, better tooling, I'm not talking just the better machine learning, just the tooling around the machine learning and the better cloud computing, which enables this rapid iteration cycle, has frankly been, I think, as, or more, transformative than anything else.
Lukas:
Which kind of leads to a question I had in biology in particular, which is, are there datasets available in biology in the same way as in vision? There's an impression that there's more proprietary data, I guess.
Daphne:
So that's, again, something that's changing. And one of the datasets that has been most transformative, I think, at least from the work that I've done, is the UK biobank, which is 500,000 people with genetics, with clinical outcomes, including longitudinal clinical outcomes, and very deep phenotyping that includes different types of imaging and blood biomarkers and urine biomarkers and a whole bunch of other covariates like environmental factors. And that data set has, on its own, I think, been truly transformative, both in the development of new methodologies and in the insights that it's given us about human biology. There's been other data sets that have been, I think, also very important. They aren't as large or as carefully curated, which, I think, has limited, to some extent, the impact relative to the UK biobank, but still have been quite significant.
Daphne:
So there is the TCGA, which stands for The Cancer Genome Atlas, which is a reasonably large cancer dataset across different tumor types. There is the... Let's see, the GTEX dataset, which speaks to different gene expression across different tissues and different individuals, so that you can look at the variation within an individual across their tissues in their gene expression, but also for the same tissue across individuals. So you can kind of have this be like a two-sided matrix. There's others that are like that, end code, which speaks to DNA markings across different cell types. So I think there is more and more of-
Daphne:
... things across different cell types. So I think there is more and more of that available that is not entirely proprietary. There are also some on the chemistry side. They don't then by and large, with a few exceptions, like the UK Biobank being, I think, the best example of something that is truly high quality, truly well curated, with every experiment done exactly just so.
Daphne:
And that is a challenge for a lot of people because noise in biology is much more of an issue than it is in many other domains. That is actually why we're building insitro the way we are, which is we have a significant wet lab component whose primary purpose is to generate large amounts of data so that we can train the models in the right way.
Lukas:
Is there a notion of transfer learning in this field in the same way as in vision or are the problems just too different?
Daphne:
I think that certainly there is transfer learning. And even in images, there have been examples where people have trained resonant models on images in the web, and then done transfer to microscopy images.
Lukas:
Which is incredible, right? Isn't that amazing.
Daphne:
I know it's amazing. Isn't it? So, I mean, I would expect it would be even better if you train the microscopy images. But still the fact that this actually does translate is, I think, a pretty remarkable achievement. I think there's other examples that one could generate. People have done a fair bit of work, especially recently, on pre-training of say graph neural network models for chemical structures on large numbers of compounds.
Daphne:
And then using that type of encoding as a pre-trained model for something, for which you have less training data. Like more specific properties of compounds. So I think that's actually one of the big areas that, I think, will become important over the next few years. Is how do we make use of some of those larger data sets that maybe have less supervision as a way of enabling us to build models that are useful on a smaller set of data set.
Lukas:
But you actually built a wet lab to collect data, which is super interesting. How does your team break down into people doing machine learning and people doing, I guess, biology, and people doing other?
Daphne:
So if you think of the small fraction of people who are like GNA, the composition of the company, for most of the time, used to be about 50/50. So initially I think we had a few more wet lab people because you need to start making data before you can really have a lot of data to analyze. But even then we had some computational people who helped with making sure the experiments were designed right.
Daphne:
And then it became about 50/50. And then now we're actually starting to grow the next set of functions. Which is once you have insights that come out of the biology, you actually have to make drugs. And so we're starting to build out functions in chemistry, and drug discovery. And so the balance is shifting a little bit more towards the life sciences. But it's really quite evenly distributed among those functions.
Lukas:
Oh, that's cool. And I guess it sounds like your ambition is to not make just like one drug, but to kind of build a process to make lots of drugs.
Daphne:
That's right.
Lukas:
And I would think with a hit rate, I just picture managing a business where the sort of hit rate of something is like one in 10, or one to 20 sounds incredibly stressful. Is that the case?
Daphne:
It is incredibly Stressful. Especially when each experiment cost you tens, or maybe hundreds of millions of dollars, at least today. So how do you navigate that is certainly something we think about a lot. How do you make the process faster? How do you make it less expensive?
Daphne:
How do you fail fast so that you don't end up spending the hundreds of millions of dollars on something that is going to fail? So how do you recognize earlier that something is the wrong path? That actually is the point of what the machine learning is looking to do. And how do you ensure that you have enough capital to give yourself multiple shots on goal, in case the first couple don't work out.
Lukas:
Right. Right. Although, you've done a good job with that. It looks-
Daphne:
Oh, yeah. I can't complain.
Lukas:
Well, I guess I want to make sure I also ask you about some of your other work. And I wanted to ask you about Coursera and I guess teaching in general. I think you're not teaching anymore. Is that right?
Daphne:
No, I'm no longer a professor at Stanford. I'm an adjunct and it's great to have some connection back to the department, but I don't teach anymore.
Lukas:
It seems sad to me because, I mean, I just wanted to say you were such an amazing teacher. Like you were-
Daphne:
Aw, thank you.
Lukas:
... notoriously difficult teacher. That was kind of your reputation. And you weren't kind of the warmest teacher, but you're like memorable, 16 or 17 years later. It's just like a really, really excellent teacher. Like I feel like I just learned very quickly and efficiently from you. And then also when I TA'd for you, I got to see how much you cared about grading, which I really appreciate. It's interesting to see.
Lukas:
I was coming from a math department too, where it's like, they just did not care about teaching or grading. And it felt just really good. It's like someone's here and really kind of cares to take the time. And so I kind of, wasn't surprised that you started a company around teaching, but I was kind of just curious to hear the story about it and how you thought about it. And what happened in the early days.
Daphne:
So teaching had always been a passion project of mine, kind of like on the side. Because, as someone who's on like the research side of a top academic institution, top research institution like Stanford, you're not supposed to really invest a lot of time in teaching. So I was always a little bit of an outlier in wanting to spend time on that.
Lukas:
Can I ask, what do you think that was, that made you want to do it? Because it really was quite evident that you cared more than anyone else about teaching.
Daphne:
I guess I've always thought that education was just the door to opportunity. And that if you set someone on the right path at an early age, or rather you enable them to get on the right path at a relatively early age. Because I mean, teaching is not really a thing. A teacher enables people to learn and become who they can be. And they have to make the investment and want it. You can't learn someone, they have to learn.
Daphne:
I just felt like it was an incredible enabler. I'd come from a family where I was privileged in that both of my parents had access to higher education. And I saw how much opportunity that created for me that others just didn't have. And I guess I've always felt and still feel, and really try and teach my children as well, that for those of us who have been privileged so much is expected. And it's our responsibility to give something back.
Daphne:
That was, at that point, my way of giving something back is by teaching. And in fact, that was what led me to also eventually depart Stanford. Because I felt like my opportunity to give something back to the world in a much greater scale was available to me by founding Coursera. And opening up education to a much, much, much larger number than I would ever be able to teach at Stanford.
Daphne:
And that's actually also what led me to insitro, because I feel like there's an incredible moment in time now in bringing together two disciplines in a way that could be totally transformative to the world. And I think it's kind of incumbent upon me. There's almost a moral imperative to make that happen if I can do that. And it's not something that many other people can do.
Lukas:
And I guess, I saw you started another company, Engageli that's seems like a teaching tool. Right? And was that a reaction to something you wished Coursera did? Or?
Daphne:
Yeah. So yes and no, in the sense that it was driven by the observations that we had in the pandemic, when all of a sudden I had two teenage kids who were thrust into Zoom school. And these are two kids that are academic high performers, that are by and large, pretty diligent. And, at some point I was kind of looking in on them and noticing that the youngest, after a few minutes in her class, making sure that the teacher saw that she was there, would turn off the camera and the microphone and spend the rest of the class perfecting her Sims game.
Daphne:
Whereas the older one would spend the time going through the Netflix catalog. And this is like, okay, if this is what my kids are doing, despite the fact that they have all these opportunities, what happens to all those other kids who don't have that same set of privileges. And they're going to a school with much larger classes and teachers who have way less time to invest in trying to make the classes better on video.
Daphne:
So that was really part of it. But truthfully, and this comes back to, I think, the thrust of your question Luke, is that originally when I was getting interested at Stanford in teaching, it was actually not originally with the only purpose of teaching the world. But also in trying to get teaching to be better even at Stanford. Because I felt like, okay, I got to spend, whatever, three hours a week with people like you in a class.
Daphne:
And we were making use of that time with me just standing in front of the class, droning at you, and delivering a lecture that was not that different to what I delivered a year before. Is that really the best use of class time? Or can we spend the time actually engaging and interacting with each other, and really learning? Which is much more of an active effort than it is just sitting there watching a professor talk at you.
Daphne:
And so this really was, to me, coming back to what had motivated me to go into a lot of capabilities that ultimately went on to become what we built in Coursera. And really create a tool by which people can learn together, even if they are not physically co-located. And what we've discovered is that the move online actually makes things better, irrespective of whether you're in the same classroom or not.
Daphne:
Just because of the ability to flexibly chat with people who are in a group with you, work together as a team. And really create an environment that fosters active learning in a way that is very hard to do.
Daphne:
If you just have a bunch of people sitting in a large auditorium with not great acoustics, all facing forward in fixed seating, but with a tiered classroom, looking at the instructor down below. So I think I'm hopeful that one of the few benefits of this terrible pandemic that we're suffering through is that we will not actually go back to teaching the way we did before the pandemic. But we'll have a better way of teaching.
Lukas:
Interesting. Yeah, I'm remembering now that I think one of the things he did really well, I thought, in a in-person class was actually kind of watching when you were losing the class and then pacing. I remember you had this trick where you would ask, who does understand what I'm saying, which everyone should do. It's funny, I've taken that with me for the rest of my life in talks and stuff. I really appreciate it. But everyone should do it. Because a lot of times you would ask that and it'd be like a third of the people that raised their hand. And it was actually even helpful for me to know, as like a nervous student, that I'm not the only one who's kind of like lost track of where this is going.
Daphne:
Yeah, a lot of people ask the opposite question, which is, who's not with me? It's like, "Well, most people haven't even absorbed the question and you've already moved on." Or, "Does anyone have questions?" And it's like, "Well, I don't even know if I have a question because I haven't understood what you're saying." So I think it's really important to create an atmosphere where the default is I'm not understanding rather the default is I am understanding, and especially when you are teaching complex material.
Lukas:
Right, right. Another question I wanted to ask about, I remember years ago when I was your student, you were super interested in probabilistic graphical models, which were really interesting. I remember especially being interested and they've sort of... The thing that stuck with them is sort of causality. It seems like you can find that in data, which is really cool and surprising. But I was curious, have you maintained an interest in that? Has that field evolved in interesting ways? What's happened with them? I don't hear about them as much.
Daphne:
Well, I mean, I think there's been a lot of discussions in the last few years about deep learning because of all of the big transformative things that deep learning has been able to do because we've been able to get away from feature engineering, which has been such a pain point in most of the tasks that we deal with. I think there is clearly still very much a need for understanding causality.
Daphne:
When I think about the work that we're doing in drug discovery, the fundamental question that we're asking is, if I make this intervention in a human is it going to make a clinical difference? Is it going to benefit the human? That is an interventional question. If you confuse that question with an observational question, you very easily, immediately fall into all sorts of traps about correlation being different from causation and a lot of the correlations being completely going in the wrong direction from a causal perspective. So you find yourself intervening in symptoms or just downstream sequela that have nothing to do with the fundamental disease processes.
Daphne:
I think even in machine learning more broadly, there's a growing recognition that that is one of the big unsolved problems in getting machine learning to go to the next level. I was at the NeurIPS conference, not this past year, but pre pandemic, and Yoshua Bengio was giving a keynote talk. And he highlighted that as one of the main unsolved problems, both because of its intrinsic importance, but also because understanding causality and the causal processes that underlie the world enable you to learn with much sparser data because you have a much more structured representation. So I think that what's likely to happen is that the pendulum has swung very much towards the deep learning side of the world, as it should because of the tremendous advantages, but I think it's now starting to coalesce. These two paths are starting to coalesce. We're going to see a lot of interesting work coming out on that front.
Lukas:
Cool. Thanks. So we always end with two questions and I want to make sure you have a little bit of time for them. So the penultimate question is, what's an underrated topic in machine learning? Maybe I'll say it to you like this, if you had more time on your hands, what new thing would you investigate or look into?
Daphne:
So I'm going to use this opportunity to give you two answers. One of which is maybe more directly to your question and the other one, which we didn't get to earlier, which is the why I'm doing what I'm doing right now.
Daphne:
So I think on the pure machine learning front, what we discussed earlier is really a fundamental problem, which is, how do we leverage large amounts of weekly supervised, unsupervised data to learn a representation that enables us to then very efficiently learn from much smaller data sets? I think that's an area where, yeah, people have said, "Well, there's whatever the image representation that we learned in ResNet and of course, word to vec," and there is others, but I don't think we've really sort of pushed this to the limits in terms of how do you bring these different types of data sets together? What's the right way of combining the objective functions in a way that balances things in the right way? So I think that's an area where there's going to be a lot of interesting progress of how do you learn and refine a representation over time.
Daphne:
If I broaden this question out from machine learning specifically and ask where I think there is a really big opportunity for the world, it's in this convergence of these two disciplines, which is biology and data science and maybe engineering. So maybe it's three disciplines. The analogy that I use here is if you look at the history of science, there has been sort of epochs in history where one field has just sort of really taken off and made a tremendous impact on the world in a relatively short amount of time. In the late 1800s that was chemistry with the periodic table, and then in the early 1900s it was physics with understanding the connection between matter and energy in between space and time. And then in 1950s, it was computing and the ability to use silicone chips as a way of really doing calculations that up until that point maybe even not a person could do.
Daphne:
And then in the 1990s and 2000s, there was a bifurcation. There was data as a field, which emerged from computing, but also from optimization and statistics and neuroscience, and I think it's really its own field. The other is what I call quantitative biology, which started to measure, finally, a very robust and reproducible and quantitative way aspects of biological systems. That's what gave us sequencing and microscopy and all of the things that I talked about before. And I think the next big field that's going to emerge is the convergence of those two fields into one, and I'm calling it digital biology. To me, it's the ability to measure biology with fidelity in its scale, use machine learning and data science to interpret the measurements that we get, and then use bioengineering techniques to go back and intervene in biology to get it to do something that it wouldn't otherwise do. That has implications in human health, but it also has implications in bio materials, and in agricultural technology, and in environmental science, and in energy science. All of these are places where the convergence of those two fields and this digital biology is just going to transform that space. I think that's going to be the next big field of the next, whatever, epoch of science.
Lukas:
Wow. Well said. Let's go to the highlight reel. It's actually a good segue to our final question, which is, I would say this for insitro, you're trying to discover new drugs using machine learning. What are the practical day-to-day challenges right now of making that work?
Daphne:
Well, so I think there are a number so I'm going to highlight two. One is that biology is really hard. You are dealing with live things and they're variable, and they depend on the exact temperature in the room, and on who the tech is that's manipulating them, and a lot of things that you don't normally think about and we don't have to deal with in a lot of the more exact sciences. So how do you create datasets that are robust enough and experimental procedures that are robust enough so that the noise does not overwhelm the signal and the variability does not overwhelm the signal?
Daphne:
The second is that in order to do the kind of work that we're doing, you need to create a really unique culture of individuals who are able to sort of speak both languages, at least to a certain extent, and communicate with people with a discipline very different to their own. That's something that we don't have to do quite as much in many other applications of machine learning. So if you're doing machine learning for web recommendations, you don't need to deeply understand the catalog of items on the Amazon site in order to write the recommendation algorithm. That's not true for biology. You really need to understand enough that you can have a meaningful conversation with a biologist to our chemist.
Daphne:
So the recruiting of people who either have that joint skillset or are willing to learn enough to have a meaningful dialogue and really work as part of a truly cross-functional team with people from the other disciplines... We don't train enough people like that. I think building the company with that kind of individual and with the right culture is something that I think about all the time. I think we've done a really great job of it at insitro so far, but it's definitely an ongoing effort all of the time.
Lukas:
Awesome. Thank you so much.
Daphne:
Great. Thank you.
Lukas:
Doing these interviews are a lot of fun. And the thing that I really want from these interviews is more people get to listen to them, and the easy way to get more people to listen to them is to give us a review that other people can see. So if you enjoyed this and you want to help us out a little bit, I would absolutely love it if you gave us a review. Thanks.