Sean and Greg — Biology and ML for Drug Discovery

Sean and Greg talk about the challenges of combining two highly specialized fields like biology and ML, and how they think about building cross-functional teams.description.
Angelica Pan

About this episode

Sean McClain is the founder and CEO, and Gregory Hannum is the VP of AI Research at Absci, a biotech company that's using deep learning to expedite drug discovery and development.
Lukas, Sean, and Greg talk about why Absci started investing so heavily in ML research (it all comes back to the data), what it'll take to build the GPT-3 of DNA, and where the future of pharma is headed. Sean and Greg also share some of the challenges of building cross-functional teams and combining two highly specialized fields like biology and ML.

Connect with Sean and Greg

Listen

Apple Podcasts Spotify Google Podcasts YouTube

Timestamps

0:00 Intro
0:53 How Absci merges biology and AI
11:24 Why Absci started investing in ML
19:00 Creating the GPT-3 of DNA
25:34 Investing in data collection and in ML teams
33:14 Clinical trials and Absci's revenue structure
38:17 Combining knowledge from different domains
45:22 The potential of multitask learning
50:43 Why biological data is tricky to work with
55:00 Outro

Watch on YouTube

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!

Intro

Greg:
Evolution is one of the most interesting aspects of informational science because it's the ultimate bootstrap system. You've got these letters strung together on DNA that have, over billions of years, encoded themselves into the most sophisticated system on the planet, and it's everywhere around us. In theory, artificial intelligence could look at that and understand every piece of it the same way that every cell does.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. Today, I am talking to Greg Hannum, the VP of AI Research at Absci, and Sean McClain, the founder and CEO of Absci. I'm talking with them about drug discovery and development and manufacturing and how ML fits into that, and that's what Absci does. This is a super interesting conversation that I really enjoyed.

How Absci merges biology and AI

Lukas:
Why don't we start with you, Sean? Maybe you could explain to our audience what Absci does. This might be like explaining it to your mother or something, right? Everyone's sort of interested in these applications, but maybe doesn't really understand the deep biology or really even the industry that you're in. How do you think about that?
Sean:
Yeah, it's pretty simple. We are merging biology and AI together. One of the really exciting aspects of our technology is that we are able to screen or look at billions of different drug candidates, looking at the functionality of those drugs as well as the manufacturabilty. That's compared to what the industry is currently doing, is looking at drug candidates in the tens of thousands. If you look at a protein-based sequence like a monoclonal antibody — you're all familiar with COVID, Lilly's antibody that came out, that's a protein — and if you look at a protein sequence, there is more sequence variance in an antibody than there are atoms in the universe. What we're essentially doing is feeding in all these billions of different data points on the protein functionality and manufacturabilty to ultimately be able to predict the best drug candidate for a particular disease or indication. Essentially our vision is to become the Google index search of drug discovery and biomanufacturing where we can take patient samples, find the specific biomarker or target for that particular disease, and then utilize deep learning and AI to predict the best drug candidate for that particular target or biomarker. All at the click of a button, and totally changing the paradigm of healthcare and biotech, and ultimately getting the absolute best drug candidates to patients at truly unprecedented speeds. It's this really exciting forefront of, again, merging biology and AI together.
Lukas:
Do you ultimately take these drugs to market and sell them? How far do you go in this process? Do you just invent them and then hand them off? How does that work?
Sean:
It's really a perfect marriage of what we do and what pharma does. Pharma's really good at being able to design clinical trials, take the drugs through the clinical trials, and then ultimately market them. Where we come in is being able to assist the pharma and biopharma companies with actually designing and creating the drug itself. Then we out-license it to the large pharma to take through the clinical trials as well as commercializing it. We get milestones and royalties on that, which essentially, in the world of tech, is another version of a SaaS model, but based on the clinical trials and ultimately the approval of the drug product.
Lukas:
How far along is this? What's the drug where you've used these techniques that's closest to something that cures a disease?
Sean:
Yeah, so we have one product that we're working on right now that is in Phase III. They are planning on implementing our technology post-BLA approval. We're potentially assuming the drug gets approved. A few years away from actually seeing that drug on the market. So that would be our first drug candidate that would make it to the market utilizing our technology.
Lukas:
What does it do?
Sean:
Unfortunately due to confidentiality, I can't disclose that, but I'm hoping here in the very near future that we will be able to disclose that. I will say in general, most of the programs that we work on are either on immuno-oncology or in infectious diseases. But our platform's really agnostic to the types of indications or diseases that we can go after, but we really focus on where the industry's focused, and a lot of that is on oncology.
Lukas:
Is that because cancer's such a big deal and so many people get it or some other reason?
Sean:
Yeah, I would say that that is one of the big diseases that the industry is focused on and where a lot of innovation can be. Our technology is really an enabling technology, so we take the ideas that our pharma partners have, they're the experts on the biology, and saying, "Hey, we need to design a drug that has these attributes that can do this." We can then enable them to do that and that's across really all diseases and indications.
Lukas:
Forgive me for such basic questions, but I'm really curious how this works. So a pharma company would come to you and say... Is it as simple as, "We want to cure this specific disease and we need a molecule that cures this disease?" Do I have that right? I mean, how does that happen? Then what do you deliver? Is it like, "Here's a molecule," or "Here's 20 you should try," or "Here's how we think about it?"
Sean:
Yeah, I mean, the simplest way of looking at it, it's exactly how you described it. So they come to us and say, "Hey, we have this particular target or indication and this is the biology. If we design a drug that has these attributes, we think that this drug candidate then could kill this cancer cell." They then have to perform the animal models and then ultimately take it into the clinic to prove their hypothesis on that, and we're assisting them in being able to discover the drug candidate that has the properties that are needed to solve the biology problem that they have determined is going to ultimately cure or improve that particular disease.
Lukas:
When you say drug candidate, is that literally a molecule?
Sean:
That is. In our case, that is a protein that is being used as a drug. There's protein-based drugs and then there are small molecule-based drugs. So small molecule drugs, Advil, Vicodin. Basically a pill in a bottle. Then you have the protein-based drugs or biologics, such as insulin and a lot of the exciting monoclonal antibodies. Again, going back to Lilly's COVID antibody or GENERON's COVID antibody, these are all protein-based drugs. The interesting thing with protein-based drugs is you can't chemically synthesize it. You actually have to make it in a living organism. That adds more complexity to discovering these molecules as well as manufacturing them.
Lukas:
Can you predict exactly what the protein's going to look like and then look at it and see if it does it? Is that all in simulation or are there surprises when you actually try to manufacture it?
Sean:
Yeah, so there is a lot of surprises that can occur. We are not to the point where we can predict drug functionality. That's ultimately where we're headed with all of this. A lot of times, if you can predict the functionality of a protein, that doesn't necessarily mean that you can manufacture it. So many times we see with large pharma, they discover these really exciting novel breakthrough protein therapies, but ultimately can't take them to the clinic because they can't manufacture them. You not only have to predict the protein functionality, but you also have to be able to predict the manufacturabilty of it as well. We're really looking at both of those. Really what AlphaFold has done with being able to predict the protein structure based off of the amino acid sequence, where we're headed is being able to predict the protein function or protein-protein interaction. So it's the other side of the coin. It was a huge breakthrough for AlphaFold for basic research. What we're doing is going to be a huge breakthrough in drug discovery and biomanufacturing. Again, that's the opposite side of the coin from what AlphaFold has done.
Lukas:
I want to make sure I heard you right. Did you say you're not predicting the functionality?
Sean:
We are predicting the protein functionality.
Lukas:
The functionality is how it interacts with another protein?
Sean:
Exactly. It's "How tight does it bind to another protein?" Then also, we take into consideration immunogenicity. Is it going to react in the body once it's administered? Then also taking a look at the CMC or manufacturing aspects. Is it soluble and stable? Can it be produced at high yields? These are other predictions that we take into account or other attributes we take into account.

Why a biotech company started investing in ML

Lukas:
Interesting. I want to hear more about how this actually works, but, I guess, one question I want to make sure that I asked you is that I saw that you started your company in, I think, 2011, right? It seems like ML as applied to medicine has changed so much. I'm curious if you started your company with this perspective or how different it was, and also how your perspective on machine learning has changed as machine learning has evolved and deep learning's come along.
Sean:
We did not start off as an AI company. I would say we are very similar to Tesla's evolution. Tesla started off as an electric car manufacturer. They started collecting all this data from their sensors, built an AI team around that, and now they're a fully autonomous self-driving car tech company. That's a very similar evolution that Absci is on. We started out on the biology side and engineering E. coli to be more mammalian-like to really shorten the development times and decrease manufacturing costs. We then built out this technology that allowed us to screen billions of different E. coli cells and look at different variants of proteins, looking at basically the drug functionality and then also looking at, "Can you actually manufacture this?" We started generating all this data, billions of different data points on the protein functionality and the manufacturabilty. We knew that if we could leverage that data with deep learning, we could get to the point where we could predict the protein functionality needed for every type of target or indication, and that's ultimately what led us to apply our Denovium pioneering deep learning technology for protein engineering. But it really started off with the data. Data is so key and we have proprietary data that no one else has that we are then leveraging deep learning to mine that, to get us to the point where we can ultimately predict protein functionality. Where we're currently at right now is being able to leverage the data we already have and be able to predict the best billion-member libraries we should be screening for, for every new target and indication we work on. Eventually, as we train the model with more and more of our proprietary data, the more and more predictive it's going to get. Instead of predicting a billion-member library, it starts predicting a million, a thousand, and then ultimately predicting the absolute best drug candidate for a given target or indication, looking at what modality should it be, the affinity, low immunogenicity, all the manufacturing attributes that you want. Right now, it's a race to feed as much data as we possibly can, but it all started off with the biology technology that we had originally developed.
Lukas:
For you, Sean, as CEO of a company that's not a deep learning company, I'm curious how you first got exposed to deep learning and what made you think that it might be useful, and then how you got conviction around making these large investments in deep learning that you're doing now. What were you seeing that made you feel like it would work? It seems like you're more bullish on it than maybe a lot of your peers and I wonder where that might be coming from.
Sean:
I'm bullish because we have the data. Again, it all goes back to data. We have high-quality data on the protein functionality and manufacturabilty. It goes back to an earlier point that I made, which was there are more sequence variance in an antibody than there are atoms in the universe. There's no screening technology that we could ever create that would allow us to mine that big of a space. That's really where the deep learning comes into play, is being able to essentially sift through all of the potential evolutionary paths that a drug could be created in and figure out what is that best drug candidate, basically mine that whole search space, and ultimately come to the point where we're creating the best drugs for patients. I think we've seen huge...once we've implemented the deep learning technology, we've already seen huge gains in terms of yields and the types of drugs that can be discovered when taking our data and pairing it up with deep learning. Ultimately where I see us going is becoming a full tech company once we have enough data here. I'm extremely bullish on AI and what it can do within healthcare.
Lukas:
It's interesting talking to you in that we work with, I guess, a lot of pharma companies, which I see are slightly different in what they do than you, but it seems like their perspective is "interested in deep learning, but probably not at the CEO level," except the sense that they're making, I'd say, small or medium investments whereas you want to transform your entire company in this direction. Do you think that you're doing something different than your competitors around deep learning? Do you think that you can be the best at this in some way?
Sean:
I do think that we can be the best. I would say that the industry is starting to understand the benefits of what deep learning and ML can provide. Biotech probably doesn't have as great an appreciation for tech and machine learning and really what that really means, and vice versa, that the tech industry doesn't quite understand all that goes into biology. It's really exciting to be able to take two industries, two cultures, and merge them together to really create something that's going to be hugely impactful for patients and ultimately the world.
Lukas:
That's super cool. I mean, thanks for doing an interview like this. I think this is really great for cross-pollinating ideas. I love these. I have a lot of maybe slightly more technical questions. Greg, feel free to jump in if you like.

Creating the GPT-3 of DNA

Lukas:
One thing I wonder about with ML applied to this stuff is, do you feel like it was always a latent possibility to successfully be able to make these predictions that you're doing now and it was just a matter of getting enough data? Or do you feel like there's been breakthroughs in machine learning, in model architectures or something like that that have actually made this a more practical application?
Greg:
Yeah, thank you. It's a great question. I would say that it's a little bit of both, that there has always been potential for ML in bio and has been very successful in the past in some of these same indications, but it's been limited both on the data collection side — which is not stagnant, it's moving in incredible ways, the same way that the AI community has, and the AI modeling...recent advances in large-scale architectures, transformers, a lot of different techniques for getting these models to converge successfully and to be very predictive have been incredible breakthroughs as well. Essentially now I'm less concerned about the AI holding back any sort of success as I am about making sure that we can marry these two communities, make sure that what is always an intrinsically messy process of collecting biological data is actually connected to the inputs and outputs of that AI. Which, as Sean will be the first to tell you, this is a great place to be able to do that at because a lot of that hard work of actually developing these assays and working through that challenging space is part of the bread and butter of Absci.
Lukas:
Could you give me maybe a concrete example of an ML breakthrough that would help with this? For example, I think of transformers as... I know them as technology mostly for natural language processing. I could sort of imagine how this might apply to what you're doing, but maybe could you walk me through some kind of architecture, some kind of new way of doing things, and how you framed the biology in this machine learning world?
Greg:
I'll give a couple of examples that have come over the last few years. The biggest is related to scaling. The biological problems are necessarily complex. Evolution is one of the most interesting aspects of informational science because it's the ultimate bootstrap system. You've got these letters strung together on DNA that have, over billions of years, encoded themselves into the most sophisticated system on the planet. It's everywhere around us. In theory, an artificial intelligence could look at that and understand every piece of it the same way that every cell does. What you need to do to connect these dots now is in collecting enough data of different parts of the system. Namely, you need a lot of nucleotide data, so we need to do DNA sequencing. But we need that from lots of different organisms and we need to understand how they translate into proteins, we need to understand how those proteins act and function, what if they bind together, how they fold together, is an incredible number of pieces that need to come together to see that big picture. This is where scale becomes very important. It's a bigger problem than some traditional ML or even the original deep learning architectures are capable of solving, because it simply requires more parameters, requires more complexity, requires better understanding. NLP-based models and transformers in general are really good for this domain because a lot of what we operate on isn't sequenced space. But I wouldn't say that they're the only approach to this either. But those advancements in letting us get to larger and larger models to create the GPT-3 of DNA is something that really gives us, for the first time, a real handle on these challenges.
Lukas:
There is this trend in NLP — which I'm much more familiar with — of models becoming more and more black boxes. Less and less informed maybe by linguists. I don't know if every linguist I've had on this podcast would agree with that, but I think broadly as the data increases and the model complexity increases, they become more open. Is there a similar trend in these applications, where maybe the chemistry and physics matters less and you just treat it as this translation from letters to "Did the drug get successfully produced or not?" or do you still inject your knowledge of biology or chemistry or physics to make the whole system work?
Greg:
Yeah, it's been moving in that direction, but we're not there yet. Biology is...those two communities still haven't fully been united. There have been some big advancements recently in the protein-biology space, and the MSA transformer is a big example of this where being able to take something that bioinformaticians and computational biologists have been doing for years of aligning sequences to see what kind of patterns they share in nature can be used as an input directly with a special kind of architecture to let models learn from that. These sorts of biologically inspired architectures are still coming. AlphaFold is another great example of one where they did a number of relatively novel techniques and combining them together was really key to the success. The black box approach is powerful and I wouldn't downplay it, but we're still plenty of room for improvement.
Sean:
But I think that's ultimately where we want this to go. You can input in a target sequence and be able to have the output be the sequence for the drug candidate and predict all the binding just based off the sequence itself. We've already seen some really interesting discoveries that have occurred from...our deep learning model showed that we got increase in overall yields from this protein that wasn't necessarily classified as a chaperone, but our deep learning model predicted that it would be. I think these are some of the really interesting discoveries that are going to be occurring at a very rapid pace by bringing the AI and biology together.

Investing in data collection and in ML teams

Lukas:
Sean, how do you think about investing in data collection versus your ML team? There's maybe two ways to improve your models. Going out and collecting more data, which is probably really one type of investment, versus building up ML expertise. Do you think about it that way and do you feel like there's a trade-off there? How do you look at that?
Sean:
I think investments in both is absolutely critical. You can't invest in one and neglect the other. You really have to make the strong investments in both. Right now, a big investment of ours is, "What is all the data that we want to be feeding in into the models?" Looking out 10 years, are we going to regret not collecting this piece of data? Then how do we build our databases and scale the amount of data that's needed in the future? How do we collect it as quickly as we possibly can to then hand it over to our ML team to be able to continue to train and improve the models? We have made huge investments in both, from the wet lab side, the data capture, and the database and scaling that along with the AI team.
Lukas:
As more of a computer scientist, I'm definitely enamored at the idea of a wet lab. Could you describe what happens and what that collection process looks like?
Sean:
We just built out a, I think it was 88,000 square foot campus. Half of the campus is office space and then the other half is an actual lab. The lab is super key to what we do. It ranges all the way from the drug discovery team all the way down to our fermentation and purification team that grow up the cells and ultimately purify them. A lot of the data that we're feeding into our deep learning models is Next Generation Sequencing data and flow cytometry data. That's really key. Some of the breakthroughs within NGS and the speed at which we can process NGS data is really enabling us to do what we do. It's really fun to be able to grow a team that's both on the wet lab side and then the AI and ML side. Also, I would say an AI scientist that understands the biology is absolutely critical to what we do and the talent on that side is...there is not a lot of it out there, but we have done a really amazing job of building out talent that understands both aspects.
Lukas:
Maybe this is a stupid question, but what goes on in a wet lab these days? Is it like beakers full of proteins? Is it microfluidics arrays? I don't know. How does it work? How fast can you actually collect meaningful data?
Sean:
We build these...so we start off with building these large libraries. We work with what's called a plasmid. It's basically circular DNA and that encodes the drug product. We vary that DNA to look at various different drug candidates. In a single small test tube, we basically take all of those billions of different plasmids and put that into an E. coli. It's extremely small and you look at it and be like, "Wow, there's trillions of cells in there," and it's pretty incredible. Then we take all of that, we screen it, and then ultimately we find the drug candidate and the cell line. Then we grow it up in big fermentation reactors. Think of beer and brewing beer. It's essentially big vats that are highly controlled and then you just grow up the bugs in there and basically give them the genetic code to make the drug candidate and then you scale it up from there. But yeah, it's all beakers, fermentation, purification. You name it, we've got it.
Greg:
I'd add a little color to that as well, in that from a background of somebody who doesn't spend every day inside the wet lab, it feels a lot like stepping into Wonka-land. You have an amazing amount of human ingenuity sitting on every desk, whether it's a mass spectrometer or some sequencing technology or...all these devices have very specific and very incredible capabilities and a bunch of people who know what to do with them and know how to put all the pieces together to make this stuff happen.
Sean:
It's so funny. I actually think I don't think I've ever had anybody ask me, "What does a wet lab do?" I was searching for the words to describe it. I probably did a terrible job. But it's like-
Lukas:
I thought it was great, what you provided.
Sean:
You don't really quite understand the magnitude until you step in and really understand every intricate aspect that's being done.
Lukas:
I remember the first time I ever went into one of our customer's wet labs. I felt like, "Oh, this is what I thought science was like when I was a kid." I love it.
Greg:
I'm still disappointed I don't get to show up as a lab coat. I might just start doing that now.
Sean:
Yeah.
Lukas:
It's funny. I never thought about this, but we do a lot of ML experiment tracking, but I would imagine there's a lot of parallels to tracking all the experiments that you're doing in the lab. Do you have software that does that? You've probably written a lot of software to just keep track of everything that's happening in there, right?
Sean:
We've actually decided to build a lot of this out ourselves and Jonathan Eads, who's our VP of Data Science, he and his team are actually working on building out a database where we track everything internally based off of the software that they have developed. This is really because there is no software solution out there that really met our needs. We actually just got a demo of it the other day and it's really incredible, what it's going to allow us to do. Not only in the data capture, but also being able to track where programs are at in the lab, where we have bottlenecks. I'm mean, it's really this brilliant software that is really going to help expedite what we currently do and to be able to capture the data that's needed for the long-term success.

Clinical trials and Absci's revenue structure

Lukas:
Very cool. I'm curious about how you think about where this goes. Where do you imagine ML taking you as you collect more data? Do you think the whole process moves to this? Do you think you could run clinical trials essentially in ML and know if they're going to be successful or not?
Sean:
I won't say that we'll be able to run ML for clinical trials, but the drugs that we do design, if indeed we are predicting the best drug candidates for various indications, it's going to increase the overall success rate. That in turn is going to lead to shorter clinical trial timelines and being able to rapidly progress new drug candidates through, and ultimately lead to the point where we can do personalized medicine because we have shown that the success rates dramatically increase and allow for that personalized medicine. But who knows? We could here in the future be able to use ML for a clinical trial design and prediction as well. One of our core values here is believing in the impossible, so I feel bad for not saying, "Yes, ML will be able to predict clinical trials and not actually have to go through it." It'll be really interesting to see what's done on that front in the future.
Lukas:
What is a typical clinical trial success rate?
Sean:
Right now, it's right around 4%.
Lukas:
4%.
Sean:
Yeah.
Lukas:
But there's different stages, right? Or how does that work?
Sean:
Yeah. There's three stages. You have your Phase I, your Phase II, Phase III, and then ultimately approval. So going from Phase I all the way through approval, it's about a 4% success rate.
Lukas:
Wow.
Sean:
Yeah.
Lukas:
Just as another CEO, it sounds totally harrowing to me to have my revenue depend on a 4% success rate process. How do you stay sane in a market like that?
Sean:
The way we structure our revenue is one, the pharma partner pays us to actually develop the drug candidate and the cell line. We're getting paid for that. Then we get paid on milestone payments as they progress through the clinical trial. You get a milestone payment at Phase I, Phase II, Phase III, ultimately approval, and then royalties.
Sean:
Even if a drug doesn't make it to the clinic, you can still get paid these milestone payments, which are 100% pure margin. Then it's a law of large numbers. It's just growing the number of programs you have as quickly as you can. You ultimately get to the point where you do get drugs approved and you get royalties coming in for 10 to 15 years off of that. But you grow the revenue base just by growing the number of programs every year.
Lukas:
Can you say order of magnitude how many of these you're doing? Is it like thousands?
Sean:
We currently have nine active programs ongoing. Our goal for this year is five programs, which we're on track for, and then increasing those year over year. But no, it's definitely not thousands. It's more on the tens instead of thousands.
Lukas:
Do the programs inform each other? Is this similar to natural language where you can have one big model and then fine tune it on the different cases?
Greg:
Yeah. That's actually a big part of why we think this is so exciting, is because it really is one physical system underlying a lot of these drugs. Creating a model that can understand this for one drug is useful. Then for the second one, it presumably will need less training data because it can transfer learn what it understands about the first one. Then you go to the third and the fourth, and before long, as Sean was saying, the number of shots you need on goal becomes reduced to the point where any novel drug then becomes a one-shot learning problem. This is exactly where we see it going.

Combining knowledge from different domains

Lukas:
Is it possible for you guys to engage with the academic community at all? I feel like you're actually adjacent to two very different academic cultures, right? There's the ML culture, which I know well, but seems like it might be tricky to share data with and then the vast medical literature, which I know less well. Are these communities relevant to you at all? Do you try to do any publishing or engage in some way?
Sean:
Yeah, definitely. We love to engage in the academic community and we are looking to publish some papers here in the near future, both on the work that we're doing, but also in collaboration with some of the leading new academic professors in our area. We see this as ways to continue to validate the work that we're doing and improve the science that we have and leverage domain expertise that we don't have. The academic community for us is really essential to the work that we do. We very much foster those partnerships and collaborations.
Lukas:
Cool. Well, I know a lot of ML practitioners that I think would be interested in working in your domain. Can you say anything about what you look for in hiring an ML practitioner that might be different than, I don't know, a Google or an OpenAI?
Greg:
I can speak to some of what we've looked for on our team and what we continue to look for going forward. There's a lot of the strengths that naturally come from the AI community that we like to keep going forward. The way that we think about problems, the way that the... how we understand the implementation details. As you know, AI can be tricky to execute on both the compute and the setup and understanding all the different systems and software that goes into that. But on the totally different side, you have all the biological complexity and it's an entirely different field to be learning...you need a whole other degree to learn about all the complexities that come from that. Lab scientists and the close relationship with them is an important piece there. I guess what I'm trying to get at is that it's that capability to learn, because there's so few people who naturally are in both spaces anyways. So it's a capability to learn, the patience and the rigor to go through and understand all sides of the problem, and how to make an impact therein. It's never as easy as a lot of AI problems often are where it's like, "Here's your inputs, here are your outputs. Now, maximize some scoring function." It's a lot trickier than that. The scientists live that day to day. To some extent, it's like, "Well, welcome to our world." And that's great because it means that when...we can also say, "This is how AI can address these challenges. It can help clean up that noise. We can help better understand what's going on with this process, and then, yes, ultimately build systems that speed up and maybe even replace a lot of these processes."
Lukas:
Sean, I guess in that vein, as you have transitioned from not doing a lot of machine learning to really making this heavy investment in machine learning and building out these teams, have there been any kind of unexpected cultural issues or team issues that you've had to work through that might have happened because of adding all these ML nerds?
Sean:
Yeah, I think that it's having everyone recognize that by combining both ML with biology and the lab scientists, that it ultimately is getting to our vision quicker and that it ultimately is impacting patients' lives in ways that we couldn't do without combining it together. I think the first thought is, "Oh, my gosh, Sean, you're bringing in all these AI and ML experts. Are they just going to automate my job away and they're going to be able to predict everything and there is going to be no need for me?" It's like, "Absolutely not." Biology is so complex. We have so many problems to solve. Once we solve one problem with AI and we have the data, we then need the biology and wet lab expertise then to solve the next problem and the next problem after that. It's never going to go away. You need both. At the end of the day, you can't stop the wet lab and the biology side because that's what feeds the data and both are absolutely critically important. I just love the different perspectives that both sides bring to the table to make our company the best it possibly can be.
Lukas:
It sounds like a lot of fun. Have you gotten any questions from your ML team where you're just like, "Man, we're just miles apart here," like you just don't understand what we're doing?
Sean:
No, I think honestly everyone has really done a great job of understanding the other side's perspective. Sometimes the AI team may not be getting data as quickly as they would like, but then they dive in with the scientists and they're like, "Oh, I understand you ran into this problem. Can we work together to increase the throughput?" Or it's like, "Hey, I gave you all this data. I'm not seeing any improvements yet. When are we going to start seeing improvements from our AI models?" I think it creates patience and collaboration and, I think, a respect for each other's part that they play in the overall bigger picture.
Lukas:
Greg, do you agree with this? Should I ask you separately?
Greg:
No, no. I think you nailed it. You started by saying it's exciting and I couldn't agree more. It's an opportunity of a lifetime to be at the intersection of something like this. It's wonderful to see such smart people and such talented people who are respected in their own field and then coming together. There's something very humbling always to be on the other side of things and realizing, "Wow, there's always more to learn." It's very healthy, as Sean said. It does give you a greater sense of context and perspective.

The potential of multitask learning

Lukas:
We always end with two questions, and I think you both are coming from super different perspectives, but I'd love to hear both of your answers to this. One question we always end with is what's a topic in ML that you feel is underrated versus its impact? I mean this very broadly. I mean, I guess, Sean, what skills do you feel like people should be showing up with that they're not, maybe?
Sean:
When folks come to Absci, we're solving very big complex problems. Our mantra and our number one value is there for a reason, which is believe in the impossible. We are always looking for people that are wanting to push the limits on both the AI side as well as the biology side and really bringing that together. We are creating this new ecosystem that really hasn't existed and this understanding of what ML can do for biology and vice versa. We just want to bring in people that want to think about things differently and change paradigms. I'm super excited about where the future lies with AI and biology together and we're really on the forefront of that. Yeah, couldn't be more excited about where the industry's headed.
Greg:
All right. Yeah, I guess I'll give my different take here on what's the underappreciated side of ML. I'd say that it definitely has some appreciation, but could be higher, is the capability of deep learning and artificial intelligence to do integrative work. We see an awful lot of research solving specific problems, often hard problems, and they compete against each other on performance scores and evaluation. But the real value, I think, in the practical world for AI is how well it ties different kinds of information together. We use this at Absci in trying to collect dozens of different kinds of assays and we can understand, "All right, in context for just one of them, this is a spreadsheet of data. It's not even that large. But maybe if I relate that to the embedding space projection of a different model that was trained on a different task, it can tell me something useful about the problem that I'm working on now." This is a philosophy that we're big proponents of, of integrating large multitask systems that can leverage the commonalities in the data and understand...putting them all together. This is an advantage, not just that you get to use all your data on hand, that you get information on that, but it also creates a simplicity to everything where instead of having to run all these different pieces, you can ask from maybe one piece of data what the other pieces would look like. You can take a lot of what might be, let's say, in the case of bioinformatics, we have a lot of computational tools for understanding protein function. You can run dozens of these different tools and try to get them all to work together and set up your environments or you can have one AI model that knows these answers and can give it to you in a millisecond. Full appreciation of how well it can simplify problems and bring different kinds of problems together is something that I think could use more appreciation on.
Lukas:
This really works for you. I mean, I feel like a lot of people talk about this multitask learning and combining problems, but it's always felt a little theoretical to me. Do you actually find that it meaningfully helps on tasks to incorporate data from other tasks?
Greg:
Oh, absolutely. This was a big part of what we did at Denovium, was with taking our DNA models and protein models, tying them together, two entirely different domains of data. But it allowed us to...users could essentially take a DNA sequence and then just one artificial intelligence model, find all of the proteins, what do they do, characterize them with 700,000 different labels. Very multitask. We had something like 25-some odd different databases that were all tied together in different...it essentially had to multitask quite a bit to solve those challenges. But it both worked and it really sped up the progress of what we could do with it, as well as allowed some really unconventional approaches. So Sean was talking earlier about the chaperone discovery work where we could use these protein models to understand what a protein would do if it otherwise hadn't been understood by science. These sorts of models, because they're generalized over so many different kinds of tasks, were not burdened with memorization and they can say, "Oh, yeah. Well, hey, look, this looks an awful lot like this. It should do this," and we can trust it to step outside its box.
Lukas:
Is there any paper or something that you could point people to who want to learn more about this? Have you been able to publish any of this work?
Greg:
There is a legacy work that was somewhat of a precursor to it. We can pull up the paper later.
Lukas:
That'd be awesome.
Greg:
Yeah.
Lukas:
Cool. We'll put it in the notes.

Why biological data is tricky to work with

Lukas:
Our final question, and Sean, I'm really curious to get your take on this one. Sean, you've been super positive about the promise here, but you guys are actually doing ML and trying to get real results, and so I'm sure that you're running into problems. What has been the biggest unexpected problem trying to go from this idea of something you want to do to actually making it really work in reality?
Sean:
Oh, man, there's problems every which way. I would say first, it's actually convincing the scientific community and our partners that deep learning and AI is the future and showing them work and showing that this can actually happen. That's first hurdle. Then I would say, the other biggest hurdle and challenge that we've had to work through is being able to develop the technologies that get us the data — get us the data in a clean format — and then scaling that data and then building out a world-class AI team. Greg and Ariel and myself with Matthew, we're always looking for the best talent and how do we bring them in. But as you know, as a fellow co-founder, it's like once you think things are going well, you're always thrown in off the deep end and going in another path and having to solve another problem. It's continuously problem solving it, but that's the fun of it. We've made so much progress and we're going to continue. I think that's just so much of the fun of growing a company and doing what we do.
Lukas:
Greg, anything you want to add of unexpected hurdles along the way?
Greg:
Unexpected hurdles? I mean, that's every day.
Lukas:
Well, give me one. Give me one real story from the trenches.
Greg:
Oh, let's see. What's a good one that we've discovered recently? It's always getting back to the fact that biological data is messy and a lot of scientists are exceptional at what they do, but things come back that you're surprised at. For example, we assemble these plasmids, these long stretches of DNA that are a circle that can essentially convey various information about how to construct the drug and how to manufacture at scale. A lot of the technology that we're developing is trying to say, "Okay, if you put in this sequence, it will do this. If you put in this sequence, it'll do that." In the process of building the precursors for that — I'm not going to credit deep learning here, just credit the infrastructure development underneath that — we discover, "Oh, hey, in some of our assays, whole sections of the DNA have just been cut out and have been looped together into a smaller shape. What's going on with that?" This was nobody's plan. Your AI is not going to say, "Wow, that was a really interesting phenomenon. You should go..." These are the sorts of things where it is that collaboration environment where an AI scientist can, even just in the process of getting things ready for ingestion to an AI, can really make sure that all the data is together and understood and a lot of these things are overcome. Then, of course, on top of it, now you get the insights of, okay, now for the ones that are together, what do we see here? What is interesting?
Sean:
I think it all goes back to the hardest part that we deal with, is the biology. We can predict these billion-member plasmid libraries to build, but it could take us a week to build to it or it could take us two months depending on the complexity of it and we just don't know because it's biology. It keeps it interesting.

Outro

Lukas:
Well, awesome. Thanks so much for your time, guys. This was really fun. Really appreciate it.
Sean:
Thanks so much, Lukas.
Greg:
Thank you.
Lukas:
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we worked really hard to produce. So check it out.