<div style="width: 100%; height: 170px; margin-bottom: 20px; border-radius: 10px; overflow:hidden;"><iframe style="width: 100%; height: 170px;" frameborder="no" scrolling="no" seamless src="https://player.captivate.fm/episode/007d646a-ea29-435c-a951-0207c8933286"></iframe></div>
I was using data science as a backdoor to problems. It was like, I could talk to people and figure out what they’re working on and figure out how the software and algorithms and analytical methods they were using mapped to problems that I’ve worked on previously and use those analogies to move sideways from work that I had done outside of the biomedical domain into the biomedical domain.
You're listening to Gradient Dissent, a show about machine learning in the real world and I'm your host, Lukas Biewald. I've known Jeff Hammerbacher for a long time, and he's had a truly incredible career. He started off running what was, essentially, the data science team at Facebook, and then founded Cloudera, which was a really early company in the data science space and recently went private after being public for quite a long time. But, mid-Cloudera, he actually left and became a professor at Mount Sinai and started his own lab. Now he's working on a company called Related Sciences that does drug discovery with machine learning. I actually ran out of time talking to them today because I have so many questions and the stories are so good. This is a super fun one.
Jeff, thanks so much for doing this.
Yeah, man. Good to see you.
Yeah. Good to see you. I want to get into the stuff that you're working on at the Hammer Lab, but this is, obviously, for a lot of people who've come up through data science, that we record this for. I thought it might be interesting to start just with your early career, just because I think people would want to know about it, and you had such an outsized impact on the field of data science. I was curious just to hear your story about how you came into Facebook and how you...I think you started a data science team there, right?
Yeah. So let's see, I landed at Facebook in early 2006. My initial title was Research Scientist and then, eventually, I ran a group of what we would soon call "Data Scientists". The next step after that was absorbing what we called data infrastructure at the time, which would — I suppose — now be called data engineering. So we ended up with a team called the data team. It was almost 30 people by the time I left, so it was pretty good-sized, and our mandate was effectively to collect all the data generated by the site and then do analyses on it to improve the business outcomes. It was a rapid learning experience. I was there for less than three years and we went from effectively zero data for offline analytics to petabytes per day. There was no real technology to support doing that at the time, so I was really spending a lot of time talking to people at Yahoo and eBay and Google, just trying to figure out what was going on. The commercial vendors...it wasn't really a blip on the radar yet to do data at that scale, so it was pretty intense. I learned a lot and I met a lot of great people and it eventually led to starting Cloudera.
People might not realize back then, it wasn't standard practice even to keep all your data. I remember talking to the CTO of eBay, even though I think a little bit after that, and he was saying, "You know what? We only keep 1% of our click logs because it's just too expensive to store it all." Why do you think Facebook was so out on the forefront of doing this kind of data analysis?
We were certainly not ahead of Google, so I would never claim us-
That's a high bar.
Yeah. I wouldn't claim that we were at the forefront. I would say it was "necessity is the mother of invention", in that we just had so much data and so much user activity that we wanted to understand and our product was evolving so quickly. I think the need for offline analytics was really driven home to executives during the News Feed launch. This is something that's probably incomprehensible to most people listening, but Facebook didn't have a newsfeed when I joined and we went and launched that six to eight months after I joined. I remember getting a phone call...so Mark and Chris Hughes, two of the founders, were doing Facebook's first ever press tour. News Feed was a big deal and we had a PR and marketing function at the company, finally, and so they had lined up all of these interviews on the East Coast, and the launch was a disaster. They were out here fielding questions and freaking out because the narrative around the product launch was very negative, but our metrics were pretty solid. So we were spending a lot of time really digging in to understand what was happening to user activity to try and distinguish the narrative, to just see what the users were telling us from what the press was telling us, and then helping to decide whether we needed to roll the thing back. It was a pretty big crisis at the company, and so using data to help stabilize product decision-making...then, I think after that, it became a more critical function at the company, but it took a long time. I think growth was another big motivator. It's another part of the Facebook story that's not really well understood, is that we kind of went sideways for six months there in late 2007, early 2008. There was a lot of stress in the executive team and the engineering team, and a large chunk of people got re-orged to really focus on growth. That ended up creating probably the highest-level awareness that we needed to invest in data infrastructure and data science. I think those were probably the two things that I look back on and think...and also just internationalization. It's tightly coupled to growth, but at some point, you're navigating the product through your intuitive understanding of what people in your demographic cohort want to see from the product, but then you have to transition to understanding what a grandma in Turkey wants from Facebook. At that point, you really need to start flying with instruments, so I think those are some milestones that I can recall from over a decade ago.
It's funny even to think back to then, but NoSQL was not really a thing a lot of people knew about. What was your tech stack in 2008? Do you remember where you're storing all this data and how you were querying it?
Oh, totally. Yeah. When I landed in 2006, the tech stack...well, first of all, in 2005 they didn't use version control. This is one of my favorite things about Facebook, was they had a Cron Job that ran every night and tar'd up the source code and copied it off to storage. That was how they did version control. So, it was a different time. GitHub didn't exist, Subversion was the dominant source control product. The tech stack was the LAMP stack, it was Linux, Apache, MySQL, PHP. Facebook played a big role in adding another "M" to that stack, Memcached, which was essentially Reddis-ish. The current modern thing I would...it was a key-value cache, so you didn't have to hit the database. It was basically like if we hit the database, we had failed on the application side because it was just the user activity was so high so it had all come out of the cache. So in terms of the stack for analytics, when I got there Dustin Moskovitz, one of the founders, had built something called the Watch Page. The Watch Page was powered by a Cron Job that woke up every minute, issued a query to every MySQL production database to just gather some stats about user activity, and then pull the results of those queries down into another offline MySQL database, which contained a rolling time series of per-minute metrics. That was great and we used it for a long time. That's what everybody internally was watching to see user signups and that's where a lot of the metrics around daily active, monthly active would get defined and pushed out. But we had no offline data store to do analytics work on. These were summaries that were computed at the time of the query and pulled back, so you couldn't do any post hoc analytics over it. So, the initial attempt at a tech stack for a data warehouse was to use Oracle. So, actually, that was me. I didn't make the purchasing decision, but I had to do a lot of the installation maintenance of that thing. I very clearly remember the Sun T2000 server that we were running on and...obviously, this is all colo'd and not in the cloud at the time and, you know, fiber channel interconnect to a network-attached storage device and running Oracle RAC (Real Application Clusters) at the time-
Was it sharded or is this like one machine is holding a whole...
Oracle RAC was a shared storage distributed compute...so a bit like the architectures that we end up with today in the cloud, where we have this bottleneck to get to your object store — that's how databases were — and these were block stores. I said an architect storage, but it was actually a Storage Area Network which was speaking a block protocol to the server, not a file-oriented protocol. That's how databases were built at the time. It was insane to conceive of writing a database that wrote to a file system, they had to talk to the block layer. The file system was just going to slow you down. So yes, we ran Oracle RAC which was, like I said, shared storage, distributed compute, and it fell over immediately. I remember we hired a DBA, a database administrator, and he quit on his third day. He was just like, "I've never seen anything...this is crazy. What are you doing?" There's this guy, Tom Kyte, who wrote a lot of books about Oracle database internals. I was reading a lot of Tom Kyte books, learning a lot about tuning. You know those early multi-core — these Sun Niagara chips — were one of the first multi-core...Now we're all stuck with it because Moore's Law has basically ended, but it was really the beginning of the end there. So, learning a lot about how to scale up on multi-core settings and then just starting to look around frantically for something that could scale past that. We had two sources of data at the time. We had production databases, but then we had the major source of data that just ended up totally flattening us. We called it Falcon and it was built by a guy named James Wang to power the News Feed, and it was just an event log. It was the kind of thing that you would pass through Kafka today, but it was just this homegrown C++ toolkit. It was eventually replaced by Scribe, which we made open source, it was a popular tool for log tailing. That was written by a guy named Bobby Johnson. The Falcon logs were the vast majority of data, all the event data. Any time a user did anything on the site, we'd log it, and then we wanted to use that to reconstitute information about user activities. Falcon is what really ended up just knocking us down, and so I was frantically looking for a new tech stack beyond just an Oracle RAC instance. There were a few alternatives. At the time there were a lot of shared-nothing distributed database companies targeting the data warehousing market. Neteeza had been very successful using custom silicon ASICs to accelerate queries in a shared-nothing architecture. They had gotten bought by IBM for $400+ million and that really caused a lot of new entrance to come into the market. So these were companies like Greenplum, Master Data, Vertica, Parexel. A lot of interesting distributed database companies, but most of them couldn't scale to what we needed. Honestly, the Yahoo experience was what I modeled a lot of our tech stack after. You'll be familiar with that from your time there. So they had a similar SQL querying over event log data infrastructure called MyNa, My NetApp, which unfortunately, I didn't spend a lot of time talking publicly about, but I managed to get to know the people that built it and learn about how it worked. It was effectively a Hadoop-like architecture, but instead of a data node in a distributed file system, they had NetApp filers where they were querying data over. So we hired a guy named Suresh Antony who built, effectively, a very rapidly implemented version of MyNa called Cheetah to bridge us between the Oracle era and whatever came next. Then we started really looking around and we found the Hadoop group at Yahoo. Eric Baldeschwieler and folks, Owen O'Malley and Doug Cutting, obviously, were doing some really interesting work to pick up this work that had been published by Google about MapReduce and the Google file system and implement it as an open source project. Everybody thought it was insane at Facebook. Writing stuff on the JVM was just very much frowned upon. It was very polyglot programming languages environment, but the only exclusion was Java from that zoo. It was an uphill battle to convince people that this was going to be something that might solve our problems, but eventually it became a pretty significant component of our infrastructure and we ended up writing a lot of database utilities on top of it. A project like Hive — it's what a SQL query interface and a metadata manager in front of the distributed file system in MapReduce implementation — ended up becoming a really significant component of our analytics tech stack there.
It sounds like you have some of this infrastructure built when the growth stopped. I think a lot of people, myself included, relate to the pain of growth stopping and trying to figure out how to get it going again. Was there some piece of analysis that you felt like you did to get that restarted, or was it just a lot of little things? How did that go down?
I've actually gotten a cease and desist from Facebook before for saying this in an interview, but the honest answer is the Hotmail contact importer.
I remember that, yeah.
That was the era. That was the social graph of 2006 to 2008. It was Hotmail. Yahoo Mail to a lesser extent, like a 10th, and Gmail, even smaller than Yahoo Mail. It was really about — what do they call them, dark design tactics or something? — it was these things where it was like, "Put in your email address and we'll invite all your friends, and we'll just auto select all of the emails and obfuscate that, and if you click okay we're just going to spam your inbox and spam your mailing lists," and that was really how Facebook grew. There was a lot of stuff after that that was a lot more targeted. In our group, we had a guy named Itamar Rosenn who was my first hire and is still there.
Oh, no way. He's a classmate of mine.
Yeah. He's still there. I was just texting with him yesterday, I got to catch up and see how that's going. So, Itamar...there's a guy, Matt Cohler, who was an executive who was really one of the key strategists for early Facebook. Cohler — I'm sure at the behest of Mark and some of the board, or potentially it was his own idea — I'm not sure exactly who, but it was communicated to me through Matt Cohler. He pulled me and Naomi Gleit, who you may have also been a classmate of, if I know my Stanford connections. He pulled me and Naomi and he said, "Hey, growth is an issue. Let's start dedicating some analyses to it." We started meeting regularly and doing analyses, and Itamar joined not long after. Itamar generated this weekly growth report, which was a set of standard metric as well as a deep dive every week that was distinct and specific to some high-level question we had at the time. That growth report, we turned it into a PDF to make it look nice and sent it out to the company. I used a lot of LaTeX back in the day for my math notes in college, so I like to-
You tee'd that up at LaTeX and then used that as a company report? That's amazing.
You do it for a year and all of a sudden you're fluent and so then it's hard to go back because it just looks so much better when it's in a nice...there's all kinds of better ways to do it today, but that was my solution then. So yeah, so Itamar would send out the growth report with a lot of input from Naomi and Cohler, and that became a focal point for analyses to better understand growth. Then, ultimately, a growth team was built. If I recall it correctly, James Wang, the guy that wrote Falcon, ended up being the engineering manager for that growth team. He played a big role in the initial work that they did over there.
Wow. That was really fun, too. Thanks for taking me through that.
Then, I guess I have the same question on Cloudera, which is also an iconic company in data science. I remember when you were starting it and thinking about what the market size would be, but I guess what really prompted you to start it? Can you tell me a little bit about the early days of getting that off the ground?
Sure. Well, we tried to start it earlier in 2008. So this guy, Christophe Bisciglia was at Google and was teaching a MapReduce class at University of Washington and was really trying to push Google to proselytize their approach to data management and data analysis into the academic environment. He was using Hadoop in that course, so he was connected to the Hadoop community through that. Microsoft made a bid to buy Yahoo in early 2008, and that cataylzed...so Christophe and I had been chatting about what would it look like to start a company to support Hadoop, because he needed it for his work and I needed it for my work. When Microsoft said they were going to buy Yahoo, then we were like, "Oh, boy. We really need to accelerate the timing on this." So that was early 2008 and we had a third guy who was going to be a co-founder, a guy that I gotten to know because we interviewed him to be VP of Engineering at Facebook and we actually offered him and he turned us down. Mike Abbott was his name. So Mike is now at Apple running a big swath of their software development, and I really hit it off with him during the interview process. I stayed in touch with him and I was like, "Hey, man." He had a lot of experience with database internals, he had a startup company called Composite Software that did federated query, which I guess today would be called a data mesh. Mike was always a smart guy and I really wanted him to start the company, but he actually had some personal life issues that made it not really work out. It kind of fell apart in March '08, but that got Christophe and I talking, and he started working on his own. He recruited a guy named Mike Olson, who I had followed for a while because Mike was the CEO of Sleepycat Software, the maker of Berkeley DB, which is an embedded database that was very, very successful. The killer app was active directory. Mike had sold his company to Oracle, had done two years, and was on the way out. Christophe had recruited him to...he actually incorporated the company as "Clouderra" with two R's and Mike was the CEO, but another guy, a third guy, Amr Awadallah — who you probably know from your time at Yahoo. He had run a group called Product Intelligence Engineering, it was very successful — Amr had spun out of Yahoo and was convinced by a guy named Andrew Brachia at Accel to be an entrepreneur-in-residence at Accel Partners. Amr was, actually, at the time working on a spot market for cloud resources, which was very early in 2008 to have this idea. So we were like, "Maybe this isn't the right time for that. Maybe someday it'll work." I had spun out of Facebook to do an entrepreneur-in-residence program at Accel Partners as well. I had actually cooked up with a guy named Eric Vishria who's now a partner at Benchmark and we were working on a consumer energy demand monitoring system. Eventually, Amr and I got to chatting and Christophe really catalyzed the whole thing. He and Mike were moving forward and Amr and I were like, "We should probably hop on there." So Amr, me, Christophe, and Mike ended up reconstituting it as "Cloudera" with one R and then just re- founded the company going forward. We ended up hiring Doug Cutting about a year later once we had established some credibility, but it was just the four of us when we got moving.
What did you work on in the early days? It must've been a pretty big change going from running the data science team to founding a company.
Oh yeah, for sure. On the one hand, yes, on the other hand, no, because at Facebook it was a very sink-or-swim culture. I really felt like I built that data team with no real supervision. I basically went around the block once a week with Adam D'Angelo to just have a conversation for an hour. He was very helpful about just clearing roadblocks for me and helping me think through strategic things. But ultimately, it was just something that I thought needed to be built and they just said, "Go build it." I don't think anyone up top at Facebook was like, "Let's hire 30 people to work on a data team." I think I just kept hiring people, and at some point they looked over and they were like, "That's a pretty big data team." People talk about an "intrepreneur" or whatever and I guess it did feel a bit like that. I did feel like I was just building a little company inside of Facebook, and ultimately the Cloudera product roadmap was just the Facebook data infrastructure product roadmap done as a... Most of the reason I started Cloudera — or I got involved with Cloudera — to be honest was, I just wanted to see the things that I wanted to build exist in the world. I knew that Facebook, they were entering a period where they weren't going to be quite so excited...it was more of a "buy versus build" period — which made complete sense given the scale of the business and the success of the business — so I was like, "I'd rather build some of this stuff." So we got to work. Hiring was, obviously, a lot harder to hire for a random startup versus the hottest startup in Silicon Valley. I had to do a lot of legwork on hiring, and then just figuring out what to build. Sequencing, I knew what the end state was going to look like, but I didn't know how we were going to get there. Figuring out what to build first was pretty hard. We started with a couple of open source projects to just get data into the Cloudera environment. A project called Sqoop and a project called Flume that were dedicated to database and log data, in particular. Honestly, I saw Splunk at the time and I was like, "I want to get to a pricing structure that looks like that." I think the reason why data companies work in 2021 is the consumption-based pricing and Splunk had that figured out in 2005. But we never really could figure it out at Cloudera, We ended up getting stuck with a more Oracle-, Teradata-like pricing model. So yeah, so we were working on it, effectively filling out the stack to become a vertically integrated data platform — whatever they're calling them these days — but a place where you would collect data, put it, structure it, query it, analyze it, fit models to it, what Snowflake and Databricks are trying to build today. It was a very obvious product roadmap. That's what we wanted to build, we just couldn't figure out how to build it or how to get there, what the right sequencing was to get there. The other thing that we had to do is swap out components over time. We all knew that there was a shelf life to the core Hadoop projects and so we were trying to think beyond it. How do you make that transition from these legacy products to what we felt could actually serve as production enterprise workloads competitively with what other vendors were offering? Things like Impala for query engine or Kudu for table storage were always something we wanted to build, but just had to figure out when and how to get it out.
I think one thing that was interesting at the time — it seems so wrong in retrospect that it's hard to believe people thought this — but I remember actually talking to Matt Cohler about Cloudera and he was thinking, "How many companies would really use this? Maybe it's tens or maybe a hundred," or something like that at the time. I think even you expressed a little bit of doubt to me when you were starting. Did you feel worried about the market size or how did you think about that? Were you just sure that it would work or was that ...
Nah. For me, it was about manifesting a product vision, not about building a huge company.
I didn't expect it to get as big as it did, or people to care as much as they did. When I was leaving Facebook, I wanted to work on a super nerdy infrastructure software company. What could be nerdier than Hadoop? Within a year, we were in the New York Times and that part of the hype around it was always a huge turn off to me. It wasn't something that I wanted. I wanted to hire the best engineers from Sun and VMware and Oracle and Google and get them to build open source infrastructure that would allow any company to do what Google could do. That was what I wanted to do and whether or not it had commercial value at the scale that would necessitate venture returns, it wasn't that critical to me because we didn't raise that much money. Our Series A was $5 million. Our Series B was 8 or $9 million. These aren't even seed rounds anymore. So what we were building was different from what it became. I agreed with Cohler at the time, I didn't worry about who was going to use this because I just worried about completing the product. I just knew everybody was going to need it, to be honest. Everybody was going to have a petabyte-scale data. I didn't know in what form they were going to be storing and analyzing it, but I wanted to solve problems to facilitate that world. But yeah, our Series B was a brutal fundraise. Our Series A was easy because Amr and I were both entrepreneurs-in-residence, and so we had two partners who loved us and believed in us and they would have given us money to start whatever we wanted. But then our Series B...we ran around Sand Hill and I actually remember I got a nice note from Dana Stalder at Matrix Partners a few years after, because he just beat us up in the pitch where he was just like, "I don't ever expect you'll get a seven-figure deal for this." He was like, "You'll probably get less than 10 six-figure deals for this. There just isn't a market. You should just pack it up now. This was like a science project." You're in those meetings and you just hear that over and over and over again and it's like, "Yeah, that's a valid position to take." I didn't necessarily disagree with it. So yeah, I couldn't be happier. The fact that they're still focused on open source is quite cool.
Do you feel any frustration that they're not a more iconic company? They were so early with the strategy that's worked so well and it's hard to say, I don't know, whatever their $5 billion market cap is not a wild success, but it does seem like they missed people shifting to Spark. Does that bother you at all?
I'm kind of weird in that I don't like big companies. To me, it's not a success if you have a hundred billion dollar market cap, but you've got all closed source software and you have it...so to me, I always talked about Cloudera as an engine for turning VC first and then company enterprise dollars into open source software. So for me, I look at the public goods that were created. I look at the standards, the software, those kinds of things. Honestly, I made plenty of money, I'm going to be okay. People who want another zero, at this point, it's all going to some foundation. You know what I mean? There's no material needs that's going to be resolved by if there was another zero on the end of Cloudera's evaluations. I honestly don't know why people want more money than what we were able to make, and that was honestly a pretty big surprise anyway. I didn't start Cloudera to make money. For me, I look at things like Arrow and Parquet and Ibis and other kinds of open source infrastructure...even Hue, our user interface, has become adopted by pretty much all the cloud providers. I look more at, "How do you change the tools that people use in their-." and "How do you change their thinking?" Impala was really the first open source, vectorized, codegen, distributed query engine. It was something that everybody knew we needed to build and I was really proud of it when we built it. Whose name is on the jersey, at the end of the day, I don't really care. It was more about impacting the universe of ideas and public goods. I'm really happy with a lot of the work that we did. I will say, I think just being on the JVM is just tough for day-to-day developers. You can impact enterprise, but ultimately, no one uses enterprise stuff in their day-to-day. Snowflake is a huge company and they've built great technology, but it doesn't change how I do data analysis on a day-to-day because I don't need a super expensive data warehouse for my day-to-day data analysis. We built a lot of stuff off the JVM at Cloudera, subsequent to founding. It was a funny era to get stuck in JVM. I wish we had pushed more Python. We ended up buying DataPad, Wes McKinney's company, and we had Wes McKinney in the company for a while. It was after I had checked out — I was Founder Emeritus at the time, I referred to myself — I could never really convince our head of product management to really push on the Python ecosystem harder, but you can see that's where everybody's going now. I think if there's anything that I regret, it's not being able to influence people to get more into the PyData ecosystem sooner.
I also wanted to ask you about this incredible career transition that you've made that I'm just so impressed by it, to go into research. Can you talk about how you did that, how you got up to speed enough to start your lab, how you learned about almost a totally different field?
Yeah, totally. So, 2012-ish at Cloudera, we were four years in and it was bigger than I ever expected it to be. I'd replaced myself twice, first as VP Product and then VP of Data Science. I had hired people who were better than me at that job. The only thing left to do was hire a professional CEO, and we kicked off that search. To be honest, I was also having a lot of misgivings and also health issues that just made being a high intensity startup founder executive job in San Francisco just very unpalatable to me. When I was thinking about what I wanted to do next, I really wanted to focus on finding a domain where I could do data science and not get bored of the entities under analysis. I had started my career on Wall Street and very quickly realized I didn't really want to think about money all day. Then, I moved to Facebook and pretty quickly realized I didn't want to think about how people navigate consumer web products all day. But I loved the software methods at both jobs. It was a weird thing. I really enjoyed my jobs, I just could not care less about what the product was at each of those jobs. Cloudera was always, to me, a way point where I was like, "Hey, I want to be able to do data analysis at scale. Tools don't exist to do that with open source software. This is our best hope of just getting some tools for doing data analysis at scale into the world, so I'm going to do that." But I do data analysis, I don't necessarily see myself making tools for data analysis for the rest of my life. In 2012, I started thinking about different domains where I might not get bored, and biomedicine was just a big, expansive domain where I thought there's a lot of sophisticated work happening, but the technical infrastructure was actually pretty limited. We had sold into pharma companies at Cloudera and they were some of the last to adopt modern technology stacks. We had partnered with some large academic institutions and I saw their infrastructure and it was very outmoded and slow-moving. So I thought, "Oh, hey, there's some things that we learned over here that could be useful over there and I probably won't get bored of what's going on." In 2008, when I left Facebook, I'd looked into the biomedical domain to do a startup and I had met a bunch of interesting companies at the time. This is like when 23andMe was getting started and — oh, there was another company that was just like 23andMe that I went and visited as well, I can't remember their name — so I got to know a group of people in the biomedical field who had started a nonprofit called Sage Bionetworks that was creating a shared infrastructure for storing and analyzing data in a pre-competitive, open source fashion. They asked me to come and advise them on data infrastructure and open source strategies as they were creating this nonprofit and, eventually, asked me to join the board. So I served on the board of that nonprofit and through that lens, I got to see and meet a lot of interesting people and it helped confirm for me that this was a field that I would enjoy working in. Ultimately, what catalyzed me moving into an actual role in that field was Eric Schadt, one of my fellow board members at Sage Bionetworks and one of the creators of Sage Bionetworks. He was recruited to run the Department of Genetics at Mount Sinai in New York City. I like New York a lot more than San Francisco. I moved to San Francisco from New York and I was very dismayed. I was like, "I thought this was supposed to be a city. Everything closes at midnight or 2:00." I don't remember. It certainly wasn't 4:00 AM like in New York City. It was so tiny and the public transportation was terrible and so I was always very underwhelmed with San Francisco as a place to live. It was so cold all the time, so I was very excited about New York City as a place to live, relative to San Francisco. I was excited about doing something in the biomedical domain with software and data. We were getting beers at the Nut House in Palo Alto, which I'm sure you know, and he was having me talk to people over there to just talk them through what they could build. He was like, "What would you think about just being out here with an actual position at Mount Sinai?" I thought about it and, ultimately, I was like, "Okay, that sounds like fun." We worked something out with Cloudera where I was like notionally part-time, so I was going back forth between San Francisco and New York for a while. In the fall of 2013, it was really when I was like full-time in New York and started hiring people in the lab. So I had a year to just read a lot of textbooks, talk to a lot of people who are working around, play around with the software. I've always been autodidactic. I got terrible grades in school. It was always about reading and thinking more than it was about doing homework for me-
You got terrible grades, but you went to Harvard, right? How does that-
Yeah. So it's a little bit complicated. I had a good SAT score. I started getting terrible grades junior year of high school. I had, I guess, enough good grades to buoy my grades and my overall GPA. And I played baseball, so I ended up getting into Harvard primarily as an athlete and an SAT score and then a decent GPA. It was basically, like, once I hit 16 that I stopped caring about school. I think early Jeff was engaged enough to achieve a GPA that was not going to be fully dismissed by Harvard during the admissions process, thankfully.
Mm-hmm (affirmative). But yeah, I guess you've done an incredible job of quickly learning really hard topics, so that makes sense. So you got up to speed...I actually try to research all the people that I talk to and I was looking through your list of papers and I could barely parse the titles to them, honestly.
Yeah. You talk to people who are doing work, you read papers. Review papers are key for me. Finding a good review paper on a topic, and then figuring out who wrote it and then what their recent research is, and just finding kindred spirits, people who think like you do and being able to converse with them and interactively map a domain. I had had biology education previously. Thankfully, Harvard is a liberal arts education, so I had done courses on DNA and neuroscience and molecular biology, so I had the basics. So yeah, just reading a lot of papers and...software is a good angle. I used to reference a lot, John Tukey, who was kind of a proto-data scientist, and he has a quote where he said, "I love being a statistician because I get to play in everyone's backyard." I was using data science as a backdoor to problems. It was like, I could talk to people and figure out what they're working on and figure out how the software and algorithms and analytical methods they were using mapped to problems that I've worked on previously and use those analogies to move sideways from work that I had done outside of the biomedical domain into the biomedical domain. There's a lot of problems that you can find analogies for and choose methods for.
In particular, we were able to find a really cool problem in a domain of cancer immunotherapy. When I was moving into biomedicine in 2012...2011 was a milestone year for the approval of a immune checkpoint blockade drug¹. This was a drug, which rather than targeting anything related to cancer, was actually targeting the immune system. What it was actually targeting was...a T-cell is a cell in your immune system that's responsible for cellular immunity, for killing bad cells. Cancer cells are bad cells. T-cells were believed to be the mediator of the immune response to cancer. There was this protein...when a T-cell gets angry and starts wanting to kill stuff, it expresses an off switch because it's very important that you'd be able to turn T-cells off. T cells are very destructive and your body needs to be able to resolve the immune response, and so the T-cell exposes this off switch. The notion behind immune checkpoint blockade is "Cancer might've figured out how to press that off switch. What if we basically covered up the off switch and we made it so that T-cells couldn't be turned off by cancer?" Perhaps that would cause the immune response to cancer to fully eradicate the tumor. It works for a shockingly high percentage of people. The most exciting thing about immune checkpoint blockade — at the time — was these Kaplan-Meier curves, these survival curves, where you could see that immune checkpoint blockade was raising the floor for long-term survival of patients. It wasn't just advancing survival by a few months or years and then, ultimately, everyone had the same 10-year outcomes. It was genuinely changing 5- to 10-year outcomes. Obviously, it took a long time to see that, but those results are holding and that durable response to cancer was wildly unusual, and then-
Is that something you worked on?
Ultimately, yes. When I came to Mount Sinai I had never heard of it, but there was a principal investigator at Mount Sinai named Nina Bhardwaj. Nina was a very successful immunologist who was pursuing a few ideas for ways of stimulating an immune response to cancer. One of the things that she was very early on was an approach called a neoantigen vaccine. This is a therapeutic vaccine. Most people think of prophylactic or protective vaccines, something you get so that you don't get a disease. Therapeutic vaccines are given to stimulate a specific immune response while you currently have the disease, with the goal of curing it. A new antigen vaccine is a therapeutic vaccine. An antigen is a specific target of the immune response, and a neoantigen is an antigen created inside of a tumor cell due to the mutations that the tumor accrues. Cancer is a disease of the genome. The way that a cell becomes cancerous is that it accumulates mutations that equip it with behaviors that allow it to grow out of control. Often there's a positive feedback cycle, so getting additional mutations might damage your DNA repair machinery, for example, that then causes you to accumulate even more mutations. A lot of cancers have accumulated many mutations ,and the more mutations you've accumulated, the more likely that one of those mutations is to have changed a protein produced by that cell in a way that causes that protein to become immunogenic, that is to create an immune response directed against it. Neoantigens are those sub-sequences of amino acids inside of proteins that have been altered by mutations accumulated by the tumor cells, which create these novel or neoantigenic targets for cancer. The idea was, "What if we could sequence someone's tumor, sequence their normal tissue, look for mutations that are in the tumor but not in the normal tissue, and figure out which one of those mutations might generate an immune response for this particular patient. For this particular patient, can we then synthesize a vaccine which will stimulate an immune response specifically against those neoantigens in their tumor, suited for their immune system?" Everyone's tumor is unique, but also, everyone's immune system is unique. If you ever have to do tissue or organ transplant, you get HLA typing done. Your HLA type is what effectively determines which amino acid sub-sequences of a protein your immune system cares about. So you had two inputs. You had the HLA type of a patient, and then you had the somatic mutations — that is, the mutations present in the tumor and not in the germline tissue — and those became inputs into a predictive algorithm that would predict, "These are the most likely to generate response neo-antigens." That was the data science problem that we identified embedded within this larger research. At the time, Nina's group was just leveraging a web server built by another group and they generated predictions for her, and so we looked at it and said, "Oh, hey. Maybe we can build you a better predictor of neoantigens." Ultimately, she was very trusting and allowed us to participate in the phase one clinical trial. Our group wrote the computational component of the clinical trial protocol, and ultimately administered the computational algorithms that generated the vaccines that went into actual humans. That was a pretty fun research project to be involved in. Ultimately, the software we wrote called MHCFlurry² is...so finally, we get to something that might matter to your listeners now that we're what, 48 minutes into the conversation. If you made it this far, machine learning happens here. So we ended up building a neural network that predicted neoantigens called McFlurry that's now one of the better approaches, and is still actively developed by several people.
Two questions to make sure I understand what's going on here. So one, does this mean that every single person gets a slightly different medicine, based on...can you even do a clinical trial where everyone's...I always just imagine a clinical trial, everyone gets the same thing. I guess in this case, everyone gets the same process. Is that right?
Yeah, no. You've hit on a very fascinating question that has generated conversation at the FDA and that continues to this day, which is, when the therapeutic is an algorithm and not a molecule, how do you administer a clinical trial that can generate evidence that the algorithm itself can create better outcomes? Fortunately for us, they were pretty understanding and allowed the trial to go forward. I don't know how it's going to work. We were building what's called a peptide vaccine. The actual molecules that we put into patients were little sub-sequences of amino acids called peptides together with adjuvants, just general purpose immune stimulants to draw the attention of your immune system to those peptides. Peptide vaccines are very well understood as a therapeutic modality and widely considered to be safe. So I think that certainly helped, but the intervention under study in that clinical trial is an algorithm, not a specific molecule. It's different for every patient.
That's so cool. I guess the other thing...I hope I'm following all the steps here, but it felt pretty deterministic to me, like what's going on and then what intervention you would want. What's the part where you need a machine learning algorithm? I guess the way you were explaining it, I was thinking, "Oh, you look at the genome and see where the problem is and then you know the amino acid, and then you know the medicine that you need." Where's the uncertainty that requires you to use an ML algorithm versus, I guess, just some deterministic logic?
Sure. So the HLA type of a patient is a set of genome sequences for genes that code for proteins, which are highly polymorphic. That is, they're different across the population. There's at least six of these proteins that matter, and every person has a distinct repertoire of those six proteins. One input to the predictive model is the amino acid sequence of all six of those protein, and that's pretty variable across the population. Then the other input of the model is a window of amino acids around every point mutation that occurs in your tumors that doesn't exist in your normal tissue. Cancers can accumulate hundreds, thousands, hundreds of thousands, sometimes even millions, somatic mutations and melanomas, which have the largest mutational burden. You end up with two sets of sequences as inputs to the neural network-
I'm sorry. What's the output of the neural network?
The output of the neural network is a predicted binding. I don't want to explain exactly what HLA molecules do, but effectively, your body chops up all the proteins in your cells, a subset of them, for processing and it chops them up into smaller fragments. Your HLA proteins bind selectively to a subset of those smaller fragments, which your body believes to be interesting to present for inspection to your immune system. What you're ultimately trying to predict is the binding affinity between peptide fragments generated from the proteins in your tumor cells and the HLA proteins that are specific to your immune system. So, ultimately, the thing you're predicting is this protein peptide binding affinity.
How did you get labeled data for this test?
There's a group in San Diego that generates the vast majority of the labeled data, and they've done a great job of curating it. There's something called the Immune Epitope Database³. It's a fairly difficult...we actually got to the point where I had a wet lab and I talked to the group in San Diego about generating measurements of our own to create labeled data and they were like, "It's not worth it. It's really hard. Just use our stuff." Later in the lab's life, some new techniques for generating labeled data from in vivo tissue came out, that used a different measurement paradigm. Some of the work that we did in the lab as I was leaving — and it was carried on by members of my lab in the new labs they worked in — was to leverage this alternative source of labeled data and bring it together with this early source of labeled data. There's a few different assays, all of which are pretty difficult to run, so we don't get super high throughput. The mass spectrometry data, this novel source of label data, often it's positives only, so you're not necessarily measuring...there's a lot of work that has to go in, and as you're very much aware, you don't just get to pick up a dataset and fit a model to it and call it a win. There's a lot of work that goes into massaging the training data to get it ready for machine learning.
Is it important for this task, or these kinds of tasks, to use modern techniques like deep neural networks, or do you think simpler techniques would also work pretty well?
One of my frustrations is we didn't write more papers about the work that we did because one of the theories that I have for this lab was to just hire a bunch of people from industry and see if we get turn them into academics. One of the hardest things to do with people from industry is to convince them that writing a paper is worthwhile, but we did. We tried a lot of cutting edge. One of the guys that worked on the problem early on was a guy named Alex Rubynstein, who's now, actually, a professor at University of North Carolina, Chapel Hill in a biomedical department, which is cool. He did a PhD at NYU in the whole deep learning craze, so he was pretty experienced with the models. We iterated through a lot of more complex...this is the era when LSTMs were becoming very exciting, sequence learning models. So, I think, I remember Lasagne was a library built on top of Theano. I think there was a guy Colin Raffel who was really good with it, and he came down and talked with us. I feel it was Alex Graves at DeepMind that had a lot of sequence-to-sequence learning. We went up to NeurIPS three years in a row as a lab and presented some work up there. We were definitely paying attention to what was happening in the state of the art for learning on sequences. It didn't make a huge difference. I remember trying out Siamese networks and things like this and it wasn't really moving the needle. I honestly don't know where they landed, what the current version of MHCFlurry is, from a neural architecture standpoint. But I want to say that nothing we tried that was more exotic made a huge difference. So, ultimately, I think mostly no for that problem. I should also say that the leading predictor prior to ours, for a decade, was a neural network. So this is a field where they already were using neural networks before the deep learning craze happened. It's not like we were coming into the field and we were like, "Hey everyone, neural networks." They were like, "Yeah, of course, neural networks, we've been using... " We weren't trying to act like we were bringing fire from Olympus. It was like everybody was already using neural networks, but could you make better use of them? So embedding layers and things were relatively novel approaches. There were some ideas that we could bring to bear, but it wasn't just a slam dunk to just use the latest neural architecture.
What types of things are you working on now in your lab⁴?
Well, nothing actually. I'm on leave from my lab so-
Well, what are you working on? What are you up to?
Yeah. I went on leave from my lab in January of 2020 because I started a biotech venture creation firm with two of my friends, Adam Kolom and Jack Milwid, in mid 2019⁵. One of the things that I did with my lab...so I started my lab up in New York City and it was purely computational. But one thing that you learn quickly if you're running an academic lab is that it's difficult to collaborate in academia, and it's a lot easier if you own vertical research ideas rather than being a person who brings a skill into a horizontal research network. Those are just a lot harder to build, those horizontal research groups, and they're often built through pedigree like, "Oh, I did my PhD with this professor and so I'm going to work with you." I had zero pedigree, so I recognized pretty quickly that this theory that my lab could be this ally to many other labs was like, no one wanted an ally. So I had to build data generating capacity on my own. I ultimately, ended up building a wet lab as well, and for a variety of reasons realized that academia was a better place for me to be doing basic science rather than translational science. So this neoantigen vaccine idea that we worked on when it was very early stage, ultimately there were several venture-backed companies that went public and had hundreds of employees working on it, including BioNTech, actually, which is the maker of the vaccine that I got for COVID-19, which was nice. It was a lot of...100x more resources could get put into that problem on the commercial side versus the academic side. So I decided to start angling my lab towards more basic science questions and doing mostly data generation with some computational work layered on top. We started working on things like optimizing protocols for genome editing in T-cells and growing organoids, which are small, 3-dimensional model systems to represent tumors in vitro that we could do more reliable experimentation upon. We layered some computer vision work on top of that, which was pretty fun. We did some natural language processing work over the research literature as well, but the lab became more of a traditional biology lab than a computational group. But the other part of that idea was that, "Okay, my lab should become more basic, but I want to have some translational work." So the translational work I decided to funnel through this biotech venture creation firm that we created called Related Sciences. So yeah, for the last two years or so, I've been working mostly full-time on Related Sciences. The idea of Related Sciences is to use data to identify promising pre-clinical therapeutic opportunities and to create companies to then pursue those preclinical therapeutic opportunities.
Wow. Very cool. Awesome. Well, thanks so much for your time. It's such a pleasure to catch up and so cool, all the things that you've done. I love that I got a chance to hear all these stories, so...
Yeah, no, I wish...I could talk more about the fun machine learning tools and techniques we're trying out at Related Sciences some other time, but I'm always happy to talk about my history as well.
Yeah, I really appreciate it. We should do a follow up.
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description, where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce, so check it out.