Jerome Pesenti — Large Language Models, PyTorch, and Meta

Jerome discusses the current advances around large language models and shares some stories about his time as VP of AI at Meta, including leading the team that developed PyTorch.
Riley Fields, Angelica Pan
Created on December 19|Last edited on December 22
Comment
﻿
﻿
About this episodeJerome Pesenti is the former VP of AI at Meta, a tech conglomerate that includes Facebook, WhatsApp, and Instagram, and one of the most exciting places where AI research is happening today.
Jerome shares his thoughts on Transformers-based large language models, and why he's excited by the progress but skeptical of the term "AGI". Then, he discusses some of the practical applications of ML at Meta (recommender systems and moderation!) and dives into the story behind Meta's development of PyTorch. Jerome and Lukas also chat about Jerome's time at IBM Watson and in drug discovery.
Connect with Jerome:﻿﻿﻿Twitter﻿
﻿LinkedIn﻿
Listen﻿
﻿Apple Podcasts﻿﻿    Spotify﻿     Google Podcasts    YouTube ﻿﻿
Timestamps0:00 Intro﻿
0:28 Jerome's thought on large language models﻿
12:53 AI applications and challenges at Meta﻿
18:41 The story behind developing PyTorch﻿
26:40 Jerome's experience at IBM Watson﻿
28:53 Drug discovery, AI, and changing the game﻿
36:10 The potential of education and AI﻿
40:10 Meta and AR/VR interfaces﻿
43:43 Why NVIDIA is such a powerhouse﻿
47:08 Jerome's advice to people starting their careers﻿
48:50 Going back to coding, the challenges of scaling﻿
52:11 Outro﻿
Transcript
IntroJerome:
When people overbuzz AI, I ask them, "What did AI change in your life?"
What did AI change? Really, truly. Don't tell me you set a timer on Alexa or Google. That's not life-changing.
What was life-changing that came from AI?
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald.
Jerome Pesenti was VP of AI at [Facebook AI Research], which is one of the most exciting places where AI research is happening. Before that he was CEO of BenevolentAI, and before that he was VP of Machine Learning at IBM Watson.
So, he's had a long career and has seen a ton of different applications and lots of change in the state of the art in machine learning. This is a super fun conversation, and I hope you enjoy it.
Jerome's thought on large language modelsLukas:
The first question that's top-of-mind is just, with all the advances in large language models that we keep seeing — I know Meta had Blenderbot — I was kind of wondering if you have a point of view — or Meta had a point of view — on building a large language model differently than a DeepMind or an OpenAI, and how you think about that?
Jerome:
Oh, wow. You go right deep into the challenge there.
I would say the large Transformer models...I think at this point, it's not just a language model, right? The Transformer and large models are starting to really be able to be used in multiple tasks. I think this is a trend that everybody is following: size, multimodality, more data, more self supervision actually and less classical supervision, rather than trying to do multiple tasks at the same time. I think this is working really well.
It's why people call them "foundational models". I'm not sure I agree with that term. So, I do think everybody's going in that direction and that's paying out handsomely. Where I would say I'm a little bit more cautious is, I think these models have lots of problems. And solving these problems is not trivial, not easy.
I would say there's two abuse class of problems I've seen, and so the people who will be able to solve that really will be onto something that's interesting.
One is control. When you have these language models...I don't know how much you've played with Stable Diffusion or GPT-3. It's really, really surprising in the things it gives you, but sometimes it really doesn't give you what you want at all. It's not necessarily what you ask. Sometimes it has big artifacts that show that it's not humanly generated. And it's not quite clear how you get rid of all this.
There's this whole thing around prompt crafting. I think it's interesting, okay, but I don't think you can...I mean, it's kind of scary to say you're going to do like...that there's going to be a new type of software engineering that's going to be for...
Because it's so unreliable, you know.
And so that's the first piece, which is, "How do you make all these models more controllable?", which is like you have a higher guarantee of what the outcome is going to be.
The second is bias.
Obviously intelligence is about bias, but if you type something...I mean, the easiest way to do it is on these new image generation models. If you type "CEO", guess what you get. If you type "assistant", guess what you get. If you type "fast food worker", or if you type "banker".
It's striking. I mean, it works. Like 100% of the time, you get extreme bias. And it means you can't really just use this in production. I think it would be terrible. 
So, very exciting. I think everybody's seeing the trend there. It's working-scale, multi-modality, multi-task, self supervision. But, you know, they are not very controllable and they have huge bias issues.
Lukas:
Do you feel like there are still intrinsic cognitive limitations, like a Gary Marcus might say on Twitter? Where do you sort of stand on the promise of this technique with Transformers?
Jerome:
I'm definitely...you have the spectrum of Gary Marcus on the left and you have people who are extremely enthusiastic talking about AGI on the right. I'm squarely in the middle.
Lukas:
Oh no, this is going to be a boring interview.
Jerome:
Yes, yes. I mean, I can tell you some things that are very, you know, controversial.
I think Gary really over-does it, because the progress is undeniable. I mean, everybody seeing the systems are surprised. I've been in the space for more than 20 years and I look at the stuff and I'm blown away. If you had asked me a year ago, "Would we have made this progress?", I wouldn't have guessed it. I thought these tasks were higher.
But I think what happened is that the more you get closer to human-level intelligence, the more you realize that the task is much harder. Some people are like, "Oh my god, we're going to lose our job as developers, as creators." No way that's going to happen.
We're still millions away, because as soon as you make some progress, you realize that...and it's some people I've said, but it is that the goalpost actually looks further, because you realize actually intelligence is a much wider space.
It's much more complicated. You realize that the system still makes very, very silly mistakes that humans wouldn't make, but it does things that you didn't think would be possible.
I am squarely in the middle, which is I don't think we are anywhere close to human intelligence. I also think that "AGI" is a bullshit term. It doesn't mean anything because intelligence is by definition, never general. And then I don't buy Gary, because you can't deny the progress. You look a bit like a fool if you deny that.
But, it's such a much bigger problem than people imagine. As we said at Meta/Facebook, we're 1% done. And I really believe it, we are 1% done. We did go 1% of the way, and that's a huge accomplishment.
Lukas:
1% what? 
Jerome:
1% to human intelligence. We've made progress. We've made real progress, right? But it's such...intelligence is so amazing, that you still have a long way to go.
Lukas:
But don't you feel like the stuff that we're building is starting to help build the next generation of that stuff? I kind of can't believe how well the code generation works. I've been using it in my VSCode.
Jerome:
That one is also super overstated.
Lukas:
You think so?
Jerome:
Absolutely. You are in software, right? I give you a piece of code, okay, and I tell you it's 99% accurate.
How good does it give you...the problem is that generating code that's not accurate...I mean, sometimes finding a bug is way harder than writing the code from scratch, right?
Lukas:
That's fair.
Jerome:
I think the way to think of Codex and stuff like that is like an auto-complete. It's a very smart auto-complete, the same way when you write your email right now, Gmail does auto-complete. It can complete sentences, and it's quite smart, and it's quite impressive. And if you cherry-pick the results, it looks amazing and it's very surprising what it can do.
But, you know, it writes something, and then you have to say, "Well, is that actually accurate?" You don't have guarantees, and not having guarantees in code is a huge, huge problem, right? Really bug-free code is worth a million times [of just] code. It's not the size of the code that matters.
So, I'm really cautious on this one. I do think it's a useful developer tool. People will use it like they use auto-complete to write email. But it's not going to write...it's not going to put developers out of a job. No way. And especially...it's tricky when you write code, because you need to have guarantees.
Lukas:
Well, I certainly feel like it helps me write code faster.
I imagine better versions of it could...it seems very far from putting someone out of a job, but it seems like it could make you work faster.
Jerome:
It may make you faster, but is it better or is it worse?
You can write worse code faster, I'll give you that. That's for sure. Is it really allowing you to write...I think it will — I also believe it, right? — it will make people faster. But how much will depend on the validity of the code?
If you had a system that could guarantee you that the code is accurate, that would be a complete revolution. This is not what it is, right? Again, having guarantees and having control over the outputs is something that's really one of the big challenges of these models.
Making sure that what it says is accurate, that's another thing. These language models, they hallucinate. Avoiding that is really, really, really tricky.
Lukas:
Going back to my earlier question, now we're seeing a whole bunch of different big models coming out that all seem functionally like Transformers. You know, trained on a huge corpus at...basically all text that anyone can find, as far as I can tell, and high volume.
Do you feel like the research is sort of converging on this one technique? Or do you feel like DeepMind and Meta have different strategies and points of view there?
Jerome:
Well, actually, you should have seen Yann's tweet a few days back. It's like, "Hey, it's weird. Nobody talks about reinforcement learning anymore."
Which is...Yann had said — I don't know if you remember — "That means we don't really need the cherry anymore." I don't know if you remember this metaphor of the cake. The cherry is the reinforcement learning and supervised learning is the icing, and the body of the cake — the genoise — is unsupervised and is self-supervised.
He really, I think, predicted that it would happen. And it is happening. From an information theory perspective, it makes sense. When you do reinforcement learning, you get very little information whether you're right or wrong. It's kind of binary: "Yes", "No", you are going in the right direction.
With supervision, you just use a label. And with self-supervision, it's where you use the whole data, so maximizing the information you get out of the data is definitely the trend. I think that's where we're going. And, you know, you see self-supervision happening in every other field.
The flip side also is, Transformers are just working amazingly well, and scale is working amazingly well, and the combination of all these right now is a trend. I don't think we have a secret sauce that would be...or we "had", as you know I'm no longer there.
Lukas:
Right, interesting.
Do you feel this concern that very few people will be able to do this training at large-scale? What do, actually, academic institutions do in a world where the most exciting results are coming from very, very high-volume training?
Jerome:
Yeah, it is concerning.
I can tell you that the costs of the system and these models...I mean, just before I left, we put online one of the biggest superclusters out there. It's just extremely expensive. I can't tell you the cost, but it's staggeringly expensive.
So yes, it is worrisome and it does work. But, I do believe that we are kind of wasteful in the way we do things today. We are not really optimizing. It was very interesting to see Stable Diffusion come out really quickly after DALL-E.
I'm a huge proponent of open sourcing, of open models. I'm actually...Meta had done it with OPT-175B, but it was cool to see Stable Diffusion come out after DALL-E. Not only releasing open source, but also shrink-wrapping it.
Now that I'm by myself, actually I've been running it on my own computer or on a Colab. It's pretty cheap and that's kind of cool. I haven't been able to train my own version yet, but at least it's a bit more manageable.
But overall, I am a little worried. I'm not seeing how we can avoid this, given how well it works. But we also have efficiency gains we can make.
AI applications and challenges at MetaLukas:
We always talk about sort of the practical applications here, and how they're different than research. Can you talk a little bit about at Meta? What were the applications that really mattered to Meta that they were using, and how that kind of differed from the research interests?
Jerome:
Let me ask you a question because that's something I feel like—
Lukas:
Please.
Jerome:
-when people overbuzz AI, I ask them, "What did AI change in your life?"
Lukas:
In my life?
Jerome:
Yes, in your life. What did AI change? Really, truly. Don't tell me you set a timer on Alexa or Google. That's not life-changing.
What was life-changing that came from AI?
Lukas:
That's interesting. I feel like my life is not that different than someone in the 80s, but by that sense...I actually love listening to music with an agent where I could just request it by saying it. It's delightful, but I wouldn't say it's life-changing.
I mean, I assume that all the recommendation systems that I interact with probably guide me...I feel mostly happy about that. I remember when Amazon kind of first came out with a recommendation system, it just felt so great. It was like, there's a whole world of books that I want to read that I didn't know about.
That might be the most...I don't know. What do you think? You've probably thought about this more than me.
Jerome:
It's a good point.
Actually, it's interesting what you say. I will challenge that the first one...I don't think for many people, "life-changing" is that I can ask something for music and it plays it.
Lukas:
Yeah, "life-changing" is way too strong. Yeah, sure. 
Jerome:
But it is true.
To answer your question, you guessed right. Which is, at a place like Meta, recommender systems are just hugely impactful. And in two areas. One is advertisements and the other is organic recommendation.
Just that...by the time I left, my team was a few thousand people and [it] justified the entirety of the budget by far, you know, multiple [times]. The ROI of investing in this system with larger-scale — especially, you can imagine, in advertisement — is really staggering.
If you ask me, that's actually kind of disappointing, if you think about it. The most successful application of AI so far has been advertisements. And I would say maybe the second-most successful has been recommender systems in apps like TikTok, for example. But it's kind of behind-the-scenes.
Lukas:
Well, wait, wait, wait. Actually, you're a search guy. Don't you think maybe...I should have said "search"? I feel like web search is incredible.
Jerome:
No, because web search came up without AI, right?
Lukas:
That's true.
Jerome:
The whole history of AI at Google, I would have liked to be a fly on the wall there.
Actually, there was a...Sundar got interviewed by Kara Swisher just recently. He was talking about how much reluctance there was at Google to use AI in search. It's a fairly recent story, actually. And today, even some people...I mean, I do think actually AI is very useful in search, but I would put that in the category of "behind-the-scenes", you don't really understand what it's doing.
But it's also a late story. Whereas in recommender systems and ads, it came much earlier as a fundamental block. Whereas I think Google worked pretty well early on with traditional information retrieval techniques.
So, you're right. I mean, if you ask me to answer the question, recommenders are the big thing. The second big thing — which is especially when I was there — was moderation. Moderation at scale can only be done with AI.
Moderation at scale is done.
I think you can look at the stats as a report that are done every three months, but now we are up to like high 90s, and most of the things...even though there are 30,000 people doing manual moderation — that pair with AI — the amount of data to process is so great that the majority of the first action is done by AI, in the 95% plus, for things like hate speech, or bullying, or a lot of complex problems.
Doesn't mean it works perfectly, but it creates enough friction that I think it does make the system overall much better.
Lukas:
When you scale up to that massive volume, to the massive volume of inference, what changes about how you approach a problem like that? Say, moderation at scale and trying to moderate everything that's coming into to Facebook.
Jerome:
I don't know if you're asking in terms of the actual application or the support of that application.
Support of the application is very, very hard. I mean, the whole MLOps aspect is just...you know, and we could discuss that. It's really, really hard. I don't think in my tenure at Facebook/Meta, we solved it. 
We solved some part of it, especially with PyTorch — I think it was a great success — but after all it's hard. All these systems that evolved quickly at scale: very, very hard.
On the other side, from a user perspective, scale is tricky because you can have the impression it works well. All our stats show, "Hey, we made a lot of progress. If you look at since we introduced AI on hate speech, the amount of hate speech in the platform went down 3x."
Unfortunately, that doesn't mean that's the experience of people, and it doesn't mean it's true for anybody, anywhere in the world. Very, very interesting problem.
The experience, for example, is very interesting. It doesn't matter if you match your policies and you remove hate speech; what matters, actually  is how people experience your product. And that's a very different story.
And the experience of people depends a lot on where they are in the world. The language aspect, the cultural aspects are very, very important there.
The story behind developing PyTorchLukas:
It's interesting that you say...actually, I was kind of curious about both sort of the technical and non-technical challenges, but since you bring up PyTorch, I would not have thought that PyTorch was something that you think of as sort of helping with the operations.
I feel like when it came out, it seemed oriented more towards research, but I guess maybe I'm wrong there.
Jerome:
Oh, yeah. That's a long story. I can tell you a little bit of the story, how it happened.
Lukas:
Tell me the story, please. Yeah.
Jerome:
Yeah. So when I joined Facebook at the time — right in 2018 — the company had decided to go on a dual path with PyTorch, Caffe2, and ONNX in the middle. I thought, "That's just such a hack. That's a non-decision."
I think the decision was made two months before I arrived. It's the one thing...usually when you join a company like this, you do not want to make decisions early. This is one decision where I told the team....actually, I didn't say, "Hey, we should do PyTorch." I told the team, "No way, we're going to do this."
We needed...from experience, I knew that we needed to be on a platform that had community support. So I told the team, "Okay, you're going to have to pick one framework that we know will have traction in the community."
They were honest, and they knew that that could not be Caffe2 at the time. The community support there really dropped. PyTorch was a rising star, but not production-ready. And really, the only one that had all these aspects was TensorFlow at the time.
But the team was convinced that the model of PyTorch was better, and allowed more dynamic graphs. So they came back and said, "Hey, we think we can make it happen. We can make PyTorch a contender, both on the research front and the production front." And that's where the company bet.
For the past four years after the decision, we've been moving almost everything at Meta from Caffe2 to PyTorch. People love PyTorch. So it's not actually a hard thing to convince people. It's just amazing. It's a better tool to do exploration. But it didn't mean we had all the MLOps around it.
And to this day, we still are trying to really figure it out. It's not easy, but it was the right choice. PyTorch definitely, as you surely have seen, it's just a product that people love. And you want to start from that.
That gave us a lot of traction that was the right direction. But it still lacks a lot of the infrastructure around it. And there are a lot of reasons for that that we could discuss at the end.
Lukas:
Do you have a theory of why it's so loved? Because we watched this firsthand. When we started Weights & Biases, TensorFlow had a clear lead. And we watched PyTorch overtake it just during our own logs. It was a really dramatic shift.
It's funny because from my perspective — and I've dabbled with both — they seem pretty feature-comparable to me. I mean, in the early days, there was obviously PyTorch had the just-in-time generation of the graph.
Do you have a theory about why PyTorch seems like it was so much better loved?
Jerome:
Yeah, I'll give you another little anecdote.
I remember the reason actually I felt strongly about this when I joined Meta is before I joined, in my team I remember we had also this problem...at the time you had Theano, you had other systems. We were a small team — I was in a startup and we were in a small team — and we already had a few frameworks.
I said, "We can't do this. We got to agree on one." And so I think we agreed on one, I think it was TensorFlow. And six months later, they're like, "No, no no, we got to use PyTorch. No way we can..." And I'm like, "We made a decision!"
We went to PyTorch, and I'm like, "Okay, there is something there."
I actually think that the reason is simple. The people who developed PyTorch — Soumith in particular — had a design mindset.
If I were...the mantra, it was actually a user-centric design. It's funny because I think the people who did it didn't necessarily know they were demonstrating they knew the terminology [?], but it really definitely had the research in mind and what they wanted to do. And you can feel it.
The problem with TensorFlow is that it was retrofitted. So even if now, because of influence it's there — it has been plugged on top — it still feels like it's crumbled up together. 
It's hard to acquire the love, you know. You can lose it; it's hard to gain. So it's really about user-friendliness...researcher-friendliness, actually. I think also the fact that research is driving the narrative in AI today. It's not a stable field, right?
That really put PyTorch at the center of that universe.
Lukas:
What were the important pieces that you had to put around it to make it really work for you in a production environment?
Jerome:
The challenge with PyTorch...actually, the really complex stuff is that it's almost like an anti-pattern. Let me try to explain that.
I think there's this saying that "Early optimization is the root of all evil." But the challenge with something like PyTorch is that you need to do early optimization. You don't have a way around it.
Why? Because you need to create a system that gives a lot of flexibility to users to do a lot of things, yet is optimized. Because scale matters, efficiency and speed matter.
So you have this constant challenge of — especially in the interest of the operator internally — to have things that really follow...like, if you couldn't do Transformers today in PyTorch, but it would be awesome in everything else: forget it. Nobody will use it, right?
So you need to...very quickly, when you see where the trend is going, you have to go and put very good operators, and you need to optimize it. It is constant progress, they are doing this. That's one challenge. 
The other challenge is we had to give that team...I'm really a big believer in focus, and in this case, it was a constant balance. I said, "Hey, look, you have two focuses. I cannot make it simpler for you, and you cannot screw it up."
One is you cannot screw the external community. You have to create something that people will continue loving. You cannot make it bloated, right? The problem when you start creating enterprise software or production software, it becomes bloated, it becomes difficult to use. You can't do this.
At the same time, you have to make it work for us internally. It has to have all the production aspects. It has to be deployable, it has to be production-ready, which most people in the research community don't see, don't understand. 
We had to have these two objectives. And that's hard. The team suffered through, but I think they did actually quite an amazing job at keeping it, because ultimately Meta is going there. It will be 100% PyTorch in a very soon future. And I think the community still loves and adopts it.
Lukas:
Was there some experience that you were talking about that made you understand the value of community support? Were you using something at a different company, where it didn't have the community support?
 You just mentioned that a couple times, that it's so essential to use technology that the community believes in.
Jerome:
Yeah, because I've seen companies be stuck in a dead end. Actually, you could almost argue — maybe they're going to hate me for this — but PHP and Hack at Facebook is a really tricky one. They kind of own it. Facebook is so big that I guess — Meta is so big — they can own it.
But I really think this is not very good. I think you see it dying on the vine and you are adopting a technology that just doesn't progress anymore.
I've seen it for many systems. I would say all the big data systems, the containerization systems. You can see there's always one winner and if you make the wrong choice, you're stuck at some point moving off from it.
Jerome's experience at IBM WatsonLukas:
Right, right. I thought you were going to maybe mention IBM Watson. I'm kind of curious what that experience was like.
Jerome:
That is a very different story. I can tell you more about this.
I think what...I mean, the good thing for me is that I went there through an acquisition. I had created an AI company and IBM acquired it. It was great for everybody. I was very happy.
Actually, I think when IBM created the Watson units, that was a bold move. It was really about saying, "Hey, we believe there is a commercial potential in AI." That was 2013.
At the time, actually, not many people were talking about AI. The deep learning revolution came around in 2011, '12. People were saying it's coming.
Actually, Jeopardy! — the challenge when they did it with Watson — did not use deep learning, which is kind of interesting. It's a bit of a dirty secret. It used very little machine learning. It used traditional NLP and managed to get something very good.
They made this big bet on it. I think it was really — obviously — the right bet. It was early and it was good. But there were challenges, right?
The challenge is that you had to be patient. I tend to say, "You need to be impatient for profit and patient for revenue." And IBM did the opposite. They were impatient for revenue and patient for profit. They did a lot of this very large engagement, promising the moon, that you may spend $10 billion to make $1 billion. That's not a very good business.
What I was focused on when I was there was to really try to shrink wrap AI and put it as cloud services. At the time, we came up with this idea of putting AI in the cloud as services to do speech, to do conversation. To this day, I think that's still the majority of what Watson is doing. I think it was very ahead of the game.
But, the only problem is IBM didn't have much of a cloud. I felt a little bit stuck when I was there because I think it's the right strategy, I think we're getting traction, but I'm building on infrastructure that's not as robust as if you are on Amazon or Microsoft.
Drug discovery, AI, and changing the gameLukas:
And then you went into drug discovery, didn't you? It's super hot now, I feel like. Is that right?
Jerome:
Yeah, yeah. I got recruited to be the co-CEO of a company called BenevolentAI. I think it's a fascinating field. I'm a huge believer that it will happen.
You can see there's a lot of promising things happening in AI. Even at Meta — in the research team FAIR — we were doing things around understanding the function of proteins, looking at making predictions around free energy on small molecules and catalysis. Very interesting stuff you can do with AI today.
Now, that said, it hasn't really completely changed the field. I actually think that drug discovery needs a bit of what I would call a "Tesla revolution", which is you need a tech company to take it head on.
But it has such a huge amount of domain knowledge that it's a very hard problem. It's similar in some way to what Elon did with Tesla. It takes 15 years to understand what it takes to build a car. And I think drug discovery is even bigger than that. It's even more complicated.
But the decision process of these companies — when they approach technology — they're saying "There's no good model out there, but some models are more useful than others". Okay, that's what they say out there. The reason is the models are more useful because they just use them to justify the decision they had made before.
That's the way drugs are made these days. A lot of decisions made, not a lot of data to support it. A lot of influence, you have a concept called a "key opinion leader". That's how decisions are made there.
I'm not a big fan of influence authority. That's not, I think, how a business should be run. But that's how it is right now. I'm really looking forward to a big disruption and maybe I'll get involved in this again.
Lukas:
That would be cool.
When we started Weights & Biases, we didn't think that we'd have many pharma customers. And now, we work with most of them. So it does seem like at least the pharma companies believe pretty strongly that there's something there for deep learning to help with drug discovery.
Do you have a sense for what the breakthroughs have been that have made things like AlphaFold work well?
Jerome:
Well, there are different challenges.
What I find remarkable is that — and I still don't quite understand it — it does seem that deep learning and especially even the Transformer architecture, for example, are kind of able to understand the grammar of things. Of images, of text, but also of proteins, for example.
At Facebook, we had a project — at Meta — where you just feed hundreds of millions of proteins to a language model, and the system from there is able to predict function pretty well. Without having seen anything, with very little supervised data.
It's something that I'm just not sure I understand, because it's not like a brain understand molecules, right? That means there's this generic computation that works well in so many areas. And it just still blows my mind.
I understand that it can do it for language and for images, because humans can do that. But humans can't understand...can't fold molecules or understand their functions.
So, why is it working? Why can you predict...why can you do quantum calculations better with...? I don't know. It's really, really interesting. It seems to me like this thing that's generic even more than human intelligence.
Lukas:
Yeah, it does seem like an opportunity to do something that humans really can't do.
Jerome:
That's the case, yes.
But there are lots...back to your question, there are actually lots...you have the chemistry, you have the biology, you have the clinical trials, you have patient data. There are actually many, many stages. There is the target identification.
For BenevolentAI, one of the big things we were doing is trying to mine the literature to come up with new graphs, find new relationships, new targets. It's very, very early in the game.
Then you have companies that try to figure out, "Okay, give it a target. What are the right molecules that can affect that target?" Can we do some AI-assisted chemistry there? And then there are people who try to understand better the biological aspects, like how docking actually works.
And then you have the patient data and you have the imagery of the patient data. How can you understand it? Can you deduct from there? Can you combine that with genetic information?
Actually, there's really literally like dozens of places where it can affect.
I was talking to a friend of mine who just started a company to think of how to design...I think he called it "promoters". So, not the piece that's active, but the thing that like first [?] in an RNA-based [?], but the thing that's going to say how much is going to be...how potent it's going to be.
The little code that you don't pay attention to in DNA that usually tells you how much is used and how much the cells can be affected. I had no idea this thing existed, but you need a code for there, and it's a few hundred amino acids there. Using AI for that might be very good.
The advice I gave him was like, "Hey, go use Transformers. I bet you they're going to...train them on DNA. They'll figure out..." But I don't know about it.
Anyway, there are a lot of aspects of the process where it can help. I would say dozens.
Lukas:
It sounds like something that you're excited about right now and looking into?
Jerome:
Yes, it is. Yeah.
But I really...what excites me is, "How do you get..." I'm convinced that you're going to see a lot of what we call "business processes" be improved throughout the industry.
I think you're going to see...it's slow, by the way. You're going to see companies adopt [AI] for part of the processes, like insurance companies and banking and healthcare. They're going to take little blocks. They're going to work with these B2B companies and they're going to adopt it.
What I'm more excited with is, how do you change entirely a field? You have transportation. Obviously a lot of people are trying that, with self-driving cars or other kinds of self-driving. Maybe that's going to come first.
You have healthcare and you have drug discovery, paired. I think you have education as well, that could be completely transformed. But I'd love to do something that not just takes the current companies and just incrementally improves them — which I think is what's going to happen naturally — but changes the game.
I think in drug discovery, you can change the game. You can change the decision process. You can change...the attrition that you have right now that makes a drug cost $1 billion dollars will be diminished by 10x.
The potential of education and AILukas:
I totally agree with you on drug discovery and autonomous vehicles. You'd be blind not to see the opportunity there and the success that folks are having.
But I don't actually know that I've seen a ton of success in education. It seems like a surprising...it seems like education actually has the least amount of technology inserted into it.
Jerome:
Yeah, I agree with you. It's a field I'm very interested in, I've been looking into it. 
The way I put it...I actually just wrote a little position document for this pretty recently. The way I put it is that I think education is completely...in the war for attention, education is completely outgunned today.
If you are a teenager, do you want to go to a boring lecture or do you want to go on TikTok and see stuff by millions of creators that really is adapted to your interests, and understands what you like, what makes you...a system that gets you versus a system that's static. You know, the same way of educating as 500 years ago.
It doesn't mean there's no opportunity there. I think there are. But culturally, it's also a difficult field.
I think of it...the way I put it is that, look what's happening on TikTok. Kids go on TikTok...my daughters, they send me stuff like, "Oh, look at this guy, he teaches me math on TikTok." I'm like, "Come on".
That's entertaining. I'm not sure that's the way to do it, but it shows you the potential to make it a lot more engaging. You have to engage the user. You have to make it compelling to them.
I think there's techniques and there's AI to do that. I think we understand that pretty well, actually. That, I think, is an opportunity.
Lukas:
Interesting.
Jerome:
More to come.
Lukas:
Excited to learn more about this. As someone who likes to learn...I actually think YouTube has become such an incredible educational resource. Even on deep technical topics.
And I think the voting is surprisingly effective too. I would have thought that it would be hard for really good educators to sort of bubble up to the surface on very advanced topics, but it seems like it's a pretty good...I don't know.
Jerome:
I agree.
Lukas:
The algorithm, I guess, on YouTube is working well for me. I've been learning more math.
Jerome:
I agree. And you know, when you look at...I think that's the thing that...I'm not sure it works for younger students, but I think for adult education, I think for high school education, a lot of them start bypassing the traditional way, and go into YouTube.
But YouTube is also not an educational platform, right? There are other ways to learn. Personally, I love learning through practice and through exercise.
Lukas:
Totally.
Jerome:
I think people have different styles. I have a hard time staying in front of a lecture. I love practice and I love something that...the frustration I have with all the education systems today is that they don't start by constantly evaluating you.
What are my gaps, what do I need to practice next? What's the optimal thing that I can do next?
A lot of systems today really tell you, "What is the best next thing I can show you?" That's how TikTok works. So, what is the thing that's going to make you really, really want to come back on TikTok?
I don't think education works like this today. What is the thing that's going to make me more informed and want to stay and continue that course?
Lukas:
Well, I hope you work on this. I'm...
Jerome:
We'll see. You think drug discovery is complicated? Oh my god, education is also complicated.
That's the problem, you know. Healthcare, education, drug discovery, all these complex fields that are hard to disrupt.
Meta and AR/VR interfacesLukas:
Right, right, right.
Some other questions I had, I was wondering...Meta has made this huge bet on augmented reality, as far as I understand. Do you think that machine learning has a role to play there, or has caused some of the interest in AR or VR?
It's not a space that I understand super well, but...
Jerome:
Yeah, and it has a...let me give you a framing for it.
The challenge with this new kind of interface...let's assume — which is not a guarantee — that it's going to be a set of glasses that you put on your head. And let's say it's going to be the next platform. Because let's be honest, I think phones are an amazing invention, but they're kind of a frustrating invention.
You have a little screen like this. I see yourself always on that little screen. My prediction to you is like in 30 years, people are going to look back and say, "My god, this is like the Stone Age of interfaces."
So, something is going to change it. The challenge with glasses is that it's not an imperative interface. I'm not typing. In some ways, a phone is a little bit less imperative than a computer or a keyboard. You're clearly telling the computer what you want. When you type, you type the key. There's no ambiguity there.
I think the touch screen was a little bit more of an implicit interface. It's not exactly sure what you're saying...it's actually using a little bit of machine learning underneath there to figure out what you're talking about. But it's not groundbreaking machine learning to figure out what exact word. And it's using actually some of these language models when you type on your keyword.
But imagine now you have glasses, right? There's no input. So, what is it?
One of the obvious ones is voice, but it's very likely that it's not going to be just voice, for sure. It's not going to be just voice. It's going to be gestures. It's going to be motion. 
One thing that Meta is working on is a little bracelet, they acquired a company that did this. I think it's very, very interesting. You can maybe type in the air or move your finger silently. There's going to be motion. There's going to be trying to understand your intent.
The problem with glasses is you don't have a keyboard. You can't enter information. You can't tell the glasses what you want, but you'll need to have a rich interface that understands you. And so AI has to play a role there.
It's a very challenging role. It's creating a contextual interface that understands all the context around and lets you really direct the system you have on your face.
Lukas:
This is probably a speech interface, I'm guessing.
Jerome:
Speech, the problem is that...speech is part of it. But our guess is — our guess was — is that speech may not play as big a role as you think it will.
I mean, when can you really speak to your phone, right? As Siri. How often do you use it? I never use it. So I don't...
Lukas:
Yeah, I never use it also.
Jerome:
Yeah, I never use it either. Because it's awkward, right? I'm in the middle here, I'm going to talk to my phone like this? 
Actually talking to the glasses, while it's possible — I don't know if you saw, Meta came up with the Ray-Ban. My team actually did the speech for it. It's nice, it works well — but actually, there are not many places where you want to do this.
Maybe you want to do more motion. Your gestures, other things, a combination of all these things. Tap, you know. The interface will be a lot more complex, multi-model than we assume. It's not going to be just speech.
Why NVIDIA is such a powerhouseLukas:
Interesting.
Okay. Another totally different question that I had — that I was wondering if you had a thought on — is, one thing that's been really striking is NVIDIA's total stranglehold on the training market. I mean, there's some stuff coming out of Google, but it doesn't seem like it has tons of traction, at least in training.
Do you have a sense for why that might be? It's lasted a lot longer than I would have thought. There's lots of startups that compete and people working on chips, but somehow it just doesn't seem to move.
Jerome:
Oh, I know. I would say I know all about it. 
Remember what I told you earlier, which is that these things are very expensive, right? And when you have a sole provider, it's very complicated and it's very expensive. 
Thankfully, now the crypto market went down, so I think it's going to be a little nicer for GPUs. But it did feel at that time like a racket, what we were paying for these GPUs. 
But, the flip side of that is NVIDIA is very good. And they're very good not just because of the GPUs. I think the GPU — especially when you come from more of a PyTorch exploration mode — it works well. It's very multipurpose. I think it's very flexible. That worked really well for us.
But the thing also is, NVIDIA got the software really, really well. They really got it right. They work with us amazingly well. They're very competent people to create that. That's a coup de [?], and it's hard to replace.
I'll tell you at Meta how I wanted...and I threw some money at other people to say like, "Go do it, or we'll do it for you." You got to be able to compete, you know.
But software is hard, and they are very talented and they do a great job. And that's what got them there. They just have the best software...they have great hardware and have the best software stack on top of it. If you're serious, it's still the best in town.
Even if you compare to the TPU, the benchmarks are comparable, yet the GPU is way more flexible. So unless you have some workloads — I think it works well for ads for Google — the TPU can be competitive, but for the rest, actually, GPU is still the best game in town and they have a great software stack on top.
Lukas:
You would think more specialized systems would work in more specialized cases, wouldn't you? It's kind of amazing that the flexible system also seems to function the best for almost all these cases.
Jerome:
Yeah, but think of this thing, the challenge we had, right?
Imagine you try to design a chip, and you design it when the big game in town are CNNs, and LSTMs, and a lot of...in recommendation, it's a lot of sparse networks. And then you wake up three years later and everything has changed in the game. The game has changed, and now it's Transformers and actually dense networks start to be really relevant to do also recommendation.
You design your chip and it takes five years to get it out. So by the time you get it out, you know it's already over. Which many people are doing and have done as well.
It's very hard. It's the problem I told you, this early optimization. Which is if you don't keep your options open — while still optimizing what you have — you may be in a dead end.
Jerome's advice to people starting their careersLukas:
Interesting.
Well, cool. We always end with two questions, but I guess before that I'm just kind of channeling all the students that we always get in comments, wherever we post these.
You've had this very enviable career in machine learning and we have so many students that use our software and watch these interviews. Do you have any advice for students coming out? What would you work on if you were just sort of entering the field out of undergrad or grad school? How would you think about that?
Jerome:
Well, I would not...I'm not going to give you specifics, but I'll give you a little story that I got from a guy who used to study ants. He just died recently, E.O. Wilson.
He invented a really interesting concept around evolution and he wrote a little book, "Letters to a Young Scientist". He says, "When I was young, I came out and..." He was in his PhD, and he decided to focus on ants.
The amazing thing is, at the time, that sounded like a very crazy idea. Obviously, ants we know as a society are very important now. And he became the world's specialist in it, world-renowned.
What I tell people — especially in science — who come out is, "Don't be afraid of going for something that you own, that's your own thing. Go for it. And be bold about it. And actually, don't assume that everybody has done everything. There's a lot of opportunity for you to own it, and go for it, and be focused on it."
That's what I would advise. I think this is a very wide space. There's a lot of space for everybody. Be bold, be ambitious.
Going back to coding, the challenges of scalingLukas:
Fair enough, all right.
The last two questions are...one is — and it seems like you're kind of doing this — but if you had extra time to work on something, what would it be?
Jerome:
It's what I do now, I told you I'd do kite-surfing.
Lukas:
Yeah, totally. But if you weren't kite-surfing all day long, what would you be looking into?
Jerome:
Well for me, I'd do two things, because...one is, I was a goddamn manager for the past, like, 10 years.
I think the last time I coded was before my company got acquired, and I love coding. So I'm going back to coding, I'm going back to getting my hands dirty, really understanding...
As much as my team developed PyTorch, do I really understand it? Do I understand how it works? I'd spend more time doing this, and that's a lot of fun.
Lukas:
I love it.
Jerome:
I think Karpathy, just coming out of Tesla, he said the same. My skin is cleaner, I sleep better. Dealing with technical problems rather than people problems is always a big boost.
That's what I'm doing: really, really staying up to date. I feel it's really critical to understand. My next stage is, "Okay, I want to write a Transformer from scratch, what is that? What is in it?"
Lukas:
Nice. 
Jerome:
The second one I'm trying to do is really try to evaluate where the big opportunity is.
For me, I feel like, "Okay, I've done the B2B startup, I don't want to do another one like this." I want to try to see, "What's going to be the big revolution here? Is that going to be drug discovery? Is it going to be transportation? Is it going to be education?"
I'm going to pick one, I'm going to make a bet, I'm going to go for it. Maybe I'll fail, maybe there's 1% chance I'll succeed. But at this, it'll be worth it to.
Lukas:
Nice, I love it.
Final question is, when you think about taking a model from research to deployed in production and useful, where do you see the major pitfalls? Where are the pitfalls that might be surprising to someone that is just a researcher?
Jerome:
Oh my god, it's so complicated. It's actually really...it's something I feel like we haven't figured out. I mean, I'll reverse the question, which is, "What makes DevOps good?", right?
Do you want something that's reliable, that scales, that you can test? Testing in AI, it's hard, actually. How do you test? You can test like...tests that are very close to the model, or you have downstream tests.
Imagine you change the speech recognition and you have 20 systems with 20 layers on top of that. How do you test the last system and what depends on that? "Reliable", well, these systems...we claim they are deterministic, but they are not, actually. A lot of behaviors are really weird, that you cannot actually completely reproduce, right?
And then scale. These things keep scaling. Every year at Meta, we were like 10x bigger, and it wrecks havoc on all your assumptions. It's really, really hard. It really breaks...the assumptions you want to have to create this, they're just not there.
I don't think we have figured it out. I think it's still a work in progress.
OutroLukas:
Awesome, well, thanks so much. This was super fun. I really appreciate your time. Thanks, Jerome
Jerome:
Thank you so much, Lukas.
Lukas:
That was great. Thank you.
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So check it out.
﻿
Add a comment
Tags: Articles, Gradient Dissent, Podcast
Iterate on AI agents and models faster. Try Weights & Biases today.