CEO Sean Gourley on NLP, national defense, and establishing ground truth

Sean and Lukas discuss NLP, working with vast amounts of information, and how crucially it relates to national defense. Made by Angelica Pan using W&B
Angelica Pan

Listen on these platforms

Apple Podcasts Spotify Google Podcasts YouTube Soundcloud

Guest Bio

Sean Gourley is the founder and CEO Primer, a natural language processing startup in San Francisco. Previously, he was CTO of Quid an augmented intelligence company that he cofounded back in 2009. And prior to that, he worked on self-repairing nano circuits at NASA Ames. Sean has a PhD in physics from Oxford, where his research as a road scholar focused on graph theory, complex systems, and the mathematical patterns underlying modern war.

Show Notes

Topics Covered

0:00​ Sneak peek, intro
1:42​ Primer's mission and purpose
4:29​ The Diamond Age – How do we train machines to observe the world and help us understand it
7:44​ a self-writing Wikipedia
9:30​ second-time founder
11:26​ being a founder as a data scientist
15:44​ commercializing algorithms
17:54​ Is GPT-3 worth the hype? The mind-blowing scale of transformers
23:00​ AI Safety, military/defense
29:20​ disinformation, does ML play a role?
34:55​ Establishing ground truth and informational provenance
39:10​ COVID misinformation, Masks, division
44:07​ most underrated aspect of ML
45:09​ biggest bottlenecks in ML?

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
Sean Gourley:
We need to train machines up that can help us establish ground truth so that when new information comes available, we can measure it up against that and say, "Is this consistent, or is this contradictory?" Now, just because it's contradictory to the ground truth doesn't make it false, but it does mean you want to look closer at it. And this is kind of I think as we build up defenses for democracy, we need ... And I've talked about this, a Manhattan project to establish ground truth.
Sean Gourley:
It's going to take a lot of work and a lot of effort, but it's very, very hard to see a democracy functioning if we can't establish information providence, if we can't establish whether information is likely to be part of a manipulative attack. And if we don't have any infrastructure to kind of lean back on and say, "Well, here's what we do know about the world, and here's what we do understand with it." And so this is a big problem I think for democracies, and we need a way around it. And it's an asymmetric fight, but it's one that we have to win.
Lukas Biewald:
You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald. Sean Gourley is the founder and CEO Primer, a natural language processing startup in San Francisco. Previously, he was CTO of Quid an augmented intelligence company that he cofounded back in 2009. And prior to that, he worked on self-repairing nano circuits at NASA Ames.
Lukas Biewald:
Sean also has a PhD in physics from Oxford, where his research as a road scholar focused on graph theory, complex systems, and the mathematical patterns underlying modern war. I'm super excited to talk to him today.
Lukas Biewald:
So Sean, it's great to talk to you, and I really appreciate you taking the time. The first thing I want to ask you, since you're and entrepreneur and so am I, is tell me about your company, Primer. I'm sure you want to talk about it.
Sean Gourley:
We're a company that specializes in training machine learning models to understand language, to replicate different kinds of human tasks that run on top of language, everything from identifying key bits of information, to summarizing documents, to extracting relationships between entities for a knowledge graph. We also do a lot of work on language generation as well, and particularly fact-aware language generation, so we spend a lot of time trying to teach machines not to hallucinate, which tends to be sort of one of the issues of these transformer-based models. And so it's just really interesting when you're in this kind of world of machines that dream and to try teach them not to. But the goal for us is to take human actions on top of text and automate them at scale, so that we can kind of find insights that no individual human would be able to see by themselves, and we've had a lot of success in doing that over the last few years.
Lukas Biewald:
And is your goal to kind of make these individual tasks available to someone who wanted to use them, or is it to deliver these insights to a customer?
Sean Gourley:
I think the goal for us is ultimately to deliver these tools to the customer so that they can take actions that were done by humans and ultimately automate them. Now, you get the automation, but if you do it at scale, then all of a sudden you do get these insights that no individual human would have found. What we've found though as we've gone through that is that the internal kind of data science teams within these organizations have said, "Look, we'd love to kind of have these different components you've built," and so we've also been able to sell the different API components to users as well. But the end goal for us is to make this available to users with no technical knowledge and that's where we're focusing.
Lukas Biewald:
And do you have a particular end user or domain that you care about, or is this like a broad based platform for insights?
Sean Gourley:
Yeah. Look, so we've been focused on defense from day one. And my background, my PhD work was in the mathematical dynamics of insurgency, and so I spent a lot of time in the world of intelligence and defense. I think I have a really particularly useful use case. They spend a lot of time dealing with text-based information, perhaps more than anyone else in the world. So if you're an analyst sitting there inside of a three letter agency, you're going to be dealing with hundreds of thousands of text-based documents coming across your feed every day. And I think there's no surprise to anyone in the industry that that's just not a scalable human task. So we're able to go into that.
Sean Gourley:
I think there's three things that make that really attractive for us. One is the volume of text. I think the second is that any edge that you can get as an intelligence or defense operator or analyst, you're going to want to take that. And then the third thing is we've seen really really good defensibility once you're in and deployed in these organizations. There's a two year process to get in there, and so it's a good market to kind of land in once you've deployed your technology and got it working.
Lukas Biewald:
Has the state of the art in natural language processing changed to enable a company like this, or is there some like specific insight that you felt you had? How do you think about that, this moment for your company?
Sean Gourley:
Yeah. So when I started this, it was sort of 2015. I was watching, as you probably were, a lot of my friends and a lot of our friends would have been playing with neural nets and doing image processing. And I remember Jeremy Howard showing me some of the stuff he was doing with caption generation on top of images. And I remember watching that and seeing the caption generation piece and I was like, "This is going to come to language," right? "These technologies are going to come to language." And so that was sort of end of 2014, start of '15, watching friends do that.
Sean Gourley:
For me, I made a bet and said, "Look, we've seen computer vision go from 30% error rates to 5% error rates with these new neural approaches," and language felt like the next logical place that that would happen. I think if I'm honest like the first two to three years of the company, the technology hadn't caught up to the vision, but then we saw transformer-based models emerge, and that's just been a game changer. And what that's meant for customers is it's meant that these are actually trainable, which means they can be customizable, which means that you can actually start to deploy them to a pretty diverse set of use cases.
Lukas Biewald:
So you mean like fine tuned or something on their own datasets?
Sean Gourley:
Yeah. So instead of having to kind of train with hundreds of thousands of documents and data points and training examples, you can start with a model that's got a pretty good embedding structure from reading kind of general information, and then you can retrain that obviously on a fraction of the information that would otherwise have been required. So I think that's probably the single biggest thing, and that allows users to engage with this technology. When we talk about, "What's your return on investment for the time you want to take to train this or to get a payoff?" And that's come down significantly with these models.
Lukas Biewald:
You do a wide range of kind of traditional NLP use cases. Which ones have you seen the biggest change and maybe which ones have you still kind of not seen the improvement from this new technology?
Sean Gourley:
Yeah, that's a good question. When we started, language generation, it was sort of recursive neural nets and LSTMs, and you couldn't really generate a sentence with any kind of credible output, right? So the idea of even doing like a multi-paragraph summary of a document was just science fiction, so the stuff that this technology has enabled that you just couldn't have done. I think the second bet here is that the idea of training a model with a few dozen examples to pick up a relationship extraction between two entities, again that was a scientific paper that you had to write. So the stuff that this has enabled that just wasn't even within the realm.
Sean Gourley:
I think where this has come where it hasn't had as big an impact, I think it's really only limited by the training data that you're so willing to throw at it. And perhaps there are tasks in NLP that this wouldn't be appropriate for, but we honestly haven't seen it. Everything that we've given the training data for these models, they've performed in a good way. I think they make errors that the older NLP models don't make, but they make less errors, so you're going to take that every time.
Lukas Biewald:
Your name, Primer, is evocative at least to me of summarization. Am I correct in making that connection?
Sean Gourley:
It's actually ... It comes from inspiration, Neal Stephenson's book, The Diamond Age, if you're a science fiction fan. The subtitle of that is A young lady's illustrated primer. And in that book, the protagonist has a nano technology which creates a nano technological book that is designed to educate the world. And of course without spoiling the book, it kind of falls into the hands of manipulation versus education, which I think is a wonderful kind of theme. And so obviously underneath that is this idea that if you could have a self-writing book that could educate us about the world, we'd be in a science fiction world, and we'd be able to kind of do fascinating things with that. And so for us as a guiding principle is, "How do we train machines to observe the world and teach us about what they're seeing so that we can be smarter about the world that we're living in?"
Lukas Biewald:
So I guess there's some connection, maybe not directly. I guess I was feeling impressed that ... I feel like summarization or text generation like you said has been kind of the most interesting, maybe the most impressive use of the new kind of transformer technology, and I was wondering if you sort of felt that that was coming or if you were surprised by it?
Sean Gourley:
My thing always at the start was, "We're going to build a self-writing Wikipedia." And that was going to ultimately be something that this was going to enable. We were a long way away in 2015 from that technology even kind of existing. It was a bet on this becoming available, and it turns out it's been a good bet. So I'll take the win on being right, but I don't know if I had the right information, so maybe I'm just lucky, but we'll take it.
Lukas Biewald:
And I was kind of curious. You're one of the few people like me, kind of a second-time founder doing something. And it's sort of a similar space as your first company, like I am too. I'm curious if that kind of shaped your views with your new company, what you were sort of thinking of maybe doing differently and what you wanted to keep from your last company?
Sean Gourley:
I think it's kind of like you always sort of joke. Like when you're a writer, your first novel is sort of the easiest because it's sort of a collection of all your experiences up to that point. Your second novel, it has to be something new. To kind of carry that analogy on, your first novel is kind of biography. So I think in your first company, for me anyway, it was that idea you'd always had in the back of your head that you wanted to make real. I think in your second company, and it's been true for me, I've become more grounded in the commercial realities of like what's actually going to sell, what's going to scale, how big the opportunity is, what are the megatrends that are unfolding. And we've been very conscious of wanting to catch those waves and having a large commercial market to go after. Having defensibility in the space that you're in becomes really important.
Sean Gourley:
But I think overall the biggest thing is just operationally. I think when you're creating your first company, you don't really know what it's like to scale an organization. And I think until anyone's been through that, you don't really have that idea. I think once you've done it the second time, there's a lot of familiar signposts along the way where you're like, "Oh, this is what happens at this time, and that's fine," and, "This is what happens at that time, and that's fine." Whereas I think the first time you see it, you're sort of like, "Oh my god, is this the end?" or, "Is this danger?" or, "Is this what winning looks like?" And the second time you do it you're like, "No, I've got a few more data points." And just having seen something once before is night and day versus seeing it the first time.
Lukas Biewald:
Yeah. I can relate to that. I'm curious too, I don't know if you think of yourself this way, but when you look at your background it sort of feels like a data scientist, right? You have a PhD in physics, I think, right?
Sean Gourley:
Yeah
Lukas Biewald:
Did some really interesting kind of data stuff we could talk about on mathematics and war, I think. But do you think ... I don't meet a lot of other data scientists that run companies. Do you think that that then informs your leadership style?
Sean Gourley:
It's funny. I probably only hang out with other data scientists that run companies. I think me and Mike Driscoll and Pete Skomoroch we tend to kind of console ourselves with data scientist founder therapy sessions. So you're probably right though, on balance there's probably not a lot of us. I think there's a few things that come through as a data scientist. One is I think you have an appreciation of the algorithms. I think the single biggest thing that I've seen is when it comes to kind of product design, you're designing products that have algorithms at their heart. It's not algorithms to optimize a product experience. The product is the algorithm, and the algorithm is the product. And I think that appreciation's really, really important when it comes to kind of this idea of building a product and what a product market fit means and all of that. And it's not a direct translation from sort of the old world where you're designing products that don't have algorithms at their heart. So I think that's one piece of it.
Sean Gourley:
I think a second bit is that the reality is as you're growing these organizations, you're never going to have all the data you need at the start. And so if you're in a big organization, I chatted with a lot of friends that had come from LinkedIn and so on, you've got data that you can optimize, you can run AB tests on, you can do all of that. When no one's using your product because you're trying to get the algorithms to work, you don't have the traditional kind of data science methodology. It's not that useful for you. So that's definitely a frustrating piece. You can't lean on that. So I think on the upside, you understand the algorithms, but on the downside you don't really have data to make decisions on. It's probably a bit of both worlds. But I've got to say it would be tough to be CEO and founder of a company if you didn't have a good grasp of these kinds of technologies, and it's a pretty steep learning curve, so I definitely wouldn't trade the background by any means.
Lukas Biewald:
It's funny. I think myself I wonder if I'm maybe less data-driven in some ways than other CEOs that don't come from a data background because I feel like sometimes people use data as almost like a wedge to reinforce their confirmation bias. And I think as a data scientist, or at least for me, I feel like I'm maybe a little more skeptical of the data because I work with it so much, which I think sometimes makes me maybe in some realms less data-driven. I wonder if you identify with that at all?
Sean Gourley:
Yeah. There's always skepticism. The question is always, "Where'd you get this data from?" And then my mind immediately goes to kind of, "What's wrong with the data?" and that side of it, I think it is right. I think in this, there's a lot more gut instinct than I think anyone kind of appreciates. I don't think you can run a deep tech emerging company from data. Your data or intended decision framework is probably not right.
Sean Gourley:
I think where I spend a lot of time is in this kind of space between the scientific publishing and commercialization. I think perhaps more than anything having a PhD and being familiar with how science evolves allows you to sort of make these bets on scientific breakthroughs that maybe seem risky to the outsider, but when you're following it and you know what the trajectory of an emerging scientific breakthrough feels like, you can kind of put your chips behind that, place a bet on it, and in 12 months, 18 months, then you can cash in on that. And I think perhaps more than anything the benefit of a PhD in something like physics is a familiarity with science and a familiarity with the scientific process and translating that into a set of strategic bets that you can make as a CEO to position your company to best have upside with what's going to unfold.
Sean Gourley:
I was just saying here, "Maybe I'm lucky," the other way to look at perhaps more generously is I just had a really good grasp of where the field was going, and maybe I can climb some success on that. But that's the bet here is familiarity with science. And I think as we've seen here, you've got one had on archive and one hand on your email, and between the two of those you're probably steering the company.
Lukas Biewald:
Interesting. So where do you try to put your algorithms? Are you trying to push the very state of the art in terms of things like architecture, or are you sort of like intentionally drawing from research and mostly using results that you find?
Sean Gourley:
So it's interesting. There's two things. So one is research, for sure, right? Like if you've got breakthroughs, and these aren't always the obvious ones, but absolutely right. Like science unfolds, and you want to take that learning and commercialize it. Now, the commercializing of science can everything from make it cost efficient to run through to kind of training it on the right data, through to kind of understanding how to correct for the 15% false positives that pop up, which you can't do in a kind of mathematically elegant way, and it becomes a set of rule-based corrections at the end. So all of that kind of is part of commercialization.
Sean Gourley:
But the other side of it is there's a whole bunch of stuff that just doesn't fit the scientific publishing paradigm, and a lot of language generation doesn't fit the scientific publishing paradigm because all I've got, I've got blue and rouge, and these are useless with regards to kind of any customer experience of language generation. So in order to evaluate the quality of your language output, you've literally got to put humans on top of this and kind of have them evaluate everything that you're doing, which is incredibly expensive, and it sort of hasn't been part of the scientific paradigm, so there's very little kind of publishing on language generation I think largely because the ability to get a decent F-score is really, really hard. And you can probably go through a whole bunch of language processing tasks that just don't have a decent F-score measure, or have a difficult F-score measure, and as such don't have really an active scientific space.
Sean Gourley:
So, it's been interesting kind of tracking that through. And I think the other bit here is science is still some of the best inspiration, right? And in terms of like it can just sort of spark an idea and you're like, "Wow, that's a super cool attempt," and that side of science is pretty valuable too.
Lukas Biewald:
We're sitting here in August 2020 talking about text generation, so I have to ask you what you make of GPT-3, right? That recently came out, and people seemed very impressed. How impressed/surprised were you by its performance?
Sean Gourley:
I think the GPT-2 was the bigger jump, right? I think when GPT-2 came along, it was like, "Wow, these transformers scale, and they scale really well," right?
Lukas Biewald:
You know what's funny? That was exactly my reaction. I didn't want to bias the question, but I totally, totally agree.
Sean Gourley:
Because prior to that language generation via LSTM, and that was pretty bad. Like you couldn't ... You could make a sentence, but you couldn't string two sentences together. So that was the first thing, was GPT-2 was like, "Wow." Now, where GPT-3 came, and I think it's useful, was ... I was like, "Oh, it keeps scaling. It doesn't seem to have a finite kind of scaling effect at this sort of level of parameter space." So that's useful, right? But for me the big jump was GPT-2.
Sean Gourley:
Now, what we found on that, and you can take a different set of transformers, you can take XLNet or BART or BERT or whatever you want, but what you found is although the party trick is language generation, I think the true value of that is the trainability of these models. Is that you can train them to do tasks that are sort of the traditional NLP tasks, that you can train them with a lot less data. And it's super impressive to kind of see language generation, but in terms of the value for our customers it's basically saying, "With 20 training examples, you can build this thing with 95 plus precision and 90 plus percent recall, we'll automate your human task every time." And so I think that the true commercial value of this is the re-trainability. The party trick is the language generation. Although if you put on your hat of ... and maybe we'll get to that later, of disinformation and manipulation, there's definitely a whole industry that's going to spawn up around language generation, but we'll get to that later maybe.
Lukas Biewald:
Well maybe we should move in that direction. But I'm kind of curious how you ... Do you have a thought on why GPT-3 captured people's imagination so thoroughly?
Sean Gourley:
It's funny. It was one of those ones. We saw the paper get published and went through it, and the thing that captured me was the few shot learning, right? Which was super interesting. And I think it got underplayed in the paper, right? The few shot learnings was probably the most I think impressive piece of that work. And then I woke up like a month after the paper was published, and then all of a sudden an entire like VC twitter was like going bananas for GPT-3, and I also had that moment. I was like, "What's going on here," and I just sort of scratched my head.
Sean Gourley:
I think OpenAI has done one thing incredibly well, and we don't probably appreciate that. The marketing that they do is par excellence for the world AI, right? It really is impressive. And how they rolled out that release I think of GPT-3 versus kind of the GPT-2, "It's too dangerous. Don't touch it." GPT-3 was like, "Come and play with it if you're special." And it was a perfect influencer campaign that was run beautifully. It's up there with the influencer campaigns of Fire Festival and-
Lukas Biewald:
I thought you were being nice, but now I feel like maybe you're not. I don't know. I can't tell if you admire it, or if you're-
Sean Gourley:
Yeah. It was a wonderful influencer campaign. They just needed everything to back it up with. I actually think there's a lot more there, but in terms of the campaign that they did, it was wonderful. And I think that that captured sort of the minds EC Twitter.
Sean Gourley:
I think the bit that people miss on this is it matters what training data you've given to these machines, and it matters a lot more than you think. And that's the bit that everyone sort of misses. It's like, "Out of the box, we can use this with the few examples that it learns." And people talk about steerability or they talk about priming the system, what you're trying to do is correct for sort of the somewhat random nature of the training data. And it's a really bad way to steer a model where you don't know what it's been trained on, and you're trying to give it kind of hints in order to keep it away from being racist. And you don't know what it's read. It kind of feels like just the blind kind of like exploration.
Sean Gourley:
So I think the learning out of all this is training data matters. And the other bit I think here is that Twitter is a wonderful medium for displaying outputs of models that have 30% precision because you don't see the other 70% where it missed. And I think that's the other piece here is that if you look at 10 cherry-picked examples of these outputs, you're going to see some great results. But as we know in the commercial world for most applications, human precision is plus 90%. And if you don't hit plus 90% on your task, it's very difficult to commercialize it. And so I think the race as we look at NLP tasks is always the race to a 95% precision and that of assume and comparable.
Lukas Biewald:
And so you've touched on AI and safety a couple times in the last few minutes, and you also kind of operate in a world that I think is considered a gray area to a lot of AI researchers, defense or military applications. I'm curious what you think generally about especially natural language models and safety, and what should be done, and how worried you think people should be about misuse of these models, and what role you think you should play as sort of like a leading company in this space?
Sean Gourley:
I think first and foremost if we want to be a global superpower as America, you have to have defense, and you have to have intelligence. You may not want to have them, but then you don't get to be the global superpower. So that's the first thing to kind of just accept is that defense and intelligence are part and parcel of being a global superpower. It's also part and parcel of defending liberal western democracy. And there are plenty of other organizations and governments in the world that don't want that to exist, so we need that.
Sean Gourley:
As you come back from that, the second thing you say, "well, we want it, but we want it to be good." And so you say, "Well, if I want it to be good, we need to bring artificial intelligence and the latest technologies that we're developing to bear on that problem space." It's sort of a strange philosophical ground to say, "Well, we need to have defense, but it shouldn't be good," right? It's just a strange position.
Sean Gourley:
Now, as you go through that you say, "Well, there are also ethical concerns and moral concerns." There are very, very few organizations in the world that think more deeply about the ethical and moral implications of war than defense and intelligence. They live and breathe this stuff. And we can sort of arm-back quarterback from the valley, but the reality is this is something that has been thought very, very deeply about and has a lot of care, and the kind of rules of engagement very, very well defined and very, very well thought through and have been shaped and constructed over many, many years.
Sean Gourley:
Now, a lot of them haven't imagined what AI does in that, but there's also been a huge amount of work, going back to me, over the last decade with defense with intelligence talking about these exact scenarios and what it means to have artificial intelligence engaging in this kind of process. So for me here, bringing this technology to bear in defense and intelligence is something that I think is the right thing to do, and it's a very, very important mission for myself and for our company. As we do that, we also realize we've got a responsibility. That it matters if we're generating models that are classifying things that are unfolding in the world and saying, "Look, we identified an event." And if you misclassify that, that intelligence is now percolating up a chain which is going to have consequences, right? So there are very real consequences when you talk about the precision of the models that you're working with. There are very real consequences when you talk about what the data's been trained on, what the susceptibility of the models that you've got are to outside adversarial attacks. So all of this becomes something that you need to kind of work with and deal with.
Sean Gourley:
I think the ethical components of this woven into the decisions that we make. It's something that's also moving I think pretty quickly. If there's one thing you learn in science and technology, it's that science and technology moves a whole lot faster than the philosophical and ideological kind of foundations on which you can kind of make decisions on top of. And so you are by nature going to be in gray zones, and this is something you've got to be kind of open to and say, "Look, we're going to navigate where perhaps no one's ever thought about this before, and there isn't a strong kind of rule that you can fall back to and say, "Hey, this is the answer. This is what you're supposed to do in this situation," because the situation's never existed before.
Sean Gourley:
So it's something that we spend a lot of time with both ourselves and our advisors, spending time each and every week going through this stuff, making decisions, and trying to kind of navigate the best path that we can through this, but I think it would be a lie to say that this is really easy and there's these clear black and white kind of distinctions because we're dealing with stuff that simply didn't exist in the world before, but we're also dealing on the geopolitical scale with stuff that simply didn't exist in the world before.
Lukas Biewald:
And do you think at this moment in time, August 2020, do you think that for governments natural language processing, like machine learning is an important part of their defense capability?
Sean Gourley:
Yeah. I think there's three places where it comes through. The first is on the intelligence side. There's too much information coming in. And simply put, if you don't have machines playing some role in helping you navigate that information, you're going to have information that no one ever sees. And if you don't see information, you can't bring it to bear on decisions that you'll make. So, step one, the volume of information requires a natural language tool kit to actually help navigate.
Sean Gourley:
The second thing here is that the complexity of the world that we're in means that drawing inferences between something that's happening in Russia and something that's happening in East Africa is very, very difficult for an individual that has to specialize in, "I'm an East African specialist, or I'm a Russian specialist." Machines don't have that limitation, right? They can look further. They can look wider. They can draw inference across a larger set of data points because they're not fundamentally constrained by the bandwidth of information they can consume. So I think as we move to a more complex world, it's essential to have machines that can make connections across domains that humans aren't necessarily looking at.
Sean Gourley:
The third thing is, and this has sort of I think become increasingly important, is that more and more information is being generated by machines, and that's being used to manipulate. And if you've got humans that are trying to filter through the output of propaganda from China that's being machine generated, you've brought a knife to a gunfight. You're going to lose that.
Sean Gourley:
And so as we look at things like the operations out of Pacific command, there's a huge volume of information now that China's got its head around disinformation and manipulation. You can't navigate this as a set of humans. It's just not possible. And if you try and do that, you're going to lose. So I think the disinformation landscape has necessitated a set of machines that need to come into this.
Lukas Biewald:
Can you be more concrete about the disinformation? Should I be imagining sort of Facebook bots?
Sean Gourley:
So it's actually evolved a lot. So our standard kind of thing was Facebook bots back in 2016. What you've got now is a manipulation ecosystem, so it's everything from state broadcasting. In the sort of Russian example, you've got Russia Today and that sort of state broadcasting. You've got state-supported broadcasters, so things like Sputnik in Russia. Then you've got kind of fringe publications, which are supported ... These can be kind of fringe versions of Huffington Post, but it would be a fringe version of that where anyone can kind of submit. Then you've got social media, and then you've got sort of cyber-enabled hacking where you may falsely release a set of emails that have been doctored.
Sean Gourley:
So all of these components make up sort of the ecosystem of information manipulation, and they actually layer together. So you can hack a set of emails, falsify emails, spread them out, have them found on social media, have it amplified by a third party fringe voice on a user-submitted site like Huffington Post, but not Huffington Post probably. You can have it kind of rebroadcast through Sputnik and then end up on RT, and then be connected back into Fox News. So that cycle allows layering of information to come where you don't know the original source of it. You may not be aware of how it came to be, and you may be hit with the information from three different angles. That makes it feel like it's a lot more kind of authentic.
Sean Gourley:
And you can do this with fake information. You can also do it with information that's actually real but perhaps isn't as important as it should be. So maybe there's a shooting that happens which becomes front-and-center news, when the reality is is it was just a local shooting and if it hadn't have been amplified, it never would have been on the radar. So you're not just in this world of is this real, or is it fake? It's actually whose agenda is being pushed and what organism is actually pushing this agenda? This is kind of where I think we're sitting now is actually a very, very complex disinformation ecosystem designed to manipulate.
Lukas Biewald:
How does machine learning enable that though? Because all those examples you said, I could picture that being done with just human beings, motivated human beings doing a lot of typing I guess. Does ML really change this?
Sean Gourley:
Yeah. So I think state of the art at the moment is humans at the Internet Research Agency sitting down and ... From what we know, they have a set of objectives they have to hit. They have sort of a scoreboard of topics they need to cover every day, and then they get rewarded based on the performance. So it's all very manual.
Sean Gourley:
I think what we're looking at is it generally takes on order of 18 months, 24 months for sort of an emerging technology to become sort of weaponized, right? So we're not seeing yet the weaponization of language generation. We have just started to see the weaponization of image generation for fake images and fake profiles. We haven't seen the weaponization of, yet really, although we should expect it soon, of video generation. So language is ... Language generation's a lot newer. I think we're probably two years away from seeing that, but there's obviously a very, very clear path that if you can generate all sorts of anti-vaccination articles that target to different demographics, and you can do that by the scale of millions, you're going to get some really, really persuasive arguments that are able to be captured and propagated.
Sean Gourley:
So whilst it hasn't unfolded yet because the technology is new, I think it's very, very clear that this is a weapon that, if you were going to take this on, that it's absolutely something you'd want to kind of have at your disposal. So I think that's one piece of it. I think the second bit, it gets back to more of the traditional data science, which is AB testing on the scale of millions. And whilst you can't really do that when humans are typing the stuff out, once machines are producing it, you absolutely can. So I think that gets you into a world where this is going to be a lot more coherent.
Sean Gourley:
The other bit that I've flagged, going back to science, is one of the most fascinating areas scientific research at the moment, for me anyway, has been opinion formation and crowd dynamics, right? This has got roots in a little bit in epidemiology. It's got roots a little bit in stock market trading. It's got roots obviously in the world of idea formation and diffusion of ideas. But this is an area where we're actually seeing that crowds can actually be very manipulable. That research is happening. It's going on. Once you couple these other technologies into that, I think we're going to start to see that you can move and manipulate large groups of people through the information they're exposed to. And at that point, you've got a fundamental issue with democracy, right? And this is why it's such a big issue, right?
Sean Gourley:
We are based as a society on the free and open debate and sharing of ideas to come to consensus and a democratic process to elect governance for us. Once we lose faith in that, democracy dies, and there's a very, very clear vector of attack with manipulation of information by machines, and so we need defenses against that. And it's coming, and the defense and intelligence sector has realized this, and we're working very closely with them to help with that defense.
Lukas Biewald:
Can you sketch out what a defense to that might look like? Because it doesn't seem obvious there's a way to kind of prevent people from creating very persuasive content. In fact, you might argue that's happening right now.
Sean Gourley:
Yeah. So I think that's right. Look, so one of the things to recognize is this is an asymmetry, right? So with any asymmetrical conflict, one side has the advantage over the other. I sort of draw the example with image generation. If you generate a face of a person, you've got two options, right? If you want to know if that's real, you can go and check every single person in the world, and see if it's there. And if you get through everyone, and you don't find them, then it's fake. So obviously it's easier to generate an image than it is to determine if it's fake or not.
Sean Gourley:
Now, of course as you go through that, there are signs and telltale signs, right? A little too blurry, the ears are asymmetrical, the teeth don't quite line up, and so then people kind of figure that out, and then they generate a new image, and then the old techniques for identifying it aren't work anymore. And now you're in a cycle of effectively what we've seen in cyber security, which is things like zero-day attacks, right? Where you get a new model that hasn't been shown before and the statistical signatures of that model aren't known to the defense systems. So it's a game of detection and deception, right? Can I deceive the algorithms that are designed to detect whether this is real or not, or can I actually detect it and kind of like stop it?
Sean Gourley:
So that's one side of it. Now that's in images, but if you go into language, obviously there are signals in here. And one of the ones that we spot and look at is a zip distribution. So if you look at language, there's a zip distribution, which is a relative frequency of words that we use. And each author has a kind of a statistical signature of language, and machines have a statistical signature of language, and so you can spot then. But if you generate a new model, then the old methods of detecting it aren't necessarily there. So you've got the whole kind of like detection and deception, has this being generated by a machine or not?
Sean Gourley:
But on the other side of it, you've also got things like claims that are being made. So if a claim is being made that 5G causes coronavirus, you can actually trace that claim backwards. Where did it first originate? How did it propagate? And so it's not so much is the language real or fake, but has it been propagated by grassroots or has it been propagated through the network via actors that are intentional about that?
Sean Gourley:
Now to do that, you've got to classify a relationship between 5G and coronavirus. And as you look at that, there's all sorts of way to say that it's caused by, it's a result of, and so now it's a kind of a relationship classifier. And so you can do that. We've deployed that technology looking at relationships, for claim extraction, propagating that backwards. But we also look for things that counter that claim, right? So, "5G is not caused by ... coronavirus is not caused by 5G," or, "Coronavirus was likely caused by an infection of a bat into a wet market." So these would be claims that are at odds with each other. But ultimately the dynamic is, how do you get a ground truth, right?
Sean Gourley:
How do you get a ground truth? And I think if we're looking at kind of the longterm kind of game on this is we need to train machines up that can help us establish ground truth so that when new information becomes available, we can measure it up against that and say, "Is this consistent, or is this contradictory?" Now, just because it's contradictory to ground truth doesn't make it false, but it does mean you want to look closer at it.
Sean Gourley:
I think as we build up defenses for democracy, we need, and I've talked about this, a Manhattan Project to establish ground truth. It's going to take a lot of work and a lot of effort, but it's very, very hard to see a democracy functioning if we can't establish information providence, if we can't establish whether information is likely to be part of a manipulative attack. And if we don't have any infrastructure to kind of lean back on and say, "Well, here's what we do know about the world, and here's what we do understand with it." And so this is a big problem I think for democracies, and we need a way around it. And so this is going to come down to ... It's an asymmetric right, but it's one that we have to win.
Lukas Biewald:
Do you think that it would be wise to use the same kind of manipulation techniques to spread true information?
Sean Gourley:
Yeah, this is interesting, right? So on the one side, you've got detection. I think the other side's you've got, "Well, what's your reaction?" What's the action that you take on top of this? I think at this point here, and you can go into just kind of the health crisis kind of dynamic of COVID, and that sort of maybe makes it a little more real. And so if you've got stuff here around the diffusion of HCQ being an effective treatment or, "Masks don't work," this is really dangerous, right? This is incredibly dangerous, the propagation ... And we've seen bot activity around masks don't work. There does seem to be coordinated attacks around pushing divisiveness and masks.
Lukas Biewald:
Sorry, why would that be true? Who would stand to gain from pushing the idea that masks don't work?
Sean Gourley:
So if you want to create political division, which has been the stated goal of the IRA, the Internet Research Agency, you find any hot-button issue that will divide a country, push it. It puts in to tribalism. You have an us and a them, and you lose the cohesivity. Why do you want to do that? Well, if you don't have a unified set of political consensus on anything, it's very, very hard to go to war. It's very, very hard to rally the US to say, "Don't invade Crimea," if you can't even agree on masks, right?
Sean Gourley:
So one way to kind of neutralize the strongest military in the world is to ensure that the political actions will never come to agreement about how it will be used, and Russia's been incredibly smart on that. And so one of their kind of goals is they look through is to divide the nation so that you can't agree on anything. And so one of the things has been masks. Now, the added kind of benefit of the masks is that it kind of ruins the health of society by having division on that. And it also ruins trust in the political system, which is again to Russia's advantage. So there's absolutely been something that if you're sitting there, this has been one of the things that have popped up on your daily kind of topic board of things you have to act on. And we can see that kind of manifest from the way in which information is propagating and the way in which bot type activity is engaging.
Sean Gourley:
And so if you look at that and you say, "Well, all right, there's nothing we can do about that," well, that's the wrong thing to do because not only are you creating political divisiveness, lives are being lost, right? And so it's a hard position to hold that we shouldn't do something. I think the question then comes is like we do want to propagate information out that is true and that does kind of conform to the scientific consensus. But the interesting thing on that is masks were not a scientific consensus. And if you went early on, it was against WHO regulation. And so if you posted ... And I had conversations with Jeremy Howard about this. If you posted on Reddit, they said, "You can't put that here. You can't post that masks are an effective solution." And the reason you can't post it is because this is pseudo science because science hadn't come to a conclusion.
Sean Gourley:
So it's really, really tough, right? As you go through this as to say, "Well, what is ground truth?" particularly if science hasn't figured it out, and then, "How do we police content that may or may not confirm to this?" And so immediately as you go through this, you start to realize that it's a very, very difficult problem, however it's also one that you feel like you've got to act on. So think we're going to have to be in a place where we do use this technology to inoculate ourselves against disinformation. And one of the things here is, to take the virus analogy, if you haven't been exposed to a political stance on masks, you'll probably take whatever you're first exposed to. And if you're exposed ... the first information is that, "Masks don't work. It's a conspiracy." If that becomes your first exposure, it's much, much harder to change your opinion than if you're first exposed, "Masks are a good idea. If you help me. I help you. It's a good idea."
Sean Gourley:
So, one of the things you look about is identify the manipulation campaigns early and inoculate susceptible populations to the messages by exposing them to good, well grounded ground truth.
Lukas Biewald:
With those similar techniques.
Sean Gourley:
Similar techniques, I think you're going to have to use similar techniques, right? And this is kind of ... to go back to the book from Stephenson, the line between education and manipulation is a very, very fine and often blurry line. It's that dynamic, right? Is like, "Well, If I am educating, I am manipulating." But the difference is I'm doing it for the benefit of you. I'm doing it for the benefit of the society, not I'm doing it for my own benefit. I think that's kind of the dynamic here is undoubtedly we're training machines to understand the world in ways that we can't, to do things that we can't. What we teach them, how we teach them is very important because they're going to then be tools that either benefit us or work to our detriment, but that kind of dynamic is ... it's they're undoubtedly going to see things that we can't see, and they're going to understand things that we just can't understand. And we need that because we can't navigate this world without them. So, they're here, but we need to take responsibility with what's in front of us.
Lukas Biewald:
Well, I have lots more questions, but I'm running out of time, and we always end with two questions. If you look at the subtopics in machine learning, is there one that you think doesn't get as much attention as it deserves, that you think is way more important than people give it credit for?
Sean Gourley:
Yeah. I think it's information retrieval. So the world of IR is sort of machine learning, kind of. I mean it's sort of been 25 algorithms and so on and sort of that. But I think information retrieval has been something that we've totally forgotten about, but it's so fundamental to all database technology in the world, and yet we haven't really kind of given it the attention that it deserves. So, aside from some researchers that I'm sure are not getting their papers submitted to NIPS because information retrieval is not top of the list. But more information retrieval for sure.
Lukas Biewald:
I feel like that was really the first major application of machine learning, at least that I was aware of.
Sean Gourley:
Yeah. And we just haven't touched it. The volume of information retrieval literature with these new kind of technologies is pretty low, and yet underneath it, it's a search and recall problem.
Lukas Biewald:
Interesting. I love it. You're the first person that said that. I think it's a great answer. Okay. Then the final question is, when you look at the projects you've had of taking machine learning from conception to deployed, in production and useful, where were the surprising bottlenecks in that entire process?
Sean Gourley:
I think the surprising ones have been just the amount of training data and the importance of training data. I think coming in, we knew that data had to be clean. We knew that there was cost functions. We knew that there'd be deploy issues. We knew that there'd be security issues for deploying on prem on sensitive data. All of that was known. I think coming into this, the importance of not just the ... And we also knew that there'd be a volume of training data. What I didn't think at the top of this and surprised me was the specificity of the training data drives the performance of the models in ways that are just not obvious when you start out on this. And these things are kind of excellent prediction machines, but they're also excellent cheaters. And they'll find ways to cheat and find the right answer, but it's because you gave them the wrong data. And I think that sensitivity to the data is something that's really surprised me.
Sean Gourley:
Now, on the flip side of that, if you start investigating methods of exposing these models to the right data, you also get wonderful performance in ways that go above and beyond sort of the general applications, so I think it's a blessing and a curse. But I don't think going into this if I'd been told that that would be the thing that kind of drove the most performance, that I would have agreed to that. So that's probably the biggest surprise.
Lukas Biewald:
Well, thanks so much. This was really fun and fascinating.
Sean Gourley:
My pleasure. I've enjoyed it a lot. Thanks, Lukas.
Lukas Biewald:
Thanks for listening to another episode of Gradient Dissent. Doing these interviews are a lot of fun, and it's especially fun for me when I can actually hear from the people that are listening to the episodes. So if you wouldn't mind leaving a comment and telling me what you think or starting a conversation, that would make me inspired to do more of these episodes. And also if you wouldn't mind liking and subscribing, I'd appreciate that a lot.