Emily M. Bender — Language Models and Linguistics

Emily dives into the problems with bigger and bigger language models, the difference between form and meaning, the limits of benchmarks, and more.
Cayla Sharp
Emily and Lukas dive into the problems with bigger and bigger language models, the difference between form and meaning, the limits of benchmarks, and the #BenderRule.
Emily M. Bender is a Professor of Linguistics at and Faculty Director of the Master's Program in Computational Linguistics at University of Washington. Her research areas include multilingual grammar engineering, variation (within and across languages), the relationship between linguistics and computational linguistics, and societal issues in NLP.

Connect with Emily

Listen

Apple Podcasts Spotify Google Podcasts YouTube SoundCloud

Show Notes

Topics Covered

0:00 Sneak peek, intro
1:03 Stochastic Parrots
9:57 The societal impact of big language models
16:49 How language models can be harmful
26:00 The important difference between linguistic form and meaning
34:40 The octopus thought experiment
42:11 Language acquisition and the future of language models
49:47 Why benchmarks are limited
54:38 Ways of complementing benchmarks
1:01:20 The #BenderRule
1:03:50 Language diversity and linguistics
1:12:49 Outro

The papers discussed

  1. "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" (Bender, Gebru et al. 2021)
  2. "Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data" (Bender & Koller, 2020)
  3. "AI and the Everything in the Whole Wide World Benchmark" (Raji et al. 2020)
  4. "The #BenderRule: On Naming the Languages We Study and Why It Matters" (Bender 2019)

Watch on YouTube

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
Emily:
It's really important to distinguish between the word as a sequence of characters as opposed to words in the sense of a pairing of form and meaning. Because what the language model is seeing is only the sequence of characters. It's a bit easier to imagine what that's like if you think about a language you don't speak.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald. Today, I'm talking to Emily Bender, who is a Professor of Linguistics at the University of Washington, who has a really wide range of interests in linguistics and NLP, from societal issues to multilingual variation to essentially philosophy of linguistics. I'm especially excited to talk to her because she was actually my teacher for Linguistics 1 at Stanford University, where I was an undergrad. It was one of my favorite classes. I still remember it. I still remember a whole bunch of interesting facts that I learned. And it led to this lifelong interest in linguistics that I've really enjoyed. So, could not be more excited to have a conversation with her.

Stochastic Parrots

Lukas:
I thought it might make sense to start with the paper that you coauthored, "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? "¹, which was notable even to me on Twitter for a lot of controversy at Google, which I was hoping you could maybe start by describing, but then get into the meat of what the paper actually says.
Emily:
Yeah. So, it's not in the IPA and hard to pronounce, but the title actually includes an emoji. The last character of the title is a parrot emoji. We were doing that just kind of for fun, because we liked the stochastic parrots metaphor, and there was a while before all this happened that we thought the thing about this paper would be it was the one with an emoji in the title. Little did we know. The paper came about because of work that Dr. Timnit Gebru and Dr. Margaret Mitchell and their team were doing at Google, really trying to connect with the engineering teams to build in good practices to make the technology work better for more people and do less harm in the world. So, that was sort of the role that they had there. They noticed, especially Dr. Gebru, that there was this big push towards bigger and bigger language models. The paper has this table that just...the number of parameters and the size of the training data just explodes over the past couple of years. Dr. Gebru actually direct messaged me on Twitter saying, "Hey, do you know of any papers that talk about the possible downsides to this, any risks? Or have you written anything?" And I wrote back and I said, "No, I don't know of any such papers and I haven't written one. But off the top of my head, here's five or six things that we can be worried about." About a day later, I said, "You know what, that feels like a paper outline. So, here's a paper outline, you want to write this?" That was early September, and the conference we decided to target was FAccT, the Fairness Accountability and Transparency Conference, which took place finally in March 2021. Submission deadline was October, I think, 8th of 2020. So, in a month, we put together this paper. That was possible because it actually wasn't just the two of us writing it or the four named authors finally, but in fact we had seven authors. So, Dr. Gebru brought in Dr. Mitchell — it's really important to me to emphasize that they have doctorates, but I also know them well enough that I'm going to start full naming them now, or first naming them actually — so, Timnit brought in Meg and three other members of their team, and I brought in my PhD student, Angelina McMillan-Major. Between the seven of us, we sort of had enough different areas of expertise and literatures that we've read that we could pull together this survey paper. And so, it came together. It was amazing, and also an interesting writing experience because we never had a Zoom meeting or anything where all of us spoke together. It was all done through remote collaboration in Overleaf. So, not a super common way for a research to get done, but it worked in this case. The Google authors put it through what they call pub approve over there, got approved. We submitted it to the conference and then put it away, because none of us had actually anticipated working on that in the month of September. It was like extra work for everybody. so we all turned back to the other stuff we needed to be doing. In late November out of nowhere, from my perspective — and I should say that in telling the story, I'm not at Google, I've not been funded by Google, and so I only have sort of secondhand understanding of what went on at Google plus what was out in the press eventually — but the Google coauthors were told to either retract their paper or take their names off of it, and they weren't told why. They weren't offered a chance to discuss what might need to be changed about the paper, it was just "retract it or take your names off of it". We had this strange moment of, "Okay, what do we do with this paper?", because it seems kind of odd to put something out with just two authors that actually represents the work of seven people, what do we want to do here? My PhD student Angie and I, we turned to the Google coauthors and we said, "We will follow your lead here. What do you want to have happen?" And they said, "No, we want this out in the world. So, you two publish it." That was the initial answer. And then Timnit, sort of on reflection, said, "Actually, this is not okay. This is not an okay way to treat a researcher who was hired to do this research." This was literally her job and the job of everyone on that team. And so, she pushed back. The result of all that, you can go find in all the media coverage, is that she got fired. Google claimed she resigned. Her team says she got resignated, which is a great neologism. That went down fast enough that she was able to then put her name on the paper. Meanwhile, Meg started working on documenting what had happened to Timnit. The end result of that was that she was fired a few months later, but after the final version of the paper was done. So, that's why the fourth author is Shmargaret Shmitchell. That's a really sad story for everybody involved. I mean, it's terrible mistreatment of Timnit and Meg and the other members of the team, those who are on our paper and those who weren't. It's become, I think, a really difficult environment to work in. It's sad for Google because they lost really wonderful expertise and a lot of goodwill in the research community. And sort of sad for...it sheds a light on the sad state of affairs about the way corporate interests are influencing what's happening in research in our field right now. On the other hand, my coauthors and I still maintain...we all really enjoyed the experience of working on this paper together and of weathering the stuff backwards together. One weird result is that this paper has gotten way more attention than it ordinarily would have. I think it's a good paper, it's a solid paper. And boy, did we put a lot of polish on it between the submission version and the camera-ready because we knew it was going to be read by a lot of people. When I put up the camera-ready as a preprint...I didn't put it on arXiv, because those tend to get cited instead of the final published versions. So, I just put it on my website and tweeted out a link with a bitly link to shorten it, so that I could see how many times it was downloaded. It has been downloaded through that link alone over 10,000 times. I know that there's other ways to get to it, which is way out of scale to anything that I've ever written otherwise. So, that's been interesting as a researcher, but it's also I think fortunate because it has come to the attention of the public, and I think that this technology is widespread. It's being used. It's being used in lots of different ways. It's really valuable that the public at large has a chance to understand what's going on. And so, through Google's gross misstep, I and my coauthors have been given the chance to help educate the public, which is something that I do feel fortunate about.
Lukas:
I'd love to kind of get into what the paper talks about. But do you have any sense, or has Google made any comments, about what their objection was? Because I sort of had this feeling that it must be a really incendiary paper, and then in the prep for this interview I actually read it and it felt pretty uncontroversial, I guess, was my feeling reading it. So, I just wonder...I mean, maybe it's hard to know, but have they said anything about what they didn't like about it?
Emily:
Yes, I mean, in public comments, there's been things like, "It doesn't cite relevant work that is trying to mitigate some of these issues." But at no point were we ever told which work we should have been citing. And we do, in fact, cite some work that is trying to mitigate these issues. So, I don't know quite what that was about. But you're absolutely right. We figured that we'd be ruffling some feathers with this paper because we were basically saying, "Hey, this thing that everyone's having so much fun chasing, maybe let's go a little bit slower and think about what kinds of downsides there are and how to do this safely." There's going to be some who don't want to hear that, but we honestly thought it was going to be OpenAI who was upset, because GPT-3 is kind of the best known example of this, and it was our running example, too. So, we thought we'd ruffle some feathers, did not realize we were going to be ruffling feathers inside Google. And, it's basically a survey paper. We didn't run any experiments. We didn't do any analysis. What we did was we pulled together a bunch of different relevant perspectives on large language models and brought them all together in one place. It is surprising that the paper seems to have been part of the cause of Google basically blowing up this amazing asset that it had in terms of its ethical AI team.

The societal impact of big language models

Lukas:
Interesting. And I guess, one reading of your paper is, "Hey we should consider the downsides of large language models." I think maybe another person might read it...this might be an unfair reading, but I could imagine someone having hurt feelings if they were working on large language models, and they read your paper saying it's like an unethical thing to do, to build large language models. Would that be an overstatement of your claims? I don't have the paper in front of me, but I think maybe that could hurt feelings, I'm not sure.
Emily:
I also do a lot of work in the space...societal impact of NLP in general, and that sometimes goes under the title of ethics and NLP. I do see a lot of people reacting to that topic with hurt feelings, and I think it's connected with the way in which people identify with their work. If you say, "Hey, let's think about this technology we're building and how it behaves in the world and what we can do to make it be beneficial," and you use the term ethics to describe that, sometimes people want to read that as, "You're calling me unethical," and I think that that direction of the conversation is rarely actually valuable. I do think that, in general, people in this space want to be doing good things in the world. Certainly, there are people who are working on technology with the goal of making a lot of money doing it. There's this caricature of the tycoon or whoever, who's just happy to crush all the little people to make as much money as possible, that's out there, probably. I think much more frequently, people are working within systems that give them certain commitments around maximizing value for shareholders and stuff like that, that make it harder to put on the brakes on some things that are making money right now for shareholders and take a bigger picture view. But it is much more valuable to talk about it in terms of what are those systems, what are the incentives, what can we as individuals do within those systems, rather than think about people as ethical or unethical. I'm not sure that really speaks to your question, but hopefully it's somewhat helpful.
Lukas:
No, I mean, I think you're saying that your point is a little more nuanced than what maybe someone would take away, and I can...I run a company and I love technology and I love building...I do recognize that lots of people get hurt, and I think it's great that people are pointing out issues and also kind of pumping the brakes and flagging this stuff. But I could kind of see how someone might feel a little offended by it. I wasn't sure if I was kind of jumping to something or- I guess my question, well, the question that I kept thinking about with the whole paper in general as I was reading it is, even sort of setting aside making money, let's just talk about research and just the excitement of building models that work. I just feel that so deeply, like GPT-3 for all its fuss, it's kind of amazing what it does. I wouldn't have expected it to work so well. Would you feel...would you argue that those kinds of directions of research should stop? Or what would you want an organization like OpenAI to do differently? Ethics is a good example of a place that's kind of actually really showed that bigger models do kind of... it's not obvious that bigger models would perform tasks better at many extra orders of magnitude. Would you prefer that that research doesn't happen or happen differently somehow?
Emily:
So, I think it's worth saying that OpenAI has actually put a lot of effort into thinking about "What are the possible downsides?" and "What could happen when this technology is released in the world?", and that's important to note, and I'm glad that they're doing that. I think that what I would like to see more of is...first of all, that kind of work. What are the possible failure modes and how do they impact people? And then also, when this is working as intended, how can that impact people? OpenAI has been doing some of that and I think that's great, and they should do more. But also, you can look to other fields of engineering, where before you take something and you put it into the world in a place where people are going to rely on it, there's all kinds of testing that has to be done in sort of understanding of "What are the tolerances?" and "What works and what doesn't?" and "What's the range of temperatures that this thing could be applicable in?" and "What are the things you have to check for and certify?" and things like that. We don't have very much of that yet going on in NLP. I can speak less to other areas of AI, but I honestly think there's similar issues elsewhere in AI. And so, there's work — actually that was done at Google by Meg Mitchell and Timnit Gebru and others on a framework called Model Cards, which was sort of steps in that direction of like, "You've built a model, what does somebody who's going to use this model need to know about it?" — and that's the kind of thing that I would like to see more of. And that is in contrast to just rampant AI hype, where people build something, it's cool, it's fun, it works well, and somehow, that's not enough. People have to say...it's not enough that GPT-3 can produce coherent text, people have to say it's understanding language, which it absolutely isn't, as I'm sure we'll talk about later.
Lukas:
You have two good segues, but yeah, yeah.
Emily:
Yeah. Although it is all connected, right?
Lukas:
Yeah.
Emily:
So, for some reason, the culture around AI is all about trying to reach for these big claims rather than trying to build really well-scoped, reliable — sufficiently documented that they can be used safely and reliably — systems. That's the direction that I would like to see more of, is one thing. And then another thing, and we get into this in the paper, is that if the main pathway to success these days is just bigger and bigger and bigger, then you cut out lots of languages communities, even within the languages that generally are well supported, because they just can't amass that much data. And you also cut out smaller research groups, smaller companies that are not sitting on the kind of collections of data that Google is, or Facebook is, or Amazon is. Microsoft also does a bunch of big data work, they don't seem to have amassed data quite the same way as the other big ones. That is unfortunate because it, I think, stifles creativity to a certain extent. If the whole community is rushing towards this one goal that only some can really effectively do, then we lose out on the other things that people might be trying instead.

How language models can be harmful

Lukas:
Maybe a less obvious concern that you talked about in the paper is talking about how the models can encode bias in ways that are hard to notice. When you talk about the harms that might happen from natural language models, do you have examples of things that are actually happening now? Or is this more of like a future-looking thing that we're worried about as NLP becomes more pervasive, like worrying about future harms?
Emily:
So, I mean, absolutely happening now, and therefore easy to predict that it will keep happening in the future if we don't change. Here, the work of Safiya Noble, with her book "Algorithms of Oppression", is a really important documentation of this. She looked into what are the ways in which identities — which properly belong to the groups of people who have those identities — are represented and reflected back to people in search. In particular, her running example is the phrase "black girls" and also "black women". These things have changed over time — and she's very careful to document when she talks about particular examples what the date was — but early on, as she started this project, the phrase "black girls" as a search keyword basically turned up pornography. And that you might say is, "Well, that's just in the data." Well, what data? Where did that data come from? If you get into the heart of her book, it's basically around that that's "in the data" because of the way in which the economy of the internet allows people to purchase and make money off of identity terms. Once these things were flagged, Google sort of piecemeal [made] changes, so you don't get pornography as the results for the search term "black girls" anymore. But it's also possible to sort of poke at things and tell that it's very much individual after-the-fact changes, as opposed to anyone going through and systematically thinking about how to redesign the way that search engines and the advertising-driven ranking of search latches on to these incentives and then amplifies them. One ongoing discussion in the AI community — you see it pop up on Twitter with great regularity is — is the problem that the data is biased only, or do the models also contribute? And the answer is absolutely models also contribute. And then, there's other layer to it of, "Well, that's just what's in the data." One of the other really embarrassing examples for Google was, there was a point at which Google Image search turned up pictures of gorillas when you were searching for black people, and I forget exactly the particular configuration of that, but embarrassing and awful and racist. One reaction at the time was, "Well, that's just in the underlying data." And so, "Not our fault. We're just showing what the world is saying.", except that it's not true, because the way the algorithms that do the ranking of search results and also the bidding for the ad words is...that is emphasizing particular incentives. So, there is a certain thing in the underlying data. There's also the question of how did you collect that data? Where did it come from? What does it actually represent? It is not the world as it is. It is some particular collection of data. And then what is the optimization metric? What are all these modeling decisions that you've made? And how does that interact with the various biases in the data? And what's the incentive structure? Safiya Noble's work is a great point to look. Latanya Sweeney documented — this is a 2013 paper — how if you put in, at that point, an African-American sounding name, one of the ads that would pop up suggested that that person had a criminal history. And if you put in a white sounding name, you tended to get just a more information about so-and-so. And that does real harm in the world — it wasn't 100%, but it was significantly different between the two groups of names — it does real harm in the world because if you imagine someone is applying for a job or just making friends and someone does a Google search on them, and here comes alongside this message suggesting they might be a criminal, that does harm. And then if I can give one more example—
Lukas:
Please, yeah, these are great.
Emily:
Elia Robyn Speer did a really interesting work example around sentiment analysis and word embeddings. Sentiment analysis is the task of taking some natural language text, and her example is English, and using it to calculate or predict the sentiment. Is this a text expressing positive feeling towards something, negative feeling towards something, or not expressing feelings? The particular data set she was working with, I think, was Yelp restaurant reviews. So there, it's "Take the text, predict the stars."
Lukas:
Yeah, I've used that data set. Yeah, for sure.
Emily:
And then as an external component, she's using word embeddings, which are representations of words into a vector space based on what other words they could occur with. So, some of the training data is in-domain, the Yelp reviews, but then there's this component that's trained on general web garbage. What she found using the sort of generic word embeddings was that the system systematically un-predicted the star ratings for Mexican restaurants. All right, so she digs into it and looks into why. It turns out that because that general web garbage included the discourse about immigration into the US from and through Mexico, which has lots of really negative toxic opinions of Mexican people, the word embeddings picked up the word "Mexican" as akin to other negative sentiment words. And so, if in your review of the restaurant you called it a "Mexican restaurant", according to the system, you have said something negative about it, so you can't possibly be giving it a five-star review.
Lukas:
Well, that's a really interesting example. My next question was going to be how do models play into this? I guess that's a good example of how not just the underlying data can have bias, but the model can literally have its own bias.
Emily:
Yeah, so the word embedding picked up on co-occurrences between the word "Mexican" and lots of other things that also co-occurred with negative sentiment, and then that was used as a component in this other model. So, yeah, there wasn't in the underlying Yelp reviews any particular reason that the Mexican restaurants were rated lower, right?
Lukas:
Right.
Emily:
I don't know for sure if they were rated on average exactly the same, but it doesn't matter, because the error was the system underpredicting for any given restaurant. On average, it was missing in the low direction. So, yeah, that's a kind of bias that was picked up from an external data set. We tend in NLP to use word embeddings as really handy detailed representations of word "meaning", so word similarity, including semantic similarity. And if we don't pay attention to what meaning was picked up, what co-occurrence was picked up, then we can end up with stuff we really don't want in our systems.
Lukas:
What would you recommend doing about that? Because they are really useful, word embeddings. And I'm sure in this case, it seems pretty simple. It's hurting your performance. There's not even a model performance tradeoff here. So, what could you possibly do?
Emily:
There is a lot of work on so-called debiasing of word embeddings. If you look at Speer's work, she continues on to do some of that. And I think that part of it is, work with more curated datasets. The discourse around immigration from and through Mexico, even if you stick with only things like reputable news sources, you're still going to find that garbage. That alone is not going to solve it, but it can be better. It's not possible to come up with a fully bias-free dataset nor fully bias-free word embeddings, but you can do better. One step is to sort of say, "Okay, how much better can we do with curated data? What about debiasing techniques for the biases that we're aware of?" Part of the problems with debiasing techniques is that you have to know what you're looking for. And then on top of that, to think through failure modes. So, in a particular use case, when you're building some technology, who are the stakeholders? Who's going to be impacted by it? If someone's restaurant rating is underpredicted for some reason, what does that mean in an actual use context? And what should we be testing for to see if we have sufficiently debiased for our use case, for the stakeholders who are most likely to experience adverse impacts?
Lukas:
I guess, it does seem like it would be incredibly...I mean, it seems like it'd actually be impossible to find an unbiased dataset of human...
Emily:
Right. It doesn't exist.

The important difference between linguistic form and meaning

Lukas:
I guess these are good segues into other papers that I want to talk about. So, maybe we should just in the interest of time, we should move on to the second paper² that we want to talk about to make sure we get to it, which is around... Let me see if I can summarize this. So, this is basically sort of saying that language modeling only on what you call "form" — which I think is just sort of like the words coming through, this is kind of of the GPT-3 types of models that just like look at these strings of words — can't have understanding, like true understanding. I just thought one thing that was interesting is that you said you wrote the paper to sort of end some kind of debate on Twitter that I was definitely not aware of. Actually, I think I'm kind of coming into something with maybe more context than I knew. So, maybe you can sort of summarize what the different possible positions are here and what you want to put to rest.
Emily:
So, I kept finding myself getting into arguments on Twitter with people who were claiming that language models were understanding things. And I was like, "No, they're not. They can't possibly be." It's important to pin down what we mean by language models. So, a language model is something like GPT-3 or BERT or otherwise, where its training data is a whole bunch of text, and the training task is predicting words in the text. So, some of the times it's done sequentially, sometimes it's done with a masked language model objective, where certain words are dropped out and the training objective is, "Okay, well put those words back in and then do your model updating to...", gradient descent, et cetera, et cetera. For me, as a linguist, I look at that and go, "Hey, useful technology, interesting. Incredibly helpful in things like speech recognition and machine translation where an important subtask is, 'Okay, what's the likely string?'" So, in a speech recognition setup, the acoustic model says, "Here's a range of text strings that sound might have corresponded to," and then the language model comes in and says, "Okay, yeah, but 'It's important to wreck a nice beach.' is a ridiculous thing to say, and 'It's important to recognize speech.' is a reasonable thing to say, so we're going to rank that one higher." That's the kind of form-based tasks that they were initially meant for and good at. And then what's happened with the neural language modeling revolution in the past few years is that when you extract the word embeddings from a language model, you have a really finely fitted representation of word distribution, which is very useful, and some of them can even do...where you get the word embeddings are contextual. So, the information about the word and what it's likely to co-occur with isn't about that word across all the texts, but about that word in its current context. Super useful, but not the same thing as understanding language. I kept getting into arguments with people who were not linguists who wanted to say, "Yeah, it is." So, Alexander Koller and I wrote this paper to just sort of say, "Okay, look, here's the argument why not," with the hopes that that would put an end to it, and it didn't. People still want to come argue with me about this. The thing that is really hard to see — and sort of the value of linguistics in this place — is that when we use language, we use it...and I'm sorry, I'm going to pull out a philosopher on you here, but Heidegger has this notion of throwness. So, you're in a state of throwness when you are not aware of the tool you are using. If you think about typing on a keyboard, when it's going well the keyboard disappears. And then, you have a key that sticks and then all of a sudden, the keyboard is very "there" for you again. Well, language is the same way. When we are speaking a language that we are fluent in, it is not very visible to us until something makes us focus on it. And of course, linguistics is all about focusing on the language. So, linguists are used to doing that. When we talk about giving words to a language model, it's really important to distinguish between the word as a sequence of characters as opposed to word in the sense of a pairing of form and meaning. Because what the language model is seeing is only the sequence of characters. It's a bit easier to imagine what that's like if you think about a language you don't speak. So, what's a language you don't speak?
Lukas:
Mandarin.
Emily:
Mandarin. Okay. You don't speak Mandarin. I assume you also therefore don't read Mandarin.
Lukas:
Definitely don't.
Emily:
Maybe recognize a couple of the characters?
Lukas:
I mean, I read Japanese. So, there's some overlap.
Emily:
Okay, so let's go further away. Do you read Cherokee?
Lukas:
No, definitely not.
Emily:
Okay. So, Cherokee has got this wonderful syllabary, it's a writing system where the characters represent syllables. If someone showed you a whole bunch of Cherokee text, that experience of looking at it would be a better model for what the computer is doing than you looking at English text, because you can't help but get the meaning part when you're looking at it. Because English is a language you speak and read. Mandarin is kind of in between there because you would pick up a few of the hanzi that you recognize from Japanese kanji, and it wouldn't be quite the same.
Lukas:
I guess...I don't know, I don't want to argue with you. But I do want to, I guess, advocate for... I don't know, I mean, I have not thought deeply about this topic. What I have seen in my life is these language models working better and better than I could have imagined from the strategy that they employ and sort of seeming like they're getting more and more subtle detail. Of course, when I was a kid, I learned about the Turing test, which seems like a pretty good test of understanding on its face. I think the test is like, if you have a conversation with something and you can't tell if it's an automated system or a human, then we can say that it has intelligence, tand it sort of seems to me like these language models are on the verge of passing the Turing test. What would it take for you to feel like some automated technique actually has understanding of what it's consuming?
Emily:
Yeah. So, I think the first thing I want to say about the Turing test is the reason it doesn't work...and I hate to disagree with a giant like Turing, because Turing's work was really important and foundational—
Lukas:
But it was 100 years ago, it's possible to miss something.
Emily:
70?
Lukas:
70, fair, fair. All right, 80? 70? Okay, 70. Sorry.
Emily:
As it turns out, people are too willing to make sense of language and too willing to sort of build the context behind something that would make something make sense. And so, we are not well positioned to actually be the testers in a Turing test. That's why that doesn't work. Language models, because they can come up with coherent-seeming text. These are probable sequences, given a little bit of noise and where you start, what would likely come next based on all that training data. Then it sort of comes out as something that we can make sense of, and then we are sort of easily fooled into thinking that it actually meant to communicate that. So, you're asking the question of "What would show that a machine has understanding?" I think part of it is, well, let's talk about actually interfacing with the world in some way. We certainly do have cases where machines in restricted domains for restricted ranges of things that they can do, do understand. So, when you ask your local corporate spy bot to do something for you and it does the thing, it has understood.
Lukas:
Wait, sorry, what's a local corporate spy bot? Sorry, could we make this a little more concrete?
Emily:
I'm making a snarky remark about the privacy implications of things like Siri and Alexa and Google Home.
Lukas:
Oh, I see, I see, gotcha.
Emily:
And Samsung Bixby is in the same space. Microsoft had Cortana. Right?
Lukas:
Right, right. Gotcha.
Emily:
So, when you ask those things to set a timer, or turn on the lights, or dial a phone number or whatever, and it works, then, yes, to a certain extent, it has understood. And it has understood because its training setup was looking at not just language but something external to the language that needed to map to that. And so, that's a kind of understanding. The question is — for somebody who was interested in doing that across some more general range of things — the question is, how do you set up tasks that require some kind of action in the world, so that it can't be done just by bulldozing it with a language model and say "Well, this is a likely thing to come next.", right?

The octopus thought experiment

Lukas:
You got to describe your octopus thought experiment because that was very evocative. And I have some questions.
Emily:
Okay. So, the octopus thought experiment is about not just being able to understand but learning to understand. That's the difference between it and both the Turing test and Searle's thought experiment, where both of those basically say, "Imagine someone has set up the whole system." Then we could test for intelligence or we can...from a philosophical point of view, say it's still not understanding. So, the system exists and we are thinking about it or testing it. The octopus is this thing of saying, "Okay, if we had something that we assume, we posit that it is hyperintelligent..." and then that's part of why we picked the octopus. In fact, it was initially a dolphin, but we decided that octopuses are inherently more entertaining. Also, it was better because dolphin's environment is a bit closer to a human's environment. So, we wanted the octopus to be something that is posited to be super intelligent. And they are, I think, understood to be intelligent creatures...as smart as it needs to be. That's not the issue. We are assuming intelligence, but then we are only giving it access to the form of language. In our scenario, you have these two English-speaking humans who end up stranded on two nearby islands. They're otherwise uninhabited, but they've had previous inhabitants who set up an undersea telegraph cable. These two humans can communicate with each other. We left it offstage how they discovered the telegraph or that the other one's on the other end, whatever, just assume it exists. It's the thought experiment, you can do things like that. You know, assume a spherical cow, except we don't need spherical cows. So, telegraph cable and the humans are named A and B. They're basically using English as encoded in Morse code to talk to each other. This hyperintelligent deep sea octopus that we called O comes along and taps into that cable. The octopus can feel the pulses going through for Morse code. The question is, what could the octopus actually potentially learn here? Because this is a hyperintelligent octopus – it's got as much time as it wants, as much memory as it wants — it is able to very closely model the patterns of what's likely to come next. In our story, the octopus decides for some reason that it's lonely and it's going to cut the cable and pretend to be B while talking to A. On reflection, it's like, "Poor B, just cut off from the world." So, maybe the octopus is also talking to B pretending to be A, but we don't talk about that part. The question is, under what circumstances could the octopus continue to fool A that it's actually B? We say this is in a sense, a weak version of the Turing test. The way the Turing test was set up, A is given the task of deciding "Am I talking to a human or not?" And here, there's subterfuge. The octopus, its mere existence is unknown to A. If there's just sort of like chitchat pleasantries, those things you can just kind of follow a pattern and it's relatively inconsequential as long as what's coming out is internally coherent. And even if it's a little bit incoherent, well maybe B is just being silly. It doesn't matter so much. Okay, well, O could get away with that. But once you get more towards things where A actually really cares about communicating ideas to B and getting ideas back from B, it's going to get harder and harder for the octopus to maintain this semblance of good communication. We go through this example where A builds a coconut catapult and the octopus is able to send back sort of like, "Very cool invention. Great job." or something, even though A was asking for like, "Well, what happened when you built it?" But the octopus has no experience of things like coconuts or rope or stuff like that. So, it can't reason about those things in the world, or even know that A is actually talking about them. All it can do is come back with, "Well, what's the likely form of a response in this context?" To the extent that O gets away with that, it's because A is willing to make sense of those utterances. O has no meaning in this scenario. And then finally, we have a bear show up and start attacking A, and A says to O — or to B, actually — "Help. I'm being attacked by a bear. All I have are these two sticks, what should I do?" At that point, O is utterly useless, and so we say this is the point at which O would definitely fail the Turing test if A survived being eaten by the bear. But then we tried with GPT-2 like, what would it say? The answers were hilarious. The words are in the right topic area enough that it comes back with something funny and I encourage people to go look at the appendix to our paper where we put these, but it's never going to be helpful. And it's not actually expressing communicative intent.
Lukas:
Well, I have to say, walking into that paper without knowing the context, I really enjoyed it. For me, I especially enjoyed it because the sort of concreteness of the thought experiment that was like evocative but also makes you think, "Huh, what do I think about that?" What I kept thinking was...for me, I feel like I've learned about a lot of things that I haven't experienced, I was especially thinking about learning math, where there's all these abstract topics. I feel like in a way I learned about math, in some sense, through form almost. It's all in my head. I'm like learning things as visualizing them. It seems possible to learn to reason about things that you haven't seen or experienced just from a stream of words. I even remember actually grading blind student's papers. It was really interesting, how they walked through stuff in a math class, and it seemed like they were visualizing things even though they were blind from birth. So, I'm just wondering ... I guess, I'm not totally convinced that the octopus couldn't somehow figure out what a catapult does if they listen to all language.
Emily:
So, if the octopus had actually had a chance to learn English, then yes. But it didn't because it never got that initial grounding. And we absolutely learn things through language that are outside of what we've directly experienced. Conversely, if you as a sighted person wanted to understand what it was like to live as a blind person, you could listen to or read what a blind person has to say about that and learn about it. So, that's definitely something that we can do. But we can do it because we have acquired linguistic systems. When we use language to communicate, we absolutely tell each other ideas and things that are outside of even our own experiences, right? We invent things, and then transmit that to other people. But we do that based on this shared system that tells us, "Okay, here's the range of possible forms. These are the well-formed words and sentences. These are the sounds that we use in this language. These are the way the words are built up, the sentences are built up. And these are the standing meanings that they map to." And then we use those standing meanings to make guesses about communicative intent. The problem for the octopus isn't that it's not smart. We said it's hyperintelligent. It isn't that if it knew the language, couldn't understand those things. It's that its exposure to the language is not set up so that it can actually learn it as a linguistic system, all it can learn is distributional patterns.

Language acquisition and the future of language models

Lukas:
I guess what prevents the octopus from learning language over time like a human probably would?
Emily:
Okay, so, it doesn't get to do...and in the paper, we go into human language acquisition. For first language acquisition, it's all about joint attention. When babies learn language, it starts from social connections to their caregivers and understanding that the caregivers are communicating something to them, and then mapping the words onto those communicative intent. The child language literature talks about the importance of joint attention, that kids learn words when their caregivers follow into their attention, and attend to the same things, and then provide those words. That experience, that mapping, the octopus doesn't get that. It's just getting the words going by
Lukas:
Do you think there's some algorithm possibly that could exist, that could take a stream of words and understand them in that sense?
Emily:
Natural language understanding is a tremendously difficult problem because it relies not just on the linguistic system, but also on world knowledge and common sense, reasoning, all kinds of things. So, you can certainly use — I'm more certain than I actually am — but there's a big difference between saying, "I'm going to build an algorithm that has understanding of linguistic structure, has understanding of linguistic meaning, has understanding of how those meanings map to a model of the world, and then use that to understand," versus "I'm going to build a system that only gets linguistic form and assume that it will get to understanding in some way." So, yes. You could go much, much further with algorithms that have more in their input, in their training input, than just form. That's going to be things like visual grounding. It's going to be things like the ability to possibly query people for answers. It might be knowledge bases. It might be other sensors in some sort of embodied... I'm not saying that natural language understanding is impossible and not something to work on. I'm saying that language modeling is not natural language understanding.
Lukas:
But just so I'm clear, just consuming language without kind of all this extra stuff, you're arguing that no algorithm could from just that really understand language?
Emily:
By language, I mean, form. Imagine that you are dropped into the Thai equivalent of the Library of Congress, and you have around you any book you could possibly want in Thai, but only in Thai. For some reason, this library doesn't have Thai-Chinese, Thai-French, Thai-English dictionaries. It's just Thai. Could you learn Thai?
Lukas:
I think so. I guess what's hard is that I have a language already. But I feel like I-
Emily:
So, what would you do? What would be your first step to learning Thai if you have just oodles and oodles of Thai books and that's it around you?
Lukas:
What would I start to do? I mean...I'm not sure. Do you think I couldn't learn Thai?
Emily:
So, I'm curious about what you ... So, you as a person, could you learn Thai? Sure. You could go take a Thai language class.
Lukas:
No, no, I mean from in this situation, just sort of dropped in. I mean people do learn... How did people learn hieroglyphics or something when there's no one around that still knows it? Do they need to find like a Rosetta Stone? Or can they-
Emily:
The Rosetta Stone is what unlocked the hieroglyphics. If you don't have something like that, then what you have to do is resort to hypotheses about distributions and say, "What do we know about the world in which these texts were written? What do we know about how languages work?" Can we say, "Okay, well given frequency analyses and the length of the words, it seems like a language that's got separate function words instead of lots of morphology. So, that thing might be an article, that thing might be a copy of a verb," and you could do some analysis like that. It's not what language models are doing. To get from those sort of structural things into something about meaning, you have to make guesses about what's being described. You have to basically bring in some world knowledge and say, "How well does this fit?" When I asked you the question of what would you do, I was thinking, well, possible answers are, "I would go find an illustrated encyclopedia that has pictures in it." There's some visual grounding. Or I would go find a book from whose cover I could tell it was actually the Thai translation of Curious George.
Lukas:
These are great suggestions.
Emily:
Yes. But all of that is bringing in external things. And then once you have a foothold, you can build on it. That's an interesting way to go. But if you just have form, it's not going to give you that information.
Lukas:
Wow, interesting. Thank you, this was really interesting. I guess, my last question on this topic is, do you sort of predict that these language models will run into problems that we'll really experience and then we'll have to kind of change the approach? Or do you think that as our bar for applications of natural language goes up, they'll just sort of adapt and find ways to incorporate external information, kind of like finding the Curious George translation?
Emily:
I think that language models are going to remain useful. I mean, language models have been an important component of language technology since Shannon's work in the 1950s. This is longstanding. But I think that we are likely — it's so hard to predict the future — but my guess is that...or maybe what I would like to see is that we get to a more stringent sense of what works and what sort of an appropriate range of failure modes and what kind of fail safes we need. People are going to find that putting language models at the center of something where your application really requires you to have a commitment to the accountability for the words that are uttered is going to be a very fragile way to go. My guess is that when we get to that point, we're going to de-center the language models and have them be something that is selecting one possible output again or providing these word embeddings, but they are not a step towards general-purpose language understanding the way they are hyped to be. That's one set of problems. If you have to have accountability for the words that are uttered, you do not want a stochastic parrot. You want something that will speak for you in a reliable way, not just make up what sounds good. The other thing is if we take seriously these issues around bias and encoding and amplifying bias and training data, I think we're going to find that we want to work with algorithms that can make more of smaller datasets, so that we can be better about curating and documenting and updating those datasets so that they stay current with what's going on, rather than this path right now that relies on very large language models. So, those are my guesses. There's also the environmental angle. Well, actually, the "energy uses" angle is both environmental but also about technology, to a certain extent. I think there are more and more people — and there's Schwartz et al, Strobel et al, Henderson et al. — a bunch of work now saying, "Hey, let's make sure we're also measuring the environmental impact as we do things, or the carbon footprint so that we can direct effort to doing things in a more and more efficient way." There's that angle, but there's also many situations where you don't have the whole cloud available. If you want to do computing on a mobile device, you're not going to be able to have an absolutely enormous language model in there. There's pressure to find leaner solutions. I think that's a win-win, environmentally and then in terms of more flexibility with technology.

Why benchmarks are limited

Lukas:
Totally, totally. And it's a good segue because you pointed out a bunch of this stuff in your papers about benchmarks, which I'd love to talk about a little bit, and maybe you could kind of summarize...maybe start [with] what are benchmarks, probably most people know, but then what are the possible pitfalls with them?
Emily:
Yeah. I should say this is a paper called "AI and the Everything in the Whole Wide World Benchmark"³ that we presented at a workshop called Machine Learning Retrospectives at NeurIPS last year. It's joint work with Deb Raji, Alex Hanna, Emily Denton, and Amandalynne Paullada. Another collaboration where...in this case, we actually do have meetings where we talk to each other, but of those people, the only one I've met in person so far is Amandalynne, who's a PhD student in my department. Pandemic life, right? We got together because we were talking about the ways in which benchmarks are being misused in the AI hype machine and in AI research that is striving for generality and overclaiming what the benchmark shows. So, a benchmark is basically a standardized data set, typically with some gold standard labels. Although you could also have benchmarks for things where the labels are inherent, like language modeling. What word actually came next is the gold standard level. The idea is that you might have a standardized set of training data, or possibly not, and then you've got the standardized test data. People can test different systems against this. You have this chance of saying "Which system is more effective in this training regime?" or "Given this training data against that test data?" So, that's a benchmark. You asked me before if I could summarize the problems with benchmarks, and it's not so much benchmarks I have a problem, but the way that they're used. I think this is an example of "the map is not the territory". People will tend to say, "Oh, here's this benchmark about computer vision." ImageNet is that. Or, "Here's a benchmark about natural language understanding of English," and that's GLUE and SuperGLUE. People will say...I've actually seen this in like a PR thing that came out of Microsoft saying that computers understand English better than people now, because this one setup scored higher than some humans on the GLUE benchmark. That's just a wild overclaim. and it's a misuse of what the benchmark is for. So, what's the problem with the overclaims? Well, it kind of messes up the science. We're not doing science if we're not actually matching our conclusions to our experiments. We live in a world of AI hype, which means that people are more likely to buy in to and set up solutions that don't function as advertised because they live in a world where people are being told that Microsoft has built a system that understands English better than humans do. Of course, you could also build an AI system that does whatever other implausible thing like, "Guesses someone's political affiliation by the way they smile" or something, which makes no sense. But we live in a world where there's all these claims, overclaims about AI, and that makes these other ones also sound more plausible than they should. So, those are the problems that I see. But, benchmarking is important. In the history of computational linguistics, there was a while where when you wrote a paper for the ACL, the Association for Computational Linguistics, you would say, "Here's my system. Here's how I built it. Here's some sample inputs and outputs," done. Then the statistical machine learning wave came through and brought with it the methodology of shared task evaluation challenges, which is sort of a historical version of benchmarking, where MIST and other organizations would say, "Okay, we want to work on speech recognition, and we want to actually get a sense of how these different systems compare to each other. So, we're going to run a shared task evaluation challenge where everyone gets the same training data, and we're going to have some held out test data that no one gets to see. At a certain point, all the competitors submit their systems and we see what happen." That's an improvement in the science compared to what was going on before. But that is not the whole story. If you want to understand how the system is working, if you want to understand how to build the next system, you can't just test it on some standard thing. You also have to look at, "Well, what kinds of errors does it make?", and "How do the different systems compare not just in their overall number, but in their failure modes and which inputs work for them and which ones don't?" and on and on like that, as opposed to, "Okay, I got the highest score. I'm done."
Lukas:
Right, right. Well said. I don't have much to add there.

Ways of complementing benchmarks

Lukas:
Can you say a little more about like...I feel like this is a great paper in that you make these really concrete, sensible recommendations. You sort of suggest a few alternatives to benchmarks. Could you maybe run through those for anyone listening?
Emily:
Yes, absolutely. So, it's more of complements than alternatives to benchmarks. So, in addition to benchmarks, this can be used sort of as a sanity check or, "Okay, did my system actually do better than a super naive baseline?", or "I want to compare some systems head-to-head, let's use this benchmark." You might also use test suites, which are put together to sort of map out particular kinds of cases that you want to handle well, as opposed to just grabbing whatever happened to occur in your sample test data. You might do auditing, which is very much akin to test suites in saying...so this is like Joy Buolamwini and Timnit Gebru and Deb Raji's work on auditing face recognition data sets, where they sort of systematically created the set looking at two genders and a range of skin colors and sort of say, "Okay, is this accuracy actually even across this set of people or no?" And they found out no. So, that's a-
Lukas:
How's that different than a benchmark? That kind of sounds like a benchmark, doesn't it?
Emily:
So, it's not the way benchmarks are typically created. You could imagine someone creating a benchmark that is sort of systematically mapping out a space, but that's not the practice. The practice is, "We are going to go grab some data from somewhere, and then hold out 10% of it to be the test and the other 90% is training, or 80% training, 10% dev," right? The way benchmarks are typically put together is, "Let's just grab a sample of data and see how well this thing works," as opposed to "Let's create a testing regime through test suites or through this auditing process that can allow us to find the contours of its failure mode." Not "How well does it work on average?" but "How well does it work for this case, and that case, and that case?" There's also adversarial testing, which is...a few different things fall under adversarial testing. Sometimes people will create test sets by going and collecting all the examples that previous systems did poorly on to make a particularly hard test set, which is interesting in the sense that it can filter out the sort of freebies that are too easy, but also doesn't necessarily guide anything towards better performance for a particular use case. Because it's just sort of like, "Well, we're selecting what was hard for the previous model," not "What's particularly important to get right or what's particularly likely to be frequent in our use case," and so on. So, that's one kind of adversarial testing. Another one is what we did in the Build It, Break It shared test. This was Allyson Ettinger, Sudha Rao, Hal Daumé, and I in 2017, put together a shared task where we had system builders and then breaker teams. The breaker team's goal was to find minimal pairs, two examples that were minimally different to each other, but would work...for which the systems would work for one but not the other. That would be a way of sort of mapping out what causes system failure. So, you can look at that. You can look at error analysis. Take the test set from the benchmark or the dev set from then benchmark, and then go in and look and say, "Okay, what are the kinds of problems that are showing up?" A lot of systems that rely on language models tend to do really poorly with negation, which is one of these things that's very important to the meaning, but tends to be a short word or subword, and so it is easy to miss. You can imagine speech recognition or machine translation, if you missed one word out of 20, it matters a lot what that word is. If you replace "a" with "the", in many cases, that's not going to cause a lot of problems. But if you just skipped a "not" somewhere?
Lukas:
Yeah, that makes sense. Yeah.
Emily:
All of this is basically about looking at what it is we're trying to build, what it is we're testing on, how it fits into the motivating use cases, and then what works and what doesn't, and for what doesn't work, what are the implications? What happens in the real world if that failure happens? And also, what are the likely causes? What is tripping us up? All of that is what we would like to see, instead of the leaderboard-ism, which is everyone just trying to climb to the top of the pile-on, which doesn't feel like it's really... I mean, people talking about the speed of progress in AI love to talk about how quickly those leaderboard changes and how quickly the state-of-the-art, SOTA, gets higher and higher on these various benchmarks. I always think, "Yeah, but so?" What does that actually mean in terms of understanding the world better from a scientific point of view or building technology that works better not just in the average case, but also in the worst case?
Lukas:
Yeah, it's interesting. Well, I had a couple things came up for me reading that paper. When I started my career, I think I was just sort of on the tail end of ACL papers where it seemed like they would just cherry pick some examples where it worked or it didn't, and it just seemed ridiculous. I remember they had early benchmarks and people would have lower accuracy than just guessing the most common case or something, which you could argue that's better, and people did, but that just seemed a little ridiculous to me. I remember this anecdote from your class about...I think it was Noam Chomsky saying that, "Oh, moms don't teach kids language," but actually they do, and it's just like no one bothered to check. So, it's kind of maddening, and I think I appreciated benchmarks from that. But then your recommendations are not only reasonable, I think in companies, a lot of it is standard best practice. I don't think you would just release a new model without trying it and getting a flavor for where it works and where it doesn't. You wouldn't just be like, "Oh, we took 10% of the data and held it out, let's ship it." It does seem like that's actually one case where you see it more in companies than in sort of academic literature, probably because it's easier to look at one number and be like, "Hey, we beat it." But clearly, that's flawed. So, anyway, I thought that was a great paper with really good suggestions that I think everyone should definitely follow.

The #BenderRule

Lukas:
I also want to make sure we got to the last paper that we talked about, which is cool, because I just want to make sure people know. What is the Bender Rule?⁴ And why is it important?
Emily:
So, Bender Rule or the #BenderRule-
Lukas:
Is it #BenderRule?
Emily:
Yeah, well, it's both.
Lukas:
Say what it is first, and then I have some questions about best practice.
Emily:
Yeah. It is itself a best practice, which says that you should always state the name of the language you're working on, even if it's just English. This is a soapbox that I've been carrying around and periodically climbing up on since about 2009, where I saw a lot of that pre-neural statistical NLP work saying, basically, "Look Ma, no linguistics," and claiming that systems were language-independent because there was no linguistic knowledge hard-coded. And these supposedly language-independent systems were mostly tested on English. You also see a lot of work people will publish a paper on machine reading or paper on sentiment analysis, and in fact, no, it's a paper on machine reading of English and sentiment analysis on English text. Flip side is if someone's working on Cherokee, or Thai, or Chinese, or Italian, then that work...it's harder to get it accepted to the research conferences because it is deemed language-specific, where work on English is somehow general. That's a big problem for the science, it's a big problem for getting to technology that actually works across languages. I've been sort of going around pestering people to actually test cross linguistically and to name the language they're working on. In 2019, like three or four people — and this is in that piece on The Gradient, I have their names listed — sort of referred to this practice as the Bender Rule. I didn't name that, but once it was named I ran with it. Part of it is it's kind of a face threatening question to ask. If someone's written something about machine reading and I walk up and I say, "What language?", it's a stupid question to ask because it's obviously English. So, it's face threatening to me. And it's also a little bit rude to them, to ask this question that says you should have said. I don't mind people blaming that on me. Part of the reason I ran with the hashtag is, if someone wants to go ask this question and they feel like it's a sort of a silly question to ask, they can pin it on me, and I'm happy to lend my name to that.
Lukas:
I see. Nice.

Language diversity and linguistics

Lukas:
I guess this is a hard question, but this is what kind of comes to mind for me, it's like, "Wow, English is so specific and probably has all these kind of idiosyncrasies." How do you think NLP be might be different if it started in like Thai or Cherokee or something or English just happened to be...I mean, English must be unusual in all these ways, right? Are there characteristics of English that are unusual and the world could have gone a different way?
Emily:
Yeah, absolutely. Actually, in that paper, I list out a bunch of them. One thing is English is a spoken language, not a signed language. If we had started NLP with American Sign Language or another signed language, it would have been very different, right?
Lukas:
Clearly.
Emily:
Yeah. So, that's one big choice point. Another thing is that English has a very well-established and standardized writing system. Many of the world's languages don't have a writing system at all, and many of them that do don't have the degree of standardization that English does. Also, many languages will have a lot more code switching going on, on average, than English does.
Lukas:
What is code switching?
Emily:
Code switching is when you use multiple languages in the same conversation, sometimes even the same sentence. That happens a lot in communities where there's a lot of bilingualism or multilingualism. So, if you and I...well, you also speak Nihongo, right? When you studied kanji, what was your favorite way to benkyou them? I am not a fluent code switcher, so that was really awkward and stupid sounding, but to illustrate the point.
Lukas:
I remember actually when...yeah, I know, I have experienced that for sure.
Emily:
Certainly, English is involved in a lot of code switching. But there's also lots and lots of monolingual English data and when you go into social media data for Indian languages, for example, enormous amounts of that are code switched with English. And so, there's a whole range of interesting technical challenges that come up there. We live in a world where the first digital setups were sort of accommodated...lower ASCII, most conveniently, English all fits in lower ASCII. English has relatively fixed word order. We have a relatively low...relatively simple morphology. Any given word that shows up is only going to show up in a few different forms. Compare that to Turkish where you can get like, I think, millions of inflected forms of the same root, and so that that changes the way you handle data sparsity and what data sparsity looks like. Our orthography is a mess. Someone was just asking on Twitter, "How come we do grapheme to phoneme prediction but not phoneme to grapheme prediction? So, grapheme to phoneme is, "Given a letter, what's the likely sound?", and that's an important component of text-to-speech systems when you hit an out of vocabulary word. Phoneme to grapheme would be, "Given a sound, what's the likely letter?", and that's not a typical task. I wonder to what extent that's true because of English's opaque and chaotic writing system.
Lukas:
Right. Sounds like an impossible task.
Emily:
Yeah, exactly. But if you were to look at...Japanese, setting aside the kanji, if you just try to transcribe Japanese in kana, that's way more straightforward. Spanish also has a very transparent and consistent grapheme to phoneme mapping in both directions. So, down to things like that, the properties of a writing system for English. English likes to use white space between words and sentence-final punctuation. These are things that we sort of take as given, that it's easy to tokenize into sentences and words, that just aren't going to be true in other languages. So, I don't know. I couldn't tell you what NLP would look like. I can just sort of tell you sort of where the points of divergence might be.
Lukas:
No, those are fun. I mean, definitely. I mean, I don't know. Those differences are so interesting.
Emily:
Well, you voluntarily took a linguistics class, so I'm not surprised.
Lukas:
Well, I just feel like linguistics is so cool. I mean, as an outsider just because if you don't know it, then it's really eye opening to just...because you swim in it, to sort of see all these patterns that I never would have noticed. And I feel like especially...like phonetics is probably the most deep, where you're just like, "Oh, my god, those two sounds are different?" I would just never, never have noticed that. It's so easy to do the thought experiment and realize you're wrong, that it's just...I love that stuff I feel like most of my early work was in parsing Japanese in different ways. I do remember...I guess it didn't seem like that was an impediment to publishing, but it was surprising that there was so little work on it for how necessary of a task it would be to deal with it. In my first job, it was mostly processing Japanese language stuff, and it was striking how little research there was defined on the topic. I felt like there was a sort of more institutional knowledge inside of companies than literature on it.
Emily:
What happened in the research community is well, that kind of parsing problem is "solved" because people had made a certain progress on it for English, and that was mistaken as the problem in general being solved. So, what's new here? Well, this is for Japanese. That's new. But it's actually hard to get people to see that. My goal with what got called the Bender Rule is to say, "Okay, let's keep English in its place," and say, "When I've done this for English, I need to say that it's for English to hold room for the other work on other languages," which is also really important and novel and valuable. We'll see. If we periodically go through...different folks in the field go through and count how many papers in an ACL conference actually work on different languages and actually say what language they work on, and it's not changing as fast as I'd like. But there's some really good developments. The Universal Dependencies project has produced treebanks for many, many languages, and that has spurred a whole bunch of very crosslinguistic work, which is exciting.
Lukas:
What do you think about... I mean, some of the most evocative work feels like building language models across all the languages or translation models that can kind of use pairs of languages in interesting ways, where you have more data to help with ones with less data. Do you think that's a fruitful direction? Or does that...do you think that sort of encodes our biases somehow in the way it works?
Emily:
I mean, it's certainly interesting, and to the extent that we're relying on these massive data-hungry things, where languages just don't have that much data, seeing what we can do based on transfer from the bigger languages is an interesting and valuable way to go. I think the interesting questions to ask would be, "To what extent does this impose the conceptualization of the world encoded in English on to the results in other languages?", and "What follows from that? What are the risks?" How does that compare to, "Well, but if we just do monolingual, we can only get this far, so, we'll take those risks. We'll figure out how to mitigate them." That kind of work I think is important. It's also really, really important to know that you are working with genuine data in the low resource languages. There was this thing where it came out that — I think it was Scots — the entire Scots Wikipedia was written by one person who doesn't speak Scots. Wikipedia is this really important data source in NLP, so any NLP system that claims to be doing something for Scots just isn't. A fantastic model in that regard is a research collective called Masakhane, which is a continent-spanning research initiative in Africa towards doing participatory research to create language resources for African languages. They've done really interesting work on how to build up the community so that people can come contribute as translators, not machine translation specialists, but people actually translating language. There's a really cool paper that came out in I think findings of EMNLP last year describing Masakhane project. That kind of work of, if you're going to work with low resource languages, being sure to connect with the community. Who would be the people using the technology, then you could find out, "Okay, well, what are the concerns? To what extent do you want to bring in what we can do from using the larger resource languages?" versus "Would you rather stay monolingual and see where we can go and hear from the community and involve the community in the research? I think Masakhane is a great model of that.
Lukas:
Cool. Well, that seems like a good place to end. We're way over time and you've been really generous. Thank you so much. I really enjoyed talking to you.
Emily:
Yeah. Likewise, thank you. I can go on and on. So, I appreciate the chance to do so.

Outro

Lukas:
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce. Check it out.