Enterprise-scale machine translation with Spence Green, CEO of Lilt

Spence shares his experience creating a product around human-in-the-loop machine translation, and explains how machine translation has evolved over the years.
Cayla Sharp

Listen on these platforms

Apple Podcasts Spotify Google Podcasts YouTube SoundCloud

About

Spence Green is co-founder and CEO of Lilt, an AI-powered language translation platform. Lilt combines human translators and machine translation in order to produce high-quality translations more efficiently.

Connect with Spence

Show Notes

Timestamps

0:00 Sneak peak, intro
0:45 The story behind Lilt
3:08 Statistical MT vs neural MT
6:30 Domain adaptation and personalized models
8:00 The emergence of neural MT
10:15 How Lilt was developed
13:09 What success looks like for Lilt
18:20 Models that self-correct for gender bias
19:39 How Lilt runs its models in production
26:33 How far can MT go?
29:55 Why Lilt cares about human-computer interaction
35:04 Bilingual grammatical error correction
37:18 Human parity in MT
39:41 The unexpected challenges of prototype to production

Links Discussed

  1. Models and Inference for Prefix-Constrained Machine Translation (Wuebker et al., 2016)
    • Phrase-based and neural translation approaches to completing partial translations
  2. Sequence to Sequence Learning with Neural Networks (Sutskever et al., 2014)
    • The first neural MT paper
  3. Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al., 2014)
    • One of the earliest neural MT papers
  4. GNOME Project
  5. Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation (Johnson et al., 2016)
    • Training multi-source, multi-target NMT models
  6. Achieving Human Parity on Automatic Chinese to English News Translation (Awadalla et al., 2018)
    • "Human parity has been achieved"
  7. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation (Freitag et al., 2021)
    • "Human parity has not been achieved"
  8. Models and Inference for Prefix-Constrained Machine Translation (Wuebker et al., 2016)
    • The problems that Lilt faced in going from prototype to production

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!

Sneak peak, intro

Spence:
Translation is in this space of so-called AI-complete problems. Solving it would be equivalent to the advent of strong AI, if you will, because for any particular translation problem, world knowledge is required to solve the problem.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host, Lukas Biewald. Spence Green is a machine translation researcher and also the CEO of a startup called Lilt, which is a leading language translation services company. He has been using TensorFlow since the very beginning and has been putting deep learning models into production for longer than almost any of us. I'm super excited to talk to him today.

The story behind Lilt

Lukas:
I think the best place to start here is, you're the CEO of Lilt and you built Lilt, maybe you can just give us a description of what Lilt is and what it does.
Spence:
Well, I think it's important to say where the company came from and the problem that it solves, and then I can explain what it does. I think what it does follows from that.
Lukas:
Perfect, that's great.
Spence:
Where it started, at least for me personally...in my mid 20s, I decided I want to learn a language. And so I moved to the Middle East for about two and a half years. And while I was there, two important things happened. The first was...So I was learning Arabic and I had a friend, and I was talking to him one night and he said...he was like the building watchman in my building. and I was talking to him and I was like, "What did you do in Egypt?", where he was from. And he said, "I was an accountant." I said, "Oh really, why aren't you an accountant here?" And he said, "Because I don't speak English." I was like, "Okay, well, we're in an Arabic speaking country and you can't get a job as an accountant", and it's because people make a certain amount of money if they speak English. If they don't, they make less, and I had never really encountered that before. Six months or so after that conversation, Google Translate came out, and I got really excited about that. I left my job and went to grad school, started working on MT. And then a couple of years later, I was at Google working on Translate where I met John, my now co-founder, and Franz Och, who started the group at Google and did all the really early pioneering work in statistical MT. We were originally talking about books a lot and why books don't get translated, and we found that Google's localization team that did all of their language-related work for the products didn't use Google Translate. This was amazing to me, why would this be? And the reason is because, in any business setting or non-consumer setting, you need a quality guarantee. An MT system, like any machine learning system, it can give you a prediction, but it can't really give you a grounded certificate of correctness about whether it's right, and that's what businesses want, or book publishers, or whatever. So we started building these human-in-the-loop systems where you need the human for the certificate of correctness, but the crux of the problem is to make that intervention as efficient as you can.

Statistical MT vs neural MT

Lukas:
I mean, I guess my biggest question that I was thinking about, that I've always wanted to ask you is, how different is the problem of translating something properly versus setting up a human-in-the-loop system with a human translator to translate well? Is it almost the same problem or is it quite different?
Spence:
By translating it properly, what do you mean?
Lukas:
I guess, I mean, so Google Translate is just trying to give you the best possible translation.
Spence:
Got you.
Lukas:
I assume that what you're doing is like helping a translator be successful translating something, presumably by guessing likely translations.
Spence:
Yes. Right. It's a good question. So the question is the mode of interaction with the machine. The way that machine translation systems have been used, really, since the early '50s, was when this line of research started. It's funny that machine translation was like this really old machine learning task and originally people thought the digital computers that were developed during the Second World War for bomb making and for cryptography, the initial idea was, "Russian is just English encrypted in Cyrillic, and so we can just decrypt Russian." The initial systems that were built in the '50s weren't very good. The naive idea was "Let's just take the machine output and pay somebody to fix it". And this linear editing workflow is what our work in grad school was about, was going beyond that in some way, like a richer mode of interaction. What we came up with was effectively a predictive typing interface. There are two problems that we really wanted to solve. One was, when you're doing translation, the system makes the same mistake over and over again, documents tend to be pretty repetitive. It's an annoying user experience and it's inefficient when the system just makes the wrong prediction over and over again. So the solution to that is to have a system that does online learning, which was part of the work. The other was, "How can you interact with a text string beyond just using your cursor and fixing parts of it?" And that is doing predictive typing. So if you put those two together, you want to do online learning and you want to do predictive typing, it's a fundamentally different system architecture than the type of system you build for like, Google Translate system architecture.
Lukas:
Although it seems fairly close. I mean, the predictive typing, I would think you have a language model and a translation model. Is it sort of the same... or at least that's how MT systems used to work, or at least in my memory, right? Is it?
Spence:
That's the way that the statistical systems used to work and really it came down to doing inference really rapidly. Well, yes, it came down to doing inference really rapidly and doing inference with a prefix. Instead of just decoding a sentence with a null prefix, you send a part of what the translator did. The old systems...we actually had a paper on this a couple years ago¹, how to do inference with a prefix was an algorithmic problem that you had to solve. The new neural systems just do greedy beam search, so it's actually pretty straightforward to do that these days.
Lukas:
And is that what you're using?
Spence:
Yeah. I mean, like everything in NLP these days, it's a Transformer architecture, and a pretty vanilla one too.

Domain adaptation and personalized models

Spence:
What our team really focuses on is domain adaptation, rapid and efficient domain adaptation. So we do personalized models, either at the user level or at the workload flow level for all of our customers.
Lukas:
All right. And workflow means like a set of documents, so you're like learning a specialized model as the transition happens?
Spence:
I think the way to think about it is more from your early days, which is, anywhere that you have an annotation standard, you would have a personalized model. So if you think about it in a business, like a marketing workflow has a writing standard that may be different than a legal workflow. And so you would have different models for each one of those workflows.
Lukas:
I see. So you're actually training, then, thousands and more models.
Spence:
Yes, that's correct. That also has...that's right. So there are bunches of different models, being trained continuously in production all the time, right now. The way that you can think about what the translator does — and I think what's really interesting about this task is, in most machine learning settings, like data annotation for supervised learning, is some operating costs. You have to pay people to go off and do it. It's an artificial task — translation, you can think about them. They're just doing data labeling. They're reading English sentences and typing French sentences, as soon as they finish that you just train on it.
Lukas:
Right. Right. And do the models get noticeably better over time?
Spence:
Yes.
Lukas:
That's super cool.

The emergence of neural MT

Lukas:
I'm curious about the technical details of just making this work, but before getting into that, I'm curious, you started in 2014, is that right?
Spence:
Early 2015 we started the company-
Lukas:
2015.
Spence:
Yeah.
Lukas:
So you've seen such an arc in terms of ... I mean, I feel like machine translation has had, it's had such big changes, at least from my perspective. Has that been hard to adapt to? Has that helped you? Have you had to learn new skills to take advantage of it?
Spence:
Yes. We started the company in late 2014, and the system that we had, which we built at Stanford over the course of about 10 years, was competitive with Google Translate. In December of 2014, the first neural MT paper was published². I mean, people worked on neural MT in the '90s, but it didn't work. And so they got it to work again. There are two papers published, one in December 2014, the other one in January 2015³. And it's like, pretty promising, but nowhere near production ready. And then I think the thing that was really quite shocking was how quickly Google got that into a production-scale system, which happened in the late summer of 2016. At that point, our system was as competitive as anyone. And then suddenly, there was this huge leap in translation quality. We were graduating, all three of us — John and I and a third guy — right at this crossover point. So we didn't really have any empirical experience with these Neural Machine Translation systems. So we had to like build a neural MT system from scratch over the course of about six months. We went from... the Stanford system was about 120,000 lines of code that had been developed over a decade, going to a system that I think was about 5700, 6000 lines of code and-
Lukas:
That's amazing.
Spence:
I mean, it's really quite shocking. I mean, a bunch of that is like pushing a lot of the functionality down into the framework, which everything in the Stanford system was like custom-built.

How Lilt was developed

Lukas:
I guess, 2016, what framework are you using? Is it Caffe or is it even before that?
Spence:
No, we wrote it in TensorFlow from the beginning. So-
Lukas:
Wow, wow. Cool.
Spence:
It was, I guess, an okay technology bet. I think there's some push to move to PyTorch, but we've got a pretty significant investment in TensorFlow at this point.
Lukas:
Yeah, I would think so. Were you sure that it was going to work? I mean, this seems like a really painful experience for a startup to do mid-flight.
Spence:
It was terrible. Yeah. I mean, you kinda just had to do it. The results were so compelling. I think that MT really is, probably of all the tasks within NLP that deep learning has really revolutionized, I think it really makes the case that MT is probably the most significant example. The recent language modeling work, of course, is really impressive, but MT just went from being kind of funny to being meaningfully good.
Lukas:
How did you find enough parallel corpora to make this work?
Spence:
Well, there's quite a bit of public domain data. So for example, the UN has to publish all of its proceedings in its member languages. There are news organizations, like the AP, that publish in different languages. There are open source projects, that GNOME project⁴, for example, that publishes all their strings in a bunch of different languages. So you can train on all that, and then you've got web crawl too, which is where most of the training data comes from.
Lukas:
I see, I see. It's funny, I remember working on MT briefly at Stanford and feeling like it was really unfair that Google had so much more access to data...
Spence:
It does help to have a search engine.
Lukas:
I mean, I guess if you're mostly doing web crawl, then that makes us ... I remember, just all kinds of weird artifacts from... I think we were training on the EU data that was in all those languages, and it was just such bias towards political meanings of nouns. It just seemed ludicrous sometimes.
Spence:
I think in an enterprise setting, that's the real value of domain adaptation. The second thing that I think is interesting is the legacy approach to enterprise, translation within the enterprise, is to just build a database of all your past translation. If you translated something before, you just look it up in the corpus and retrieve it, otherwise, you send it off to a vendor. Big companies that have been doing translation for decades have this big corpus that they've built up. We train on that too, and that customer-specific training is where you get the real improvement versus just a big general domain system.

What success looks like for Lilt

Lukas:
I guess at the end of the day, how much... I mean, do you measure your results in how fast you can get a translation done? Is that your core metric? And I guess if so, how does that change with the quality of the translation? Do you get diminishing returns, or as it gets close to perfect, can someone just like cruise through a translation?
Spence:
Well, I think that there are... maybe I should say a few sentences about how a customer would work with us.
Lukas:
Sure, sure.
Spence:
An example of one of our customers is Intel. And if you go to intel.com, in the top right corner, there's a dropdown and you can change the site into 16 different languages, and that's all of our work. If you start looking that way, you'll see translation all around you. You'll see it on websites, you'll see it in mobile apps, you'll see it when you get on the airplane and get 10 language options for the in-flight entertainment system. That's where this can be used. Right now, it's a problem that you can solve with people. You can hire people to solve it. The problem is the amount of information that's being produced far exceeds the number of people that are being produced in the world right now. And so you can't just solve it just with throwing bodies at it. That's why you need some automation. So an example like that Intel website...From their side, what they just see is us delivering words. The only real metrics that matter are how quickly that gets done and the quality level that it gets done at. They don't really care whether it's machines, or lemmings, or whatever is doing the translation work. On our side, it's...the whole name of the game is using automation to reduce the production cost and the production cost per word. When you produce a word to give to an enterprise, there's a translation cost and a QA cost and workflow routing cost, and there's a software hosting cost, there's a bunch of different cost buckets, and it's just minimizing that.
Lukas:
Am I wrong, that the majority of the cost would be the human that's doing the translation?
Spence:
That's exactly right. So then the metrics that we care about internally have to do with making that part more efficient, but that's not something that...it translates into business value, and then it reduces the cost of what we provide to customers, and it makes it faster, but those metrics are not the same metrics that our customers care about.
Lukas:
Are there cases where you worry about with a self-driving car where someone... it's so good that they stop watching and the car crashes? Does your translation ever get so good that you worry that an annotator might just start accepting every prediction and quality might suffer?
Spence:
Yes, this is a good question. I think it's more of a risk, and this bears out empirically in the linear post editing workflow that I mentioned, where I just give you some machine output for some random machine and ask you to correct it. It's a passive task, and cognitively it's not very engaging. People tend to just gloss through that and make mistakes. Whereas in predictive typing, it's like an active engaged task. And so if they're basically cheating there, then it comes down to performance management on our part of, "Whoa, this person did 2,000 words in 10 seconds. That doesn't seem right." So you can monitor that.
Lukas:
How do your customers think about the quality? Is it like an intuitive feel for it or are they like spot checking it? Or how does that work?
Spence:
I think it's again, in the same realm of an annotation standard like your world, where we work with the customer to define what we call a text specification, which is, "What are the text requirements within each language?" That usually follows from marketing guidelines. They have their brand and style and copy editing guidelines. And then how does that manifest in Chinese and Japanese and German and French? We have a QA process where we have raters go in and rate the sentences according to that framework. And then that's what we deliver back to them.
Lukas:
So you don't just deliver the result, you deliver an estimate of the quality based on raters.
Spence:
Yes, yes.
Lukas:
I see. That's cool. They must appreciate that. Or is that industry standard to do that?
Spence:
No. There are some vendors that will implement like a scorecard and they'll give you the scorecard back with the deliverable, but we just try to keep it... we just count the number of sentences where there's some annotation error, and then we fix those, but it gives you some sense for what the overall error rate is.
Lukas:
Got it.

Models that self-correct for gender bias

Lukas:
I think people have pointed out that in translation, there can be ethical issues. I think people noticed that Google was...in languages where the pronouns aren't gender specific, making it "he" for traditionally male occupations. Is that something that you think about or incorporate into your models at all?
Spence:
Well, I mean...part of my work in grad school was on Arabic. When you work with Arabic corpora, there's almost all male pronouns, because it's coming from Newswire, and most of the people who are active politically in the Arab world are male. So that's the representation in the data. And so systems will tend to predict masculine pronouns for lots of different things. But then the human-in-the-loop model, you have people who are there correcting that, and they can use the suggestion or not. By that annotation, you'll get a different statistical trend that the system will start to learn.
Lukas:
I see.
Spence:
So it's self-correcting.
Lukas:
Cool.

How Lilt runs its models in production

Lukas:
I guess I really am interested to know about the technical details of your system, as much as you can share. I mean, you were a super early user of TensorFlow, and you have all these models running in production. Can you, at a high level, tell me how the system works and how it's evolved? Do you use TensorFlow Serving to serve these up? How do you even run all these models in production at once?
Spence:
Yeah. I think the most interesting part of it is, how do you...the interesting cloud problem to solve, of which there are several, but I think the big ones are...you have a budget, if you're implementing predictive typing. You have a budget of about 200 milliseconds before the suggestions feel sluggish. That means that the speed of light starts to become a problem. You have to have a multi-region setup because our community of people who are working are all over the world. You usually hire translators within their linguistic community that are fluent in that native language, so we have people all over the world. So the first thing is it has to be a multi-region system. The second is, it's doing online learning, so you have to coordinate model updates across regions. And the third thing that I think is interesting is to make inference fast...commonly in a big, large-scale system like Google Translate, you'll batch a bunch of requests, put it on custom hardware, run it and then return it . But if you're switching in personalized models to the decoder, basically on every request, then you have to run on the CPU and you have to have a multi-level cache to be pulling these models up and off of cold storage and loading them onto the machine. So that's been...a lot of the engineering is to make it fast worldwide, and to make the learning synchronized worldwide.
Lukas:
You mentioned that there's like some notion of switching to PyTorch. What would push that at all?
Spence:
This is where my expertise, my empirical limitations run into a wind. The two things that I've heard from our team are you can prototype faster in PyTorch than in TensorFlow, and then there have been some backwards compatibility issues from TensorFlow 1 to TensorFlow 2. There tend to be more breaking changes. We've got our system running in some TensorFlow 2 compatibility mode with some frozen graphs from before. That's been a little bit of a problem.
Lukas:
I think one just notable thing from our perspective has been this rapid ascendance of Hugging Face. Has that been relevant to you at all? Do you use it anywhere?
Spence:
We don't. I think when...It's funny. When the Transformer paper came out — I went to grad school, Ashish Vaswani was a contemporary at grad school, and then Jakob Uszkoreit has been a great friend of our company — and so we called Jakob the next day, and we're like, "Let's talk about this." And so we talked it through, and we started working on it. It was a really tricky model to get working correctly. And it took some time. I think that paper came on on a Tuesday, if memory's served, and I think Joern started working on the implementation on Wednesday morning.
Lukas:
Wow.
Spence:
Something like that. It was like December or January, before we had a working model. And I think their tensor to tensor release helped a lot, there's some of the black magic in there that helped. So this was like mid 2017. But it's tricky to get working right in production. So I think having a library that people can use more broadly, that may not have the same internal resources to get these systems working, it's really, really, really valuable.
Lukas:
Totally, totally. Do you think that given your...do your latency and throughput requirements mean that your models are different at all than what a Google Translate might use?
Spence:
Yes. If you're running on custom hardware, you can of course afford to run higher dimensional and more expressive models. We have to do quite a bit of work with knowledge distillation to try to compress the models, so that inference is fast on the CPU. It's also been really helpful, Intel is one of our investors, and so their technical teams have helped us with some optimizations to make it run faster on a CPU, and that's been really valuable.
Lukas:
That's cool. Do you use different models at all for different language pairs?
Spence:
The short answer is yes. There's a general domain model that for every language pair that the domain adaptation starts from, and it basically just forks off of that. And then the model fork starts learning. We change the general domain models much less frequently. We just actually yesterday released new models for English to Japanese and Japanese to English, and one of the researchers has been working on much deeper encoders. I think the one that came out yesterday has like a 12-layer encoder, whereas historically, we've been running like a 4-layer encoder or something like that. So now over the the next little bit, we'll be moving more of our general domain models to some of the current state of the model architecture.
Lukas:
And your general domain models though, those are different for each language pair right, or is there sort of one?
Spence:
Yes. That's an important point. I think one of the most exciting papers in the last couple of years was training multi-source, multi-target models⁵. Google had a paper last year or the year before, where they just piled all the corpora together and trained this huge neural network. This is really hard to think about coming from the statistical MT days, because it's just like crazy to do in a statistical MT system, but we use some groups of languages. We'll group similar languages, especially if they're low-resource languages, and we don't have much data, and then you'll have a system that's for five different languages or so.
Lukas:
There's something about that that's so appealing. I mean, I'm way out of date, so I never saw that working when I was in grad school, but I love the idea of it.
Spence:
Yeah. It's a really attractive idea.
Lukas:
It sounds like it's actually working.
Spence:
It does work, yeah.

How far can MT go?

Lukas:
So I guess, I don't know how much you feel comfortable expounding on this topic, but I'm really, really curious. I mean, do you have a feeling on how far MT goes? Do you think that human-level MT is realistic? It's funny when you talk about companies wanting quality guarantees. I mean, I would think just having used a lot of Google Translate in my life, quality guarantees seem like it would be useful, but also just seems like the quality of Google Translate just isn't good enough that I would want to put that on my website, generally. Do you expect that that is likely to change?
Spence:
Yeah, I guess I can offer some assorted comments on thinking about that.
Lukas:
Please. Thank you.
Spence:
In no particular order, because I think there are both technical and social issues to do with that. And I think there's some philosophical issues. So let's start with the philosophical issue. Translation is in this space of so-called AI-complete problems, so solving it would be equivalent to the advent of strong AI, if you will, because for any particular translation problem, world knowledge is required to solve the problem. There are inputs that are not in the string that are required to produce a translation in another language.
Lukas:
Sorry to cut you off, but based on what I've seen lately from Google Translate, it feels like less AI-complete than I would have thought.
Spence:
Yes. So that's the next comment that I'll make, which is that philosophical statement doesn't mean that within business settings, you should not be using it. And I'll give you an example. One space we've been looking at recently is crypto. Four months ago, nobody knew what a non-fungible token is. How do you translate that into Swahili and Korean? Well, an MT system is not going to give you the answer to that question, because language is productive. People are making new words all the time. Machines are not making up new words all the time, people are. Philosophically, you've got to have training data for the system to be able to produce a result. People do not need training data to do that. But then I think increasingly, there are a lot of business settings where it's good enough to solve the problem. If you go...for years, you can go to Airbnb and look at a flat and click translate with Google, and it'll give you a translation. It may not be perfect, but it's certainly enough to convince you you want to buy this, rent this flat. I think there will be more and more cases where fully automatic machine translation solves the business problem at hand. I think that's absolutely true. And then I think there's a third part of it, which is social and organizational, which is, "How soon, VP of Marketing, are you willing to let raw machine translation go on your landing page with no oversight?" One way to think about that is, how soon are you, Lukas, ready for a machine to respond to all of your email?
Lukas:
All of my own email?
Spence:
Yeah.
Lukas:
Well, I have to say-
Spence:
Some of it probably sure, but others, parts of it a little bit dangerous.

Why Lilt cares about human-computer interaction

Lukas:
I mean, this might be an off-the-wall question, but I have noticed ... I think I have a slightly more polite writing style because of Google's predictive text algorithm. I wonder if you're slightly shaping the translations with your predictions, even if the translator is getting involved in making it match.
Spence:
Oh, yes, this is called priming. It's a common feature of psychological research. One of the things that we showed in grad school is when you show somebody a suggestion, they tend to generate a final answer that's closer to the suggestion than if they start from scratch.
Lukas:
I mean, I guess maybe it's better that I write slightly more politely. I don't know, maybe there's some good you can do with it.
Spence:
Well, it's pulling your writing down the mean behavior, a mean level of performance. So I'm not sure if that's great.
Lukas:
Pulling down or pulling up, I don't know.
Spence:
Yeah, or maybe it's pulling you up to a mean level of performance, right?
Lukas:
Do you think that the translators learn to use your system as well? Do you see productivity going up for an individual that's doing this?
Spence:
Yeah. We have an HCI team, and this is one of the main things that they're working on right now, which is, I think... I remember right when we started the company, one of my co-advisors, Jeff Heer, who started Trifacta, I was telling him — this was really early on, and I was showing him some of the stuff we were building and we want to optimize this, and we want to do that — and he said, "Let me stop you right there." In the early days of a company, you're just trying to make things less horrible than they are. You're going to be in that phase for a long time, before you get to the optimization phase. So I think for a lot of the last number of years, it was like catching up on neural MT, making the system faster multi-region, making the system more responsive in the browser, and there was just like a lot of un-breaking work that was going on. Now we've got some pretty convincing results that the thing that we really ought to focus on is how people use the system, that the greatest predictive variable of performance is just like the individual's identity. When we look at how people use it, there's really high variance and the degree to which they utilize the suggestions, how they use the different input devices on the keyboard, how they navigate and work through a document. So the team's spending quite a bit of time on user training right now actually.
Lukas:
So user training not like modifying the interface, but you're training people to-
Spence:
User training, yeah.
Lukas:
Interesting. Have you ever considered doing multiple suggestions? Is that possibly better or?
Spence:
Yeah. One of the reasons that this predictive approach to MT didn't work really well is because the interfaces that were built up until our work, they use a dropdown box. It turns out, when you put stuff on the screen people read it, which slows them down. So what you want to do is show them the one best prediction that's the very best prediction you can show them.
Lukas:
I see. Interesting. I bet that's especially true when you're confident in your predictions.
Spence:
Yeah.
Lukas:
Cool. Is there any other surprises in terms of your interfacing with humans? I feel like — my last company was a labeling company — it just had all these interesting ways that the interaction between humans and machines surprised me. Has the way that you engage with the human changed at all over the years that you're running this besides training?
Spence:
Maybe one of the biggest things that we learned is that historically within translation...in this translation world, I mentioned this MT work goes back to the '50s. So in professional translation as a .. I don't know it predates agriculture or something, that's really an old profession, right?
Lukas:
Sure.
Spence:
So these people have been engaged with AI systems for 50 years, and for most of that period of time, their systems are really bad. There's a lot of bias against these systems, and people, especially those who used them for a while when they weren't really good, they were reluctant to try them. I think more broadly now people are using them because MT is a lot better, but we found that resistance to change was really significant, and the way to get around that was to align incentives better with the business model. What do people actually want more than they want to not embrace machine learning? Well, they want to get paid, they want to be recognized for their work, they want to be appreciated, they want to have a good work environment and work with good people. I think we found that focusing on those things, when you do those right, then people are really open to, "Let me try this automation, I'm okay with the fact that you're changing the interface every week" and all that stuff.
Lukas:
That makes sense.

Bilingual grammatical error correction

Lukas:
Is there a feedback loop with the ratings? I would think that might be an important thing too, if you're then rating the quality of the translation
Spence:
Yeah. We just submitted a paper to EMNLP, hopefully it'll get in. We've been working on a bilingual grammatical error correction. What the reviewers do, you can think of as another review step. So we took an English input, we generated some French, maybe there's some bugs in the French, and we give that to another person who then is going to find and fix those bugs, or maybe they make some stylistic changes or who knows what they do. That just becomes another prediction problem with two inputs, the English and the-
Lukas:
Corrected input?
Spence:
-unverified French or whatever you want to call it. Then we're going to predict the verified French. You can use a sequence prediction architecture model for that, or you can use sequence modeling for that. The team's been working on that for about the past year and a half, and they've got it working now. We announced that last fall, and we'll have it in production I think sometime in the second half of the year.
Lukas:
Wow, that's so cool. In production, what would that mean? Once you finish editing a document, it goes through and makes suggestions?
Spence:
Yeah. It's a fancy grammar checker. Only, it's a grammar checker that's data-oriented, instead of based on rules, and it can learn things. It can learn simple phenomena like spelling mistakes, but it can also learn stylistic edits.
Lukas:
Well it sounds like it's also incorporating the source language too, right?
Spence:
Yeah, so that's how it's different than a Grammarly or the grammar checker that you have in Google Docs or whatever, in that instead of you only have one language to look at, the string that you're generating is constrained by this other source language input. So you can't just generate anything. You've got this very strict constraint, which is the source language.
Lukas:
Do you plan to do a separate one for every single document stream or work stream that you have?
Spence:
Yes.
Lukas:
Wow.
Spence:
You can use the same infrastructure for that, that you use for the translation.
Lukas:
That's so cool.

Human parity in MT

Lukas:
Well, cool. So we always end with two questions that I want to give you a little time to chew on. One that's kind of open-ended, but I'd be interested in your thoughts in MT specifically, is what's an underrated aspect of machine learning or machine translation that you think people should pay more attention to, or that you'd be thinking about if you weren't working on Lilt?
Spence:
Maybe it's around the question that you posed earlier, which is the human parity question with translation. There was a paper, I don't know, two years ago, Microsoft had a paper saying "Human parity has been achieved"⁶, and then two weeks ago, Google published a paper on arXiv saying "Human parity has not been achieved"⁷. I think that in our application, there's a lot to translation quality, which is the particular message that you're trying to deliver to an audience, which a lot has to do how the audience feels. And certainly in my time, in grad school, I was really focused on just generating the output that matches the reference, so the BLEU score goes up, and I can write a paper. I think there's a lot of interesting work to think about broader pragmatic context of the language that's generated, and is it appropriate for the context that you're in and for the domain. That's really hard to evaluate, but it's really worth thinking about whether it's in natural language generation, or machine translation, or whatever else. So I think, maybe thinking about that a little bit harder, I would spend some time on.
Lukas:
Yes, the BLEU score is funny, because it seems such a sad metric for translation. It makes sense that it works, but it just seems so ludicrously simple. I mean, at some point, I feel like it must lose meaning as the best possible metric, right?
Spence:
Well, people studied it a lot, and I think the conclusion was that it's the least bad thing that we've come up with. Two decades of study, it continued to be the least...nobody could come up with anything. It was as convenient and correlated better with human judgment. So maybe it's a testament to a simple idea that people are still using 20 years later.
Lukas:
I guess simple metrics are better than complicated metrics. There might be a lesson for business in there
Spence:
There might be a lesson there too, yeah.

The unexpected challenges of prototype to production

Lukas:
But I guess the final question we always ask, what's the biggest challenge of machine learning in the real world? But I'd like to tailor it to you a bit, what's been the hardest part of getting these language models to work in production? You touched on it a bit, but I'd love to hear, especially any part that might be surprising to yourself as an academic, before starting the company. Where have the challenges been?
Spence:
If I think back to when we started the company, the research prototype that we had, you had to specialize it to one document. If you're going to translate a document, you had to compile this part of it, and then load it into a production system, and you could send it the document and it would translate it. If you sent it anything else, it basically wouldn't work. I remember when we raised money for the company, I told the investors, I was like, "Yeah, we're going to take this prototype and have a production product in six weeks or something." What actually happened is it took us nine months, and the problems we had to solve turned into an ACL paper⁸. You should not do this. This is very bad. I think I really underestimated how far it is from a research prototype that's actually a pretty effective system, to an MVP for something like what we do, which is taking any document from any company and generating a reasonable output. And doing that with like the learning turned on, and the inference and all that stuff, getting to a large-scale production system. Which is probably not surprising to anybody who's worked in these production-scale MT systems, but the amount of detailed large-scale engineering work that has to go into that was surprising to us, I think, even having worked on Google Translate.
Lukas:
Well, can you give me an example. What was something you ran into? Because it does seem like that shouldn't take nine months. What came up?
Spence:
Well, in those days in that original system, it was...you had to be able to load the entire bitext into memory. The systems stored words as atomic strings, and you had to have all the strings in memory to be able to generate a translation. We did a lot of work on what's called a compact translation model, where you can load the entire bitext into a running production node, and the lookups happen fast enough that you can generate an output. I think in the neural setting, what's been really challenging is you can't do batching. You can't just put it on a GPU or a TPU because [of] the latency constraint that you have. That's meant a lot of work on CPU inference on the way that production infrastructure swaps personalized models off, onto and off, of the production nodes. It seems, conceptually, really simple but when you actually get down into it, you're like, "Wow, we've been at this for two months and we're still not quite there yet. What's happening?" And that's sort of been our experience, I think.
Lukas:
Interesting. And I guess at the time, there was probably a lot less stuff to help you.
Spence:
Yes, there was no Kubernetes, there was none of that type of infrastructure.
Lukas:
Awesome. Well, thanks so much. This was really fun. And thanks for sharing so much about how your company operates.
Spence:
Yeah, it's always good to chat with you.
Lukas:
If you're enjoying Gradient Dissent, I'd really love for you to check out Fully Connected, which is an inclusive machine learning community that we're building to let everyone know about all the stuff going on in ML and all the new research coming out. If you go to wandb.ai/fc, you can see all the different stuff that we do, including Gradient Dissent, but also salons where we talk about new research and folks share insights, AMAs where you can directly connect with members of our community, and a Slack channel where you can get answers to everything from very basic questions about ML to bug reports on Weights & Biases to how to hire an ML team. We're looking forward to meeting you.