How Pandora Deploys Machine Learning Models into Production with Amelia and Filip

Amelia and Filip give insights into the recommender systems powering Pandora, from developing models to balancing effectiveness and efficiency in production.
Angelica Pan

Listen on these platforms

Apple Podcasts Spotify Google YouTube Soundcloud

Guest Bios

Amelia Nybakke is a Software Engineer at Pandora. Her team is responsible for the production system that serves models to listeners.
Filip Korzeniowski is a Senior Scientist at Pandora working on recommender systems. Before that, he was a PhD student working on deep neural networks for acoustic and language modeling applied to musical audio recordings.

Connect with Amelia and Filip

Show Notes

Topics Covered

0:00 Sneak peek, intro
0:42 What type of ML models are at Pandora?
3:39 What makes two songs similar or not similar?
7:33 Improving models and A/B testing
8:52 Chaining, retraining, versioning, and tracking models
13:29 Useful development tools
15:10 Debugging models
18:28 Communicating progress
20:33 Tuning and improving models
23:08 How Pandora puts models into production
29:45 Bias in ML models
36:01 Repetition vs novelty in recommended songs
38:01 The bottlenecks of deployment

Links Discussed

Transcript

Note: Transcriptions may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
Filip:
It's a balance that you have to find between having enough novel content and knowing which users like more novel content and which users prefer to hear the same old songs
Lukas:
You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host Lukas Biewald. Filip is a scientist at Pandora working on recommender systems. Before that, he was a PhD student working on deep neural networks for acoustic and language modeling applied to musical audio recordings. Amelia Nybakke is a software engineer at Pandora, where she runs a team responsible for the production system that serves models to listeners. I'm super excited to talk to both of them.

What type of ML models are at Pandora?

Lukas:
Maybe I could start by asking at Pandora, what do the machine learning models actually do and how would a Pandora user experience the results of the models?
Filip:
Well, we're a big company. So SiriusXM Pandora, we have a lot of different models that are spread across all the product features. Almost every digital product has some science background or some science module powering it. So we have Internal for content understanding, we have advertising models, we have, of course our main feature at Pandora, recommendations. We have a lot of work done in recommendations and figuring out which songs to play next. And of course having so many models interact with each other, makes it quite a complicated scenario. So we need really powerful and strong engineering to make everything work smoothly and to prevent outages and stuff like that. So my work mainly focuses on the musical side. So I work for basically the algorithmic programming team that focuses on creating the radio experience. And what I do is compute similarities between artists or tracks. So given this track, which tracks are similar and so on. And working at Pandora is the best place to do that because we have this awesome data that nobody else really has. So we employ a lot of music analysts who really listen to the songs and annotate them manually. So this is the dream of every PhD student in this field like, "Oh, wow, I have actually experts looking at the data and annotating them." And also our users provide us feedback. And over time we collected over a hundred billion thumbs up or thumbs down ratings if users like the song or not. So we have very detailed and strong features on one side and very nice explicit feedback by users on the other side. So it's the perfect scenario to create very, very powerful models.
Lukas:
Do you have a model that's sort of explicitly trying to understand the similarities between songs?
Filip:
Yeah, exactly. That's one of my projects I worked on last year. Basically this model connects... We had that all from the beginning, a model that tries to understand which songs are similar. But of course now with the deep learning revolution, we try to replace the old models with more sophisticated neural networks that try to take the features we get from the music analysts and map them to what people think about similarity, because it's not obvious which features actually make a song sound similar, right? Is it the tempo? Is it the mood? And by using machine learning, we're able to stop thinking about that and just make the computer do this for us.

What makes two songs similar or not similar?

Lukas:
And I guess as a fan of music myself, I really wonder what the similarity really means, right? Long ago, when you were famous for having all these annotators, you've probably had to really think deeply about what similarity really means. Do you have a definition in your head of what makes two songs similar or not similar?
Filip:
I don't know. It's very hard to pinpoint that, right? When Pandora started, I don't know, maybe Amelia knows better when it started, it's 2000 and something?
Amelia:
Something like that, yeah.
Lukas:
That's when I first started to hear about it yeah.
Filip:
Yeah. So back then, what people would do is just compute distances manually. They would take the features and weight them, and manually create an algorithm that tells us what's similar and what's not. And they would look how the lists look like, and this is bootstrapping from nothing, right? But gradually of course we collected more and more feedback. We can replace that just by using models. And we don't have to think about that at heart. So we have an idea of how this works, which I obviously share, but it turns out that if you run a model on that, it figures out a little bit better than humans could do before.
Lukas:
I guess I would imagine though, one definition of similarity would just be these two songs, people that like one song, like the other song, is that how you think of similarity or is it somehow deeper than that? These two songs are fundamentally similar songs. I would think a person who likes kind of one recent top 40 song might like another recent top 40 song, but they might be totally different genres. So then does your model try to say these two songs are similar or not similar? Or what does it do in that case?
Filip:
We're more focused on this radio experience, right? So you select an artist, for example, or a track to start, or where you're from and this is maybe a more direct or specific way to define similarities. Basically similarity is what kind of songs I can play on that radio. So if a person likes top charts music, they won't want to listen to some hip hop song on their, dance radio, right? So in this case, we don't have to model the taste of the user, although we do that, of course, for other things, but in terms of music similarity, we really think like, "Okay, which songs can we play on this radio station?"
Lukas:
So I would think that your model would need to look at both the sort of musical elements of the song and kind of other things, right? Because our culture kind of affects our sense of similarity. What other things does your model look at besides just the audio of the song?
Filip:
Well, it's just as Amelia said. We have different models for different aspects of this whole musical experience, right? So we have models that are just based on the musical features. We have models that are just based on the audio. And this is especially... Because when you do recommendations, everybody, no matter if it's Netflix or Pandora or whatever, you have this long tail of unknown items that few people that have listened to them. So we can't really understand from user interactions who would like them. So this is the way we deal with that. We go from the contents from audio or from musical features. But for songs where we already gathered a lot of feedback, it's easy for us to just do the classic thing. Oh, somebody liked this and this song, they're similar to you, so maybe you also like that song. Depending if a song is very popular or not, then different recommenders work better or worse.

Improving models and A/B testing

Lukas:
Got it. And so what happens when the model improves? How would I, as a user of your product experience a better model? Would I notice when you put out a new version that does better recommendations?
Filip:
Well, we know this, right? Because we have a very powerful A/B testing framework that Amelia works a lot with, right? And when we create a new model and we add it to our ensemble of recommenders or we improve one of the models, we just deploy it very quickly in an A/B test. And after a few hours, we already get results. So we see oh, people thumbed up more, or people spent more time listening. Amelia worked a lot with these A/B tests, right? So you know a lot about that.
Amelia:
Yeah. I would expect that you personally would not notice other than, "Oh, Pandora has been really getting it right today." Yeah, but we see things like listeners are thumbing more in one or another direction. They're spending more time on Pandora, they're creating more sessions. We have a bunch of different things we can look at to see how we're affecting listener behavior.
Lukas:
Okay, so maybe let's get a little more technical for our audience. I do have a zillion questions as a Pandora user, but the point of this is supposed to be around how you actually make these models.

Chaining, retraining, versioning, and tracking models

Lukas:
Do you actually chain these models together? It sounds like you take the output of a lot of these different models and then use all those outputs to make decisions in your application?
Amelia:
Yeah, absolutely.
Lukas:
Everyone kind of talks about this scenario where one model changes and it has unintended consequences. How do you deal with that, all the models connected together?
Amelia:
That's a good question. One of the ways that we use models is during our song recommendation pipeline. The ensemble recommender system proposes a set of candidate songs and passes them to a microservice that handles the real time evaluation of a machine learning model. And that machine learning model is a larger overarching model that figures out how the other models are informing the decision. Did that...
Lukas:
Let me see if I can repeat it back to you and tell me if I got this right. So it sounds like you have an ensemble model or kind of several models that take into account different things, like maybe the actual audio quality of the song, and you mentioned sort of non audio features of the song and it proposes several songs that you might play next. And then you have another model that runs as a microservice that looks at those options and maybe it takes into account more things and decides the actual specific song that gets played?
Amelia:
Yeah, that's exactly right. And some of the features coming into the final model are from the previous models, the models from the ensemble recommender.
Lukas:
Do you have to retrain the microservice every time you deploy a new model upstream of it?
Amelia:
I can tell you that the model that the microservice uses is retrained every day. So with fresh data and we have validation that runs to make sure that our results aren't totally wacky before we actually upload it to the microservice. Yeah. And we have daily reports that show us maybe feature importances and the average value. So we can keep an eye on how the model is changing day to day.
Filip:
The nice thing about that is, for example, when I deployed my recommendation system last year, it's addictive because you look at the numbers every day and like, "Oh yeah, I recommended that many songs and people liked it that much." It's really nice how easy that works. You just add a new model, and after you wait a bit and the microservice pulls them in and select them. And it's really cool.
Lukas:
And I guess what do you use to keep track of all the versions? This is something that a lot of our users are asking us all the time. How do you version your models and version the data that the models are trained on? How do you think about that?
Filip:
Yeah. Everybody asks that because it's a very hard problem, right?
Lukas:
So if you could walk me through as much detail as you can, how you do that. Because I'm sure a lot of people are wondering.
Filip:
For code it's pretty easy, right? Everybody uses Git, and we use Git for basically everything. And we have our own instance of a product that we use and all the code that trains the models and all the code that runs in production is on the server. Tracking model versions is way more difficult, especially during development, right? Because you run a lot of experiments, you try to compare them. What we did until recently, we were just like, everybody wrote their own libraries that you start the conflict somewhere, computed hashes, and so you can track back if you want to find something, but that was really a pain. I would have an experiments directory where I had 200 sub-directories with different experiments, and I would have another Google sheet somewhere that stores the names of the important experiments I can know which models to use when I want to deploy them. And yeah, now since we use Weights & Biases, this got way easier because we just log our experiments, we can filter them easily. We can compare them very easily. We store, for example, there are weights there and just pull them when we need to. And we decide, "Okay, this is the model we want to go with," we just download the model and that's it. So it's keeping track of models during development is gotten much easier through that.
Lukas:
I'm so glad to hear that. I appreciate the Weights & Biases shout out, but I'm not trying to make this only about Weights & Biases.

Useful development tools

Lukas:
I guess another question we get all the time is what are the other tools that you use day-to-day to make your life easier as a machine learning practitioner?
Filip:
So I think this is more about the development part, where we create the model, we train and we look at the data and try to figure out what to do. Almost everybody I work with, we use IntelliJ for development, because it's just this one IDE that rules them all. It has all the languages. We mostly work with Python for the experiments and then once we are done, we either use Python with PySpark or Scala with Spark to deploy the code in production. And with IntelliJ all of this gets so much easier because it speaks all these languages, it has very nice plugins, which you connect to Google Cloud. This is a service we're using for almost anything now, we switched a few months ago and also made our lives much easier. There's plugins where you can connect two to a Dataproc cluster and inspect all the database, schemas and tables, and you get column completion when you write your SQL statements that's just so incredible. The first time I saw that, I was like, "Wow, this changes everything." So yeah, mostly IntelliJ. Also a very nice thing is the remote debugging feature where you don't have to login with SSH to your training server and try to debug with a command line, you just have a visual debugger or you can inspect the variables in there and run the code remotely still. So it's pretty strong tool for me and makes my life much easier.
Lukas:
Can you talk a little bit about how you debug models? This is another question everyone has. Could you walk me through your process a little bit when that goes wrong and actually the performance goes down, what do you do?
Filip:
Well, then it's just going through the code and tracing back what you changed and what might have caused the problem. But I think it's more important to never come to this point and I try to do that by being a slow starter. So I don't try to write the most complex model right from the start, but let's start slowly, first get some of the data. And then I try to make sure the data makes sense. So I try to select the small model, try to overfit the model on the small data set. Is it possible? I change for example, I randomize the features to make sure that there is some, for example, no problem with train tests or if the model actually produces garbage when I put garbage in, because that should happen, right? There was a very nice blog post by Andrej Karpathy on that topic is called "A Recipe for Training Neural Networks". And he's pretty good at what he's doing so I'm just trying to follow this recipe as good as I can. Making sure you understand the data, making actually sure that you don't evaluate something that you don't care about. You should make sure that the numbers you get actually reflect what you want to see at the end. And yeah, that's basically it, just being very defensive with your development and checking things again and again. It's difficult to debug models because there is no right way. And if you make a bug and you don't have a training code it will still mostly work, it's not like it will crash and burn. It will work, but worse.
Lukas:
Right? Do you have any bugs that come to mind as particularly difficult or ones that you struggled with for a long time?
Filip:
I was training an embedding network that uses the triplet loss, and you have to select positives and negatives, right? And the data was stored as a sparse matrix. So you had a matrix, which items are connected to which, there's a grand truth. So it's very easy to understand which positives to sample for a given item, because there is a one in the matrix.
Lukas:
Wait, so I'm not an expert in the space. What does a one mean here?
Filip:
That they're connected. Say, for example, you have tracks and which ones are similar for example. But the problem was that when you mask that, because when you do a train test, you have to mask that matrix, you don't use any data from the tests in your training set. But the problem then is that you don't know whether a zero, like no entry means that it's masked because it's in a different split or whether there is actually no connections. What ended up happening is that I didn't sample all the negatives that were possible. And this of course makes your training harder because you're not using all the data. And finding that out was pretty tough because it still works.

Communicating progress

Lukas:
All right. So maybe this is a question for both of you actually. This is a question that comes up a lot that people always want me to ask is how do you communicate the progress you're making with the non-technical people outside of your team, but in the company?
Amelia:
At least for the system that I'm working on, we have weekly meetings with our PM who communicates up the ladder. We occasionally, end of year, maybe end of quarter, we'll present to the broader product organization, what changes we've been making and how they've been affecting our core metrics.
Lukas:
I think sometimes people tell me that they have this experience where other teams are kind of working on engineering projects with sort of, "Add these features that are very visible." But the stuff that both of you are working on can feel more experimental and there can be long periods of time where the experiments aren't working and that can be frustrating. Is that consistent with your experience or not?
Filip:
Well, luckily our direct managers, at least Mike is, I think in the science department, every manager used to be a scientist in his previous life. So they know how science works and you can make a lot of progress in a few weeks and you can be stuck for a month on just iterating on experiments and nothing really works. The good thing is because of all this infrastructure we have, the microservice Amelia was talking about, and we can actually trace back every thumb or every song that somebody liked to all the individual models. So in the end, we after, say every quarter of a year, we can actually put a number on how many more thumbs we get because of this contribution and how much more time people actually spend up listening to Pandora. And since Pandora is a ad-based service that translates very well to money.
Lukas:
Sure. That makes sense.

Tuning and improving models

Lukas:
Okay, another question for both of you is how do you think about tuning and improving your models? Do you do that kind of hyperparameter search that a lot of people talk about or is it more intuitive or is it more structured?
Filip:
Yeah, I think when it comes to hyperparameter search, it's a hybrid, right? Because we deal with similar problems all the time. Oftentimes we already have a good guess how the model should look to get reasonable results. And most of the time, this is just where I start. I would just try five different configurations to see how big or how small I can go. And then just settle with a model that works well enough and that's it. And then I just keep it rating on different things like, "Okay, which kind of other features can I use? Can I pull other data and integrate it somehow?" And once all of this is done, once I'm quite confident, okay, this is the model structure that we'll probably work, these days with Weights & Biases, we just create this hyper parameter sweep. You don't have to change anything in your code. You start it on a Friday, it runs over the weekend or longer, depending on the size of the model. And then you're done. So it just saves a lot of headache if you can run this automatically and without much thinking.
Lukas:
And so you spend most of your time, it sounds like thinking about more data you could get or different features you could try than the hyperparameters?
Filip:
Most of the time, yeah. Honestly, because that's where... This is a difference between working at academia when I was doing my PhD and actually working in industry, because in academia you have like, okay, this is the dataset, that's the standard in this field. You take it and you try to improve. You try to create a new method or whatever. But at an industry like at Pandora, it's like, "Okay, we want to solve that problem. Here's all the data we have to solve it." So you have to think about, "Okay, what kind of data makes sense to use, which biases would that induce when I use that data?" So thinking about which data to use, how to clean it up, because that's a big problem. Data is of course, if you work with real data, you have outliers, you have some problems here and there. I spent much more time thinking about data since I started working in industry than before. Definitely.

How Pandora puts models into production

Lukas:
All right. Well maybe let's talk a little more about production. I mean, you started to talk about the microservices, but I'd love to hear more about how you actually serve the models in production.
Amelia:
Yeah. We're generating our production models in GCP, and then we upload them to Redis, which is a key value store. And that's where the microservice can read them. And then to avoid having to go to Redis every time we need a model, we stash them in a guava cash on heap because at least the models that I work with, we're using them every time there's a request for more songs on a listener station. So that's so often-
Lukas:
Are these deep learning models or simpler?
Filip:
I think the model you're talking about, the microservice is not, just because it has to serve a lot of requests real time. So you just can't afford to run a complicated, deep learning model as this point. The recommenders in the ensemble, there are a few deep models there, of course, but for the final selection of the track, I think the models provide enough features and enough candidates to just have a simpler model like that.
Lukas:
All right. Cool. I see. So there's sort of bigger models that run in batch mode where they don't have to be real time. Is that right? And then the final model you talked about has to do real time so it's lighter weight?
Amelia:
Yeah. And we definitely have had the experience we've tried increasing the size of the model and had to pull that back because it wasn't performative enough. Definitely performance is something that we're always keeping in mind. We don't want the user to wait around.
Lukas:
So, what are the hard parts about getting that model into production? What are the day to day challenges?
Amelia:
Yeah. Efficiency is the biggest thing that I worry about, that latency again. We're always trying out new changes, things like Filip mentioned, adding new features. Sometimes we'll try partitioning listeners in a different way. So sending different listeners to different models or changing the size of the model. And sometimes those changes will look really promising offline, but then we try it in production and we'll see that it's too costly computationally. Yeah. We're getting hundreds of thousands of requests every minute, we've got to be super fast.
Lukas:
I'm curious, have you always been deploying a new model every day or is that a new process for you?
Amelia:
For the last couple of years we've been doing that every day. Occasionally we'll skip days if we don't pass validation that day and then somebody will go and look and make sure that it's reasonable or see if we need to make changes. But yeah, we're staying pretty up to date.
Lukas:
And I guess, what causes that? Do the songs change every day or the songs people change everyday?
Amelia:
Songs can change every day. Yeah. I think mostly we just want to make sure, we have the latest data, the latest thumbs, the latest completion rates, the latest way that listeners are reacting to songs.
Lukas:
I see. So I guess you sort of don't have to worry as much about kind of data drift because you're retraining every day on the latest data?
Amelia:
Yeah. I suppose that's less of an issue for the particular model that I'm working with.
Lukas:
Do you have a production monitor in place for that model? Do you look for signals that bad things might be happening?
Amelia:
We certainly have dashboards that monitor things like number of requests and latencies and CPU, thread counts, things like that. But mainly the way that we monitor things are those A/B tests where we're pretty confident that our control model is pretty darn good. Any changes that we're making, we're comparing against the control model.
Lukas:
I see. Order of magnitude, how many A/B tests can you run in parallel? I'm jealous of how many users you have, it must be amazing to get that data.
Amelia:
Yeah. Our particular group is fronting tens, probably maybe hundreds if you look at our broad product area. And then I think thousands, if you're looking at the whole company.
Lukas:
Wow. Is it tricky to swap models in and out in production or is that simple for you?
Amelia:
It's simple mechanically. We can just overwrite the value in the cache, but in practice we're a lot more careful, we always run an A/B test. We never swap anything in, without making sure that it's moving metrics in the right way and not degrading the experience for the users. But yeah, mechanically, it's really simple.
Lukas:
Can you talk about the steps that you go through from taking a model from experimental to it's the model that's blessed as the one that runs by default in production?
Filip:
I can speak of the recommendation models that we use for stations and that's actually also not that complicated because in the end, what you do is you experiment with the model and then at some point say, you think, "Okay, this is the model I want to use." So then what I would do is just translate that model into an airflow, a deck where it can run weekly or daily, or however often I think it's necessary. And in the easiest case, I would just produce a table on GCP with recommendations. And this table gets then pulled just by, I'll just ping an engineer and say, "Hey, there is a new model around, look at this table," and they will pull it into this ensemble where all the candidates are being pulled together. And for a certain number of users the microservice will then pick songs by that model and we will just... In the beginning, it's just a very small percentage, of course. So we don't throw this new model at all the users because we don't know how it behaves, right? So we then say, try it on 1% of the users and observe the numbers. Do they like the recommendations, do they thumb up songs recommended by this recommender? And also, does it make sense to add this recommender to the ensemble at all? Because maybe it recommends awesome songs, it doesn't add anything to the mix. And that's basically it, and then it's in production.

Bias in ML models

Lukas:
Okay. Well, we always end with these questions. I'd love for both of you to weigh in if you feel comfortable. The first one is what's an underrated aspect of machine learning that you think people should pay more attention to than they do?
Filip:
Well, the thing that I always think of is, maybe not that directly technically related to machine learning, but in general, it's ethics and diversity and equality. That's a topic that comes up sometimes, now in machine learning, and it's getting more prominent, but I still think it's enough because we are just creating all these models that do very seemingly smart stuff. But few people actually look at, "Okay, what are the consequences of these things?" And even some figureheads from academia and industry they say "Oh, the models, they're not biased. We don't really have to care that much about that because the model is just, they're neutral." And it's kind of right, right? The model has no bias, but it learns the bias from training data. And the training data we use is stuff that's happening right now or that happened in the last 10, 20 years. So what the model learns is to reproduce that bias. And I don't see a way to really tackle that from a data perspective, just because, let's take that the new GPT-3 model, the language model developed by OpenAI, was trained on 410 billion tokens. How do you change the training data in a way that it doesn't produce a no gender bias or racial bias? It's impossible I think. I think we have to think very carefully about how we use these models and how can we integrate some way of human decision-making in the whole process and not just blindly trust whatever the model says.
Lukas:
Can I ask, does that come up at Pandora? I feel like you have in some ways a, it's really wonderful, kind of fun application of machine learning and maybe one of the few places where there might be less ethical concerns, I can't imagine some, but do you think about it day-to-day?
Filip:
Well, the reason why I got into music and machine learning is that very reason because it's just very hard to do bad things in music, but actually we have some discussions about that. For example, let me give you two examples. One is what we call popularity bias. So it's known that basically all recommendation models suffer from popularity bias, meaning that they recommend popular items more often. And most of them actually recommend popular items more often than their popularity would suggest. So they reinforce the whole thing.
Lukas:
Right. Because it's a safe choice, maybe?
Filip:
Exactly. It's actually quite hard to beat a recommender that just plays the most popular songs, looking just at the numbers. So we have some functionality included, maybe not in the individual models, but at the end we try to diversify artists. We try to boost artists that are not very popular because it basically helps everybody. It helps the user to find new artists. It helps the artists to get more exposure that they wouldn't get otherwise. And so I think it's a good thing to do basically. Plus some of the recommenders, as I said, are just looking purely at the musical information, right? So how does the song sound like? Or what characteristics the analysts annotated. This is just the way to try songs that don't have much feedback data and are hard to recommend otherwise. And another thing that we recently started discussing, and we intend to explore it further is we found that for some genre stations, we have a very imbalanced distribution between male and female artists. And of course, nobody at Pandora decided to make, I don't know, make the country radio only play male artists, but this is what just happened because we look at what people listen to, right? And we'll always take care of that. Every listener gets what they want to listen to. So if somebody just likes hearing male voices, they will just get male artists. And if somebody just likes female voices, they will get female artists. But we're discussing about how can we create a better balance of new female artists, pushing them more in the scenarios where we have a strong imbalance right here?
Lukas:
Well, that's really cool. Amelia do you have any thoughts on that topic?
Amelia:
I really appreciate that Filip is thinking about it. I have noticed that the music that I tend to like is male artists, but I, as a woman, would like to support female artists. And I would like to be able to find female artists that I enjoy. And I would like to see that promotion of female artists happen in Pandora. We do things in the product that will try to offset some of that imbalance. For instance, during Women's History Month last year, we created personalized playlists for our premium users that were only female artists. My playlist was very good.
Lukas:
Nice. You did share it in the show notes.
Amelia:
Yeah. And we do things too like, Black History Month, we had some personalized, I think, Pandora stories that we shared out. Yeah. So we're definitely trying to make a small bit of difference in that bias.

Repetition vs novelty in recommended songs

Lukas:
This is a pretty broad question, but do you have any thoughts on, I feel like sometimes these recommendation systems and machine learning in general kind of gets a knock for sort of optimizing for our reptile brains. I could see that with Pandora, maybe wanting in the short term to hear the same songs over and over, but in a more, I don't know, higher brain sense wanting to be exposed to new music. Do either of you think about that day-to-day, do you feel like it's possible to over-optimize for a thumbs up or listen times?
Filip:
Definitely. Yeah, but it's something that we always have in mind. So of course, the direct metrics are time spent listening and so on, but we definitely hear users saying, "Okay, there's just too much repetition." And then we just try to... This is something that's very hard to measure in a very direct way. And what Amelia said is that if you just lightly reduce repetition, because it's an easy thing to do, it tends to annoy some people. It's a balance that you have to find between having enough novel content and knowing which users like more novel content and which users prefer to hear the same old songs all the time. So it's definitely something that we have to keep in mind and we do.
Amelia:
Yeah. One of the things that we're doing in the product to relate it to that specific question is the modes. So if you're on an artist station and you're getting tired of your normal station experience and you're really wanting to get some new stuff in there, you can go into discovery mode and you'll get some really fresh songs. But then when you get tired of hearing new stuff, because that's sort of exhausting, constantly having new content thrown at you, you can go back to your old experience.
Lukas:
Awesome.

The bottlenecks of deployment

Lukas:
Well, the final question and we're running out of time, but I want to make sure I ask it is, what's the biggest challenge of actually getting machine learning models deployed in the real world from the beginning of the kind of conception of it to actually in people's hands, giving them better views, where are the surprising bottlenecks?
Filip:
Well, I think we talked about that a little bit already. For me coming from academia it was that first of all, you approach the problem from a different point of view because before you just have the data set and you're trying to improve the model, even if it's just by one percentage point accuracy. And now it's more like you have a problem and the first step is to find the data that solves that problem. So you have this huge data store and "Okay, how can I find the data that solves the problem?" Then you develop a model which is pretty similar. And then at some point you have to ask yourself, when is the model good enough? Because you can always keep on tuning. This is science, right? So you can just keep on improving forever and here the difference is of course, that one or two percentage points improvement in academia gets you a new paper. In industry it might not even matter because the impact on the end user is so small because you have a hundred other recommenders in the ensemble. And then for me, the hardest part was to just let it go at some point and just say, "Okay, this is it. That's enough."
Amelia:
For me, I totally already mentioned this, but the biggest challenge is always making sure your machine learning model is performative enough to make predictions in real time. I think during the research phase of development, you can focus on the accuracy of predictions without worrying a ton about the latency of the predictions. But in production, the prediction latency has to be low enough that a user isn't waiting around for results. So there's definitely a balance there between the effectiveness of a model and the efficiency of a model.
Lukas:
Right. Spoken like someone who really has models in production. It's so great to talk to both of you. I really appreciate it. That was super fun. I feel so proud that we could help you guys. At Weights & Biases, we make this podcast — Gradient Dissent — to learn about making machine learning work in the real world. But, we also have a part to play here. We are building tools to help all the people that are on this podcast make their work better and make machine learning models actually run in production. If you're interested in joining us on this mission, we are hiring at Engineer, Sales, Growth, Product, and Customer Support, and you should go to wandb.me/hiring and check out our job postings. We'd love to talk about working with you.