How Sean Taylor of Lyft Rideshare Labs thinks about business decision problems
Sean joins us to chat about ML models and tools at Lyft Rideshare Labs, Python vs R, time series forecasting with Prophet, and election forecasting.
Listen on these platforms
Sean Taylor is a Data Scientist at (and former Head of) Lyft Rideshare Labs, and specializes in methods for solving causal inference and business decision problems. Previously, he was a Research Scientist on Facebook's Core Data Science team. His interests include experiments, causal inference, statistics, machine learning, and economics.
Connect with Sean
0:00 Sneak peek, intro
0:50 Pricing algorithms at Lyft
07:46 Loss functions and ETAs at Lyft
12:59 Models and tools at Lyft
20:46 Python vs R
25:30 Forecasting time series data with Prophet
33:06 Election forecasting and prediction markets
40:55 Comparing and evaluating models
43:22 Bottlenecks in going from research to production
Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to email@example.com. Thank you!
We focus so much effort on training models, getting features, on all our crazy architectures. The space of models that we can consider is increasing rapidly, but we still are bottlenecked on "Is this model better than the one that we already had?"
You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host, Lukas Biewald.
Today, I'm talking with Sean Taylor, who's the Head of Rideshare Labs at Lyft¹. Previously, he was a research scientist on Facebook's Core Data Science team, and before that, he got his PhD in Information Systems at NYU's Stern School of Business. He also has a BS in Economics from the University of Pennsylvania, and he tells me that he prefers R to Python, so I'm excited to get into that with him today.
I guess where I wanted to start is the stuff you're working on now on ride sharing at Lyft. I mean, my first question is just, for people who haven't thought deeply about this, how does data science and ML factor into a ride sharing app that probably everyone has used? What are the pieces that matter, and what role does data science and ML play?
1: At time of recording
Yeah, that's a great question. I think it's a pretty abstract concept because you just tell an app where you want to go and a driver shows up, and there's a lot of things that happen under the hood to enable that.
I think of Lyft as a stack of algorithms that all add up to a driver arriving when and where you want. So, that driver showing up there is just a sequence of well-made decisions, and you can trace those decisions back as far as you want, all the way to when we acquired that driver and signed them up to drive for Lyft, and when we acquired the rider and got them to install the app and decide to use it. All those decisions added up to that match that we got in the marketplace.
On the actual matching at the time of the ride request, I would think about it as, well, there's the map. We have to have a high quality map. On top of the map, we come up with ETA estimate. So, how long will it take a driver to get to a rider. That helps us perform a more efficient matching. Then there's a dispatch algorithm which actually performs the matching. There's a wide set of available drivers for some ride requests, so we have to decide which one is the best driver to send. Then also, we have to decide on a price.
Pricing is a core algorithm for Lyft. On top of planned pricing, there's adaptive pricing. We have to respond to marketplace conditions to try to make sure the market stays liquid, so that's an algorithm that we have to run. Then I guess on top of that, we'll give drivers incentives, we give riders coupons, there's algorithms to decide how we disperse those.
So, it's just a wide variety of little mini algorithms, all the way down to just, now we have say, we're predicting where you're headed, so that when you open up the app, maybe we can be intelligent about what shows up on the screen. It's a lot. I think a good experience is the conjunction of all those good decisions made, so if any one of them goes wrong, it can be a very bad experience.
I think of the Lyft problem as more of quality control, in a way. The product itself is pretty exchangeable. We have competitors. It's pretty... you have other ways to get where you need to go. So really, it's all about making sure that those decisions are made really reliably. Every one of those decisions is powered by some estimate of some state of the world, right? So, the ETA estimate is probably the most tangible. How long is it going to take a driver to get to a specific spot on the map right now?
But we have to estimate all kinds of other quantities of interests, like "How will riders respond to higher or lower prices? How will they respond to higher or lower wait times?" They're all combinations of machine learning and causal inference problems, in a way, because ultimately, at the end of the day, we're going to change something.
We don't want to just train on some... it's not like a supervised learning problem. We actually want to say, what would happen if we did this differently? What would happen if we sent this other driver instead? And so, the problems are quite a bit more complex than just a standard predictive modeling set up.
I mean, how do you think about that, right? Changing a price is such an interesting thing. It doesn't fit... definitely, I agree, it doesn't fit neatly into a normal ML prediction. Do you have training data that you can run on, or how do you even model that?
Yeah, that's a super interesting question where you have... one way to think about it for machine learning people that I like as a way to explain it is that there are features that are not under your control, and then there are features that are under your control, and you want to think about modeling them differently.
It's important that the features under your control are subject to some randomization in order to be able to estimate a causal quantity of interest. If you want... if you really want to know what's going to happen when you raise prices, then you have to raise prices sometimes.
Part of the problem with training models like that is you have to let the causal part of the model speak a little bit more than the features. There's going to be other things that predict conversion rate on a ride much better than price. Price is a powerful predictor, but if you don't randomize it, then there'll be other things that could explain the changing conversion rate that are correlated with price, like, say, ride distance. So, controlling for a rich set of things, having randomization of the variable is really important, but also there's a whole bunch of modeling architectures that we employ that help let the causal part of the model speak a little bit more.
There's some really exciting work going on in, say... people call these heterogeneous treatment effects models. There's even neural network architectures for doing these kinds of things these days. But at the end of the day, you have to have been running some experiment in the background in order to make those models be able to tell you what's going to happen when you change the state of the world in some way.
I mean, I would think price specifically, is obviously a sensitive topic for users, but also probably even way more for the driver. Do you think about other considerations there? Do you put constraints around yourself around setting price, outside of just modeling most efficient market or something like that?
I think that one of the core problems for Lyft, and it's very pervasive, is "What's your objective function for the business?" You have to... at some point, you have all these algorithms that are all working together. What common goal are they working toward? At the end of the day, there's some kind of welfare being created by the system, and it's going to be allocated... some of the welfare is being allocated to the rider, some to the driver, and some to Lyft, which we'll take as profit. So we have to figure out where we're going to split those things, and there's trade-offs in splitting them different ways.
If we just greedily took all the objective for ourself, we'd charge really high prices, pay the drivers almost nothing, and no one would use our platform. There's these short-term, long-term trade-offs. So, finding the right balance there is really important. One of the ways that we do that is we have a lot of guardrails in the system. We'll say, we would really prefer if certain things never exceeded some tolerances, and that's a way of us heuristically applying some guidelines that help the algorithm stay in a safe place.
For driver earnings, for instance, we really like to increase driver earnings as much as we can. One way to do that is to just have people pay more. A better way to do it for everybody is to improve the efficiency of the system. So, if we can get drivers to have a passenger in the car more often, then they'd just make more money and the total surplus is greater for everybody. So, that should really be our goal.
When we think about pricing, it's the zero sum game version of the thing. We would like to make the sum of the game larger for everybody, so we split a bigger pie. A lot of our algorithmic improvements that we think about are more on the efficiency side than they are on, "Can we take more money from this person and give it to this person?" Because that just... you run out of options there very quickly and you end up... somebody's unhappy.
Right. That makes sense.
I guess, probably, a loss function that everyone can relate to is the ETA estimation, right? We've all been in a rush and had a car come late. You had a really nice post about this, and thinking about what the right loss function is, but I wonder if you could say how you think about what it means to have an accurate ETA function?
Yeah. I think that that's a fascinating statistical topic. I mean, that post was about, there's a wide space of loss functions that all have some desirable properties of producing an unbiased estimate of ETA. You might even think about applying a bias estimator. Maybe I don't care about getting it accurate. I care about giving the user an upper bound or something like that, so you could think about some quantile loss, but ultimately, ETA predictions are inputs into some downstream algorithms.
We've decomposed the optimization problem into pieces. The ETA estimates are a thing where we have to have a contract with the dispatch system, which is that our ETA estimates have some statistical properties. So, unbiased-ness is a really key piece there because we're going to run an optimization on top of those predicted values, and if we say, "Hey, we're going to add a little bit of buffer on top so that the rider doesn't have a bad experience thinking that we underestimated", that would be bad for the downstream optimization.
So, the algorithm consumption of the estimates and the human consumption of the estimates are a little bit at odds on what would be desirable. So, I think we tend to prefer to get the statistical unbiased-ness right, and then figure out how to make the user experience better in a separate layer as much as possible. I think that historically, we played with displaying ranges of ETAs. A better answer to this question, it's not "Estimate the thing differently", but just be honest about the distribution of errors that you're likely to make in practice.
Well, tell me this. What loss function do you use? I mean, unbiased could mean different things depending on the context, right?
Personally, I haven't worked on our ETA estimation problem. We have a really strong team of researchers there doing some really interesting stuff, but yeah, I haven't worked on it, so I don't know what we landed on. I know that we're at the point now where it's pretty hard to eke out gains in that algorithm. I think it's a thing where most of the effort is on just accuracy.
One of the super interesting things about ETA is that not all accuracy is equal, so being correct about ETA in certain situations is more pivotal for your downstream optimization than others. You might think of that as label weights in some way. So, there are cases where getting the ETA right could really make the difference between getting the routing decision right or wrong in cases where you're basically going to do the same thing either way.
Could you give me an example of that? It's hard for him to picture what... I mean, of course, that's the situation for any algorithm, but what's the case where ETA is super crucial?
Yeah. So, say that there are two drivers that we could potentially route to a rider. In cases where the estimates ended up being ordered the same, then the estimates aren't pivotal, right? So, there's a wide class of estimates that would rank them the same, and so always dispatch the same driver, but in markets where we have a lot of options and there's lots of drivers available, then you start to make mistakes, right? So, it's like a ranking problem, and if you invert the ranking, because the estimate was off in some cases. So, in thicker markets, we have opportunities to do better. We have opportunities also to do worse because we're getting the ordering of the drivers that's efficient to send wrong.
I see. Interesting.
There's also a weird bias problem in the data that we have for ETA. We only observe the drivers that drive certain routes. So, they only drive to places that they've been routed to. So, estimated ETA for segments of the road that we don't observe drivers on, it's a set of missing data. That missingness is not at random. They might not be driving a certain place because we're not routing them somewhere, because we think the ETA estimate is really long, but it could now be short. So, there's a sense in which you'd prefer if you collected your data under a little bit of extra randomization or noise to get a better estimator. It's an interesting bias training set problem that I think is a little underrated. We haven't quite figured out what to do about that.
That does seem super tricky. I guess it's probably hard to run random experiments to collect more data. I think that might make people frustrated.
Yes, that's right. It's very analogous, I mean, I used to work at Facebook...One of the things that you'd worry about is you're ranking a story really low in newsfeed, and no one ever sees it, so they don't engage with it. So, your training algorithm doesn't know that there's some features in there that could say, "Hey, this is really good. We should be displaying this at the top."
So, you can end up in these feedback loops where some friends of yours might... you might not ever see their posts again, because they just aren't getting any eyeballs on their posts anymore. I don't know if that actually played out at Facebook, but it's a super similar problem, is that you have to acknowledge that your training data isn't some random sample of what you're looking for.
Right, right, right. I guess when I look at the ride share challenges that you mentioned, they seem like situations where you have pretty structured data coming in, and maybe lots and lots of data, and you have to deploy into a high volume production. It seems like a case where neural nets might struggle a little bit. Have you found that mostly neural nets work better than maybe older, I guess older is the wrong word, maybe less complicated algorithms?
I would say... so, we do have a bias for simpler solutions. I think that's for good reasons of needing to keep things reliable, and historically, people at Lyft have gotten a lot of successful results with tree-based models. So, things like LightGBM and XGBoost are pretty popular techniques for supervised learning problems. I think that's for good reasons.
I think trees do well with geospatial data. Latitude and longitude and time are things that trees can find good segmentations of. So, the features are naturally encoded very well. The representation is learned by tree very effectively, and so neural networks might provide a boost over that in the long run if you have a lot of data, but you have this thing that learns really quickly and doesn't overfit too much. So, it's an easy drop-in thing to use.
I think that we're moving toward using neural networks, and in some cases, gradually. I think, yeah, we are trying to sort out some of these deployment challenges and making sure that they run reliably. Yeah, I think all the model quality control stuff is something you have to relearn a little bit as you move to a new modeling paradigm.
I guess you mentioned online, at one point, that your team uses entirely PyTorch. Is that right, and could you talk about the trade-offs there?
So, part of it is historical. I worked at Facebook and I did a hack-a-month at FAIR. That was right when they were deploying PyTorch for the first time. I learned about it before TensorFlow, so it wasn't like I thought PyTorch was better than TensorFlow. Fast forward to last year, my team was working on...we're building a forecasting tool that has a plan built into the forecast, so we can change some policy variables and have the forecast reflect the change. So, we might say, "Hey, we increased our coupon in volume and that's going to increase demand." So, we'd like the forecast to reflect that, forecast with some causal effects baked in.
If you can produce a forecast like that, one of the natural things that you'd like to do with it is actually run an optimization on top of it. So, you'd say, "I will produce this forecast" and then actually optimize the plan to make the forecast look as good as I would like it to look. If you're doing that, a really desirable property is that the model that you fit is the differentiable object, so that you can use basically... the same methods that you use for optimizing the fit of the model, you can use for optimizing the policy variables that you're plugging into the model.
So, we really wanted to be able to produce a Python function that we had fit from data, but that was differentiable. So, having the model be done in something that was auto gradable was really important. I'm a big Stan fan and I like Bayesian modeling, but a lot of the Bayesian modeling tools don't naturally just produce this object that is differentiable. So, we're like, okay, well, we should work in some space where we have these auto grad tools available.
It's been a bit of a trade-off. I think we're doing things that look a lot like Bayesian models, but on top of PyTorch. We're having to invent a lot of ways to do that ourselves, that would have been a lot easier if we did something PyMC or Stan. It's been a little bit of a challenge, but the upside has been a lot of modeling flexibility and also the ability to borrow from what all the neural network people are doing for improving the speed and reliability of fitting. So, there's a little bit of...it's fun to do things that look like neural networks, but are not. We're not using them to fit. There aren't any layers or pooling, or anything interesting going on. They're very similar.
They're just the kind of models that you would fit in R, but we really needed this engineering requirement, that we would produce this model that had this nice property of being able to run optimizations and grading. Getting the gradients is a really beautiful thing at a place like Lyft, because we care about marginal effects of everything. So, if you want to know what the lifetime value of getting an additional rider is, which is a very common thing in business... What's your marginal benefit of getting one more person on the platform?
With a differentiable model, it's very easy to do queries like that. We can just say, "What's the gradient of the total lifetime value of Lyft?", which is something we can estimate with the model. We can do the forecast, add up all the future revenue, discount it, and then actually just look at the gradient with that variable, with respect to every rider activation, and say what that is. So, PyTorch was a really natural fit for doing those kinds of queries. So, yeah, it's a little bit of, we got really low level to solve a problem and I think sometimes we regret being that low level.
That's so interesting. So, it wasn't PyTorch versus TensorFlow. It's PyTorch versus a Bayesian framework. It also sounds like you're using PyTorch essentially for data science, because you want the auto grad... or you want the gradients to be able to pull them out. I guess, where have been the pain points? Where has that felt frustrating compared to what you've done in the past?
I think part of it is that we bet on... the optimizers that are used for neural networks are not particularly great for some of them. A lot of the models that we fit are pretty small fit into memory. We should be using some second order methods. We've struggled a little bit with confirming that we're at a global optimum for the model. These are models that we should be able to confirm that. So, if we had done it in a more traditional model package, then we might've ended up with a more stable optimization procedure.
I think the modeling flexibility that you get from PyTorch is partly... a cost that you pay is that everything's pretty low level unless you have these higher level abstractions. So, we had to build a lot of those abstractions ourselves. So, things like building spline basis expansion and building ways to...
We actually have 40 or 50 models that compose together, and we had to build a way to compose a bunch of models so that they become one big graph of models. We had to build a lot of that stuff ourselves. We have a couple of people on that team that just got really interested in that part of the problem. I hope that one day we can open source the modeling architecture.
The other super interesting pain point that caused us to develop something that I think was pretty interesting, was that everything in our system is a tensor. Tensors are really natural representation of marketplace data because it has a regular structure to it. So, you can say geography and time are two dimensions of the tensor, and you might add other dimensions, and that neatly encapsulates a lot of the kind of data that we capture. We ended up creating a labeled tensor implementation that we find it really useful to...
It's a tidy data frame in R, but it's a tensor, and so we can use them as just variables in the system, and compose them and multiply them, and do operations on top of them. I later found out that there's this label... there were a bunch of these labeled tensor packages out there that do similar things. I think that that was something that we didn't realize we needed to build, but keeping track of all the dimensions of all the tensors that we were passing around became a first-class problem very quickly.
It all sounds like you want to use data frames, right?
Yeah. They're data frames, except that they're dense, right? So, you can guarantee that you always have... for any-
Oh, I see.
... pair of coordinates, you always have a value. So, it's like a special class of data frames where you know some properties are true about them.
I guess this is a more open-ended question that I hadn't planned to ask, but I mean, since you've done a lot of Python and R, I'm curious how you compare the two, if you have one that feels more natural, that you like to live in, or ...?
Yeah, I think this'll probably be pretty controversial, but I do everything in R, until I can't anymore, because I...
That is controversial. Interesting.
I think that the Tidyverse people have figured out a lot of the interactive data analysis stuff. It's just much more first-class in R. One of the things that's an interesting consequence of R's syntax is that the lack of white space sensitivity and some of the ability to just use unbounded variables means that you just have a lot less typing to do similar things.
I'll poke fun at Wes because I've had this conversation with him. I think the pandas API could use a little love, and if we could reinvent pandas from scratch and do Python data frames again, we'd probably do it a little differently and something with a little bit less surface area for developers to...
Hadley is a designer, Hadley Wickham is the creator of dplyr and a lot of the tidyverse packages. I think he thinks really deeply about these micro interactions that people have with the code. What are you actually do... what are you trying to accomplish? What's the minimum way to get there? Then also, is it going to stick in your brain? Are you going to remember to do it next time?
So, I've just found that that fit my brain a little better, but all the production code that we write at Lyft is in Python, so I find myself porting some of my analysis in R over to Python quite commonly.
Can you give me an example of where data frames frustrate you? Or where pandas data frames are frustrating?
Sure. So, one thing that is a little annoying is having to... Some of the operations will emit data frames, some of the operations will emit a series, depending on what kind of aggregation that you're doing. So, this is a functional programming no-no, right?
dplyr is designed in the opposite way where there's very standard interface. Most of the functions take a data frame as the first input and always return a data frame, and that allows you to do this chaining thing. If you look up method chaining in panda, you'll find a couple of good articles on how to do it.
It's a real stretch to do chaining in pandas, where you can apply a series of operations and read through them, and you can do this, but it just doesn't look as readable, and it requires a lot of clunkiness, but the .pipe operator in pandas is something that I use a lot when I'm using pandas, because I think it does what I like about dplyr. It just requires a lot of you to write your own code, to fill in some of the missing pieces. I think reshaping data frames from long to wide is just dramatically easier in R because that interface is a little bit simpler, like stack and unstack operations.
In Ruby, they call it principle of least surprise. You should always... the API should return something that is unsurprising to you. Sometimes, I think some of the stuff in Python is most surprising. You're like, "How did I get here with this object? I have no idea."
This is a long rant and a long complaint, but I think we can get there. There's plenty of great Python developers that are working on this, but I think that we made some design decisions early on that made it a little bit challenging to create these expressive interfaces.
Yeah. It's so funny. So, my experience was, I wrote code in mostly R for years, and I always found R a little baffling. When I switched to Python, I was so happy, and it made so much more sense to me. It's really interesting that you feel exactly the opposite. I wonder what's different about our brains or what we were trying to do.
I think maybe functional languages are more natural for you, and I feel like all my smartest friends, that's the case. Maybe that's just going on. I mean, the thing for R that I always missed was I just felt like the plotting was so much more natural than Python. I feel like I still have to look up Python's plotting stuff. It's interesting that you don't even mention that as an issue.
Yeah. I hate Matplotlib a lot. I would complain about that to anybody. I think Altair really solved that problem for me. I think-
Jake VanderPlas wrote a really nice package. It's very ggplot-like in concept. In syntax, it's a little different, but I think it's a close map, so it's pretty easy.
But I had the opposite. I was a Python developer since 2004 through grad school. I spent a long time in Python. I started learning R in grad school. It was my later language, but I like it more, so yeah. Maybe it is just you're... some people have a certain kind of brain that fits one thing or the other.
Well, cool. I also wanted to ask you about the Prophet project that you worked on at Facebook. Could you say a little about what that did and why you made it?
Sure. Prophet is a time series forecasting package. It was built because we had some applications internally at Facebook that we didn't have good tools for. At the time, I was on the core data science team looking for interesting high-impact problems to work on.
We had a couple people come to us, just with forecasting problems, I looked around...I was like, "Forecasting can't be that hard", and I started to Google around and look for what tools are available, and I really felt like the tooling landscape was a little primitive. In particular, there's one interesting aspect of business time series that's just difficult to model traditionally, which is this multi-period seasonality.
So, you have a yearly cycle and data is super common, a weekly cycle is super common. You just end up with needing to think about carefully modeling these kinds of... they're just features that can be extracted from time, but they're not easy to do in an auto regression or exponential smoothing framework. So, I worked with... Ben Leetham, I have to give a great call out for, because I think he invented all the important stuff in Prophet. That project was going really poorly until Ben got involved and helped me solve a couple of really key problems there. Then what we figured out was that we just had this class of time series problems that are really common in practice.
It's actually a really constrained modeling space. It's almost like an architecture for time series models. We just said, "Hey, there's a small set of models that capture a lot of data that we see in practice", and that prior over the models is a really useful thing to know, because it means... Time series data are always data constrained.
You might have a year... you might have 300 observations, 400 observations. You're not talking about something you can learn a lot from the data. You have to bring a lot of priors to a time series problem. By coming up with reasonable priors for what that should be... and if you look at the Prophet code, it's got hard-coded parameters that are our priors over what we think is likely to happen in...
It's not an elegant model in the sense of that, it's not super general. It's actually very specific, but that happens to work well in practice. Sometimes I just call it a bag of heuristics that we cobbled together, and I think real time series modelers probably get a little frustrated with us for having empirical success from something that's not as principled as the work that they've been doing, but people get a lot of value out of it. Part of it is just that they don't really want to learn about time series modeling that much. They'd prefer to just get it done and move on to another problem. So, Prophet provides a very easy way to get there.
I have a feeling of a lot of people listening to this might find this useful. Could you say what's the case where Prophet's going to do well and where it might not do well?
Yeah. So, Prophet is built on a lot of local smoothness assumptions. So, if your time series jumps around a lot or is very random, or it has a non-human periodicity to it, then it's unlikely to work. It's really designed for these human-generated time series, human behavior-generated time series. So like web data where you're counting how many visits come to a website is bread and butter for Prophet, because it's highly seasonal. It has all these very predictable patterns to it, but those patterns need to be encoded in a way that allows them all to extrapolate them.
When I see time series that come from more physical processes, really high-frequency stuff... stuff that jumps around, stuff with a lot of really abrupt changes in it, which violate this local smoothness idea... then you can see right away. My prior can be expressed as looking at a time. When someone shows me a time series and they're like, "Would Prophet work on this?" I know right away if it will or not. A lot of it's just knowing what human-generated data looks like from having seen it a bunch of times.
So, you're essentially encoding, somehow, earth, human things, like week and month, and year. So, it's designed for more demand forecasting versus the position of Jupiter's moons. Is that fair?
Yeah. I think that's right. I think when we first released Prophet, Andrew Gelman, on his blog, it was very flattering to get mentioned by him, he was like, "I'll show you a time series that Prophet won't do well for," and it was some physical process. I forget what it was. I think it was lemur population or something like that. It was one of these physical processes, like population ecology, where it has a chaotic period to it, because it has a feedback loop built into it.
So, the period is not regular, and it's like, well, if the periods are not regular, then there's no way a model that's trying to learn a regular period structure is ever going to fit that. So, I think we ended up having to admit that, "Yeah, sorry, Andrew, you can't forecast lemur population using Prophet", but I think that we're fine with that. It's an 80/20 thing. We'd like to capture the kinds of problems that we see in practice.
So, can you say a little bit about what you're doing under the hood with Prophet?
Yeah. There's probably two or three tricks that I think add up to the whole thing. Probably the most important trick is just that we have these trend change points. The actual Prophet forecasting model can be really simple. If you strip out the seasonality, it's just a piecewise linear regression. Making a linear regression extrapolate well is challenging because you don't really always know how much of the historical time series to use to fit the slope at the last point. So, you're trying to go into the future, you need to know the slope at that last point where that's coming from.
What we do is we introduced this idea that the slope can change at various points in the past, and that we prefer those changes to be sparse. So, we're just using a one penalty in order to do that. That's a really standard trick in machine learning, and what that does is it comes up with a pretty, I would say, parsimonious representation of the trend of the time series, which is a sequence of lines that fit together, and the last line segment is the slope into the future. So, that actually works quite well. It's very similar to exponential smoothing procedures, which are getting the local slope that you're trying to use to extrapolate from the more recent data, rather than from the far past. It's just a sparse version of that, so that's one big trick.
But then how does that model periodic effects into the future? Or is that not part of its thing that it's trying to do?
Oh yeah. So, the seasonality is just applied additively. At its core, Prophet is just a generalized additive model. So, very similar to... a lot of gam packages will fit all kinds of stuff that looks like Prophet. It's just that they're not really designed to extrapolate well. They fit, they interpolate well, because that's what gams... the loss function for gams is capturing that. For Prophet, we just had to make these modifications in order to get the extrapolation performance.
And really, if you think about it, it's all about controlling the complexity of the model that you're fitting close to the boundary of the data, which is... because it's extrapolation, you really don't want it to get over fit at the last part where you're trying to go past it. In typical machine learning, we do way more interpolation than extrapolation, so we commonly don't think about controlling complexity at any particular point. We just want the best model, but in forecasting, you really prefer simple models when you're going off of the data that you've seen already.
Totally. I guess that's a good segue into one more topic I wanted to ask you about, which is the election forecasting. You've talked about, or thought about, election forecasting with using prediction markets, which is something that I think probably me and a lot of people listening to this have thought about.
I guess I'm just curious. I mean, the question that's top of mind right now, and this is probably going to be out of date as soon as we release this, is we have FiveThirtyEight and all the election models showing a really high percent chance for Biden compared to the prediction markets and the betting markets. Do you have any thoughts on how those two things have diverged and why?
Yeah. That's a really interesting question. I think the prediction market, people... Dave Rothschild at MSR was a really big believer in the prediction markets loss cycle, and has since switched over to polling. I'd love to... I think he'd be a better person to tell you why prediction markets are failing to do this, but I think one part of it that I find interesting is that prediction markets...I think one reasonable use case for them is to do emotional hedging.
You could say, "Oh man, it would be the worst thing in the world of Trump won, so I'm going to go bet every cent that I have on him winning in a prediction market, so that if he wins, I'm just going to win a lot of money." Not every prediction market participant is trying to maximize earnings. They can be hedging and it's a tool for hedging. So, you might think of, okay, so part of the difference in price could... it could be suppressed because of the...
But shouldn't some kind of... I mean, I'm out of my depth here a little bit, but isn't there some kind of efficient market hypothesis that someone would exploit the emotional hedging to make themselves a lot of money?
Yeah, that's true. If the constitution of the market were... if you had an infinite population of traders, then yeah, I think you'd get there, but without perfect... without a lot of liquidity, if most of the people... all the market stuff depends on having a lot of people, and they're all... if a certain fraction of them were Prophet motivated, then I think you're good.
Part of it is also transaction costs. PredictIt, for instance, has a 20% fee for removing... for taking your money out, so it makes the incentives not quite the same as trading in a financial market. Yeah, I don't know.
I think it's an interesting empirical puzzle because also, if you go to PredictIt and you look at the state level predictions, I think they align quite well with FiveThirtyEight, but the aggregate one, it doesn't. To me, that feels like the hedging explanation is my most is my favorite way to explain it, but I don't have a better explanation than that.
It sounds like, to me, then, you're siding with the poll aggregation versus prediction markets.
Well, I am a big believer in polls. I think it's a really well understood technology that we've been deploying for a long time and there's a lot of great science behind it. I think you see Elliot Morris at The Economist working with Andrew Gelman and doing best-of-breed Bayesian modeling of the polls.
At the end of the day, I think of this as there's some latent variable, which is intention to vote for one candidate or the other, that we're just getting noisy observations from. When you have a latent variable that you don't observe, you want to pool as much information that you have about that as you can, and you want to try to de-bias it as much as you can. We've gotten quite good at that.
I think that the real epistemological problem here is whether polls mean what we hope them to mean. I think it might just be that people answer polls differently now, or think about them differently. This was the Shy Trump Voter hypothesis from 2016, is maybe people legitimately aren't telling you how they're really going to vote. In a world where that breaks down, I think polls become a lot less credible as a source of information.
So, I think we always have to take on faith that people are answering these things in accordance with their beliefs, at least most of the time. I hope that that will sustain itself because I can't even really imagine a world four years from now, or eight years from now where we actually don't have any credible estimates of these things. We've gotten used to feeling some level of certainty about where the election stands.
Well, I guess, what role then... if you believe the polls, I guess what role would prediction markets play, or could they play, in an election forecasting?
Certainly, the polls are informing the participants of the prediction markets, right? I can't imagine that they're coming up with their beliefs... People in the prediction markets have some subjective belief about what's going to happen, and that's informed by some information about the world. Whether that's just them talking to their friends or reading the news or whatever, or actually just analyzing data, I think at the limit, if you really want to do well in a prediction market, you would want to bring as much information as you could to bear on the problem.
But also, I guess, this comes up a lot where it's like maybe the people who analyze the data the most are not as willing to participate in the prediction markets. People are always calling on Nate Silver to make large bets about what he's estimated, and he seems a little bit reticent about that. So, I guess, yeah, there is this interesting question of maybe the polls aren't driving the prediction markets as much as much as you think.
To be honest with you, I don't really know what's motivating a lot of the people participating in the prediction markets, and whether they're really acting in a profit-motivated way, or they're just gaming. You can think about fantasy football players are doing a similar thing. They're moving some things around on the internet and hoping that they won a little bit of money as a result of it, but they might not be thinking too deeply about it.
I'd love to see some research on just actually talking to those people about what their process is and what they're doing. If you go to a website like Metaculus, which I'm a big fan of... it's not a prediction market, but a prediction aggregator... you see a really nice community of people that actually talk about how they end up with the forecast that they came up with.
I think that you get a lot of insight from that, like what are they actually doing in practice to figure out the future state of the world? It does look a little bit like this foxes versus hedgehogs things. They just cobble together little bits of information and make more directional changes.
Yeah. I mean, I guess you can imagine... I mean, Nate Silver is so spectacularly good at articulating what he's doing, but you can imagine somebody who's really good at forecasting, but maybe not as compelling of a writer or as clear of a thinker, doing really well in a prediction market, but not having a famous website. So, it does seem like that could provide them room to shine.
I was a big believer in it and I think I'm just starting to have doubts now. I mean, I built a prediction market a few years ago because I thought that there were a lot of Nate Silver types out there doing this kind of stuff. I guess I just didn't end up...it's really hard to get people to participate in prediction markets.
I think this is an underrated aspect of it is. I built one. I tried to get people to use it. It's cognitively costly to create predictions, and especially ones where you're going to have some skin in the game, you're going to even incur more. So, it's not free to get participation in a prediction market.
They're doing computation in the background that's expensive to produce their predictions. I think this is an underrated part of the problem, is that in financial markets, we just assume that the incentives to participate far outweigh the cost to the participants, but in prediction markets, I think that the problems that they're solving are cognitively expensive, and the payoffs are a little bit smaller. So, we might be in a world where we get under-participation, so you don't end up with these great stories about markets being amazing aggregators of all available information.
Well, we always end with two questions and I want to give you some space to answer these questions. So, our second last question is, what's an underrated aspect of machine learning or data science that you think people should pay more attention to?
Yeah. That one... I always have strong opinions about... and to me, it's very obviously model comparison and evaluation.
We focus so much effort on training models, getting features, on all our crazy architectures. The space of models that we can consider is increasing rapidly, but we still are bottlenecked on "Is this model better than the one that we already had?"
I think that that's a nuanced problem. It's usually a lot of criteria that go into that, and coming up with good model evaluation procedures is hard. It's not just AUC. It's not precision recall curves. That's a part of the problem, but there's just so much more to model comparison, like cost of the model, upkeep, decay, stability, interpretability. I mean, there's just this wide array of things that we'd like about models that we're not really encoding.
I just feel like it's always the thing that people, when I'm talking to them, have thought the least about, but it's the part that I'm most interested in. So, that's my very clear answer to that one.
Interesting. Is there any work that you could point people to if they want to learn more about that?
I mean, I think that the posterior predictive checks stuff in the Bayesian community is getting in the right direction. It's sort of a general approach to inspecting the predictions that a model makes.
You see actually in the... Elliot Morrison and Andrew Gelman doing this with their election probability model. They're looking at predictions and trying to see, "Does this make sense to me, and where can we make improvements?" So, I think that that's a really fruitful place to look to. I guess the other literature that I point people to is off policy evaluation.
Usually, if you have a model, you're going to go and make decisions with it, at some point. Those decisions will add up to some value in some way. The most faithful representation of how good the model is, is if you actually plugged it into your production system and ran an online test, how well would it do? So, off policy evaluation is just an offline way to try to estimate what would happen online if you ran your model in production. It's a hard approximation to make, but if you can do it, then you can be much more sure that your model is the right one for the task that you're going to deploy it for.
So my final question is, what's the biggest practical challenge of making machine learning models useful in the real world? I would say for you, at Lyft, what do you see as the biggest bottleneck to taking a model from research or conception, to use in production?
Good question. I think there's still a lot of really base needs that need to be met. I think getting and collect... getting training data into a shape that the model can be trained on it, I think is still something that... We spend a lot of time just making datasets for consumption of models, and I think that that's something that's still a little bit slow.
There's some technology that's helping there, like feature store type ideas. I think that that's a challenge. I think that this, just, model life cycle stuff is still a big thing. I think two people collaborating on a model is a pretty challenging thing these days. I think you see... if one person gets to work alone, they can move much more quickly than they do in a group, but getting a group's worth of effort on a model is a really useful thing. So, I think that decomposing the problem into something that multiple people can work on is a big opportunity.
Finally, I think that the monitoring and making sure that things are behaving the way that you'd like in production, that trust when it's running in production. And for us at Lyft, it's like if we screw this up, then the marketplace falls apart and drivers don't make money, and riders don't get rides.
It's a really big downside risk to losing reliability. So, getting to the point where we trust the decisions and that we can... So, we end up spending a lot of time just making sure that we're confident that the models are going to do something reasonable in the real world and a lot of layers of testing in between. I think that in the future, I would hope that we can get to a point where that friction starts to go down and we can be a little bit more iterative.
Awesome. Well, great sentiment to end on. I really appreciate your time.
Yeah Lukas. Thanks for all the great questions. This was super fun.
Doing these interviews are a lot of fun, and the thing that I really want from these interviews is more people get to listen to them, and the easy way to get more people to listen to them is to give us a review that other people can see. So, if you enjoyed this and you want to help us out a little bit, I would absolutely love it if you gave us a review. Thanks.