Roger & DJ — The Rise of Big Data and CA's COVID-19 Response
Roger and DJ share some of the history behind data science as we know it today, and reflect on their experiences working on California's COVID-19 response.
Listen on these platforms
Roger Magoulas is Senior Director of Data Strategy at Astronomer, where he works on data infrastructure, analytics, and community development. Previously, he was VP of Research at O'Reilly and co-chair of O'Reilly's Strata Data and AI Conference.
DJ Patil is a board member and former CTO of Devoted Health, a healthcare company for seniors. He was also Chief Data Scientist under the Obama administration and the Head of Data Science at LinkedIn.
Roger and DJ recently volunteered for the California COVID-19 response, and worked with data to understand case counts, bed capacities and the impact of intervention.
Connect with Roger and DJ
0:00 Sneak peek, intro
1:03 The popularization of "big data" and "data scientist"
7:12 The rise of data science teams
15:28 Big data, Hadoop, and Spark
23:10 The importance of using the right tools
29:20 BLUF: Bottom Line Up Front
34:44 California's COVID response
41:21 The human aspects of responding to COVID
48:33 Reflecting on the impact of COVID interventions
57:06 Advice on doing meaningful data science work
Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to email@example.com. Thank you!
What I think people don't realize is we are all in it to get better every day. We're sharing our skills. Whether it's sharing open source, techniques, ideas, technology, that's where it's coming together. In fact, if anything, I think, this terminology, this movement is a community-based organization, just as like Roger said, open source. No individual made this happen. The community owns this collectively.
You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald.
DJ and Roger are both good friends of mine, and have been working in data and ML for the last 20 to 30 years. Roger, was, for a long time, the co-chair of the Strata Conference, and VP of Research at O'Reilly. DJ was the Head of Data Science at LinkedIn, and the Chief Data Scientist under the Obama administration. They both recently worked on the California COVID response using data, and I could not be more excited to talk to them.
All right, so I have a whole bunch of really good thoughtful questions, and then one question is probably slightly annoying for DJ, so I just thought I'd get it out of the way and start there. Which is, I was telling my wife, Noga, about talking to you this morning. I was saying, "DJ, you're the person that came up with the term 'data scientist'", and then my wife was like, "No, no, that's not true." Then we were discussing it, and then I was looking it up, and I couldn't figure it out. I was just wondering if you could let me know what the real...
I feel like at least you made the term popular, right?
Yeah. I think the first part to call out here is Roger actually gets credit for making "big data" popular versus when people were talking about data. Roger gets credit for "big data". I think the part about... Most people don't realize that. Roger never talks about it, but he's the guy. I remember going to an early talk of Roger's, where he's like "big data". I'm like, "Who's talking about big data? Isn't all data big?" He laid out an argument for it, and I was like, "Oh, yeah." Then you saw it catch fire afterwards.
Roger, you should talk about big data, but I'm happy to talk about where I think the origin story of data science comes from.
Let's start with that, and then I want to hear about the origin story of big data, and then what I should trademark and what domain names I should buy.
Totally. Part of the thing of data people, especially in that early era of LinkedIn and Facebook and others, was that there was a community starting to form of people getting together, and what do you call themselves. People had many different versions of names that were going on. Even going back to the '60s, there's been arguments of people where they found documentation and people titling things "data science". It's been floating around, and I wouldn't be surprised if we find a lot more examples of what people had been calling data science.
What was also going on at the same time, as people were trying to figure this out, people were playing around with the terms like analytic scientists, and Jonathan Goldman was the guy who came up with that. Pete Skomoroch was talking about the idea of a data artist. That actually got raised at a board meeting at LinkedIn. It was like, "Are we painting a palette? Are we creating a palette with that?"
What I recall is, and what we put in our book, was when we were getting ready for the IPO for LinkedIn, Facebook was... Jeff Hammerbacher and I both got together, and we were like, "Hey, HR is breathing down both of our necks. What do we call people?" We had too many different job titles, and so it was like, "Well, what's the listing?"
What actually went through those is you start to think about the terms. "Analysts" felt a little too Wall Street. "Research scientists" was a title that Yahoo had really popularized for where the data scientists sat with Cameron Marlow and other people, but they were always pushed out to the side of the product process or the product engineering process, and so that was a little too researchy. If you go with some of the things more "Statistician" or "Economics" or any of those, you're creating a war right off the bat, but also, the term hadn't really quite caught on except for places like Google with Hal Varian's team.
What we did is we went through that list. Jeff actually was the one who was like, "Well, we're starting to think about this term 'data scientist'." I took it back to... It was like, "Well, that seems plenty reasonable." I took it back to the team, and Monica Rogati actually had the idea of saying, "Well, we're LinkedIn. We have all the job postings. Let's post all the jobs with different titles, and see what everyone applies to."
So, we did that. Monica actually constructed the test for it, and guess what, every one we hired was in the term "Data scientist", and so that's why it sticks. I think a lot of people have gotten caught up in this origin story, but I think there's two parts that are important.
One, it exemplifies that this was a team effort. It's very easy for people to say, "Oh, DJ and Jeff did this." It's a community wide thing, right? This is a broad, diverse community that was all coming together to make this happen.
The second is, why did it take off? Not only did we data science our way into this title, but the reason I think it takes off is because no one knows what the hell it means. I say that with great seriousness because... Roger knows this. As you watch these fields evolve, and you've seen this, Lukas, tremendous amounts through all your work over years, is people like to put people in boxes.
They like to put skill sets in the boxes, and there's like, "Oh, if you're doing data, you're not supposed to do product. If you're doing product, you're not supposed to do engineering." We're like, "Why can't we do it all if we've got the skill set?" The data scientist person, people are like, "They're smart, and they have superpowers. We don't understand them, but they really add value." If you pull on that string of why do they add value, the reason funnily is because they're allowed in the room, and they have context.
Once you have context, you can take your skills, and apply it to the problem faster than other people can. It's the ambiguity that has come out. I think that has led to the rise of the title being actually taken over. If you'd asked Jeff or me back then, I'm very confident I would have said "No way this is going to be a thing that sticks. This is going to be really something that we really label our teams."
I think part of the reason it also took off... Frankly, LinkedIn and Facebook were very successful in their IPOs. People said, "What's behind that?" People said, "Ah, Roger's term 'big data', and the whole thing that makes big data come alive are these data scientists." I think that in my view is how we should think about where this is coming from and where... It also gives us indication of where we need to go.
Well, I guess there must have been something new going on with the social network companies of, I guess the mid to late aughts that there was a new need or something. I mean, I run, I think, a pretty standard business model at Weights & Biases, and it's really hard to imagine operating without a data science team. There must have been some kind of function before that. What changed in the requirements that it was needed to make a new role that didn't exist before?
Well, let me lay out an argument from the late '90s, and then Roger should dovetail because he's seen the whole evolution of this. What I think had happened... This really started just around 9/11 time period, as people were like, "Wait a second, there's signal in the noise, but no one's actually able to capitalize on it. How do we find the signal? How do we do something with it?"
You did see a lot of the early e-commerce companies like eBay and others actually had the equivalent of "data science team". They were just analytics functions in those roles, and people were called Business Analysts or other different titles. Google had a lot of these people, and had a lot of impact.
I think the seminal difference that we saw, which was really building on Yahoo's research team and those kind of groups, is that the data team could actually build products, not just come up with insights. At LinkedIn, the data science team had one part, which is, "Hey, how are we doing?" Metrics, dashboards, all of that thing. Had another component, which is "You're responsible for revenue. You're responsible for engagement. Your responsibility is to build things, make stuff happen."
You're a design team, a product team, an engineering team. All that comes together, and then there was another part, which is you gotta open up new turf and help things in new ways. That looks like security. Because if you're going to fight bad guys who got super sophisticated data tools, the only way to survive that is by bringing increased data science and functionality actually to bear to that. Roger, let me hand it back to you.
Sure. I think what changed, and it was right around the time frame you're talking about is that suddenly, there were little companies with big data.
They weren't going to go out and buy Oracle or... I was at Sybase at '99. They weren't going to go to Sybase. They needed to come up with their own thing, and the primary thing I think those big social media companies had to do was write quickly, and that meant in a distributed fashion. Then you've got Jeff Dean doing MapReduce¹ at Google to start that going, and then you've got the Yahoo people taking that idea and making it out.
What's interesting is that it dovetails with open source becoming mainstream, because now you've got people who are willing to use open source because their company is banking on it.
I remember talking to Abdur Chowdhury, who was the Chief Data Scientist at Twitter. This might have been 2004 or 2005. He's like, "I wouldn't use Oracle, because I can't go into the code and fix it, if there's a problem", and then that became a really important thing... I also think that that was the era when the best engineers in the world were really centering on the Bay Area.
That's where, in some ways, when I wrote the essay about big data and stuff, I was doing those talks while using big data. I was trying to capture this notion of having to store a lot of data, do it in a distributed way, analyzing big masses of data instead of a little or medium-sized pieces of data, and this became more core of what companies were doing. I tried to get that all in one thing.
That's the way DJ and I met is, I was writing a journal article with Ben Lorica on big data², and I knew Jonathan Goldman. He said, "You should come and talk to us", and so we did. We really liked the way DJ's team had people arrayed all across these functions that we used to think were in separate pieces.
He mentioned the product piece. He had visualization people, and they were all kind of together. We're like, "Not only is it big data, it's also a big group of people" with these multiple functions that ended up being worth integrating and worth coordinating with. I think it was a big thing, because I've been doing data warehousing in the late '90s, and that was a siloed thing as imaginable in most companies. It was not part of the mainstream.
I think what happened is, all of a sudden, you had LinkedIn, Facebook, Google, where that was what the company did, capture a lot of data and try to make sense of it to, in some ways, improve what they're doing, and in some ways to monetize what they were doing. It's a lot of incentive. And it was just driven in a whole different direction because of the open source piece of it too.
I'll just add one thing about big data. There's one personal part, and then one other part.
The personal part is I've worked at home for a long time, and I used to often bike to go daily shop. It's where I go get some things. Once every other 10 days or so, I had to do a big shop. I think this is just a verbal tic that I use, that there's little and big things.
The other thing is I got access to SimplyHired's data, and it was huge for me. It was two terabytes, and two billion rows, and I needed help. I got introduced to Scott Yara at Greenplum. We started doing that. I know the first talk I gave where I officially use "big data" was around that, distributed data management and doing that. I guess the last thing I should say is I was at O'Reilly, which is famously meme generating³ as a company. That helped.
Right. I had a platform that people actually listen to.
The part of there that, I think, hopefully people are also taking away is, this has been a very big tent phenomena. Abdur and Scott Yara, I mean, Lukas, you, all of us work together.
People don't realize when we were first comparing these ideas of how do we use Mechanical Turk, we actually... People don't probably don't realize, we ran a test head to head with each other of like, "How could my team do it versus you?" We learned a lot more from each other. We ended up going with you and using you, but what I think people don't realize is we're all in it to get better every day.
We're sharing our skills. Whether it's sharing open source, techniques, ideas, technology, that's where it's coming together. In fact, if anything, I think, this terminology, this movement is a community-based organization, just as like Roger said, open source. No individual made this happen. The community owns this collectively.
I'll bring up an interesting adjunct to that. I don't know when it was. It's probably around 2010, 2011, but at the time, there was MapReduce at Google, and then there was Hadoop starting to make a lot of waves out in the world.
The people I know at Google were very much in support of Hadoop. I think people were evolving. They're thinking about open source. The reason Google was so in support of Hadoop is that if you've learned MapReduce on Hadoop, they could hire you, that it was a way of training people. I think now, open source is a different dynamic on why people do it.
But back then, that ended up being an important dynamic. I think, when you're ready to ML, where everything is open source now, is that that's the logical thing. ML tools are cool. They do a lot of stuff. They're great, but what you really need is people. The more people you get involved, the more likely these things are going to get traction and become part of the mainstream. That is why PyTorch and TensorFlow and those kind of things are, I think, in the public domain with an open source way or shareable, because what's really more important is what you do with them than the tools themselves.
It's funny. Not to turn this into a whole reminiscing session, but DJ, I remember right after meeting with you back in... I think you recently left eBay. I remember I got a meeting with the eBay CTO, which was a huge deal for me at the time, because we were selling data products. I have this vivid memory of him telling me that he couldn't possibly store all of the user data. He basically erased 99.9% of it, and just saved the little bits of the rest of it, because that's all you needed to do anything important.
I remember thinking like, "Wow, that seems so painful to erase that data. You might want that data," but it's funny, because now, I feel like no company would dream of erasing data. It makes me wonder how much of all this is just driven by the ability to actually store all of this data .
Actually, what people don't always realize, this is one of the reasons I actually moved on from eBay, the straw that broke the camel's back. eBay obviously recovered from this, but there was a big argument from a number of us that said, "Hey, every time we want to do something interesting, we have to go to the lords of the data warehouse, and ask permission."
To get something done took months, and it should... It was pretty easy, obvious stuff. One of the things that... I remember this meeting very clearly is a number of us had this technical session. We basically said like, "Look, the bet for us has to be Hadoop. There's no other way. We cannot sit on traditional infrastructure and do the problems that we need to compete. It's business critical."
Those ideas got pushed out, and effectively, all those people that were on that mission of doing it all left to other things. Chris Riccomini, one of the key people behind Kafka, was one of them as well. What it showed is...I think this is something that companies need to grok with respect to machine learning, is that there are paradigm jumping moments.
If you don't jump, you will have to jump later, but you're going to be so far behind the curve. eBay obviously adopted one of the biggest Hadoop clusters with Cloudera. Seven years later, five, seven years later? But they could have been so much more competitive and done so much more. It strikes me that there's a similar moment that is happening around machine learning that if you don't get on the bandwagon now, you're late, if not already late.
Interesting. I mean, of course, I would agree with that.
I actually had a question. I had a question written up for you, Roger, that's maybe poorly formed, but I feel like you had front row seats. I'm not even sure if it's completely true, but it feels like there was this massive shift from Hadoop to Spark maybe five or six years ago, and it seemed like it was slow. Then all of a sudden... I was wondering. It seems like you just really saw that. I was wondering what you think about that and if there's some fundamental problem with Hadoop that they could have fixed, or if there's something coming beyond Spark. I was really curious to get your take on that.
I actually have strong opinions on this, but they'd be easy to try to puncture holes in.
I think Hadoop was a write engine, and that people needed a read engine. The fundamental early problem was the one you guys just talked about with eBay. How do you get all this stuff to disks? Well, distributed was the way to do it, but it was slow to get stuff out and things like no schema, that ends up, isn't really very good if you're trying to do any analytics on it.
I remember I got a lot of arguments with people, where they were telling me MapReduce is the way everyone's going to work. I'm like, "There's just no way that that's going to be the case. There's just too much embedded SQL. SQL is very productive." In a place like the Bay Area with its high concentration of great engineers, a lot of people are getting MapReduce, but a lot of people weren't getting MapReduce.
Back when I was first learning, I have a lucky thing, because I was working with Greenplum. Joe Hellerstein used to come to my home, and we'd go through MapReduce problems as they were trying to put in a MapReduce part of it, but my sense was that it was going to be SQL that was going to win, and that the analytics, instead of just storing stuff which is like step one, it's the analytic support that really mattered. Spark was just better at that than Hadoop was.
I think it was Impala was the first time that really SQL was available. Spark came right away with SQL. The other thing that happened, and this is just kind of an anomaly of... not anomaly, but just one of those harmonic convergences, Python was starting to become just a de facto language right around when Spark had a Python binding. That meant a lot more people were just able to get into and do the work that made sense. Also, just part of it was Spark being in memory. It was just fast. As long as you were able to make your RDSs, then eventually, the more table like things in memory, then you could run really fast queries.
I think when it comes down to all this, you'd mentioned the kind of disconnect, DJ, between your getting data at eBay, and having to wait for it. I was running the equivalent organization at Sybase. I had the data engineers and data scientists right together. Of course, they weren't called data scientists. I ran the group. I did both things. The reason we did that is so that no one could complain that it took months to get anything, because I wanted to keep everything tight together.
I think, go forward to when Spark started coming out, you were able to actually do data engineering and data science work all in the same platform. You didn't need someone to pull all this stuff for you. You could do it yourself, because it was SQL, and you could just pull it in, and go through the whole thing up to even early ML stuff at the time.
The other thing is Spark is cheap. One of the things that people... The eBay team had some amazing technologists. They're all that Sun generation of deep, deep infrastructure thinkers, and so they used to have a TIBCO bus, and they had basically stream processors sitting on top of it. Except it's very expensive, just because of just the structural constraints of the time. With Spark, one of the things that was beautiful about it that we saw is like, "Wait, we can have a stream processor finally? We can actually do computation without having to wait and doing all sitting behind the ETL pain?"
That gave us a massive leg up on a key set of problems, mostly that were time-bound, like fraud and security issues. That was natural to gravitate to versus the Hadoop frameworks, the MapReduce frameworks.
The other part is I think Roger's pointing out, which I still think is there, is a lot of people want to work. We saw this for Kafka also. It's like, "We were going to put the logic layer on there," but it just takes so much time of development, even with the open source community to graduate these things, and Spark didn't have to worry about the underlying buzz.
The part there that I think that we're seeing is data has moved into a space of just the background view. You've got specialized tools, right? For depending on the team, you're going to need different things, because most people who work in MapReduce, that is a leap way too far for most individuals and teams, especially when you're bringing in fresh talent from other disciplines or other areas.
I just want to bring up one...This is a bit of a corollary to this stuff, but when things first started — I know Lukas, we haven't even brought up Math Club yet, Math Club in San Francisco — diversity, cognitive, physical background, all this, is something that leads to really a lot better outcomes. I think that that's at tension with things like MapReduce, which are exclusionary and are really geared towards people who are really technically adept, that the companies that are really going to do well are the ones who can bring the tools out.
I'm not talking about just democratizing data, because I've got a really clear issue around too much democratizing data, but getting people who can go into the data and figure things out and having a lot of different perspectives on that is really going to make a big difference. I think that that means having tools that more people can use to get there matters. I think when you look back at the aughts to maybe the early tens, that they were still pretty hard to do, and that now we're at a place where a lot of people can spin something up, and start to make sense of it from all sorts of different backgrounds.
That's a great point. I wanted to go back, Roger, to an earlier point that you made before I forget, which is you made a little bit of a... I don't know. It seemed like you're a little bit dismissive of NoSQL databases, which is ironic, because I learned about NoSQL databases in your Math Club, and I've continued to use them for the last 20 plus years since then.
I was actually curious, do you think that it's generally a bad idea to... I mean, of course, everyone uses them now for some functions.
No, I don't think they're a bad idea at all. I think they were not a replacement. They're good for what they were. I think that the main argument about schema-less was...that was a really terrible argument. If you want to make sense of data, you probably want to have it organized in a way that people know.
When you think about analytics, it's a combination of things, the combination of data, the tools you're using, and the person who's using the data. The more that the person can know about the data, the less cognitive load on them to get into it, the better they're going to do. Having to deal with different schemas is not a way to promote that. In fact, what you end up promoting is someone who's got this photographic memory, rather than just a broad memory.
I mean, it's like the way JSON is clearly the way that most people move data around. I'd much prefer getting CSV data, because it's organized in a way, and you're not going to have the overhead of tags and stuff that are telling you what everything is. You can move right into... What usually I want to do with the data is trying to make some sense out of it.
I'm only dismissive in it as a pure replacement. It's like a lot of things. When it's the right tool, like you've got a lot of text. I know SQL database is great, but for plenty of things, I want a key. I want a primary key to... I mean, I have something to dedupe against.
I just ran a big deduping project for the state of California around Homebase's timecard data, and they gave me stuff, and I had to dedupe it. It took 19 steps to dedupe it to try to make a primary key that I could use, and pick the right one when I had multiples and stuff.
I think this stuff matters. I think that... I'd love to hear someone argue the other point, but that you end up with things kind of messy, which was maybe okay, but you end up having to build taxonomies and the kinds of things that help you make sense of data. They end up looking a lot more like tables than a schema-less thing.
DJ, do you have thoughts here?
I'm with Roger. I think one of the things that we've seen with a tool that is being used for many other things, you end up building a lot of scaffolding or process around it that then suddenly is like, "Hey, there's data dictionaries for this, and there's manuals and wikis to help you get through the schema-less world, and you're just like, "Did we just put a schema structure that's just meta around this?"
Roger and I both had the opportunity and good fortune of working in California on the COVID response. There's a lot of really dumb, boring, unsexy problems that are the real rate limiter of progress.
People are very apt to saying, "Oh, there's another data source. We'll grab it. We'll put it in," and then you ask "How many people are actually ever looking at it?" It's zero.
You go around, and then you say, you look at the requests that people have, and you're like, "Everyone's requesting this data. How come no one's looking at it?" You go, "Oh, this is actually a comms accessibility problem. We're trying to solve this with all this machinery and everything else."
Literally, in the California COVID response, do you know the thing that changed the game? Myself and three other state people, we wrote a data dictionary in Excel for like, "Here's all the data that we have."
We just sent it around to all the different departments. We're like, "This is what we have. Here's where it is. If you see something new, or you want something, here's the new process. We're going to go super old school, and you can print this out. You can share it. Here's all the data that you want."
People can flip through it and be like, "Oh, I need COVID case counts by this. Oh, great. It's already in there. Sweet, ready to rock and roll." Those things move the needle more than just having this brand name data warehouse or super other cool stuff or dashboards up the wazoo, because they don't get looked at or utilized.
I think one of the things that, I think, I find myself saying a lot is "What problem are we trying to solve, and does this actually solve the problem?" I suspect this is true for all of us. We've been in plenty of times where people are like, "The problem you're trying to solve is not the problem you really have."
Go for it.
I have this thing where I often tell people, "What does paradise look like?" That's the question I ask. Then they give me... paradise isn't clouds and harp playing, but the business plan they're trying to solve. Then they go, "Okay, how do I step through? How do I get there?"
Then that process leads into what kind of data you might use. As you were saying that, DJ, about the data dictionary that you did, I mean, I think that's really important. I think there's some... If we want to get into this, there are some fundamentals that people forgot about, but I think are worth reiterating to put things in a more productive manner.
But at the bottom of this list I prepared for this is "Put human perspective first." Maybe I should have made that the top thing, because I think what it ends up is we start thinking about the math and all that and biases and everything that's part of this, but it ends up... It's really a human process, and what you're really trying to do is get humans to give them the cognitive capacity to make better decisions, or at least to make decisions that are informed in a way that they can then learn from what they've done, and move forward from there.
Well, I want to hear this list of best practices, but that reminds me of one of my favorite memories of you, Roger, which I don't think you... I don't know if you remember this. I don't know if it made such an impact of you, but I was late to meet you at our first office for Weights & Biases when it was six people. I remember you were like telling our Head of Product, Carey, you're telling her, "Basically, nobody wants data visualizations. They want insights."
It's funny, because our tool is mainly of...data visualization tools is one way of looking at it. She was nodding in agreement. I was like... She was thinking about taking all the graphs out of our products. I was like, "Oh my god, I'm five minutes late." This is already happening.
I remember that. That actually is on my note. I think we were just hanging around, and trying to make a point.
One of the points is when you've got KPIs, when you've got someone who's in the data every day, and they know what they're looking for, you need a dashboard. You need this kind of visualization. But when you want to communicate, and I liked when DJ used the word "comms", you need narration, you need annotation. A dashboard won't do it.
I can just give one example from the state. They had mobility patterns for every county in California. There's 58 counties. There's 58 little charts arrayed in a lattice. Alpine County in California has 1,200 people. There's high schools bigger than that, and that showed as big as Los Angeles, which is the second biggest county in the country.
That was not telling a story. That was just going to confuse people. The point I was trying to make when I was at your office was more around that, that you need to include the things around narration and annotation. Again, bringing the human part in so that you can make sense of it and to show what's going on.
If you've ever seen me give a recent presentation, I use the lipstick mode, and I put big red circles around the things I want you to pay attention to. Then the slide appears with "Tada", then that comes on, and then I say it so that I'm trying to peg it a little bit into your memory.
I think this gets actually to a pet peeve.
Roger and I were talking about this some time ago, which is my biggest pet peeve... Roger, I'm curious your reaction to this. Somebody's like, "As you can see.
I'm like, "I have no idea what you're talking about. There's 58 lines. What are you talking about?" Then they're like, "As the graph shows," and you're like, "I don't know what that means even." People love these things, and you're just like, "Where the hell is...Tell me..."
We have a saying in National Security, BLUF, Bottom Line Up Front.
Tell me what the bottom line up front is, and then I can get there.
But if you're taking me on this journey of literature, I don't have time for that crap. Help me understand. Can you go to the president and be like, "Well, let's go on a data journey together, and let's talk about how we got here."
No, bottom line up front, and then figure out if they're interested, how do you get them to the richness that helps get another level of understanding?
There's a thing that I always tell the analysts who work for me that you can at best communicate four things plus or minus three. You got one to seven things that you can try to communicate, and you should say them upfront. The BLUF-ing, you can go through it, and then say them again at the end, and save the detail for later.
I think what's hard for a lot of analysts is that for them, the story of how they got to where they got is pretty interesting to them. It's really the insight that you really need to...
"I joined this table, and then I did this", and then everyone's like, "Oh my God."
They'll say something like, "And I forgot to do a left join."
We don't want the GitHub repo.
That's what appendices are made for. You throw that stuff in there.
I wanted to ask you about... I mean, you've both mentioned your work with the COVID response. I was thinking about COVID, and I feel like it's maybe the first time in my life that I feel like I've really consumed data visualizations from my government.
It does seem like starting to kind of get this communication in graphs and charts that are reasonably good and seemed to be well thought out, but I was wondering what problems you were trying to solve, or what were the big problems that data could solve with COVID and our government?
Well, I mean, maybe I'll start, and then Roger, you want to layer in because you picked up a lot of the baton from me in our first wave, and then took it far further.
The way this happened is our intention wasn't to go up to the Capital, and just be like, "Look, we're here." In fact, what it was, was we just happened to be on a phone call with a friend who is helping out at California, was actually a state employee. We're talking and they said, "Here's what we're thinking about data." As I remember saying, "Well, that's not what I would do if I were you." They're like, "Well, what would you do?" It's like, "Famous last words."
In a couple hours, I wrote a memo. I said, "Here's the way I would frame it. Here's what I think is doable. Here's what's not, and here's how I would organize things." Next thing I know, 24 hours later, we were driving up to Sacramento at 5:00 am to meet the team and start jumping in, and then we were up there for about 100 days.
The first part of it was, remember at this time period, there was no data. People think there was lots of data. We had data that we weren't sure we could trust out of Wuhan. We had data off two cruise ships, and a little bit of information when we were able to call our friends who could connect us to other friends who were physicians in northern Italy.
That's all the data we had. There was all this talk [about] epidemiological models, all these things. There were no models that were like, "This is the gold standard." There's no weather model for this. We were able, luckily enough, to have... The story actually, interesting enough, is being super detailed by Michael Lewis and his book that's released today⁴. Then we had this amazing woman, Charity Dean who is working on things. We had another guy, Adam Readhead, who is another public health official, and Amy Tong, who's running Information Technology for the state of California, amazing human.
These are real...these are people we should be grateful for. What they were putting together was like, "Well, what is the model?" We found that one of the models... Everyone was looking at the models for all of the states. The model for Delaware is the same as the model for California. That doesn't make sense, and that doesn't help us think about how to think about LA, San Francisco versus Alpine county or Tahoe area.
We needed a more sophisticated one. Luckily, there was a research scientist named Justin Lessler out of John Hopkins, who had a pretty sophisticated model. This model-
I'm sorry, this is a model of what? What are you modeling?
This is modeling... It's basically a set of differential equations, and it says basically, a person's... You start with the population. You sprinkle some base conditions of those that are infected. They, at some percentage, get other people to be infected, symptomatic, asymptomatic ratios. Some portion of them will die. Some portion of them will survive, and then that's it, super simple.
Now, you need other things in there like, "Well, what about people who commute between the Bay Area and LA? What about different age demographics? What about closing schools? What will that do?"
They had started to build more and more sophistication in the model, and so you could run an ensemble, many, many scenarios. The only problem is this was a research thing, so it's running under somebody's desk. Luckily, we were able to call on Sam Shah, who really deserves a lot of the credit for scaling people you may know at LinkedIn.
Jonathan Goldman and Mike Greenfield really came up with the ideas. Sam Shah scaled it and made it really a machine learning platform. Josh Wills, who was at Cloudera and then at Slack figuring out how to make digital engineering work. The two of them with Justin Lessler's team and with a massive help from Werner Vogels and the Amazon team took that model and ported it over in a matter of days.
Now, we're able to run hundreds of simulations. Those simulations are what led to those first graphs that people saw of the exponential curve that were shown on press conferences by Governor Gavin Newsom. That also led everyone to see like, "Holy cow, if we don't get this under control, here's where our bed capacity is right now. Here's our bed capacity if we put them in parking lots and do everything, and here's the curve.
A lot of people at that time were like, "This is garbage." They didn't see what's happening literally in India right now. They weren't seeing what was happening in New York.
That model, that effort of a combination of data scientists, data experts, and technologists combined together with the policymakers, that's what led to the state order on saying we need to stay at home, because there's one goal. One goal is to preserve the healthcare system for tomorrow, because the physicians get sick or die, you don't get that back.
That model, those efforts with Governor Newsom and him, is what allowed other states to realize that they need to take action as well. That's what led us to the follow-on orders, and actually being able to make sure that we didn't have happened what we saw in New York happen in San Francisco or in LA, even though LA was still hammered.
From those efforts — and there was so much more data than we sort of realized — that was just one part, because then it was like, "Hey, how do we help policymakers with richer, deeper understanding of ideas?" We had to bring data in and draw insights.
Luckily for me, one of those people who answered the call on the first ring was Roger, and Roger said, "I'd be happy to volunteer." Because we're all volunteers, no one was paid. This was all volunteer all the time, and we're all just trying to do our best.
Roger, you should take over because you led the next portion.
One of the interesting things that happened is what came out of that model was the need to look at mobility data, and so we started getting a lot of updates about how people are moving around. We noticed some things about that that ended up leading, and I think this is what's so interesting about it, is the data led to thinking a lot about ethnography. It ended up being behavior that mattered, and then turning that behavior into something we could do.
I'll just give one quick example. In the spring in Los Angeles, there is a bioluminescent event⁵, so these algae glow, and people want to go to the beach and look at them. People brought this up, and we're like, "Well, what are we going to do about it?" Somebody's like, "We gotta keep people off the beaches. We gotta keep people off the beaches."
It's like at that time... This is April, I think like late April, early May. I think enough people on the team knew it was an aerosol, and that spreading apart was okay. I mean, but this is my remembering. It might have not been as clear as it seems now that we know for sure it's an aerosol, but that's how I remember it.
One of the things we did is like, "No, we're not going to keep people off the beaches. Let's keep people safe on the beaches."
Here's what we had. We knew that people were moving around a little bit. I think at that time, there was a little bit of upward movement, and so we told people in Los Angeles that what you need to do is maybe have some people to keep people spread apart and stuff. Of course, there weren't going to be as many people anyway that's doing that.
We started doing stuff like preparing for Thanksgiving in August. What do you tell people? The harvest was a big... The real boost in California's rates came because of the harvest, which was a perfect equation for how you're going to get infections in a community. I don't know if you remember it, I think Imperial County at one time had the highest rates in the world, and it was totally because of the harvest.
What we ended up doing is using data to try to communicate to the people in the state, and to think about behavior things that then we would maybe build new...and they really weren't models, they were really just studies about what we could do or what was happening that we could intervene better.
We started going from where everything was statewide, at first, to talking about rural versus urban, because there's very different things going on depending on the density and characteristics, and trying to also learn things like...
The Bay Area did better than the rest of the state. I think that the trivial reasons why really ended up not being the reasons why. The trivial reason was people could remote work easier, and an educated population, and it ends up... This isn't something we found, but a lot of it had to do with the Bay Area's experience with AIDS, and having to deal with another pandemic. That was community, community access.
As we started seeing mobility data showing more movement, we brought in someone from New Haven who had done some really interesting work around, "How do you deal with that community part?" What I like to think as what happened is the basis was laid with data, and then we were using that to go to this next level of mixing that data with some qualitative behavior.
We had some ethnographers on the team start doing a lot of surveys, and we would use those surveys to... In the end, DJ, I don't know how much of you were involved towards the end of this, but really, the surveys ended up being the driving force behind comms that we're going afterwards, which is, again, another qualitative thing, but we made sure that the surveys we're doing were being a better instrument for pulling stuff.
This is one of those lucky breaks. I happen to run all the surveys at O'Reilly, so I had some survey experience, and we were able to bring that in and improve that.
Kara DeFrias really deserves credit for having the survey idea. What she did is she basically convinced the state to basically put just as an open-ended set of questions on one of the highly trafficked webpages, and it just sampled. It was just a way of getting feedback, but the problem is it's very hard to get a feedback on a state the size of California, just given the disparity.
One of the things that was prepared every night at that point was basically a briefing book for the governor and the key staff. It had charts and graphs and all sorts of really important key insights. But also, it had snippets of key things that we heard from the real population, real stories.
These weren't data points anymore. There were people. They had names. They had ages. They had stories, and you read those things, and you could feel the fear. You could feel the pain, and so no longer could you just be like, "Well, it's an uptick. We'll see what happens."
No, that uptick destroyed a family. Like, "Oh, it's just the harvest." No, we're about to destroy a community. What are we going to do about it? It changes the whole narrative and approach you take from just being a data science thing, and thinking about this in abstract and playing with graphs to, "If we do not act right now with immediate sense of urgency, somebody will die."
It's not an "if", someone will die. Our actions directly help the shift in balance of who that is, and how do we make sure that they get the best shot at surviving? Our job fundamentally was to use data to give everyone a shot at living. If a hospital doesn't have oxygen, figure out how to get them the oxygen, so those people have a chance.
If people don't realize that the people around them are highly infected, let's give them a shot to actually be able to take safety measures in their own hands, so they can survive and increase the probability that they are okay, or some other family member that they may expose will be okay.
I just want to say one thing that was really striking to me about it. Obviously, this kind of thing is in politics, it's in the realm of politics. This group of volunteers went out of their way to always treat every group, every person, as worthwhile. There was really no politics in the traditional polarizing way we think about it going on.
It was always about how to keep people safe and how do we... Mostly, it was about how to tell them the information they need, that they can try to be safer and do the right things. It was really quite, I don't know, enervating to see that going on with it.
I guess this relates back to my earlier point about the annotation narration is that we ended up moving from learning from the data, and then moving to this alternate approach that I think ended up being effective downstream.
Do you have a sense of how effective this was when you... I mean, it sounds like there's a whole bunch of different kinds of interventions you were doing.
Is it possible to even know if you hadn't done these things, what would have happened? Do you feel anything about the overall effectiveness of this response?
There's a blog post that Sam Shah wrote that talks about this, and there's been a whole lot of estimates. I think we'll continue to see estimates and a lot of people doing deep analysis for decades to come.
I think I've received a fair amount of criticism, and it's okay to receive criticism about what people would describe as a very strong policy response, and that we were too aggressive in shutting down the economy, and taking the action we did. I actually sleep well at night, knowing that we took the strong action that we did, because if we didn't...
I mean, I was in contact very regularly with friends who are on the front lines in New York City. I was on the calls with people who were in the ER who were showing me how they were... Just like in a kindergarten, they had a wall with paper brown bags that you put your mask in, because you just need to come back and reuse them.
People forget how many physicians and nurses and janitorial staff, the people who we don't often think about in the health care system, that died in service. When you lose that capacity, you don't have the capacity to get back up. You don't have a...
There's just no one else to take care. What you're seeing happen in Brazil, what you're seeing happen in India, that could easily been us. People think, "Oh, we're good." Remember, there was no Remdesivir. People didn't even know about ventilators and how... Do we flip a person over? Do we sit someone up?
We had no information. We had a Slack channel that was created, literally, for physicians just to share information from one group to another about what they were learning. That's how little information we had.
Now, what's behind this? A decade of under-investing in public health, more than a decade. Did we have to end up this way? Absolutely not. This is an abject failure of literally 20+ years of not investing.
President Obama called for a massive revamp of this after Ebola. We saw this with MERS. We saw this with SARS. We've seen this many times over, and people often think like, "Oh, we're through the pandemic." This is not pandemic flu. This is not pandemic tuberculosis. This is not the next Coronavirus, which will show up. We will expect another Coronavirus.
I don't say that to be a doomsayer. I say it as these are the systems that we need to get into place now to be ready for what's next so that we don't have to just say, "One size fits all, shut everything down."
We can be smart, because we are going to have to create this as a knob, if you will, of dialing things open, dialing things back, depending on what we're seeing in which community. A lot of this is going to be really tough because it's socioeconomic also. As Roger pointed out, you get very different dynamics from one region to another.
As an outside observer, it hasn't felt like the levels of COVID the different regions saw was exactly correlated with the thoughtfulness of the COVID response. Do you think that's fair, or is it just that the data is noisy? Am I missing something there?
Say more. I don't think I fully understand.
Well, I guess it seems to me... I did not prep by looking deeply at the data, but I've had this sense that some states that really aggressively put in controls, maybe, or the states that put aggressive controls in quickly, sometimes, they ended up having more COVID cases than states that seemed to ignore it.
Some states, it sounds like California had a really thoughtful response. It seems like some even governors are like, "Hey, this isn't even a problem," so I can't even imagine how that state can have any reasonable response, when the leadership doesn't even believe that there's an issue when there obviously is.
I guess, it seems really hard to know how much the interventions really mattered.
Right. I think there's a bunch of stuff. Then Roger, I'd love to... Let me just be real quick, and then you should go ahead.
The first is, we're still scratching the surface in our understanding of COVID. I think a lot is still going to be learned. We now know as we were in the summer, we were really worried about protests that were happening. We were really worried at Sturgis Rally⁶. We thought, "Oh my gosh, these are going to be super spreading events."
It turned out we dodged a bullet. I think people are like, "Oh, you're wrong." I could look at it as we dodged the bullet. Because if that was a highly contagious, more like measles, we'd be in real trouble.
The other part I think, which is there, is one of the things that happened is because of the actions here, a lot of people did start to take COVID very seriously on a personal level. They go, "Oh, this isn't just some... California is taking this action. Maybe we should take it seriously."
But the other example of this is the version that you're seeing in India, which is they stopped taking it seriously. They started whole big political rallies. They gave away their vaccines, and they're not shutting down still, and then people are partying and other aspects. That has led to the spike that is...it's decimating, because there's no path out now. What you see in the COVID numbers today is a reflection of four weeks ago.
Roger, I'd love your...
Just using the machine learning language, there's a lot of features that go into what goes on with this. The protests ended up not being super spreader, Sturgis was.
Look at what happened in the Dakotas, right? Those are places without a lot of cross mixing of people because they're relatively isolated and remote, and they have the worst case loads in the whole world. It's not clear that it was totally Sturgis, but there's a lot of thought that it was Sturgis, and that their political response was not very strong.
Now, those are states with less than a million people each, so the impact isn't as great, but there was a different response, and they had a much worse result. Now, what happened in Florida, I think, can be looked at differently. I'll just bring up... I know this isn't supposed to be a political discussion, per se, but I have my parents who live in Florida.
People have self preservation, and I think that there was enough knowledge out there that, at some point, with or without these, whether the government was intervening or not, enough people were trying to play it safe, and we're doing the right things.
Yes, there were people who were in that kind of obnoxious way about individual liberty and stuff like that. I always want to take one of those people and say, "You need to talk to my friends from Taiwan," because anyone from Taiwan who went through, I guess, with SARS, they, as a collective group, knew how to act and do the right things.
It's giving up some things that feel like a freedom, or whatever, seem worthwhile in the long run to tamp things down. Of course, Taiwan had a much better response than others. It's hard to take out the American context and how people behave. There's plenty of examples that would support one position or the other, but personally, I'm more comfortable with the intervention.
I mean, I realize that it was really bad for the economy. I think things like maybe the way schools were opened and so forth could maybe have been handled differently. We've got a lot to learn.
Well, I really appreciate all the work that both of you did.
That's actually a segue to a question I really wanted to also want to make sure I ask you, which is for someone just starting their career in data science, maybe most of the people in that situation that I've talked to these days, they really care about doing something meaningful, maybe getting involved in public sector stuff.
What advice would you give someone maybe just graduating now who wants to do interesting work and have exciting careers like both of you have had? Where do you tell them to start, I guess?
That's a good question. What I've been telling people is to remember this human side of things, and don't get too lost in the numbers. This is more like... This isn't quite career advice, I think, what you're asking, but also, you've gotten a bunch of tools that are pretty cool, but that doesn't mean they're applicable in every case.
Always work your way up from, "Is there a simple thing that will work? Well, how far does it get to you," and then work from there. Then when you need this sophisticated tool, and when it's worth it, to jump in with that.
TensorFlow is not the answer to every classification problem. There's other tools that can work really well. But also, I mean, just find things. I think I just saw a tweet today, DJ, from Rick Klau that the state is hiring.
If you are trying to do some good things...like I said, I was so impressed with the people of the state and their attitudes about trying to do the right things and being good for all Californians. That's a great place to start.
The place maybe I would go with, because I agree with Roger on all of this, no surprise, and hopefully what people have taken away from listening is this is a team sport.
The amount I've learned and grown from you, Lukas, from Roger, from the people that you've introduced me to, the people we've all hung out with, the thing that you want to do at the early part of your career is be around amazing, awesome people. Be around awesome. If you're around awesome people, you'll become awesome, too. You may feel like you're an imposter to start, and you gotta figure out how to shake off that imposter syndrome.
But if you're around amazing people, they're going to carry you. It could be on a public sector. It could be in the private sector. It could be in a hybrid sector, but I fundamentally believe that if you're around those great people, that's what carries forward.
I've been so fortunate early in my career being around amazing people in academia, then being around amazing people in public service the first time after 9/11, then coming out here to Silicon Valley meeting all of you, and being exposed to that group, and then going back into government back and forth several times.
Each time, we're able to pull in amazing. The thing that people don't realize is...people ask me all the time like, "Why do people pick up the phone when you call, and then why won't they pick up for someone other?" It's just because I'm trying to do it for the team. It's a "we" approach.
I'm not trying to just further it for one perspective. I think we've all had that philosophy, that this is a collective movement. I will go on the record saying this, which is I get way too much credit. The credit belongs to the community. It belongs to the teams, all these people.
I've just had the good fortune of being in certain roles that get to shape certain things, but those people have also shaped me. They're the ones that have helped make me into what I am, and help make that happen. If you're early in your career, and you can find a place where you're learning at three to seven times the rate of somebody that's just in a regular job, you're going to do fine.
Seek out those places. Don't optimize for a salary. I'm not saying it's not important, but optimize for learning. Your first derivative, your second derivative, should be highly positive on your learning experience quotients.
I want to focus on a particular part of that, because I completely agree. This talk that I give is my general talk about data topics. It starts off with humility, and humility is a key to learning.
I also will tell anyone that says, "I need to hire a data scientist." I said, "You don't need to add one. You need to hire two." No matter what you're doing, you need to be paired with other people. In terms of finding an opportunity, I think you gotta make sure you're not siloed.
I want to give a particular example of what I think happens. When you get into the data, it's almost like a scientist looking at the universe. You say, "The universe is my data," and without outside perspective, you don't learn. The data almost in a way stops your expansion, because that's all that you can see, but there's a lot beyond that.
I know Ben Lorica, who is one of the people I was lucky enough to work with, who taught me so much, he's a real math PhD. I didn't have that background. We did not release anything without the other looking at it. We were on the phone almost every morning talking about what we were doing. When you're looking for these career opportunities, make sure that you're not going to be siloed, that you're going to have as much opportunity to work with other people, almost like in a peer programming way.
Look for that. Look for companies where you're going to be able to talk to other people in the organization so that you're getting all these things that DJ was talking about: the opportunities to learn from amazing people, and just picking up little things like DJ's story about the data dictionary working so well — the next job you go to, and there's no data dictionary, you're going to make sure there's one there — picking up on those kinds of things, because they can be so effective.
I think just making sure that you're like an octopus, and your career move is your tentacles are all over the place.
It's funny you both say that. I mean, I totally agree with it.
I think it's one of the reasons that... This is totally shameless self promotion, but I really think it's true. We've really tried to build a friendly, smart, but really inclusive community at Weights & Biases with stuff like this, where people can meet smart people that they might not otherwise have access to based on luck and geography, and so I just really encourage people to engage with our community. If you're watching this, you're part of it. We love answering people's questions, and hearing from people, and hearing about what they're working on.
Anyway, I just totally appreciate you guys coming on and talking and answering my open-ended questions, and also appreciate all the work that you've done throughout your career. It's been inspiring to watch, and clearly directly connected to a lot of good in the world.
Thanks. It's been fun to catch up.
At Weights & Biases, we make this podcast, Gradient Dissent, to learn about making machine learning work in the real world, but we also have a part to play here.
We are building tools to help all the people that are on this podcast make their work better and make machine learning models actually run in production. If you're interested in joining us on this mission, we are hiring in Engineering, Sales, Growth, Product, and Customer Support, and you should go to wandb.me/hiring, and check out our job postings.
We'd love to talk about working with you.