Ion Stoica — Spark, Ray, and Enterprise Open Source
Ion shares the stories behind developing the distributed computing frameworks Spark and Ray, and commercializing them into Databricks and Anyscale.
Created on January 6|Last edited on January 20
Comment

About this episode
Ion Stoica is co-creator of the distributed computing frameworks Spark and Ray, and co-founder and Executive Chairman of Databricks and Anyscale. He is also a Professor of computer science at UC Berkeley and Principal Investigator of RISELab, a five-year research lab that develops technology for low-latency, intelligent decisions.
Ion and Lukas chat about the challenges of making a simple (but good!) distributed framework, the similarities and differences between developing Spark and Ray, and how Spark and Ray led to the formation of Databricks and Anyscale. Ion also reflects on the early startup days, from deciding to commercialize to picking co-founders, and shares advice on building a successful company.
Listen
Timestamps
0:00 Intro
0:56 Ray, Anyscale, and making a distributed framework
11:39 How Spark informed the development of Ray
18:53 The story behind Spark and Databricks
33:00 Why TensorFlow and PyTorch haven't monetized
35:35 Picking co-founders and other startup advice
46:04 The early signs of sky computing
49:24 Breaking problems down and prioritizing
53:17 Outro
Watch on YouTube
Transcript
Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
Intro
Ion:
When we looked and we thought about, we couldn't see a path for the company to be successful — a credible one — without the open source to be successful. And then once we reached that conclusion, we just...there was no other discussion, we just focused on that.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host, Lukas Biewald. Today, I'm talking to Ion Stoica, who is maybe best known as the original CEO of Databricks, the company behind Spark. But recently, he's also started another incredibly successful company called Anyscale, which makes the open-source project Ray. On top of all that, he's a professor at Berkeley, where he runs the fascinating and super successful RISELab which is responsible for many of the most exciting startups of the past decade. This is a super fun conversation, and I really hope you enjoy it.
Ray, Anyscale, and making a distributed framework
Lukas:
I think a lot of people listening to this will know about Ray and know about Anyscale, but for someone who's working in machine learning doesn't know about Anyscale, what does it do?
Ion:
Fundamentally, if you look at the trends, the demands of the new kind of applications, like machine learning applications or data applications, are growing much faster than the capabilities of a single node or a single processor. This is even if you consider specialized hardware, like GPUs, TPUs and so forth. Therefore, it looks like there is no way to support these workloads, other than distributing these workloads. Now, writing distributed applications is hard. And if more and more applications are going to become distributed, there is an increased gap between the desire of people to scale up their workload by distributing it and the expertise the typical programmer has. So Ray — let me start with Ray before Databricks — the goal is to make writing distributed applications much easier. It's doing that both by presenting a very flexible and minimalist API, and in addition to that, we have this very strong ecosystem of libraries, distributed libraries. Many of the people in the audience probably know them, like RLlib for reinforcement learning, Tune for hyperparameter tuning, more recently Serve. But also we have a lot of other third-party libraries like XGBoost, Horovod, and so forth. Because at the end of the day, if you look at the most popular languages like Java or Python, they are not the most successful because they're the best languages. That's debatable. They're very successful because they have a strong ecosystem of libraries. Developers love libraries because if you have libraries for your particular application or workload, you make a few API calls and you are done instead of writing a thousand lines of code. Now, this is Ray, it's open-source. Anyscale is a cloud offering, hosted offering, of Ray. We are committed to building the best platform to develop, deploy, and manage Ray applications. This means higher availability, better security, auto-scaling functionality, tools, monitoring when you deploy application in production. On the developer side, we try to provide the developer the illusion of the experience of an infinite laptop. Because still most of the developers — we've done this survey, and others have done surveys — most machine learning developers still are loving their laptop and they're still doing a lot of things on their laptop. We want to preserve the experience of working on a laptop using the same kind of tools like editors and things like that, but now we want to extend that to the cloud. So you edit, you do everything on your laptop, but then when you run it, you can run it in the cloud. We package the application, all the [mechanization] to the cloud. And we run on the cloud, we auto-scale, so it's pretty much transparent. This is what Anyscale provides. But both Anyscale and Ray are really targeting to make scaling applications, in particular machine learning applications, as easy as possible.
Lukas:
That's, I guess, very conceptually simple, but clearly it's been a problem for a very long time, and you've put a lot of work into Ray and Anyscale. What makes it actually challenging to make a simple distributed framework?
Ion:
That's a great question. One lesson we learned is that people and developers, what they really prioritize, it's in some sense performance and flexibility. Even over reliability. I'll give you some examples. When I started Ray, we had only tasks. They are side-effect free. Tasks get some inputs from some storage, compute on that input, and then the result is also stored in this kind of storage. And then another task can consume. Now, that's a very simple model, and you can build a very robust system on that. This is from the lessons we learned also in the past with Spark. Because if for instance you lose some data, you can keep the lineage; the chain of tasks which created that data in the first place. And then you can re-execute the task if you know the order. And if the tasks are effect-free and they are deterministic, you get the same output. We're pretty happy about that. But then people started to want more performance. Here are where things started to fall apart. For GPUs, you don't want to just run a task, get the data in, and store the data. Because even transferring the data from the RAM, from the memory of the computer, in the GPU memory, it's expensive. And then, if your task also is doing this like TensorFlow, starting it, initializing all the variables, it takes a few seconds at the least. A bit more actually. This kind of overhead was kind of starting to be prohibitive. People ask for, "Okay, I want my state actually to remain on the GPU," but then you don't have these kind of pure tasks. And now, it's much harder to provide this very nice model of fault tolerance. And then there is another thing. Reinforcement learning. People using reinforcement learning, they wanted to use it now for simulations or for rollouts, games. Some games are not open-source, and for these games which are not open-source, they keep the state inside. It doesn't provide you the state. You cannot extract the internal state. You can only see the screen. They make you take an action — moving left, right — and then you look at the screen and you read the screen. So because of this, we have to get the actors. With actors, it was much harder to provide this fault tolerance. We still tried it, initially, in our first paper. We tried to be very smart. We said, "Okay, it's Python," so we make the assumption that inside each of these kind of actors, you have a single thread, sequentially. Basically you can order the methods which are executed on the actor. You can then sequentialize. You have an order, you record the order, and then you can re-execute or reconstruct the state. But guess what? People started to use multi-threading. Even if it doesn't work greatly in Python, they still use it. You cannot stop them. Then, we were thinking, "Okay, we are going to simplify it." Let's simplify our life, because we still want to make a system which is kind of...we want to understand it better and we want to try to provide some fault tolerance. We had these restrictions that if you create an actor, only the party which creates the actor can invoke a method on that actor. You have only one source, so at least it's easier to serialize the actions. But then, people started to want, "Oh, I want to do something like [a] parameter server. And [for the] parameter server I want not only me to access this parameter server — which can be implemented as a bunch of actors — but others need to pass the actor handles." But now you have...again, it's like this concurrency from different methods submitted by different actors or a task. So all of these things add, in some sense, complexity. And then if you talk about the fault tolerance...I'm coming back to the fault tolerance, because still it's important, especially in a distributed system. Leslie Lamport — the guy who did Paxos, and a Turing Award winner — his definition, long time ago for a distributed system, it's a system in which when a machine or a service you never heard about fails, the system stops working. Then we have to give up our ideal transparency on fault tolerance. And we said, "Okay, we can restore the actors, but then the application has to do some work in restoring the state if it cares about it." In a distributed system, these are the hard things. It's the performance and fault tolerance. And then in general, concurrency is the other thing. Because things happen in parallel, and it's on different machines now. And again, when you expose and you want to make it flexible, things are much harder. Because in something like Spark, you abstract with the [?]. You don't give control to the user to write really parallel applications. So then you have more control. But again, the more you...
How Spark informed the development of Ray
Lukas:
This does seem very similar to Spark in some ways. I assume that this was informed by your experience with Spark. Can you maybe talk about what was...maybe first describe Spark in contrast, and then talk about how that informed Ray?
Ion:
Totally. Spark was developed for data parallel applications. With Spark, you as a programmer, you see the controlled sequence. It's sequential. You write like a program. The difference is that one of these instructions now in Spark, on API, what happens under the hood is going to work on a dataset. And that dataset, whether it's — the first was Resilient Distributed Datasets, now it's Data Frame — is partitioned, with different partitions on different machines. So you have a dataset, and it's partitioned on different machines. And now you are going to execute a command on this dataset. And that command is going to execute in parallel on each partition under the hood. But when you write a program, you just operate on the data set, and you apply some function. The computation model, which is called bulk synchronous processing, is basically "operate on stage." Each stage, we have a bunch of basically identical computations operating on different partitions on the same data. Between stages, you exchange data. You can shuffle and so forth. You create another data set for the next stage to operate on. The basic stages is map may reduce. It's very synchronous. It's like, one stage operates on a data set, you do a shuffle to create another data set, you have another stage, and another stage, and another stage. For the programmer, you don't have control on parallels. Because you write one instruction, and the instruction grammatically, is at a data set level. It's only under the hood that you take that instruction, or function, and you execute on different partition. This is great for data. And obviously Spark has a great API or fantastic API for the data. Now, Ray is much lower level. Ray exposes parallels. Spark abstracts away parallels, Ray exposes parallels. So you can actually say, "This task is going to operate on this data, and this task on this data. This is going probably to happen in parallel, and here are the dependencies between the outputs of these tasks." And you have another task operating on these outputs from these different tasks. That gives you flexibility, but it's harder to program. On the other hand, in Spark — and in other systems — you have a master. This master is the only one which runs tasks because it launches all the tasks in a stage, in the state. For instance, in the case of Ray, a task can start other tasks or can start actors, they can communicate between themselves. In the case of Spark, and the other BSP systems, the tasks in the same stage cannot communicate between each other. They just work on their partition, and then how the changes are propagated, you shuffle to create another data set for the next stage. But, for humans, it's hard to write parallel programs. We are used to think sequentially. Even context switching for humans is hard, and context switching by definition is not necessarily that you do things in parallel. It's multitasking — do a little bit of this, a little bit of that — and even that is hard. We are not used to thinking parallel. This is difficult. This is hard. So that's why another reason for the libraries, because the libraries on top of Ray, they also abstract away parallels. If you use RLlib, or if you use Tune, you don't know what is running well, and you don't need to worry about that. But that's kind of the thing. It's a much more flexible, lower-level API. You know, I joke that if Ray will deliver on its promise, which I hope it delivers, and you developed Spark today, you'd develop Spark on top of Ray, and that's why you have others-
Lukas:
That was going to be my next question. That's great.
Ion:
Yes, yes. So, that's exactly the way it is. Ray fundamentally...another way to look at what it is, it's an RPC framework — Remote Procedure Calls — plus an actor framework, plus an object store which allows you to efficiently pass the data between different functions and actors by reference. That's what it is. Instead of always copying it, you just pass the references. That's it. That's where the flexibility comes from.
Lukas:
When you were working on Databricks or Spark, were there use cases that you were seeing that made you want to develop Ray? Or was it something that you always wanted to create?
Ion:
No, no, no. One thing happened, and I'm a believer in this. You should develop a new system. Existing systems do not provide the functionality you need. And before you develop this new system, you better try to implement what you want on the existing systems.
Lukas:
Sure.
Ion:
When we developed Spark, it was in 2015 I believe, in fall. I taught a class, a graduate class. I was still the CEO of Databricks at that time. Robert and Philip took that class. It was a systems class, they were machine learning students. Their project was about data parallel training. Obviously, I ask them, "Okay you use Spark for that. It's good." So they did use Spark. Actually, they modified it a little bit, they called the modification SparkNet. But then, there were a few challenges. Spark was too rigid. With reinforcement learning, the computation model you need is much more complex, so to speak. You need nested parallelize and things like that. Spark, again, was too rigid. It was fantastic for data processing, but now you needed a lot more flexibility for something like reinforcement learning. It wasn't a good fit. And the other thing, Spark is in Java, JVM and Scala. It didn't — at least at that time — have very good support for GPUs, Java. That's why we started then to develop Ray. Robert and Philip developed something for themselves, to start.
Lukas:
That's great.
The story behind Spark and Databricks
Lukas:
I mean, I also would love to hear the story of Spark. I remember a time when Hadoop had the same value prop and everyone was really excited about it. It seemed like Spark replaced it in such a massive way, that I think you rarely see with technologies. I'd love to hear what the use case was that drove the development of Spark and why you think the switch happened so quickly.
Ion:
You know that's a great question. The story there, it also started from a class project. This was in 2009, in spring. I was teaching this class — again, graduate class — and it was cloud computing services and applications. Something like that. One of the projects there was to have cluster orchestration. The problem was you wanted the same cluster to be able to run multiple frameworks, to share the same cluster across different frameworks. One use case, it was actually upgrading. Hadoop at that time was not very backward compatible. If you have a new version, it was a big deal to upgrade it. Most of the development deployments are on-prem. So now it's hard on-prem to come up with another cluster to test the new version before you are going to move to the next version. Therefore, if you had the ability to run two Hadoop versions, side-by-side, on the same cluster, this would be much better, and a great value proposition at that time. Initially the system was called Nexus, but then someone told us from academia, that this is a bad name because they already use the name. So it's a name conflict. So we changed it to Mesos. Maybe some of you may remember about Apache Mesos, that was a precursor of Kubernetes. On this project, there were four people. It was Matei Zahariah, it was Andy Konwinski, Ali Ghodsi, and Ben Hindman. With Mesos, one of the value propositions is, you have all this framework and it's going now to make it easier to build a new data framework on top. Because Mesos, well they care about some isolation between the frameworks, and doing some heavy lifting, about detecting the failures, things like that. Doing some scheduling. You'll see that Spark, one of the reasons it was developed was as a showcase for Mesos. Because now it's easier writing a few hundreds of lines of code, to develop a new framework like Spark, and run it on Mesos. So this was happening in mid-2009. So what was the use cases? The primary use case was machine learning. It's a great story there. That was RADlab, and then was AMPlab, and then the RISElab. Each lab is almost like five years, where everyone...people from different disciplines were sitting together, the same open space, Meaning machine learning, databases, system people, all together. Around that time was also this Netflix challenge, the prize, you remember? It's $1 million prize for developing the best recommendation system. We had a Postdoc, Lester, come to us, like, "Okay, it's a lot of data. What do we keep, what should we do, what can we use? You are the system guys. Tell us what we should use." Well, you should use Hadoop. We are working with Hadoop. Okay, Lester went and used Hadoop, we showed him how to use. But then he comes back to say, "Well, this is super slow. It analyzes big data, it doesn't run out of memory, but it's so slow." And obviously it was slow because most of the machine learning algorithms are iterative in nature, right? You start, you ingest more data, you find a model until you get a model you are satisfied with the accuracy. It converges. Each of these iterations was translating in a MapReduce job. And each MapReduce job was reading and writing the data from the disk. And at that time, the disc was slow disc drives. So it took forever. That was one of the use cases. The other use case was query processing. Also, at that time, what happened, everyone — at least some large companies — are adopting Hadoop to process a large amount of data. After all it's MapReduce, Google was doing MapReduce, must be good. But now you have...also, these other people like database people, and they're looking at the queries, the data and so forth. But now you have all this other huge data with someone else, and they're asking for access to the data. We said, "Okay, you get access to the data, the only thing you need to do is write this Java code, this MapReduce code, and you can process the data."" These people, that was not what they are doing. They were doing SQL, writing SQL statements. And then people started developing Hive — or I think Facebook and Pig Latin from Yahoo — a layer on top of Hadoop, which provides some query language which is similar to SQL. So you get that, you have this system, now you can do the query on it. The problem when you do a query on that, these people are coming from databases. They write a query, they get the answer. Here you write the query, well...come back in 2 hours to get some answer. So it was slow. So these are use cases that Spark targeted. And the way it targeted that was keeping as much as possible of the data set in memory. The trick which Spark had at that time was not only to keep the data in memory, but how do you ensure resilience? Fault tolerance. Because that was a big deal. If you remember all of these things about...actually, even building big computers, clusters, from commodity servers, it's coming actually from Berkeley. It was a project which was called NOW, Network of Workstations, in nineties. Before that, you want a lot of power and so forth, you buy a supercomputer. But now you have this commodity servers, guess what? They fail. So this kind of work was very ingrained. You need to provide fault tolerance. That's why Hadoop puts the data on the disc. If it's on the disc, hopefully it's durable because it's creating three copies of each data. So you take care of that. But in case of Spark, now you keep the data in memory. So how do we do fault tolerance? You do fault tolerance like I discussed earlier, because you have only tasks. The tasks, they don't have side effects. You keep the lineage of task, you record that. If something fails, you re-execute the task which created the data you lost because of the failure. That was Spark. So now, because the data is in memory, machine learning applications are going to run much faster. Because between iterations, the data is still in memory. And by the way, it was also more flexible as a computation model because Hadoop has only MapReduce to stages. But here you can chain a lot of more stages. And obviously if the data is in memory, the queries are going to return much faster, even if you have to scan the entire data which is in memory. These are kind of the use cases which powered Spark. And now you are saying, you are asking, "Okay, how is it displaced?" You see, Hadoop — in some sense — it was a lot of hype. For good reasons, but it was still in a bubble. In 2000...it was quite amazing, because everyone — at least in the tech world — they knew about Hadoop and big data. The number of Companies, like in 2012, 2013, that period, there are not a lot of companies using Hadoop. The summit, Hadoop summit, was like, 300, 500 people. Maybe 700 people. It was like a bubble. And then Spark came into that bubble and it says, "We are going to provide a better computation engine. And we are going to work." Because Hadoop has two parts. A computation engine, which is MapReduce and HDFS, which is this file system. Initially, it was a fight...not a fight, but it was...Ray was viewed for a long time that it only can operate with small data which fits in memory. But when we started, it wasn't anything difficult to operate on the data on the disc and Ray was actually doing that from day one. But a focus on in memory, because that was what was doing particularly well. Then it was a very smooth replacement, because it was now another engine in the same ecosystem and then Cloudera bet on it in 2013 at the end. And then it snowballed from there.
Lukas:
Was it obvious that there was an opportunity to start a company around Spark?
Ion:
Initially, we built Spark and it was an academic project. People started to use it, and the obvious question was, "Well, from a company, am I going to bet? I like Spark, but can I bet on it?" What happens when Matei or whoever graduates? What happens to the project? We really wanted to have an impact, because we saw this as a much better way to do data processing. We saw the data is a big problem. There were two ways to go about it. You need eventually to have a company behind the open-source, to make open source a viable solution, at least for large organizations. I'm not going to give names, but we went to a Hadoop company — we were friends with Cloudera, Hortonworks, and so forth, even MapR. We knew people, they were actually sponsors of our lab at Berkeley, we were meeting all the time — and we asked actually, don't you want to take over Spark? But they didn't, because there were other plans about what will come after Hadoop and MapReduce, as a computation engine. And then it just happened, times aligned. I was about to take a leave, Matei was graduating and all the other people...Andy and Patrick were already thinking about creating a company. So it's all coming together and we say, "Okay, let's start a company." We had a lot of discussion when we started the company, and one of the big questions is company success predicated on the Spark success, open source success. Remember, when we started, things were not very clear. We started in 2013. We started to talk about the company in the fall of 2012. When we looked around you have Linux, which is a pretty special phenomenon. But if you look at that time, there was no unicorn based on open-source. It was MySQL, but only later was sold to Oracle.
Lukas:
Cloudera wasn't big yet?
Ion:
Cloudera was not big enough. Hortonworks was small. It wasn't big enough. It's only one or two years after, that we started to have these big rounds of valuation of four point something billions. Also, Cloudera was...people think actually they're Cloud-era because they initially wanted to do in the cloud, but they saw that it's not enough business in the cloud, and probably it was true then. And then they pivoted into on-prem. We started the company. Long story, but we decided to go with — at that time — with this new business model. We only provide the hosted version of Spark on the cloud, initially only AWS. We decided that the success of the open source is necessary to the success of the company. We're saying, "Okay, if the open source is going to be successful, then if we build the best product for the open source, hopefully we are going to get these customers." Even if initially, maybe there will be other open source companies providing Spark, or cloud themself. Because then Cloudera provided Spark to their users, then MapR, then Hortonworks, and obviously AWS, Azure, and Microsoft, with [?] inside. We committed, we bet on the success of the open source back then. And we put a lot of effort into that.
Lukas:
It seems like now, businesses built on an open source model is an incredibly popular strategy for infrastructure companies.
Ion:
Yeah.
Lukas:
Do you ever...
Ion:
Databricks was one of the first to do that. Before then, it was on-prem, and that was a business model. The business model of on-prem was a little bit heavier, much heavier. And remember that some companies founded at the same time, they failed. Even if the open source would be huge...well, not failed, but they're not as successful as people believed. It wasn't clear at all. I mean, that was, at that time, a pretty big bet. We got very hard pushback and a lot of pressure to go on-prem, at least initially. But now, building a hosted offering for an open source is quite common.
Why TensorFlow and PyTorch haven't monetized
Lukas:
Why do you think the popular deep learning frameworks, like TensorFlow and PyTorch, don't have something like that hosted in the cloud? Even though enterprises generally use them, that business model doesn't exist there.
Ion:
It's a great question. I can just...obviously this is hypothesis. For PyTorch, you have GridAI right now-
Lukas:
That's true.
Ion:
-providing some of the hosting. I think that these are coming from large companies — open sources from large companies — and these companies themselves are not interested to monetize. The way, for instance, probably Google thinks about TensorFlow, the monetization, is that TensorFlow and everything would work best on GCP, in particular using TPUs, and that's how they're going to monetize. The best place to train TensorFlow models which use Tensorflow is going to be GCP. The same with Kubernetes. It's hard for a company which doesn't have the creators of an open source to create a business, to...it's harder. If you don't have the creators of that particular open source part of the company, then it's just harder. You cannot orchestrate, you cannot develop in sync the open-source and the offering. I'm not aware about a huge success so far, of a company behind Kubernetes. But, how could you do that? Most of the Kubernetes developers are still with Google. So I think it has to have something to do with that. And the other thing is about...hosted offerings are more valuable when the solution is distributed, because then the value is to manage a cluster. As long as you run on only a single machine, the value is a little bit less. Now, of course, TensorFlow can run on different on a bunch of machines and so forth, and it's TensorFlow distributed. But I do think that these are the two things. One, most of the uses of PyTorch and TensorFlow are still on a single machine. And the second one, most of the developers of these open source libraries are still with these large companies like Google, Facebook. I may be wrong, but that's what I think are the differences, at least these are some differences.
Lukas:
Interesting.
Picking co-founders and other startup advice
Lukas:
I guess another question I wanted ask you — as someone who started these two very, very successful companies — do you think that the humans you picked as co-founders had anything to do with that? Was there something that you saw in them, or some commonalities between the co-founders that you picked that you think made them effective?
Ion:
Oh, absolutely. Absolutely. And you know that, Lukas. The people are so important. I mean, I'm telling everyone that the things I'm the most proud of at Databricks — and I'm saying Databricks because it's an older company, so you can see...it's more time to observe — I am proud of the original team. At some point, to be successful, you need everything. Including being lucky, right? But I think that the people were quite complementary. They have all — despite the fact that they all have a lot of accomplishments — relatively low ego. We were very open and we are a team. Like Matei, I know him since 2006, 2007 when he joined Berkeley. Ali came to Berkeley in 2009. Andy was also there at that time, then Patrick. So we knew each other for a long time. We were together. We were very open in discussing any issues. We were not always agreeing, we had shouting matches and so forth. I remember that later, people told us that this small office in Berkeley — and we didn't realize — but when we are having these very passionate exchanges, people are hearing almost everything because there is not good isolation. It was at some level scary, because you have these people who are supposed to lead the company and they don't agree, on even probably basic things. But we were very comfortable debating. I think that it's the same with Robert and Phillip at Anyscale. It's again low ego and so forth. I think that one thing you want from everyone, including the CEO and so forth, you want everyone to put the success of the company above what everyone's goals are. Because these things, which is absolutely true. What is the saying, "There is no winner in a losing team," right?
Lukas:
Right.
Ion:
I think this is what I would say. You need...when you know people for a long time, you have that trust. Trust is absolutely fundamental. Because there are good things, and there are high and low points in the life of every company. I imagine a small company is like you have a plane which is flying very close to the ground. There is not a lot of room you have there. I'm not saying that everyone is absolutely humble or whatever, but absolutely they need to believe that the most important thing is success of the company.
Lukas:
When you set up Ray as a business, you had been running Databricks for a while and it was starting to see real success. I imagine you were quite a different person. Did you think about starting that company differently than starting Databricks?
Ion:
What strikes me is how much great feedback you get from people, and how much of this feedback you ignore. If I think back, it's about fundamentals. Everyone knows what you need, at least in theory, to build a great company. Of course, you need to have a great team. You need to have a vision, strategy. You need to really focus on the product market fit. Early customers, make them super successful, iterate from there. Everyone is like...you know how to do it. But what strikes me is how hard it is to do it. People don't do the wrong things because I don't think, in general, they don't know what is the right thing. They do the wrong thing because doing the right thing is very hard. Imagine that you are going to San Francisco, or pick your favorite city, and you are going to ask passers-bys, "What does it take to be successful?" What will people say? You need to work hard, focus, have a little bit of luck. Things like that. Everyone will tell you what you need...be driven, persistent, whatever they're going to tell you. You are going to get a lot of similar answers. All of them actually know what it takes, but how many people do it? And the reason is, it's just hard to do it. It's damn hard. When I'm looking back, there are some things we sticked with at Databricks. Like, we picked the cloud, why did we pick the cloud? Because it was focused, we wanted to focus on something. We realized early on that developing for cloud and for on-prem, it's a pretty different engineering process. You need to come up with two teams. We are not even sure that we can build a great engineering team doing one thing, let alone two. So, things like that. We were thinking that, "Okay, we are fine to do the cloud because we believe the cloud market is going to be big enough for us." If you tell me that the on-prem market is whatever tens of billions or whatever — I don't remember that time — what can I do about? I have 40 people, or 80 people. In order to capture any sliver of that market, it'll take years. So why focus on that now? These are the things. What I'm trying to say is that we didn't do anything other than, in some sense, the basics. And the same thing with Anyscale. You try to focus on where you want to innovate and the rest, you just try to use the state of the art solution. So, now how was it different? It just, in some sense, makes you more confident that these basic things are working. It also makes you more sure that approaching them is very rarely shortcuts. Just hard work. And then it makes you appreciate even more — I didn't say it so far, but it's probably the most important thing — it's about how important execution is. Like John Doerr was saying, "Ideas are easy. Execution is everything." And you get some people who make such a huge difference. Like Ron Gabrisko, who was eventually our CRO. He joined us when the company was a few millions ARR, and took us ro many hundreds or millions now. It's about having...or like with hiring. Everyone is telling you back channel references are so important. But it's hard because it requires effort. Everything requires effort. I cannot unfortunately tell you there is any silver bullet or is anything. Just stick with the basics, and think about there is no shortcuts. You also need to think that every company is different. It has to be different. If you think that they're the same, something probably is wrong. Because things change. Like for instance, Anyscale versus Databricks. When we started Databricks, AWS was the biggest cloud. Right now you have multiple clouds, you cannot ignore it. GCP, Azure. When we started Databricks, it was data scientists, the main people we focused on. Then data engineers and so forth. Here it's more developers and machine learning developers. And different users obviously want different things. Again, it's nothing earth-shattering. It's something obvious. But I think there are these little things. And then again, it's execution, it's speed.
Lukas:
Was it hard? It's funny, I'll just say, I get asked that question a lot, about second time company, what do you do differently? I answered almost identically to you, where it's like, you know what you're supposed to do, but in the details you kind of do little things better. But I'm curious, because one thing that's different about your experience than mine is, the second time, you're founding a company and you're not CEO that time. Was it hard to work with Robert in some ways? I mean, he seems very impressive and smart, but I think it might be his first corporate job in his life.
Ion:
Yeah.
Lukas:
You must feel like, "I know how to do this and you're not doing it," or did that not happen?
Ion:
No. I think the reason I was CEO at Databricks early on was because no one was sure that he's going to do it long term. Actually, Ali wanted to go to academia. Obviously Matei had a job at MIT, he was on leave. And so forth. With Robert and Philip, they didn't look at anything else. They didn't interview anything. This is what they wanted to do. I think that this arrangement, right now I really like it. And again, we worked together since 2015. Four years before we started the company, so we know each other very well. I think that in terms of responsibilities, Robert, myself, and to some degree Philip, we divided pretty well. As you know, as CEO there are so many things to do. Having someone you can rely on and split some of the responsibilities, help solve...
Lukas:
That makes sense.
The early signs of sky computing
Lukas:
Well, we're going slightly over time and we always like to end with two questions, which I'll do with you. But maybe I'll make the second to last one more specific. Is there another project like Spark or like Ray that you're dreaming of, that you would do if you had more time?
Ion:
I think the one thing I'm looking now — and this could be another next lab at Berkeley — we're looking about...tentatively we call sky computing. It's multi-cloud, but think about internet for the clouds. Fundamentally the belief here is that...what internet did is it stitched together a bunch of disparate networks and provides the abstraction of a single network. Therefore, when you send your package, you don't know through which networks your package will travel. I think that what we are going to see more and more, will be an emerging of this layer – we call the intercloud layer — which will abstract away clouds. You see the early signs. It will lead also to specialized clouds. You can think about for instance...you have a machine learning pipeline, you have data processing, you have training, you have serving. Each of these components you can actually run on different cloud for good reasons. Like for instance, maybe you process confidential data and you want to remove the P-II information from the data. You can decide to do that on Azure, because it has Azure confidential computing. You can decide to do training on TPUs, and you can decide to do serving using Amazon Inferentia, new chips. I think you're going to see also the rise of more specialized clouds, especially for machine learning. There is an announcement from NVIDIA that is Equinox, which is...it's really like GPU-optimized data centers, tightly built. So I think that's kind of something very exciting. You can see that...you look always at the trend and there are these kinds of evolutions in which the clouds, by necessity or driven by open source, provide more kinds of similar services. This provides a very good ground for emerging the next level to abstract away.
Lukas:
Wow. What an intriguing answer. It seems a little crazy, but I'm kind of convinced as you talk about it.
Ion:
Well, I mean, you asked for it. I think there are many projects, but I think this is one...I think it will happen. And by the way, with every company, with everything probably you want, you need to take a bet, right? You need to make a bet. If you don't make a bet, you are doing what everyone else is doing. You guys made a bet. What you are doing is absolutely not obvious, when you started.
Lukas:
Yeah. I agree.
Ion:
This can be a great company.
Lukas:
Totally.
Ion:
You have to. And if you are wrong, at least you tried.
Breaking problems down and prioritizing
Lukas:
All right. Well, usually we end with the question for ML researchers. We ask, "What is the hardest part about getting a machine learning model into production?" But I think for you, you're a company builder — also an academic — but I think as a founder, what is the hardest part that people wouldn't see from the outside about building a really successful big business?
Ion:
I think that, probably the hardest thing, one of the hardest things, is that obviously with each company there are ups and down. I think that when things are down, you may need to make corrections. It can be down because a product doesn't deliver, maybe because you are on the wrong path with the product, maybe because the wrong people...or I mean, not the best fit. I think that when things go well, it's easy. It's great. But I think there, it's about always going back to the fundamentals. Trying to not be emotional and trying to always look at the facts, whether there are trends in industry — you look at the data —, whether there are data coming from the customers, whether there are facts with respect to someone who maybe is not the best fit. When things are hard or good, it's like...we are humans, we are emotional. I always found that it's hard to split and push the emotions to take second seat, and try to think only about facts when you make the decisions. The harder things are, the more emotional you are. Because you take it personally and things like that. So I think this is what I found to be the hardest thing. And, I'm also an emotional person. At some degree, I'm getting also really excited. But that's kind of what I found. I found that, in general, when you kind of try to make decisions based on emotions — at least in my case, I think for some people it works. It's gut feeling and so forth — for me, it did not work.
Lukas:
Do you have any tricks for managing your emotions and thinking clearly under stress? I'm asking for a friend by the way.
Ion:
Yeah. I'm trying to simplify the problem. There are many things coming when you are under stress, and I try to say, "Okay, what is the most important thing?" and try to forget about everything else and then try to simplify the problem. Then it's easier to make a decision based on what is the important thing. I think that's what I discovered, especially when it's very hard to make a decision because there are multiple dimensions associated with the decision. Like for instance, I mentioned to you earlier on, we had a lot of discussion when we started Databricks. Okay, it's important for the data open source to be successful because now we have a company, now we need to build a product to have some revenue at some point. Obviously, there are four possibilities. The open source is successful, the company's not successful. Open source successful, company successful. Good, good, good. You have all this for a 2x2. When we looked and we thought about, we couldn't see a path for the company to be successful — a credible one — without the open source to be successful. And then once we reached that conclusion, we just...there was no other discussion, we just focused on that. Yeah, we just tried to find methods to simplify it and hope that everything else we didn't consider, we follow up. As long as you are focusing on the main thing, everything else will follow up. That's my...I'm trying to oversimplify it, and sometimes maybe it's not good. Try to think about what is the most important things I need to solve, what is the most important dimension?
Outro
Lukas:
Well, that's good advice and a good spot to end, I think. Thank you very much. That was a fun interview.
Ion:
Thank you. Thank you. Bye-bye.
Lukas:
Appreciate it.
Lukas:
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work were really hard to produce. So check it out.
Add a comment
Tags: Podcast, Gradient Dissent
Iterate on AI agents and models faster. Try Weights & Biases today.