Stephan Fabel — Efficient Supercomputing with NVIDIA's Base Command Platform

Stephan talks about Base Command Platform, NVIDIA's software platform for its DGX SuperPOD infrastructure. Made by Angelica Pan using Weights & Biases
Angelica Pan

About this episode

Stephan Fabel is Senior Director of Infrastructure Systems & Software at NVIDIA, where he works on Base Command, a software platform to coordinate access to NVIDIA's DGX SuperPOD infrastructure.
Lukas and Stephan talk about why having a supercomputer is one thing but using it effectively is another, why a deeper understanding of hardware on the practitioner level is becoming more advantageous, and which areas of the ML tech stack NVIDIA is looking to expand into.

Listen

Apple Podcasts Spotify Google Podcasts

Timestamps

0:00 Intro
1:09 NVIDIA Base Command and DGX SuperPODs
10:33 The challenges of multi-node processing at scale
18:35 Why it's hard to use a supercomputer effectively
25:14 The advantages of de-abstracting hardware
29:09 Understanding Base Command's product-market fit
36:59 Data center infrastructure as a value center
42:13 Base Command's role in tech stacks
47:16 Why crowdsourcing is underrated
49:24 The challenges of scaling beyond a POC
51:39 Outro

Watch on YouTube

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
Stephan:
Scheduling on a supercomputer typically is by Post-it. It's, "Joe, it's your cluster this week but I need it next week." It doesn't work that way at scale anymore. You want to interact with something that is actually understanding the use of the cluster, optimizing its use so that the overall output across all of the users is guaranteed at any given point in time.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lukas Biewald. This is a conversation I had with Stephan Fabel, who is a Senior Director of Product Management at NVIDIA, where he works on the Base Command platform software that runs on top of NVIDIA's DGX machines, which are basically the most powerful computers that you can buy to train your machine learning models on top of. It's fun to talk about the challenges that customers face when they have access to basically unlimited compute power. This is a super fun conversation, and I hope you enjoy it.

new section (1:08)

Lukas:
My first question for those who haven't heard of NVIDIA Base Command, since you are the senior product manager on it, can you tell me what Base Command aspires to do?
Stephan:
In a way, think of NVIDIA Base Command as your one-stop shop for all of your AI development. It's a SaaS offering from NVIDIA where you log on directly, or you log on via an integration partner, and you leverage the capabilities of Base Command to schedule jobs across a variety of infrastructures. You do that in a secure manner. You gain access to your data and retain access to your data and data sovereignty, across the infrastructure that you're scheduling the jobs on. Then it's really just a matter of optimizing that job run on NVIDIA infrastructure. That's really what Base Command aims to do.
Lukas:
These jobs, they're model training jobs exclusively, or is it broader than that?
Stephan:
Model training jobs are generally the ones that we focus on, but we also do model validation, for example. You could have single-shot inference runs as well.
Lukas:
Are there other pain points of model of development that Base Command aspires to solve or tries to solve?
Stephan:
Yeah. I think that a lot of the issues that you have with AI infrastructure, it's really that's where it starts. The question is, "Where do you train your models?" and "How do you go about it?" Most people start in the cloud to train their models. That's reasonable because just any development effort would start in the cloud today. At some point you reach a certain amount of scale where you say, "Well, it may not deliver the performance I need, or it may not deliver the scale I need, at the economics I'm comfortable with," et cetera. For those high-end runs, you typically look at infrastructure alternatives. Then the question becomes, "Okay, I already am used to this whole SaaS interaction model with my AI development. How do I maintain that developer motion going forward?", where I don't have to teach them something new just because the infrastructure is different. What we have at NVIDIA is this DGX SuperPOD. The idea is to say, "Well, how about we try this and develop Base Command as a way to access a SuperPOD, just as a cloud API would behave?"

dgx superpod (3:48)

Lukas:
A DGX SuperPOD, is that something that I could put in my own infrastructure or is that something that I could access in the cloud or both? How does that work?
Stephan:
Typically, our customers for SuperPODs...maybe we should take a step back and understand what it is. The easiest way to think about — or the most straightforward way to think about — a DGX SuperPOD is to think of it as a super computer in a box. It's a packaged-up infrastructure solution from NVIDIA that you can purchase, and it'll be deployed on premise for you and your own data center or in a colo facility. Actually we found that a colo facility is the most likely place for you to put that because it is a pretty intensive investment. Number one, not just in terms of just the number of DGXs that are involved, for example, but also of course, in the terms of the power draw and cooling and just the requirements that you need to bring to even run this beast, essentially. That's really what then dictates where this thing usually is. What we did is we put it in a colo facility and made it available right now in directed availability fashion. We have a couple of golden tickets for some customers who want to be on this thing, and then they get to select the size of the slice they want and access that through Base Command.
Lukas:
I see. When you use Base Command, you're using DGX, but it's in NVIDIA's cloud and you get kind of a slice of it. Is that right?
Stephan:
Yeah, that's right. I know we call it NVIDIA GPU Cloud, but really think of the whole Base Command proposition today as a SaaS portal that you access, that is currently coupled to more like a rental program. It's less cloud bursty elastic; think of it more like, "Okay, I have three DGX A100s today, and then maybe in the next couple of months, I know I need three more. I'll call NVIDIA and say, 'Hey, I need three more for the next month.'" That's kind of how that works.
Lukas:
Maybe let's start with the DGX box. What would a standard box look like? What's its power draw? How big is it? How much does it cost? Can you answer these questions? Just in order of magnitude.
Stephan:
You're looking at about $300,000 for a single DGX A100. It'll have 8 GPUs and 640 gigabytes of memory that come along with that. Those are the A100 GPUs, the latest and greatest that we have. You're going to look at about 13 kilowatts per rack of standard deployment.
Lukas:
13 kilowatts?
Stephan:
Yeah.
Lukas:
Constant or just training?
Stephan:
No, no. When you fire these things up, these puppies, they heat up quite a lot. They're pretty powerful and the DGX SuperPOD consists of at minimum 20 of those. If you think about that, that's what we call one scale unit. And we have customers that build 140 of those.
Lukas:
Wow. What kinds of things do they do with that?
Stephan:
Well, just all the largest multi-node jobs that you could possibly imagine, starting from climate change analysis. Large, huge data sets that need to be worked on there. NLP is a big draw for some of these customers. Natural language processing and the analytics that comes with those models is pretty intensive, data intensive and transfer intensive. We keep talking about the DGXs and of course we're very proud of them and all of that, but we also acquired a company called Mellanox a year ago. So of course the networking plays a huge role in the infrastructure layout of such a SuperPOD. If you have multi-rail InfiniBand connections between all of those boxes and the storage, which typically uses a parallel file system in a SuperPOD, then what you'll get is essentially a extreme performance even for multi-node jobs. Any job that even has to go above and beyond multiple GPUs, a DGX SuperPOD architecture will get you there. Essentially at the, I would say, probably one of the best speed performance characteristics that you could possibly have. The SuperPOD scored number 5 on the top 500. It's nothing to sneeze at.

maybe new section (8:45)

Lukas:
How does the experience of training on that compare to something that listeners would be more familiar with, like a 2080 or 3080, which feels pretty fast already. How much faster is this and do you need to use a special version of TensorFlow or PyTorch or something like this to even take advantage of the parallelism?
Stephan:
I'd have to check exactly how to quantify an A30 to an A100, but think of it as this. Any other GPU that you might want to use for training in a traditional server, think of it as a subset of the capabilities of an A100. If you use, for example, our MIG capability, you can really slice that GPU down to a T4-type performance profile and say, "Well, I'm testing stuff out on a really small performance profile without having to occupy the entire GPU." Once you have the same approach from a software perspective...if you do your sweeps, then you do essentially the same thing. Or you could do those on MIG instances and then thereby you don't need that many DGXs when you do it. I guess I should say that that's the beauty of CUDA. If you write this once it'll run on an A30, it'll run on an A100, it'll run on a T4. In fact, we provide a whole lot of base images that are free for people to use and to start with, and then sort of lift the tide for everybody. These are pre-optimized container images that people can build on.

maybe new section (10:32)

Lukas:
I would think there'd be a lot of networking issues and parallelization issues that would come up, maybe uniquely. at this scale. Is that something that NVIDIA tries to help with? Does CUDA actually help with that? I think of CUDA as compiling something to run on a single GPU.
Stephan:
Absolutely. If you think of CUDA as a very horizontal platform piece in the software stack of your AI training stack, then components like NCCL, for example, provide you with pretty optimized communication paths for multi-GPU jobs, but they'll also span multi-nodes. This starts from selecting the right NCC to exit a signal, because that means you're going to the right port and the top of the rack switch. That means you minimize the latency that your signal takes from point A to point B in such a dataset center. When you look at CUDA, and especially at components like NCCL and Magnum IO as a whole — which is our portfolio of communication libraries and storage acceleration libraries — it starts from the integration of the hardware and the understanding of the actual chip itself, and then it builds outward from there. The big shift at NVIDIA that we're looking at accelerating with use of Base Command is this understanding that NVIDIA is now thinking about the entire data center. It's not just about, "I got the newest GPU, and now my game runs faster." Certainly that's a focus area of us as well. But if you take the entire stack and work inside out, essentially, then the value proposition just multiplies the further out you go. With Base Command, this is sort of the last step in this whole journey to turn it into a hybrid proposition. I know it's very high-level right now and abstract, but it's a super interesting problem to solve. If you think about how data center infrastructure evolved over the last, let's say 10 years or so, then it was about introducing more homogeneity into the actual layout of the data center. Certain type of server, certain type of CPU, certain type of top-of-rack switch, and then a certain layout. You have all these non-blocking fabric reference architectures that are out there and et cetera, et cetera. Ultimately now that everything is homogeneous, you can now make it addressable using an API because everything is at least intended to behave in this very standard and predictable way. We worked our way up there. This has never been the case for something like a supercomputer. A supercomputer was a 2-year research project with a lot of finagling and "Parameters here, and then set this thing to a magic value and that thing to a magic value, and then run it on 5 minutes after midnight, but not on Tuesdays," and then you get the performance. This whole contribution that we're really making here is that we're raising that bar to a predictable performance profile that is repeatable. Not just inside an NVIDIA data center, where we know 5 minutes after midnight and so on, but also in your data center or in an actual random data center, provided you can afford the cooling and power of course. But then once we got that out of the way, we're pretty good. That's a real shift forward towards enabling enterprises, real bonafide true blue chip companies, to actually adopt AI at a larger scale.

maybe new section (14:42)

Lukas:
It's interesting. One thing I was thinking of as you were talking is, most of the customers that we work with...we don't always know, but I think what we typically see with our customers that are training a lot of machine learning models, is they use a lot of NVIDIA hardware, but it's less powerful hardware than the DGX. It might be P100 or basically whatever's available to them through Amazon or Azure or Google Cloud. I think they do that for convenience, I think people come out of school knowing how to train on those types of infrastructure. Then their compute costs do get high enough. I mean, we do see compute costs certainly well into the seven, eight figures. Do you think that they're making a mistake by doing it that way? Should they be buying custom DGX hardware and putting that into colo, would they actually save money or make their teams more productive if they did it that way?
Stephan:
Oh God, no. Just to be really clear, Base Command is not a cloud. We're not intending to go out there and say, "Go here instead of Amazon," or something like that, that's not what we are saying. First of all, you can get A100 instances in all the major public clouds as well. You could have access to those instances in just the same way that you're used to consuming the P100s or V100s or anything like that. Whether it's Pascal or Volta or Ampere architecture, all of it is available in the public cloud. Like I said in the beginning, it's just a perfectly acceptable way to start. In fact, it's the recommended path, to start in the cloud, because it requires the least upfront investment. I mean, zero. And you get to see how far you can push something, an idea. Once you arrive at a certain point, I think then it's a question of economics, and then just everything will start falling into place. What we found is that enterprises typically arrive at a base load of GPUs. In other words, at any given moment in time, for whatever reason, there is a certain number of GPUs working. Once you identify that, "Hey, every day I keep at least 500 GPUs busy," then typically the economics are better if you purchase. Typically, a CapEx approach works out better. It's not always the case, but typically that might be the case. To meet that need in the market is where we come in. What Base Command right now offers is this...it's not the all the way "Purchase it", you don't have to have that big CapEx investment up front, but it is something in between. You do get to rent something, it's not entirely cloud, but you're moving from the Uber model to the National Car Rental-type model. Once you're done renting, then you maybe want to buy a car. But the point is that there's room here on that spectrum. Currently we're right smack in the middle of that one. That's typically what we say to customers. Just actually yesterday, somebody said, "Well, how do you support bursting? And how elastic are you?" I said, "That's not the point here." You want to be in cloud when you want to be elastic and bursty, but typically that base load is done better in different ways.
Lukas:
What breaks if I don't use Base Command? If I just purchased one of these machines and I'm just shell-ing into the machine and kicking off my jobs the way I'm typically used to or running something in a notebook. What starts to break where you know that you need something more sophisticated?
Stephan:
On the face of it, nothing really breaks. It just takes a lot of expertise to put these things together. If you buy a single box, then there's probably very little value add in adding that to a SaaS platform, per se. But as soon as you start thinking about a cluster of machines — and like I said, more and more of our enterprise customers are actually thinking about deploying many of those, not just a single machine — then as soon as that comes into play, then you're faced with all the traditional skill challenges in your enterprise that you'd be used to from just rolling out private cloud infrastructure. It's the same exact journey. It's the same exact challenge. You need to have somebody who understands these machines and somebody who understands networking, somebody who understands storage, Kubernetes and so on and forth. As soon as you build up the skill profile that you need to actually run this infrastructure at scale and at capacity, then you're good to go, right? You can build your own solution, but typically what you'd be lacking are things that then help you make the most of it. All the kinds of efficiency gains that you'd have by just having visibility into the use of the GPU. All the telemetry and the aggregates by job and by user and by team. This entire concept of chargeback, et cetera, is a whole other hurdle that you then have to climb. What we're looking at is people who want to build a cluster, typically they want to do that because they want to share that cluster. It's a pretty big beast. If you build a big cluster, might as well, because you want to be more efficient and you want to make the most of it, and so now you need to have a broker who brokers access to the system supercomputers. As ridiculous as it sounds, scheduling on a supercomputer typically is by Post-it. It's, "Joe, it's your cluster this week but I need it next week." It doesn't work that way at scale anymore. You want to interact with something that is actually understanding the use of the cluster, optimizing its use so that the overall output across all of the users is guaranteed at any given point in time.

maybe new section (21:30)

Lukas:
I have the sense that many years ago, decades ago, when I was a kid or maybe even before that, supercomputers felt like this really important resource that we use for lots of applications. Then maybe in the nineties or the aughts, they became less popular and people started moving their compute jobs to sort of distributed commodity hardware. And maybe they're kind of making a comeback again. Do you think that's an accurate impression? Do you have a sense of what the forces are that makes supercomputers more or less interesting, compared to just making a huge stack of chips that you could buy in the store?
Stephan:
Yeah. It is interesting because if you think about it, we've actually oscillated back and forth between this concept a little bit for years. I mean, you're exactly right. The first wave of standardization was, "Let's just use 19-inch rack units and start from there and then see, maybe that's a little bit better." Then sort of the same thing happened when we decided to use containers as an artifact to deliver software from point A to point B. Standardization and form factor really is what drove us there. Certainly there's value in that. The interesting moment happens when all of that together becomes...when the complexity of running all of that together and lining it all just up, right? In the beginning you had one IBM S390, and you'd know that's the one thing you have to line up. Now you have 200 OEM servers across X racks, and that's a lot of ducks to line up. The complexity and management of independent systems that you're sort of adding together, that sounds good on paper, but at some point you're crossing that complexity line where it's just more complex to even manage the hardware. This is not just from an effort perspective, this is also from a CPU load perspective. If more than 50% of your cores go towards just staying in sync with everybody else, how much are you really getting out of each individual component that makes up this cluster? Now of course you're saying, "Well, how do I disrupt it?" Well, you disrupt it by making assumptions about how this infrastructure actually looks like, rather than saying, "Well, you're a drop in the ocean, you first have to figure out where you're even at." If you eliminate that complexity, then fundamentally you can go straight into focusing more on a data plane-type focus rather than figuring out how the control plane looks like and how busy that one is. It's got a little bit of that. I think the DGX represents an optimization that shows...rather than purchasing 8 separate servers that have potentially similar GPUs in them, here's a way that not only has those 8 GPUs in them, but it also is interconnected in a way that just makes optimal assumptions about what's going on between those 2 GPUs and what could possibly run on them. That combined with a software stack that's optimized for this layout just brings the value home. That's really where we're coming from.

maybe new section (25:14)

Lukas:
It's interesting. When I started doing machine learning, the hardware was pretty abstracted away. We would compete for computing resources, so I got a little bit handy with Unix and NICE-ing processes and just coordinating with other people in grad school. But I really had no sense of the underlying hardware. I don't even think I took any classes on networking or chip architecture, and now I really regret it. I feel like I'm actually learning more and more about it and the hardware is becoming less and less abstracted away every year. I think NVIDIA has a real role to play there. Do you think that over time, we'll go back to a more abstracted away hardware model and we'll figure out the right APIs to this? Or do you think that we're going to make more and more specialized hardware for the different things that people are likely going to want to do, and a core skill of an ML practitioner is going to need to be "understanding how the underlying hardware works"?
Stephan:
Yeah. I think what you said there is...I'm reminded of 10 years ago, we used to say, "Well, if you're a web frontend developer and you don't know TCP/IP, you're not really a web frontend developer," but most web frontend developers will never think about TCP/IP. I think this is very true here, too. You have an MLOps practitioner and today you get to think about your models and tensors, hyperparameter searches, and all of that kind of stuff, and yes, that's important. Well, not important, it's crucial. Without that you couldn't do your work. But, increasingly you also have to know where you're actually running, in order to get the performance that you need. Today it's a real competitive advantage for the companies out there to increase the training speed. Obviously what we're solving is just getting started. I mean, we take all that pain away, you just log onto Base Command, off you go. But increasingly it's a true competitive advantage. Not to be in the cloud, but to be training faster than anybody else. 2012, 2013, if you weren't working on a cloud initiative as a CIO, that was a problem. Now, increasingly, If you're not focusing on how to accelerate AI training, now you're putting your company at a disadvantage. That means that the necessity for each individual practitioner who interacts with the hardware to actually understand what they run on and how to optimize for this is going to increase. Having said that though, part of our job at NVIDIA, I think, is to make optimal choices on behalf of the practitioner out of the gate. Rather than requiring people to really understand, let's say, the clock rates of each individual bus or something like that, we'll abstract it away. People will argue that CUDA is already still pretty low level, but we're actually abstracting a whole lot to even get to that point. I would say while that's true, we're trying to shield the practitioner as much as possible. We have a leg up because we can work with both the knowledge of how the GPU looks like, and most importantly how the next GPU will look like, but also how to expose that optimally at the application layer and interact with the MLOps providers in a meaningful way that just is optimal throughout.

maybe new section (29:09)

Lukas:
Have there been any kind of cultural changes needed to build a SaaS, customer-facing product like Base Command at a company that comes up through making really great semiconductors and very... I would call CUDA low-level from my vantage point. Obviously it's an amazing piece of software, but it's a very low-level software. Has NVIDIA needed to make adjustments in the product development process to make Base Command work for customers?
Stephan:
Yeah, it's interesting. Base Command is actually not a new product. We've been using this thing internally for over five years. It was a natural situation for us because...five years ago we launched the first DGX. Of course, if you launch something like the DGX, and you say that's the best thing you could possibly purchase for the purposes of training, and you have 2,600 AI researchers in house, then you can imagine the obvious next question is, "Okay, how do we use this thing to accelerate our own AI research?" This need for creating large-scale AI infrastructure on the basis of the DGXs was born right out of this situation. With that came all these issues and as we solved them, we just kept adding to this portal or to this..it's more than just a portal. I mean, it's the entire stack, it's the infrastructure provisioning, and then the exposure, the middleware, the scheduler, the entire thing. It became more and more obvious to us what should be done. These 2,600 researchers that I just mentioned, bless their heart, they really had to go through a lot of iteration with us and be very patient with us until we got it to the point where they'd, let's say, not complain as much. The point is that we really tried to get it right. We acted in a very transparent manner with a pretty large community of AI researchers and developers, and they told us what they needed and what they wanted and what their pain points were. Going to market now with Base Command as an externally facing product was simply turning that to the outside.
Lukas:
Have there been any surprises in taking it to market? I know that sometimes when companies have an internal tool, like I think the TensorFlow team has talked about this, that it's made for a really, really advanced large team and then you want to take it to someone who's newer, or a smaller team, they kind of have new needs that are a little bit surprising to people that have been doing this for a long time. Have you encountered anything like that as you bring it to market?
Stephan:
Yeah. It's funny you asked. We encounter this in just many different aspects. One example is that most customers...like I said, we make this available. The internal example that we use is, "Oh, you get to drive the Lamborghini for a while," the idea is this is a short term rental. I mean, how long are you renting a Lamborghini? Maybe a day or two or a weekend. Here, we're saying short-term rental, they're probably going to rent this for three months or something like that. It turns out, most customers want to rent this for two years, three years. What surprised us was that there's a real need for, not only for a long-term rental, but especially the immediacy of access to this. I think we had underestimated a little bit how desperate the market was to get started right away. We knew that people would want to get started, but we always figured, "Well, the cloud is there to get started right away, you just sign up and swipe your credit card and off you go." The need for large-scale training and just the immediacy of that need, that personally was a surprise to me. I hadn't expected that. I thought that would be much more of a slower ramp than it was. I thought I was going to be in different sales conversations than I actually found myself in. That was a surprise. Other surprises are just understanding just how much people still have to go. Typically, we encounter folks who say, "My way to scale and accelerate my training is just to pick a larger GPU." There's a big, big portion of the market that certainly has been operating that way. But really helping them see that sometimes it's not scale-up model but the scale-out model that might be appropriate as the next step, it wasn't exactly surprising, but it was interesting to see just how widespread that scale-up thinking was rather than the scale-out thinking.
Lukas:
Can you say more about scale-up versus scale-out? What do you mean? What's the difference there?
Stephan:
If you think about cloud infrastructure, then a scale-up approach would be, "You started with a medium instance and you go to an X-large," or something like that. You just choose more powerful resources to power the same exact hardware, but you don't really think about adding a second server, for example, and now spread the load across multiple instances. Here, it would be something similar. If you always think about saying, I choose to run this on a Volta-based system and now I have a Volta-based GPU. Now my way to make this faster is to go to an Ampere-based architecture GPU," that would be scaling up. Certainly, that's something that you want to do, but at some point, your pace and your need for accelerated training actually exceeds the cadence at which we can provide you the next fastest GPU. If you need to scale faster than that, and if that curve exceeds the other, then you're essentially in a situation where you have to say, "Well, how about I take a second A100?" Then I have a multi-GPU scenario, and let's just deal with that, and so on and so forth. The natural conclusion of that is, "How about multi-node jobs where they're smack full of the latest and greatest GPUs, and then how many nodes can I spread my job across?" If you do, I don't know, 5 billion parameters then yeah, you're going to have to do that. Then you're going to be pretty busy trying to organize a job across multiple sets of nodes.

maybe new section (36:59)

Lukas:
Do you have any sense on how your customers view the trade-off of buying more GPUs, buying more hardware to make their models perform better? Are they really doing a clear ROI calculation? One of the things that we see at Weights & Biases is that it seems like our customers' use of GPU just expands to fit whatever capacity they actually have, which I'm sure is wonderful for NVIDIA, but you wonder if the day will come where people start to scrutinize that cost more carefully. Some people have pointed out that there's possibly even environmental impact from just monstrous training runs, or even a kind of a sad effect where no one can replicate the latest academic research if it only can be done at multi-million-dollar-scale compute. How do you think about that?
Stephan:
In the end, I think it's a pretty simple concept. If the competitive advantage for companies today is derived from being able to train faster and larger and better models, you're not speaking to the CFO anymore. You're speaking to the product teams. At that point, it just becomes a completely different conversation. The only interesting piece here is that traditionally, of course, data center infrastructure is a cost center, whereas now we're talking about turning it into value center. If you turn it into a value center, then you really don't have this problem. Of course we have extensive ROI conversations with our customers. We have TCO calculators and all that good stuff, it's definitely there. It's really about helping customers choose, "Should we do more cloud for where we're at?" and from a GPU standpoint, we're happy with either outcome. We're maintaining neutrality in that aspect that we're saying, "Well, if more cloud usage turns out to be better for you, then you should absolutely go and do that." Then sif we figure out that the economics shifted in such a way that a mix of cloud and on-prem, or cloud and hosted resources makes sense, then we'll propose that. It's really about finding the best solution there and definitely our customers are asking these questions and making pretty hard calculations on that. But, I mean, it's pretty obvious. If you think about it...a couple years ago, we talked to an autonomous driving lab team and they said, "Well, Company A put 300,000 miles autonomously on the road last year, and we put 70,000 miles on the road last year autonomously. We got to change that. How do I at least match the 300,000 miles a year that I can put autonomously on the road?" So that's a direct function of, "How well does your model work?" and so on and so forth. It's a pretty clear tie-in right now.
Lukas:
What about inference? A lot of the customers that we talk to, inference is really the dominant compute costs that they have, so the training is actually much smaller than the spend on inference. Do you offer solutions for inference too? Could I use Base Command at inference time, or is it entirely training? And do people ever use these DGX machines for inference, or would that just be a crazy waste of an incredibly expensive resource?
Stephan:
Yes and no, it depends on how you use it. First of all, you can use Base Command for model validation purposes. You can have single-shot runs. But some customers want to set up a server that is dedicated to inference and then just take mixed slices and say, "Well, I'll do my model validation at scale, basically. I'll do my scoring there." If you share that infrastructure across a large number of data scientists, you put your DGX to a good use. There's no issue with that. We do have a sister SaaS offering to Base Command called Fleet Command. That is meant to take the output of Base Command in the form of a container, of course, and then deploy that at scale and orchestrate it at scale, and really manage the inference workloads at the edge for our customers. It's an end-to-end coverage there from a SaaS perspective.

maybe new section (42:13)

Lukas:
In your view, based on the problems that you're seeing in the market, what functionality are customers asking for in their software layer for machine learning training that you're interested in providing?
Stephan:
That's a really good question because it goes through the heart of the question, "What space is Base Command seeking to occupy in a theoretical stack where the infrastructure's at the bottom and something like Weights & Biases at the top?" I would see Base Command's role as an arbiter and a broker. Almost like a bridge between a pure developer-focused, almost like an IDE, perspective and bridge that into enterprise-ready architecture. Let me give you a simple example. If you do dataset versioning — and then let's say that's what you want to do with your MLOps platform — then there's many ways to version data. You can try and be smart about this, but at the end of the day, it's a question of what infrastructure is available to you. If I have an optimized storage file underneath, my dataset versioning strategy look entirely different than if I just have kind of a scale-out, open source storage backend. If I work with S3 buckets, then my versioning looks different than I do that with NFS shares. The value that Base Command provides is that it abstracts it away. If you do dataset versioning with Base Command, then it'll do snapshots. If you do it on a NetApp filer, it'll do other things than if you do it with a different storage. But those are exactly the questions that an enterprise architect will be interested in. How do you deal with that? Just because you figure you need 50 versions of your dataset that's 3TB large, does that mean I need to plan for almost infinite storage? No, it doesn't. We can help you translate that and make that consumable in the enterprise. i I think that's a big piece that I think Base Command can provide as this arbiter between the infrastructure and the API, if you will. The second thing is, increasingly, I've seen people being very concerned about data security and governance around this. If you have sufficiently large infrastructure to deal with, then almost always you have multiple GEOs to deal with. They have different laws about the data that's being allowed at any given point in time. Just the ability to say, "This dataset can never leave France," or "That dataset has to only be visible to these three people and nobody else," is of extreme value to enterprises. All those things come into play, and I think that's where Base Command can help.

maybe new section (45:38)

Lukas:
Are there other parts of Base Command that you've put a lot of work into that people might not realize the amount of effort that it took, that might be invisible just to a customer, even just me even imagining what Base Command does?
Stephan:
Yeah. I think that we invested a lot in our scheduler. If you look at the layout of DGXs in a SuperPOD arrangement and the nature of the jobs that go into this, I think people underestimate just how optimized the scheduler is across, not just multiple nodes, but also within the node. For you to be able to say, "I'm running a job and with a one-GPU configuration," and then it's a slider, and then I say, "Well, I'm turning this into an eight-GPU job now," and that's literally a selection. What goes on in the background, it's just a lot more intricate than people typically realize. But it goes on automatically and you do have to be ready for it. You have to program for it, and people know that. But as soon as you do that at your layer or the optimization underneath, it's just incredible.
Lukas:
What's tricky, is it like you need to find eight GPUs that are close to each other and not being used and all that, is that the basic challenge?
Stephan:
Yeah, exactly. Data locality, caching strategies, all that kind of stuff is going straight into that selection.

new section (47:16)

Lukas:
Cool. All right. Well, we always end with two questions, both on ML. Let's see how you answer them. One thing we always ask is, what's an underrated aspect of machine learning that you think people should pay more attention to, or you would love to spend time on if you had more free time?
Stephan:
I think what's underrated is this aspect of crowdsourcing. I don't think anybody is looking at machine learning and the potential that just many small devices that are contributory to the creation of a model would bring. I think that we're at the cusp of that, but we're not really doing that right now. I think to the degree that it already happens, it's very hidden from us. We all know Google will run some algorithms across data that was collected through the phones. We understand that on a conceptual level, but just the ability to bring that together in a more natural sense that we might want to find recommendations not on the basis of a single parameter, but find recommendations of more meaningful parameters. I find five-star reviews very meaningless, for example. I think that is a very simplified view of the world. I find, consequently, also one-star reviews very meaningless. But if you could actually have a more natural understanding based on machine learning, that would be an interesting topic to explore, because it would have to be based on just all kinds of inputs that would have to be taken into account. I would like to see that and I think that would be an interesting field of research, an interesting field of development. I think people still assume that it's only a prerogative of the big companies to be able to do that, but I think there's an open source project in there somewhere.
Lukas:
Cool. I hope somebody starts that, and then they should send it to us when they do.

new section (49:24)

Lukas:
Our final question is, when you look at your customers and their effort to take business problems and turn them into machine learning problems, then deploy them, and solve those problems, where do you see the biggest bottleneck? Where are they struggling the most right now?
Stephan:
The biggest issue they have — at least as far as I can tell — is that they have just a getting started issue in the sense of, "How do I scale this beyond my initial POC?" I think that the prefab solutions that are out there are pretty good at walking you through a getting started tutorial and then they'll probably gets you really far if you're a serious practitioner and you devote some time to it, but I think that at some point, you'll hit problems that may not even have anything to do with ML. They may just have something to do with infrastructure that's available to you and things like that. I think that anybody who is trying to use this for a commercial and a business strategic purpose is going to run into an issue sooner or later of, "How do I go from Point A to Point B here? People call it something like AI DevOps, or something like that that floated around. I think, as an industry, we should be aiming to make sure that that job never comes and sees the light of day.
Lukas:
Too late, I think.
Stephan:
Yeah. I feel like we lost on that one already. But I really think, we should do better. You shouldn't have to require super special skills to create this whole DevOps approach around AI training. We should really know better by now how that whole approach works and then build products that drive that.

Outro (51:39)

Lukas:
Awesome. Well, thanks so much for your time. I really appreciate it. That was fun.
Stephan:
Thank you.
Lukas:
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we worked really hard to produce. So check it out.
Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!

Intro

Stephan:
Scheduling on a supercomputer typically is by Post-it. It's, "Joe, it's your cluster this week but I need it next week." It doesn't work that way at scale anymore. You want to interact with something that is actually understanding the use of the cluster, optimizing its use so that the overall output across all of the users is guaranteed at any given point in time.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lukas Biewald. This is a conversation I had with Stephan Fabel, who is a Senior Director of Product Management at NVIDIA, where he works on the Base Command Platform software that runs on top of NVIDIA's DGX machines, which are basically the most powerful computers that you can buy to train your machine learning models on top of. It's fun to talk about the challenges that customers face when they have access to basically unlimited compute power. This is a super fun conversation, and I hope you enjoy it.

NVIDIA Base Command and DGX SuperPOD

Lukas:
My first question for those who haven't heard of NVIDIA Base Command, since you are the senior product manager on it, can you tell me what Base Command aspires to do?
Stephan:
In a way, think of NVIDIA Base Command as your one-stop shop for all of your AI development. It's a SaaS offering from NVIDIA where you log on directly, or you log on via an integration partner, and you leverage the capabilities of Base Command to schedule jobs across a variety of infrastructures. You do that in a secure manner. You gain access to your data and retain access to your data and data sovereignty, across the infrastructure that you're scheduling the jobs on. Then it's really just a matter of optimizing that job run on NVIDIA infrastructure. That's really what Base Command aims to do.
Lukas:
These jobs, they're model training jobs exclusively, or is it broader than that?
Stephan:
Model training jobs are generally the ones that we focus on, but we also do model validation, for example. You could have single-shot inference runs as well.
Lukas:
Are there other pain points of model of development that Base Command aspires to solve or tries to solve?
Stephan:
Yeah. I think that a lot of the issues that you have with AI infrastructure, it's really that's where it starts. The question is, "Where do you train your models?" and "How do you go about it?" Most people start in the cloud to train their models. That's reasonable because just any development effort would start in the cloud today. At some point you reach a certain amount of scale where you say, "Well, it may not deliver the performance I need, or it may not deliver the scale I need, at the economics I'm comfortable with," et cetera. For those high-end runs, you typically look at infrastructure alternatives. Then the question becomes, "Okay, I already am used to this whole SaaS interaction model with my AI development. How do I maintain that developer motion going forward?", where I don't have to teach them something new just because the infrastructure is different. What we have at NVIDIA is this DGX SuperPOD. The idea is to say, "Well, how about we try this and develop Base Command as a way to access a SuperPOD, just as a cloud API would behave?"
Lukas:
A DGX SuperPOD, is that something that I could put in my own infrastructure or is that something that I could access in the cloud or both? How does that work?
Stephan:
Typically, our customers for SuperPODs...maybe we should take a step back and understand what it is. The easiest way to think about — or the most straightforward way to think about — a DGX SuperPOD is to think of it as a super computer in a box. It's a packaged-up infrastructure solution from NVIDIA that you can purchase, and it'll be deployed on premise for you and your own data center or in a colo facility. Actually we found that a colo facility is the most likely place for you to put that because it is a pretty intensive investment. Number one, not just in terms of just the number of DGXs that are involved, for example, but also of course, in the terms of the power draw and cooling and just the requirements that you need to bring to even run this beast, essentially. That's really what then dictates where this thing usually is. What we did is we put it in a colo facility and made it available right now in directed availability fashion. We have a couple of golden tickets for some customers who want to be on this thing, and then they get to select the size of the slice they want and access that through Base Command.
Lukas:
I see. When you use Base Command, you're using DGX, but it's in NVIDIA's cloud and you get kind of a slice of it. Is that right?
Stephan:
Yeah, that's right. I know we call it NVIDIA GPU Cloud, but really think of the whole Base Command proposition today as a SaaS portal that you access, that is currently coupled to more like a rental program. It's less cloud bursty elastic; think of it more like, "Okay, I have three DGX A100s today, and then maybe in the next couple of months, I know I need three more. I'll call NVIDIA and say, 'Hey, I need three more for the next month.'" That's kind of how that works.
Lukas:
Maybe let's start with the DGX box. What would a standard box look like? What's its power draw? How big is it? How much does it cost? Can you answer these questions? Just in order of magnitude.
Stephan:
You're looking at about $300,000 for a single DGX A100. It'll have 8 GPUs and 640 gigabytes of memory that come along with that. Those are the A100 GPUs, the latest and greatest that we have. You're going to look at about 13 kilowatts per rack of standard deployment.
Lukas:
13 kilowatts?
Stephan:
Yeah.
Lukas:
Constant or just training?
Stephan:
No, no. When you fire these things up, these puppies, they heat up quite a lot. They're pretty powerful and the DGX SuperPOD consists of at minimum 20 of those. If you think about that, that's what we call one scale unit. And we have customers that build 140 of those.
Lukas:
Wow. What kinds of things do they do with that?
Stephan:
Well, just all the largest multi-node jobs that you could possibly imagine, starting from climate change analysis. Large, huge data sets that need to be worked on there. NLP is a big draw for some of these customers. Natural language processing and the analytics that comes with those models is pretty intensive, data intensive and transfer intensive. We keep talking about the DGXs and of course we're very proud of them and all of that, but we also acquired a company called Mellanox a year ago. So of course the networking plays a huge role in the infrastructure layout of such a SuperPOD. If you have multi-rail InfiniBand connections between all of those boxes and the storage, which typically uses a parallel file system in a SuperPOD, then what you'll get is essentially a extreme performance even for multi-node jobs. Any job that even has to go above and beyond multiple GPUs, a DGX SuperPOD architecture will get you there. Essentially at the, I would say, probably one of the best speed performance characteristics that you could possibly have. The SuperPOD scored number 5 on the top 500. It's nothing to sneeze at.
Lukas:
How does the experience of training on that compare to something that listeners would be more familiar with, like a 2080 or 3080, which feels pretty fast already. How much faster is this and do you need to use a special version of TensorFlow or PyTorch or something like this to even take advantage of the parallelism?
Stephan:
I'd have to check exactly how to quantify an A30 to an A100, but think of it as this. Any other GPU that you might want to use for training in a traditional server, think of it as a subset of the capabilities of an A100. If you use, for example, our MIG capability, you can really slice that GPU down to a T4-type performance profile and say, "Well, I'm testing stuff out on a really small performance profile without having to occupy the entire GPU." Once you have the same approach from a software perspective...if you do your sweeps, then you do essentially the same thing. Or you could do those on MIG instances and then thereby you don't need that many DGXs when you do it. I guess I should say that that's the beauty of CUDA. If you write this once it'll run on an A30, it'll run on an A100, it'll run on a T4. In fact, we provide a whole lot of base images that are free for people to use and to start with, and then sort of lift the tide for everybody. These are pre-optimized container images that people can build on.

The challenges of multi-node processing at scale

Lukas:
I would think there'd be a lot of networking issues and parallelization issues that would come up, maybe uniquely, at this scale. Is that something that NVIDIA tries to help with? Does CUDA actually help with that? I think of CUDA as compiling something to run on a single GPU.
Stephan:
Absolutely. If you think of CUDA as a very horizontal platform piece in the software stack of your AI training stack, then components like NCCL, for example, provide you with pretty optimized communication paths for multi-GPU jobs, but they'll also span multi-nodes. This starts from selecting the right NIC to exit a signal, because that means you're going to the right port and the top of the rack switch. That means you minimize the latency that your signal takes from point A to point B in such a dataset center. When you look at CUDA, and especially at components like NCCL and Magnum IO as a whole — which is our portfolio of communication libraries and storage acceleration libraries — it starts from the integration of the hardware and the understanding of the actual chip itself, and then it builds outward from there. The big shift at NVIDIA that we're looking at accelerating with use of Base Command is this understanding that NVIDIA is now thinking about the entire data center. It's not just about, "I got the newest GPU, and now my game runs faster." Certainly that's a focus area of us as well. But if you take the entire stack and work inside out, essentially, then the value proposition just multiplies the further out you go. With Base Command, this is sort of the last step in this whole journey to turn it into a hybrid proposition. I know it's very high-level right now and abstract, but it's a super interesting problem to solve. If you think about how data center infrastructure evolved over the last, let's say 10 years or so, then it was about introducing more homogeneity into the actual layout of the data center. Certain type of server, certain type of CPU, certain type of top-of-rack switch, and then a certain layout. You have all these non-blocking fabric reference architectures that are out there and et cetera, et cetera. Ultimately now that everything is homogeneous, you can now make it addressable using an API because everything is at least intended to behave in this very standard and predictable way. We worked our way up there. This has never been the case for something like a supercomputer. A supercomputer was a 2-year research project with a lot of finagling and "Parameters here, and then set this thing to a magic value and that thing to a magic value, and then run it on 5 minutes after midnight, but not on Tuesdays," and then you get the performance. This whole contribution that we're really making here is that we're raising that bar to a predictable performance profile that is repeatable. Not just inside an NVIDIA data center, where we know 5 minutes after midnight and so on, but also in your data center or in an actual random data center, provided you can afford the cooling and power of course. But then once we got that out of the way, we're pretty good. That's a real shift forward towards enabling enterprises, real bonafide true blue chip companies, to actually adopt AI at a larger scale.
Lukas:
It's interesting. One thing I was thinking of as you were talking is, most of the customers that we work with...we don't always know, but I think what we typically see with our customers that are training a lot of machine learning models, is they use a lot of NVIDIA hardware, but it's less powerful hardware than the DGX. It might be P100 or basically whatever's available to them through Amazon or Azure or Google Cloud. I think they do that for convenience, I think people come out of school knowing how to train on those types of infrastructure. Then their compute costs do get high enough. I mean, we do see compute costs certainly well into the seven, eight figures. Do you think that they're making a mistake by doing it that way? Should they be buying custom DGX hardware and putting that into colo, would they actually save money or make their teams more productive if they did it that way?
Stephan:
Oh God, no. Just to be really clear, Base Command is not a cloud. We're not intending to go out there and say, "Go here instead of Amazon," or something like that, that's not what we are saying. First of all, you can get A100 instances in all the major public clouds as well. You could have access to those instances in just the same way that you're used to consuming the P100s or V100s or anything like that. Whether it's Pascal or Volta or Ampere architecture, all of it is available in the public cloud. Like I said in the beginning, it's just a perfectly acceptable way to start. In fact, it's the recommended path, to start in the cloud, because it requires the least upfront investment. I mean, zero. And you get to see how far you can push something, an idea. Once you arrive at a certain point, I think then it's a question of economics, and then just everything will start falling into place. What we found is that enterprises typically arrive at a base load of GPUs. In other words, at any given moment in time, for whatever reason, there is a certain number of GPUs working. Once you identify that, "Hey, every day I keep at least 500 GPUs busy," then typically the economics are better if you purchase. Typically, a CapEx approach works out better. It's not always the case, but typically that might be the case. To meet that need in the market is where we come in. What Base Command right now offers is this...it's not the all the way "Purchase it", you don't have to have that big CapEx investment up front, but it is something in between. You do get to rent something, it's not entirely cloud, but you're moving from the Uber model to the National Car Rental-type model. Once you're done renting, then you maybe want to buy a car. But the point is that there's room here on that spectrum. Currently we're right smack in the middle of that one. That's typically what we say to customers. Just actually yesterday, somebody said, "Well, how do you support bursting? And how elastic are you?" I said, "That's not the point here." You want to be in cloud when you want to be elastic and bursty, but typically that base load is done better in different ways.

Why it's hard to use a supercomputer effectively

Lukas:
What breaks if I don't use Base Command? If I just purchased one of these machines and I'm just shell-ing into the machine and kicking off my jobs the way I'm typically used to or running something in a notebook. What starts to break where you know that you need something more sophisticated?
Stephan:
On the face of it, nothing really breaks. It just takes a lot of expertise to put these things together. If you buy a single box, then there's probably very little value add in adding that to a SaaS platform, per se. But as soon as you start thinking about a cluster of machines — and like I said, more and more of our enterprise customers are actually thinking about deploying many of those, not just a single machine — then as soon as that comes into play, then you're faced with all the traditional skill challenges in your enterprise that you'd be used to from just rolling out private cloud infrastructure. It's the same exact journey. It's the same exact challenge. You need to have somebody who understands these machines and somebody who understands networking, somebody who understands storage, Kubernetes and so on and forth. As soon as you build up the skill profile that you need to actually run this infrastructure at scale and at capacity, then you're good to go, right? You can build your own solution, but typically what you'd be lacking are things that then help you make the most of it. All the kinds of efficiency gains that you'd have by just having visibility into the use of the GPU. All the telemetry and the aggregates by job and by user and by team. This entire concept of chargeback, et cetera, is a whole other hurdle that you then have to climb. What we're looking at is people who want to build a cluster, typically they want to do that because they want to share that cluster. It's a pretty big beast. If you build a big cluster, might as well, because you want to be more efficient and you want to make the most of it, and so now you need to have a broker who brokers access to the system supercomputers. As ridiculous as it sounds, scheduling on a supercomputer typically is by Post-it. It's, "Joe, it's your cluster this week but I need it next week." It doesn't work that way at scale anymore. You want to interact with something that is actually understanding the use of the cluster, optimizing its use so that the overall output across all of the users is guaranteed at any given point in time.
Lukas:
I have the sense that many years ago, decades ago, when I was a kid or maybe even before that, supercomputers felt like this really important resource that we use for lots of applications. Then maybe in the nineties or the aughts, they became less popular and people started moving their compute jobs to sort of distributed commodity hardware. And maybe they're kind of making a comeback again. Do you think that's an accurate impression? Do you have a sense of what the forces are that makes supercomputers more or less interesting, compared to just making a huge stack of chips that you could buy in the store?
Stephan:
Yeah. It is interesting because if you think about it, we've actually oscillated back and forth between this concept a little bit for years. I mean, you're exactly right. The first wave of standardization was, "Let's just use 19-inch rack units and start from there and then see, maybe that's a little bit better." Then sort of the same thing happened when we decided to use containers as an artifact to deliver software from point A to point B. Standardization and form factor really is what drove us there. Certainly there's value in that. The interesting moment happens when all of that together becomes...when the complexity of running all of that together and lining it all just up, right? In the beginning you had one IBM S390, and you'd know that's the one thing you have to line up. Now you have 200 OEM servers across X racks, and that's a lot of ducks to line up. The complexity and management of independent systems that you're sort of adding together, that sounds good on paper, but at some point you're crossing that complexity line where it's just more complex to even manage the hardware. This is not just from an effort perspective, this is also from a CPU load perspective. If more than 50% of your cores go towards just staying in sync with everybody else, how much are you really getting out of each individual component that makes up this cluster? Now of course you're saying, "Well, how do I disrupt it?" Well, you disrupt it by making assumptions about how this infrastructure actually looks like, rather than saying, "Well, you're a drop in the ocean, you first have to figure out where you're even at." If you eliminate that complexity, then fundamentally you can go straight into focusing more on a data plane-type focus rather than figuring out how the control plane looks like and how busy that one is. It's got a little bit of that. I think the DGX represents an optimization that shows...rather than purchasing 8 separate servers that have potentially similar GPUs in them, here's a way that not only has those 8 GPUs in them, but it also is interconnected in a way that just makes optimal assumptions about what's going on between those 2 GPUs and what could possibly run on them. That combined with a software stack that's optimized for this layout just brings the value home. That's really where we're coming from.

The advantages of de-abstracting hardware

Lukas:
It's interesting. When I started doing machine learning, the hardware was pretty abstracted away. We would compete for computing resources, so I got a little bit handy with Unix and NICE-ing processes and just coordinating with other people in grad school. But I really had no sense of the underlying hardware. I don't even think I took any classes on networking or chip architecture, and now I really regret it. I feel like I'm actually learning more and more about it and the hardware is becoming less and less abstracted away every year. I think NVIDIA has a real role to play there. Do you think that over time, we'll go back to a more abstracted away hardware model and we'll figure out the right APIs to this? Or do you think that we're going to make more and more specialized hardware for the different things that people are likely going to want to do, and a core skill of an ML practitioner is going to need to be "understanding how the underlying hardware works"?
Stephan:
Yeah. I think what you said there is...I'm reminded of 10 years ago, we used to say, "Well, if you're a web frontend developer and you don't know TCP/IP, you're not really a web frontend developer," but most web frontend developers will never think about TCP/IP. I think this is very true here, too. You have an MLOps practitioner and today you get to think about your models and tensors, hyperparameter searches, and all of that kind of stuff, and yes, that's important. Well, not important, it's crucial. Without that you couldn't do your work. But, increasingly you also have to know where you're actually running, in order to get the performance that you need. Today it's a real competitive advantage for the companies out there to increase the training speed. Obviously what we're solving is just getting started. I mean, we take all that pain away, you just log onto Base Command, off you go. But increasingly it's a true competitive advantage. Not to be in the cloud, but to be training faster than anybody else. 2012, 2013, if you weren't working on a cloud initiative as a CIO, that was a problem. Now, increasingly, If you're not focusing on how to accelerate AI training, now you're putting your company at a disadvantage. That means that the necessity for each individual practitioner who interacts with the hardware to actually understand what they run on and how to optimize for this is going to increase. Having said that though, part of our job at NVIDIA, I think, is to make optimal choices on behalf of the practitioner out of the gate. Rather than requiring people to really understand, let's say, the clock rates of each individual bus or something like that, we'll abstract it away. People will argue that CUDA is already still pretty low level, but we're actually abstracting a whole lot to even get to that point. I would say while that's true, we're trying to shield the practitioner as much as possible. We have a leg up because we can work with both the knowledge of how the GPU looks like, and most importantly how the next GPU will look like, but also how to expose that optimally at the application layer and interact with the MLOps providers in a meaningful way that just is optimal throughout.

Understanding Base Command's product-market fit

Lukas:
Have there been any kind of cultural changes needed to build a SaaS, customer-facing product like Base Command at a company that comes up through making really great semiconductors and very... I would call CUDA low-level from my vantage point. Obviously it's an amazing piece of software, but it's a very low-level software. Has NVIDIA needed to make adjustments in the product development process to make Base Command work for customers?
Stephan:
Yeah, it's interesting. Base Command is actually not a new product. We've been using this thing internally for over five years. It was a natural situation for us because...five years ago we launched the first DGX. Of course, if you launch something like the DGX, and you say that's the best thing you could possibly purchase for the purposes of training, and you have 2,600 AI researchers in house, then you can imagine the obvious next question is, "Okay, how do we use this thing to accelerate our own AI research?" This need for creating large-scale AI infrastructure on the basis of the DGXs was born right out of this situation. With that came all these issues and as we solved them, we just kept adding to this portal or to this..it's more than just a portal. I mean, it's the entire stack, it's the infrastructure provisioning, and then the exposure, the middleware, the scheduler, the entire thing. It became more and more obvious to us what should be done. These 2,600 researchers that I just mentioned, bless their heart, they really had to go through a lot of iteration with us and be very patient with us until we got it to the point where they'd, let's say, not complain as much. The point is that we really tried to get it right. We acted in a very transparent manner with a pretty large community of AI researchers and developers, and they told us what they needed and what they wanted and what their pain points were. Going to market now with Base Command as an externally facing product was simply turning that to the outside.
Lukas:
Have there been any surprises in taking it to market? I know that sometimes when companies have an internal tool, like I think the TensorFlow team has talked about this, that it's made for a really, really advanced large team and then you want to take it to someone who's newer, or a smaller team, they kind of have new needs that are a little bit surprising to people that have been doing this for a long time. Have you encountered anything like that as you bring it to market?
Stephan:
Yeah. It's funny you asked. We encounter this in just many different aspects. One example is that most customers...like I said, we make this available. The internal example that we use is, "Oh, you get to drive the Lamborghini for a while," the idea is this is a short term rental. I mean, how long are you renting a Lamborghini? Maybe a day or two or a weekend. Here, we're saying short-term rental, they're probably going to rent this for three months or something like that. It turns out, most customers want to rent this for two years, three years. What surprised us was that there's a real need for, not only for a long-term rental, but especially the immediacy of access to this. I think we had underestimated a little bit how desperate the market was to get started right away. We knew that people would want to get started, but we always figured, "Well, the cloud is there to get started right away, you just sign up and swipe your credit card and off you go." The need for large-scale training and just the immediacy of that need, that personally was a surprise to me. I hadn't expected that. I thought that would be much more of a slower ramp than it was. I thought I was going to be in different sales conversations than I actually found myself in. That was a surprise. Other surprises are just understanding just how much people still have to go. Typically, we encounter folks who say, "My way to scale and accelerate my training is just to pick a larger GPU." There's a big, big portion of the market that certainly has been operating that way. But really helping them see that sometimes it's not scale-up model but the scale-out model that might be appropriate as the next step, it wasn't exactly surprising, but it was interesting to see just how widespread that scale-up thinking was rather than the scale-out thinking.
Lukas:
Can you say more about scale-up versus scale-out? What do you mean? What's the difference there?
Stephan:
If you think about cloud infrastructure, then a scale-up approach would be, "You started with a medium instance and you go to an X-large," or something like that. You just choose more powerful resources to power the same exact hardware, but you don't really think about adding a second server, for example, and now spread the load across multiple instances. Here, it would be something similar. If you always think about saying, I choose to run this on a Volta-based system and now I have a Volta-based GPU. Now my way to make this faster is to go to an Ampere-based architecture GPU," that would be scaling up. Certainly, that's something that you want to do, but at some point, your pace and your need for accelerated training actually exceeds the cadence at which we can provide you the next fastest GPU. If you need to scale faster than that, and if that curve exceeds the other, then you're essentially in a situation where you have to say, "Well, how about I take a second A100?" Then I have a multi-GPU scenario, and let's just deal with that, and so on and so forth. The natural conclusion of that is, "How about multi-node jobs where they're smack full of the latest and greatest GPUs, and then how many nodes can I spread my job across?" If you do, I don't know, 5 billion parameters then yeah, you're going to have to do that. Then you're going to be pretty busy trying to organize a job across multiple sets of nodes.

Data center infrastructure as a value center

Lukas:
Do you have any sense on how your customers view the trade-off of buying more GPUs, buying more hardware to make their models perform better? Are they really doing a clear ROI calculation? One of the things that we see at Weights & Biases is that it seems like our customers' use of GPU just expands to fit whatever capacity they actually have, which I'm sure is wonderful for NVIDIA, but you wonder if the day will come where people start to scrutinize that cost more carefully. Some people have pointed out that there's possibly even environmental impact from just monstrous training runs, or even a kind of a sad effect where no one can replicate the latest academic research if it only can be done at multi-million-dollar-scale compute. How do you think about that?
Stephan:
In the end, I think it's a pretty simple concept. If the competitive advantage for companies today is derived from being able to train faster and larger and better models, you're not speaking to the CFO anymore. You're speaking to the product teams. At that point, it just becomes a completely different conversation. The only interesting piece here is that traditionally, of course, data center infrastructure is a cost center, whereas now we're talking about turning it into value center. If you turn it into a value center, then you really don't have this problem. Of course we have extensive ROI conversations with our customers. We have TCO calculators and all that good stuff, it's definitely there. It's really about helping customers choose, "Should we do more cloud for where we're at?" and from a GPU standpoint, we're happy with either outcome. We're maintaining neutrality in that aspect that we're saying, "Well, if more cloud usage turns out to be better for you, then you should absolutely go and do that." Then if we figure out that the economics shifted in such a way that a mix of cloud and on-prem, or cloud and hosted resources makes sense, then we'll propose that. It's really about finding the best solution there and definitely our customers are asking these questions and making pretty hard calculations on that. But, I mean, it's pretty obvious. If you think about it...a couple years ago, we talked to an autonomous driving lab team and they said, "Well, Company A put 300,000 miles autonomously on the road last year, and we put 70,000 miles on the road last year autonomously. We got to change that. How do I at least match the 300,000 miles a year that I can put autonomously on the road?" So that's a direct function of, "How well does your model work?" and so on and so forth. It's a pretty clear tie-in right now.
Lukas:
What about inference? A lot of the customers that we talk to, inference is really the dominant compute costs that they have, so the training is actually much smaller than the spend on inference. Do you offer solutions for inference too? Could I use Base Command at inference time, or is it entirely training? And do people ever use these DGX machines for inference, or would that just be a crazy waste of an incredibly expensive resource?
Stephan:
Yes and no, it depends on how you use it. First of all, you can use Base Command for model validation purposes. You can have single-shot runs. But some customers want to set up a server that is dedicated to inference and then just take MIG slices and say, "Well, I'll do my model validation at scale, basically. I'll do my scoring there." If you share that infrastructure across a large number of data scientists, you put your DGX to a good use. There's no issue with that. We do have a sister SaaS offering to Base Command called Fleet Command. That is meant to take the output of Base Command in the form of a container, of course, and then deploy that at scale and orchestrate it at scale, and really manage the inference workloads at the edge for our customers. It's an end-to-end coverage there from a SaaS perspective.

Base Command's role in tech stacks

Lukas:
In your view, based on the problems that you're seeing in the market, what functionality are customers asking for in their software layer for machine learning training that you're interested in providing?
Stephan:
That's a really good question because it goes through the heart of the question, "What space is Base Command seeking to occupy in a theoretical stack where the infrastructure's at the bottom and something like Weights & Biases at the top?" I would see Base Command's role as an arbiter and a broker. Almost like a bridge between a pure developer-focused, almost like an IDE, perspective and bridge that into enterprise-ready architecture. Let me give you a simple example. If you do dataset versioning — and then let's say that's what you want to do with your MLOps platform — then there's many ways to version data. You can try and be smart about this, but at the end of the day, it's a question of what infrastructure is available to you. If I have an optimized storage filer underneath, my dataset versioning strategy look entirely different than if I just have kind of a scale-out, open source storage backend. If I work with S3 buckets, then my versioning looks different than I do that with NFS shares. The value that Base Command provides is that it abstracts it away. If you do dataset versioning with Base Command, then it'll do snapshots. If you do it on a NetApp filer, it'll do other things than if you do it with a different storage. But those are exactly the questions that an enterprise architect will be interested in. How do you deal with that? Just because you figure you need 50 versions of your dataset that's 3TB large, does that mean I need to plan for almost infinite storage? No, it doesn't. We can help you translate that and make that consumable in the enterprise. i I think that's a big piece that I think Base Command can provide as this arbiter between the infrastructure and the API, if you will. The second thing is, increasingly, I've seen people being very concerned about data security and governance around this. If you have sufficiently large infrastructure to deal with, then almost always you have multiple GEOs to deal with. They have different laws about the data that's being allowed at any given point in time. Just the ability to say, "This dataset can never leave France," or "That dataset has to only be visible to these three people and nobody else," is of extreme value to enterprises. All those things come into play, and I think that's where Base Command can help.
Lukas:
Are there other parts of Base Command that you've put a lot of work into that people might not realize the amount of effort that it took, that might be invisible just to a customer, even just me even imagining what Base Command does?
Stephan:
Yeah. I think that we invested a lot in our scheduler. If you look at the layout of DGXs in a SuperPOD arrangement and the nature of the jobs that go into this, I think people underestimate just how optimized the scheduler is across, not just multiple nodes, but also within the node. For you to be able to say, "I'm running a job and with a one-GPU configuration," and then it's a slider, and then I say, "Well, I'm turning this into an eight-GPU job now," and that's literally a selection. What goes on in the background, it's just a lot more intricate than people typically realize. But it goes on automatically and you do have to be ready for it. You have to program for it, and people know that. But as soon as you do that at your layer or the optimization underneath, it's just incredible.
Lukas:
What's tricky, is it like you need to find eight GPUs that are close to each other and not being used and all that, is that the basic challenge?
Stephan:
Yeah, exactly. Data locality, caching strategies, all that kind of stuff is going straight into that selection.

Why crowdsourcing is underrated

Lukas:
Cool. All right. Well, we always end with two questions, both on ML. Let's see how you answer them. One thing we always ask is, what's an underrated aspect of machine learning that you think people should pay more attention to, or you would love to spend time on if you had more free time?
Stephan:
I think what's underrated is this aspect of crowdsourcing. I don't think anybody is looking at machine learning and the potential that just many small devices that are contributory to the creation of a model would bring. I think that we're at the cusp of that, but we're not really doing that right now. I think to the degree that it already happens, it's very hidden from us. We all know Google will run some algorithms across data that was collected through the phones. We understand that on a conceptual level, but just the ability to bring that together in a more natural sense that we might want to find recommendations not on the basis of a single parameter, but find recommendations of more meaningful parameters. I find five-star reviews very meaningless, for example. I think that is a very simplified view of the world. I find, consequently, also one-star reviews very meaningless. But if you could actually have a more natural understanding based on machine learning, that would be an interesting topic to explore, because it would have to be based on just all kinds of inputs that would have to be taken into account. I would like to see that and I think that would be an interesting field of research, an interesting field of development. I think people still assume that it's only a prerogative of the big companies to be able to do that, but I think there's an open source project in there somewhere.
Lukas:
Cool. I hope somebody starts that, and then they should send it to us when they do.

The challenges of scaling beyond a POC

Lukas:
Our final question is, when you look at your customers and their effort to take business problems and turn them into machine learning problems, then deploy them, and solve those problems, where do you see the biggest bottleneck? Where are they struggling the most right now?
Stephan:
The biggest issue they have — at least as far as I can tell — is that they have just a getting started issue in the sense of, "How do I scale this beyond my initial POC?" I think that the prefab solutions that are out there are pretty good at walking you through a getting started tutorial and then they'll probably gets you really far if you're a serious practitioner and you devote some time to it, but I think that at some point, you'll hit problems that may not even have anything to do with ML. They may just have something to do with infrastructure that's available to you and things like that. I think that anybody who is trying to use this for a commercial and a business strategic purpose is going to run into an issue sooner or later of, "How do I go from Point A to Point B here? People call it something like AI DevOps, or something like that that floated around. I think, as an industry, we should be aiming to make sure that that job never comes and sees the light of day.
Lukas:
Too late, I think.
Stephan:
Yeah. I feel like we lost on that one already. But I really think, we should do better. You shouldn't have to require super special skills to create this whole DevOps approach around AI training. We should really know better by now how that whole approach works and then build products that drive that.

Outro

Lukas:
Awesome. Well, thanks so much for your time. I really appreciate it. That was fun.
Stephan:
Thank you.
Lukas:
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we worked really hard to produce. So check it out.