Skip to main content

Tristan Handy — The Work Behind the Data Work

Tristan explains the rise of the modern data stack, how dbt makes data transformation easier, and why SQL is still so popular.
Created on June 2|Last edited on June 9


About this episode

Tristan Handy is CEO and founder of dbt Labs. dbt (data build tool) simplifies the data transformation workflow and helps organizations make better decisions.
Lukas and Tristan dive into the history of the modern data stack and the subsequent challenges that dbt was created to address; communities of identity and product-led growth; and thoughts on why SQL has survived and thrived for so long. Tristan also shares his hopes for the future of BI tools and the data stack.

Connect with Tristan:

Listen



Timestamps

0:00 Intro
0:40 How dbt makes data transformation easier
4:52 dbt and avoiding bad data habits
14:23 Agreeing on organizational ground truths
19:04 Staying current while running a company
22:15 The origin story of dbt
26:08 Why dbt is conceptually simple but hard to execute
34:47 The dbt community and the bottom-up mindset
41:50 The future of data and operations
47:41 dbt and machine learning
49:17 Why SQL is so ubiquitous
55:20 Bridging the gap between the ML and data worlds
1:00:22 Outro

Watch on YouTube



Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!

Intro

Tristan:
The thing that dbt does is try to get to a ground truth that everybody inside of an organization can agree on, so we can at least have productive disagreement.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald.
Today, I'm talking with Tristan Handy, who is the founder and CEO of dbt labs. dbt, for those of you who don't know, has gone from an open source project to one of the most critical components of the modern data stack in under four or five years. It's been incredible to watch from the outside and I was excited to talk to him about it.

How dbt makes data transformation easier

Lukas:
You probably are the first person that isn't kind of actively working in the ML field, but data is so critical and tangential that I thought you'd bring a really interesting perspective; you might need to make it a little more basic, you know, for our audience.
So I thought I would start out by asking you to describe what dbt is. Because in your world, it's a really famous, well-known product, but I think for a lot of ML people, they might not even know what it is.
Tristan:
Yeah. Gosh, I sometimes get challenged to answer this question from a "Imagine my aunt is on the other end of the conversation," and it's really challenging.
Lukas:
I know the feeling.
Tristan:
Right? "I do data stuff".
So, if you're in kind of more traditional BI and analytics, your world has changed very significantly over the past 10, 12 years. Really driven by the rise of the modern cloud data warehouse.
Now everybody has access to high-performance, scalable, SQL-based compute. You can just throw data in there and, by and large, queries are just fast. And there's this whole ecosystem of stuff that has risen to get data in, to organize the data that's in the warehouse, to report on it, et cetera.
We had to kind of...the whole industry had to kind of rebuild all of its tooling around the cloud data warehouse, because the way that stuff worked before was all constrained around speed and size of data.
The problem that dbt addresses is that now there's this massive profusion of different types of data that show up in the warehouse. You get Fivetran or other products that... their whole job is like, "You push a button and now a whole new data source shows up".
But it shows up in exactly the format that it lived in, in the source. So you connect Facebook Ads and you get 150 tables that map one-to-one to like a Facebook Ads endpoint.
And so then you — as a data person — you need to figure out like, "What the heck is even in there? How do I organize this in a way that is useful to do some reporting for my end data consumers?"
dbt creates a new workflow around how to do that work.
It's very code-first. It takes DevOps principles as kind of its founding ideas. And it is open-source, open-core, first commit happened about six years ago. Over the past six years, there's been a pretty large community that's grown up around it.
Lukas:
And I guess what does it actually do to address that problem?
Tristan:
So, how was this problem solved before?
Before, you couldn't rely on the warehouse to do all this work because the warehouse was constrained. So you had these intermediate environments.
A lot of times, you know, you had commercial products that sprung up to do traditional data transformation that happened before you loaded it into the eventual warehouse.
The big insight behind dbt was that the warehouse now is performant, scalable enough to just do it all itself. And what that means is if you want to get access to the compute that lives in the warehouse, you had to — at least traditionally - you had to express your jobs in SQL.
And so dbt is essentially a framework for programming...to build data pipelines in SQL. You write in the SQL that is native to your database, and then it has a Jinja layer — a templating layer — over top of that SQL, that instead of just having this collection of random SQL scripts on your hard drive, you have a framework that you can plug into.
You have references, so you can build DAGs out of SQL files seamlessly. You have environment variables, you have CI/CD, all of these things that you would expect from a programming framework.

dbt and avoiding bad data habits

Lukas:
It's funny. I feel like from my perspective, running a tech startup where I'm trying to get official records of data on all these different topics, it seems incredibly obvious to me that something like this is needed. But I wonder if my younger self-
Tristan:
-do you want to do the big reveal of how you first got exposed to dbt?
Lukas:
Oh, should we go down that path? Sure, let's do it.
You were one of the very best consultants we've ever hired. You came in and did our analytics, and it was funny because I actually edited your SQL queries that you wrote, quite a bit. And I should say I learned a lot of SQL from you.
I felt like you were the first time...working with you was kind of the first time I saw...I mean, I kind of learned SQL as a side thing in school. And then I used it a lot, you know, as...I think when you're CEO, SQL is the language that you end up writing the most stuff in.
And so I think I kind of went down a bad path. SQL kind of lets you start to write it like you might write blocks in Excel, where you just start stuffing more and more chaos into your queries.
One thing that's actually notable about working with you is you really pulled out each piece into its own named section, which I didn't even realize some of those things you could do in SQL.
Tristan:
I mean, generally either you can't at all or SQL doesn't make it easy.
I think you experienced part of the magic that data people experience when they use dbt for the first time. They're like, "I've been...for however long I've been using SQL, it's looked like garbage and you've given me some more structure in the language and I can now engineer it in ways that actually make sense."
Which is...for many of us, we're used to thinking in OO, or functional, or whatever, like programming paradigms, and then SQL becomes very frustrating because you can't actually organize your code in these similar ways.
Lukas:
Yeah. I think from my perspective, if I could write a love letter to dbt as someone who doesn't actually use it, but sort of sees the results of it on my organization...you might not realize how much complexity enters your data pre-processing.
We have a lot of people that come in and use our product as students, and there's sort of different ways to get at that. But we often want the students sort of outside our analysis of leads for sales, from that perspective.
And there's a lot of different ways you could kind of cut who's a student, but it's really helpful to have one official way that's really good, and just kind of nail that down, and then let everybody operate off of it.
It feels like one of the big benefits of dbt for us at Weights & Biases is that we're able to kind of standardize all these intermediate steps and have an organized way of...yeah, just standardizing on these things, which I think has made us operate much better as a company.
Am I off on...
Tristan:
No, you're totally right. The way that we talk about this is curating knowledge inside of an organization.
It used to be that, like, in our wetware, we used English to pass knowledge on to each other. And then somebody would write a SQL query for themselves based on their kind of imperfect understanding of who was a student.
And now there's a way to actually take that knowledge and encode it. Then you can just forget about it as an organization until you say, "Hey, how do we do that?" And then you just look back at the code and you can even look at the git blame and you can say like, "Well, here's how we arrived there."
Lukas:
Right.
And if you don't do that, you end up with all these different — slightly different — versions of "what's a student", and it doesn't match, and it's totally bug-ridden.
I feel like dbt has made a big difference.
Tristan:
Here's the funny thing about that project from my end, from my experience. I was working with you folks and you know about machine learning. And it's so cool, and trendy, and you can do magic stuff with it.
And here I am. I'm close to the business, I come at data from a...I understand the business, so let me get into the "asking questions about it" perspective.
To me, what I do feels not that complicated. I mean, at least not that technically complicated. And so it feels like people who know about something as - you know, this is my internal monologue - something as complicated as machine learning, how can they not have this kind of basic stuff figured out?
But I think that...for whatever reason, it's not a thing that ML folks have widely dem...I have my theories on why this might be true, but I'm curious if you have any thoughts on why that might be true.
Lukas:
Well, I think one thing that shouldn't be underestimated is that most people in ML have a lot of academic training. Like, just a lot of ML comes from academia. I think more than almost any other field.
Knowledge gets passed down in academia quite a bit. It's starting to change, but I think it's still, you know, people are going to original papers and kind of learning through professors, and that.
And I do think that academia teaches you incredibly bad habits, right? I think everyone kind of coming out of there has to unlearn a lot of things, including myself. Because if you think about academia, you're trying to get to an interesting result and then you never have to make iterative progress from there.
Whereas in work, most of what you do is iterate. And so, you really want things to be kind of stable and contained and clear, whereas in academia, most of what you do gets almost immediately thrown away.
You're sort of racing and kind of not...you write a lot of throwaway code. You don't think a lot about structuring your code and then you especially don't think about making your data pipelines stable and consistent.
Because I think a lot of ML-
Tristan:
-which is how Jupyter notebooks end up becoming data pipelines.
Lukas:
Exactly, exactly. I totally understand how it happens. And as a CEO of a growing startup, it drives me nuts, right?
But I actually kind of come from that lineage too. So I think I've had to unlearn a lot of these instincts as well.
I also think it's actually like a real skill - that I'm still working on - to make good data pipelines for a company.
Every query is more complicated than you think it is at first blush. And I think a lot of these choices, it's harder to do extremely agile iterative development. A lot of these choices that you make have long-lasting repercussions and need to be considered, and it's more important to get it right the first time for some of these things.
Tristan:
We started - when I say we, I don't mean the company, but the dbt community - started using this term "analytics engineer" for the people that use dbt and...or like do their work in the way that dbt teaches you to do your work.
And I think it really gets to this dichotomy where...there are data analysts who use the tools of data analysis to come to some net new results. And in that world, it's actually completely fine if your code looks like garbage, if you can't...it's just like, "Poke around until you find something interesting and then wave your hand and be like, 'Hey, does anyone else find this interesting?'"
Whereas analytics engineering is this thoughtful effort to slowly construct reality for the business.
My favorite example of this is actually...this was a consulting project. I was working for a full-stack grocery delivery company. And I had to help them calculate the Cost of Goods Sold for, like, an individual batch of green onions.
It turned out to be an incredibly challenging problem. And in many ways, deeply unsexy.
But it was so fun to me to like...now, every single time a picker picked a thing of green onions out, we knew exactly how much cost to allocate to that.
Lukas:
Yeah, totally.
And I think it's funny how CFOs come from a totally different lens from ML, where they really want things to be precise and accurate and consistent and traceable.
I mean these Cost of Goods Sold calculations always end up being...to get it that precise — which I understand why Finance wants that — is often in deep tension with the sort of exploratory data analysis that's also important.

Agreeing on organizational ground truths

Lukas:
Well, I had a question for you that I really wanted to ask. Which is, both of us run companies that are kind of hard to explain to our aunt or uncle, right? Kind of behind the scenes in helping a lot of things happen.
But I think one thing that both of us share is we really are passionate about the impact on the world. And we're kind of in this maybe, you know, more for the impact than the financial gain. I don't want to put words in your mouth, but that's my sense of you.
I'm curious how you think about the impact of the work that you do or how you articulate it to prospective employees or the world.
Tristan:
Yes, I agree with you and I'm like "Okay, let's go there." But actually no one asked me this question.
From a commercial perspective, our mission is to help data analysts and help them curate and disseminate knowledge inside of organizations.
But if I broaden the lens and think societally — and there's a lot of tech where we like to talk about making a dent in the universe. I think that's overplayed sometimes, I try to be a little more humble than thinking that we are going to somehow impact the trajectory of the universe — but when I frame it like that to myself, I am deeply concerned with our epistemic reality as a world today.
We don't need to go too deeply into politics, but there's been a lot of interesting conversation happening at the national or international...this is not just associated with the United States, but where people disagree on basic realities of what is true.
And because of that, we actually have a hard time having conversations or having productive debates. Maybe some of that's in good faith, maybe some of it's not in good faith. But whatever..the thing that dbt does is try to get to a ground truth that everybody inside of an organization can agree on, so we can at least have productive disagreement.
I don't know that there's some way to magically organize all structured information in the entire...okay, maybe that's beyond what we will ever get to as a company.
But it does motivate me to think that the world that we are working on, figuring out the epistemic reality inside of organizations is actually a big problem for the entire world right now.
Lukas:
Interesting, great answer.
Tristan:
Is that what you were expecting?
Lukas:
No, not at all. It's a really interesting answer. I'm just contemplating it. I think it's a great way of looking at dbt.
I always don't want to be the caricature of a startup CEO saying, "We're changing the world with better MLOps," but at the same time we are changing the world with better MLOps, and I do feel proud of it myself.
I don't want to come like a blowhard, but I also do feel really proud of the work that we do. I think it makes a small dent, you know, a small dent in the universe. And I don't want to be falsely humble either, when it feels good to help out all these customers working on really, really exciting things.
But I think you have such a specific, interesting answer. That's such a great way of looking at what dbt does.
Tristan:
Talking about the customers building cool stuff, there's this funny conversation going on inside of our community these days, where a lot of folks who used to be practitioners have gone over to the founder side. They've gone over to the dark side.
So when it used to be all of these practitioner-to-practitioner conversations, now it's a bunch of tool vendors hyping their own stuff.
I'm a little bit jealous of...I would love to actually go back to the other side of the fence. Maybe at some point we'll get the opportunity to rejoin the people who are actually using the shovels as opposed to making the shovels.

Staying current while running a company

Lukas:
I guess here's another question that I think about a lot. How do you stay current without working on this stuff?
For both of us, I imagine it's important to keep doing a little bit of the task. It's very hard for me to learn about machine learning in theory without practicing it.
I'm always really trying to carve out time to train new models and try out new things that are coming out. But, you know, the urgent needs of running a fast-growing company encroach aggressively in that time. How do you think about that?
Tristan:
Yes. This is something that concerns me a lot.
I think that I might be in a slightly easier position than you.
You can summarize a lot of the characteristics of our world based on the evolution of the data platforms that all this stuff runs on. You can summarize that in like Price Per Performance and these kinds of characteristics.
Fundamentally, SQL basically does what it has done for fricking 40 years or whatever. And then the tooling on top of it. There's areas in our ecosystem that have a lot of movement: data observability, data quality, cataloging. These kinds of things are very fast-moving right now. And then maybe there's another, like a next wave of data analysis products that are coming out.
I end up staying on top of stuff by curating a newsletter. I have - for six and a half years now - published a newsletter called...that now is called the Analytics Engineering Roundup. It goes out every week. I write half of the episodes or the issues.
It is this really great accountability tool to make sure that I actually have something new to say every two weeks, because otherwise it's incredibly easy if no one's...when 15,000 people are going to read the thing that you just put out there, you feel a lot of pressure to say something correct and novel and interesting.
But otherwise it's very easy to not invest that time.
Lukas:
Totally. It's funny, I actually use those external forcing functions too.
They're so effective and I always get really nervous before I have to put out something like that. Or I sometimes set up talks with topics that-
Tristan:
-you don't know the answer to yet.
Lukas:
-I don't fully know about yet, I need to force myself to figure it out.
Sorry for those of you that have watched those talks and thought I didn't look like I knew what I was talking about.
Tristan:
Well, sometimes it turns out that they go great, right?
Lukas:
Totally.
Tristan:
And then every once in a while you're like, "Ah, that wasn't perfect."
Lukas:
I feel like sometimes if I give the same talk too much, I find myself getting bored in the middle of the talk. And then I feel so sorry for the audience, because I figure if I'm getting bored, the audience must be bored out of their minds.
Tristan:
I have a tremendous amount of respect for professors, for teachers, who keep the energy level up delivering the same stuff over and over again.
Lukas:
Totally.

The origin story of dbt

Lukas:
Okay. Well, tell me about starting dbt. I'm sure everyone asks you that, but it's such an interesting question.
I'm curious what you were thinking when you started it. Was it just a rocket ship from the beginning? Or was there kind of a moment where something changed, and this started to really build traction?
Tristan:
The origin story of dbt is that I was burnt out from venture-funded startups. I'd worked at three of them.
I think that, as a community, venture-backed startups are getting a little bit better about work-life balance. But inconsistently so. Certainly back in 2015, that was not the case at all. I'd been working for 11, 12 hours a day for like 7 years.
It was like, "Okay. I'm done with that," and I really want to go back to data. I had started my bareer in data, and then I'd gone to different...I wanted to get back to actually having a pure data job."
And so I was like, "How do I this?" And how do I do it from Philadelphia? Because I'm married and my wife has a cool job and she's like, "We're not moving."
I decided to start a one-person consulting shop and I was just going to help companies implement what became known as the modern data stack. So, a data warehouse, a data ingestion tool, a BI tool. And I was going to help them do their internal analytics.
The thing that was clearly to me missing was data transformation, which was a part of how the stuff had been done in the past, but there wasn't a modern data stack solution.
Got my friend and coworker Drew Banin to help me build the early versions of dbt. It was not so many hours that was put into the initial versions of dbt, dbt is not that complicated.
Drew joined and we started using it on consulting projects. It was really our consulting clients who got exposure to dbt. They said, "Hey, I want to start using that tool." And so they would train their internal people.
The big locus of where the community came from was back in 2016, Casper got turned on to dbt and they were kind of a big deal in the New York tech scene at the time. They told their friends and so Kickstarter and et cetera. It was a New York tech thing.
If you look at the graph — we do anonymous event tracking inside the open source product — if you look at the graph of the number of different organizations using dbt over time, that graph has grown at 10% every single month for five and a half years now.
It does feel like-
Lukas:
-can I interject? I'm so jealous, I'm so jealous. That's amazing.
All right, go on.
Tristan:
At the beginning, we didn't even focus on it because we didn't have a way to make money off of that. It was just like, "Whatever, that's cool that the community is growing."
And then we got to a point where we grew from 300 to 1000 companies using it over the course of a year. That's when the Fortune 500 companies started calling us and were like, "Hey, we'd like to buy stuff from you." And we're like, "We don't have anything to sell you."
That was when we kind of changed directions and became more of a software company. But there was no single point where it all came together. It was just this...people underestimate the power of exponentials over long periods of time.
Lukas:
Totally.

Why dbt is conceptually simple but hard to execute

Lukas:
I guess another funny thing about dbt is that it seems so conceptually simple, doesn't it?
It's funny...I feel like these are mean questions. I was asking the Spark founder "What makes Spark complicated?" and "What makes Ray complicated?"
All these things, at their core, seem simple. What makes dbt hard to build?
Tristan:
The simplicity is...I don't want to take credit for that, but I think that one of our main driving product goals is to be simple.
Lukas:
Who else would take credit for that? Can't you take credit for that?
Tristan:
Mitchell Hashimoto should take credit for that, because it's a straight-up copy of Terraform.
The user interface paradigm is...my other co-founder, Connor, was an infrastructure engineer at our last company together. I was telling him about this need that I had. And he said, "Have you ever seen Terraform?"
This was back in 2016, so Terraform is still kind of new and cool. It was like, "Let me show you this thing." He showed me the HCL behind it, and then he did a tf-apply and I was just like, "Holy shit. That's really freaking cool."
Once you've seen Terraform and you've used it, you're just like, "Well, obviously that's how I'm going to do that moving forwards."
That was the product goal of dbt at the outset. It was Terraform for analytics.
On some level, what dbt does is it takes SELECT statements that are inside of .sql files on your machine and it wraps them in CREATE VIEW or CREATE TABLE as SELECT statements. And then it does some DAG processing with NetworkX and Python.
On some level, that is actually quite simple. The hard parts come in when — there's a lot, and I'm not the person who built it, so you're going to hear it passed through a less technical person's mouth — Jinja is really meant to be used as a web templating language. It's meant to process one HTML page at a time, like request/response. And in that context, it works quite performantly and all is well.
In dbt, because all of your pipelines together make a DAG, what dbt has to do is it has to read all of them at startup time in order to understand the shape of your entire DAG, so it can know what work it needs to do.
If you have 50 of these, that's not a problem. But we have users who have thousands of these. And it turns out that it's quite challenging to read thousands of files from disk and operate on them in a way that feels interactive to a user on the command line.
I think the team last year was four people. We spent four person-years of engineering time last year almost exclusively on performance.
So that's an answer. There's many answers to...once you go deeper and deeper down this hole, and I'm sure you've experienced this too. Sometimes the decisions that you make early on in the process of building something, you come back to later and you're like, "Wow. Gosh, I didn't realize what a bad idea that was going to be."
Yeah. It's a constant iteration cycle.
Lukas:
How about documentation and API names and things like that?
How do you feel about how well you've done on that? That's always something that I reflect on with Weights & Biases.
Tristan:
Oh, we're not great at that today.
Our...your APIs, your whole product is commercial product, right? You don't have open source surface area?
Lukas:
We do, actually. We have a client that's open source, and then the APIs are...anyone can call the APIs and pull stuff out. But yeah, the client is open source. It could go anywhere.
Tristan:
So, we have this funny thing where we have two different types of users.
We have users who tend to be less technical. There are people like me, who their primary language is SQL and maybe some scripting and stuff like that. And then we have contributors, and that group is much smaller. They tend to be data engineers and not data analysts.
We have historically prioritized the needs of users over the needs of contributors. And that has meant that we have — whether it's in the open source context or in our cloud product — we've historically under-invested in clean APIs.
The open source product really exposes itself as the CLI. If you try to get in there via Python and call stuff directly you can, but we don't make any guarantees about the stability of those APIs. So we need to improve there.
As we mature as a commercial business, we're increasingly taking the needs of data engineers seriously too, because dbt is increasingly this mature piece of data infrastructure inside of the companies that use it.
Documentation and API design are very front-and-center in our world today.
Lukas:
Is it a command and control style management to keep the names consistent and things like that? How do you source community ideas and yet keep predictable names and things like that?
Tristan:
I don't know that we've dealt with the name thing as much, but I will say that we're not especially good at getting groundbreaking new contributions from the community.
We have a real design ethos, the product is designed in a certain way. And it can be challenging for folks who aren't a part of all of these conversations about this to do big new things.
I will say that we have done a better job over time of carving off spaces of the product that are much safer to get external contribution on. So we now support a dozen or so database adapters. And increasingly it is the vendors for those database adapters that maintain their own adapter. That's a very well-defined surface area.
I've never run an Apache project, but I have a lot of empathy for people who are trying to run open source projects without a benevolent dictator for life. It's legitimately very hard to work through these kinds of things purely in GitHub issues or things like that.
Lukas:
Totally. And you wonder if the outcome of that kind of consensus building might not be as good as if somebody is just appointed, like "You make the call and and drive forward."
I'm not saying this is necessarily better, but it's something that we think about it at-
Tristan:
-it certainly takes more work.
Lukas:
Yeah, for sure.

The dbt community and the bottom-up mindset

Lukas:
I want to make sure I ask you about your community because you're so well-known for the quality of your community.
Can you talk about what you do in community building and why they're even...I feel like a priori, you might not even expect there to be such a vibrant community around a tool like dbt. How did that happen?
Tristan:
I think it is very interesting.
I want to have some epistemic humility in terms of...I don't know. I have my own guesses as to why this happened, but community is an emergent phenomenon. I think you could ask different people and different people would have different stories.
Here's my belief.
I think that there has been multiple decades of data people being undervalued. That the tools that are built for them underestimate their capabilities. And tools that lock them in. So you're less willing to give back to a company that feels like it maybe it doesn't have your interest at heart.
For the first time, I think we said to data people that, "We believe you're very capable. We think that there's this new way that you can work, and here's the little seed of a tool that will help you do that."
And I think that people...I think all communities are really communities of identity. They have to feel seen and recognized. That's what creates loyalty.
I think that that's why data people — especially early on, but still today — feel a deep affinity for the dbt community. Because it's the place that they feel like they're really seen and they're not underestimated.
Lukas:
Interesting. That seems very plausible to me.
I don't think ML engineers are maybe as historically disrespected in organizations — maybe they're kind of put on a pedestal — but I think Weights & Biases was one of the first companies with the point of view of, "Hey, we're going to really serve this specific group."
Where I think most of the earlier MLOps tools came with a more top-down mindset of, "We're going to sell into CIOs and sell high in an organization."
And I think whoever you sell to really ends up controlling your product direction, is what I've seen.
Tristan:
Totally.
Okay. We do top-down sales at this point too, but it will always be a complement to bottoms-up, community-led motion.
It feels very surprising to me that - and maybe it's just because I don't understand the full ecosystem as well as I'd like to — but it feels very surprising to me that not all companies today in data are started with bottoms-up motion.
It's so much more fun to build a business like this, right?
I want to build a good product for CIOs. I want them to value what we do. But I want to spend my time talking to people that do the work.
It's just more fun.
Lukas:
I feel exactly the same way.
Tristan:
Why do you think there's still so many companies that build tools that intend to be top-down?
Lukas:
Well, I think that building a company that sells lower in an organization first is a slower road.
Tristan:
Yeah.
Lukas:
People have less budget. And so in a smaller market where you need to do bigger deals, it might be necessary to sell higher in an organization.
My first company, CrowdFlower, intuitively started off with a bottom-up sale, but towards the end really ended up serving folks higher in the organization, just because I think the ML market at the time was smaller. So you couldn't do it.
I think me and you are much more of the temperament to sell to the people that are actually doing the work as I think of it, but...
Tristan:
Maybe it is a market maturity thing.
I think that there...that the places in our space that are generally a little bit more tops-down are things like governance and cataloging and things that you need a lot of standardization.
Maybe there's a compliance buyer, things like that.
Lukas:
Do you think of Databricks and Snowflake as a top-down or bottom-up sale? Is it obvious to what they are?
You can kind of get started off the website, but I sort of view them as doing more of a top-down sale from my perspective. But you would know better than me.
Tristan:
It's an interesting question.
When I think about this, I think about...when a sales person engages at a company, do they have to educate that buyer on what their thing does in the first place? And for Snowflake and Databricks, I think by and large, their buyers already know who they are.
The job of the salesperson in that context becomes partnering to make sure that...there's like a million hurdles that will prevent you from effectively using Databricks, or any data platform. So the sales person almost has to just project manage their way through both the consensus-building process and the actual implementation process.
But I think that sometimes when you go to buy a data governance tool, it's like, "Well, I don't know. What governance tools exist? Well, let's research them."
I would much rather come in...when we talk to data leaders, they're like, "Yeah, we know dbt, we heard you on the A16z podcast." They probably already have some people who have tooled around with it internally.
It's such a more fun conversation to have.
Lukas:
Totally, totally.
Well, it's hard to do that. You've made a product that many, many, many people use. Growing 10% every month, this puts you in a rare category of growth.

The future of data and operations

Lukas:
Do you have thoughts around where the data world is going? What parts of the stack are likely to change in the future?
Tristan:
Gosh, that is a very big question.
I just spoke to...I wrote a blog post at the end of 2020 that made five predictions. By and large, I think that those stand up pretty well, but I think there's a new set of things that probably needs to be written.
I just talked to a company that is building a layer that allows you to turn your data warehouse into a transactional data store. That is very interesting, because if you think about all these SaaS products that have been built over the past 15 years, each of them has their own separate data store.
You have all this data engineering to do to make sure that the right data is in Salesforce. And then that the Salesforce data comes back over into Zendesk. It gets a little silly.
You could imagine that, "Well, we've centralized all of our organizational data with these data pipelines - that were initially built for analytics - and the data warehouses themselves are primarily built for analytics too, but what if we could have another data store that sat on top of it that had more transactional capabilities?" And would allow you to have lots of queries per second and good insert and update times.
Not just that capability, but the idea that the data warehouse will stop being just for analytical use cases and be for operational use cases, I think is a very interesting thread to pull on.
I have no insider knowledge here whatsoever, but my guess is that Snowflake and Databricks would love to invest in technology to...if you look from the outside, Snowflake has changed its messaging over the years from being a data warehouse to a data platform. Now it's a data cloud.
The game in compute is you want to handle more and more and more and more workloads. I think there's a lot of reasons that we as data professionals, should like that. Because it means that we wouldn't just be doing things in service of analytics. We could actually be a part of the product development organization side of companies too.
Lukas:
Wouldn't latency need to come down to do that? You're talking about being literally something that the product actually queries in production.
Tristan:
Totally.
So, imagine...I think there's different ways to do this and I've heard different proposals, but imagine that there's a caching layer on top of the warehouse. It's using replication to get a very consistent state of the world, maybe there's a small lag between the data warehouse information.
You could imagine latency that actually was acceptable for a production application use case.
Lukas:
I see. Interesting, interesting.
Tristan:
There's VCs that are all over...Martin Casado, who's on our board, is very bullish on this trend. Tom Tunguz was writing about this two years ago.
I've always wondered like, "Okay, but, but the data warehouses don't actually...they can't service that type of query pattern today." But maybe if you just like wave a magic wand and you're like, "Somebody's going to fix that," then you could see some interesting things happen.
Lukas:
Interesting.
It's funny. One space that I think is kind of unsexy to VCs, but still seems surprisingly broken to me is BI tools.
I guess that's part of the stack, but it's just funny. I think so much money has gone into it. Every company uses it. There's like clearly a market there. But I feel like I haven't seen a lot of new things happening and it's still quite a frustrating experience as a CEO.
Tristan:
I do some very, very small-scale angel investing and that is the area where I'm most interested in.
I agree that many of the BI or analytics layer products that most companies use today were started roughly 10-ish years ago. Which in the world that we are operating in is kind of a long time.
Lukas:
Totally.
Tristan:
That doesn't necessarily mean that there's anything wrong with them, but I do want to see new takes arise.
I think that that's starting to happen. I think that sometimes it is because in the same way that Redshift kicking off the wave of the cloud data warehouse changed the priors for "What has to be true for me to make an application that looks like this?", dbt changes those priors again.
If you're building a BI tool, you can just assume that somebody is going to have a dbt project. You can actually plug into the graph and you can know a bunch of information about somebody's data before they've done literally anything in your product.
Lukas:
Interesting.

dbt and machine learning

Lukas:
Well, we've talked a lot about data, but this is...I mean, ML is so closely related to data. I'm curious, is ML relevant to your company at all?
Do you have any people working on ML internally? Do you think about ML when you think about what dbt should do?
Tristan:
There are things that we care about — from an ML perspective — that we have not yet gotten to.
They are frequently in the realm of developer experience. We have an IDE — a browser-based IDE — that we sell to companies, and there's a lot that you can do in that context to reduce the time to get from Point A to Point B.
We have access to a lot of exhaust that comes out of the millions of dbt jobs that we process. And it would be great to use some of that to predict good and not-good patterns for the way that you've built your DAG, written your code.
None of these are things that we've...we operate solely in the land today of building developer tooling using very traditional approaches. But this stuff is not so far around the corner and I'm excited about it.
Lukas:
Cool.

Why SQL is so ubiquitous

Lukas:
One more question before we get to the last two. How do you feel about SQL? It's been such...I feel like of all the computer languages, it's survived the best.
I feel like everyone knows SQL, everyone uses SQL. Something must be really good about it, I think.
Do you think it became a standard early and has just sort of stuck around as a standard despite its flaws, or do you feel like there's like some brilliance in it that makes it work? And do you wish that it would be replaced by something more modern?
Tristan:
Standards are really interesting.
I don't know that there's a technical answer as to why TCP/IP and HTTP are like the founding protocols of the internet. I think that they worked well enough and people consolidated around them and then you have an ecosystem and there's-
Lukas:
But wait, but wait. Languages don't usually work like that, right?
I feel like the languages that I learned in school, even now they're not mostly...I learned Perl, that was the thing to use. And you don't see that much anymore.
Tristan:
That's a great point, but I think that what happens with these protocols, with TCP/IP and HTTP is that they get baked into products.
They get baked into the Apache web server, they get baked into...et cetera. And that has network effects because when all the other vendors support this thing then, "Well, we got to support it too." And then everybody just kind of agrees, "Okay, this is good enough."
With a language, with Python, you run it yourself. You don't need it to be executed anywhere else. Every individual engineer or engineering team can kind of choose Python or Go or TypeScript or whatever. And they get to make that decision without any network effects being involved at all.
But SQL is more like HTTP than it is like Go because you, as the person choosing to write it, are not controlling the execution environment. You buy a database and there's only certain number of databases and they all use SQL. Well, maybe not all of them. But, by and large, most of them, historically.
So not only are there these network effects around, "Because the vendors support it, then I have to learn it," but then there's the return network effects where like, "Well, because everyone knows SQL, I am also going to build a product built around that."
Snowflake could have said...Snowflake was a brand new database. In 2012, they could have said, "We're going to invent our own language." But that doesn't make any sense because Tableau already works with SQL and everything already works with SQL.
Lukas:
I guess it's funny. Java or the JVM has some of that. And then you see stuff like Scala getting written on top of that, or are compiling down to that, but yet everything that compiles down to SQL is just enraging.
I feel like every time I've used a higher level on top of SQL, like all the different versions, I feel like I've tried them and something about it-
Tristan:
-like Active Record or an ORM or something?
Lukas:
Like every ORM is just...at first it feels good. And then you just like tear out your hair-
Tristan:
-you get into the edge cases and it's terrible.
Lukas:
Yeah. Why hasn't someone built a higher-level construct on top?
Tristan:
I totally agree with that. We didn't talk about this pre-taping, so I'm so excited to be talking about this.
That is generally how standards progress. There's this base thing, and then people are like, "Okay, that's good enough for what it does." And then they're like, "Well, let's build a higher level of abstraction" and it will solve some of the...this is like JavaScript and et cetera.
We talk about this internally as "Who's going to build the React for SQL?" And I'm very interested in that question. I believe that will happen over the next five years. I think that there's too much money floating around in incentive to want...the way that dbt works, it's very similar to Ruby on Rails back in the day, with .erb files. There's templating.
But we didn't build React. And I think that either we will, or somebody will. If somebody builds it and it's not us, then I'm very happy to just have it be another choice of language that you can put into your dbt DAGs.
I agree. I think that we've — using templates — we've made a lot of progress in what you can idiomatically express in SQL, but it's still not as pleasant of an experience as just writing other languages.
Lukas:
Are you working on this?
Tristan:
No, not today.
The one person who is working on this in public is Lloyd Tab, the founder of Looker. This has been Lloyd's passion project for a little while. It's called Malloy. You can find it on their public GitHub.
It's very interesting. It's not exactly how I would build it, but also I recently got a demo from him and there's some real magical capabilities there that I had never even thought to want out of my SQL-like language.
I don't know. I would like this as much as you or anybody else.
Lukas:
Very cool.

Bridging the gap between the ML and data worlds

Lukas:
Well, we always end with two questions and I want to modify them for you, I guess. We usually ask what's an underrated topic in ML, but maybe I'll ask you what's an underrated topic in data.
I mean, we've covered some of them, but what-
Tristan:
-can I answer in ML?
Lukas:
Oh, sure, please. Yes. Absolutely.
Tristan:
I think that ML has a persona problem and that there's been some reckoning with this. There's some "make ML more accessible" tooling. In general, I don't feel like that has been spot on. It's clear that the tools for the big kids are really where everybody's focused on today.
There are some...there's a company called Continual, there's a couple other companies in the space of trying to bring ML to the types of workflows that people in my world use. And I would desperately love that.
I'm very familiar with what is going on inside of an ML model, but it is also clearly not exposed to me in a way that is idiomatic for me to participate in this workflow.
So I'm excited about that gap being bridged.
Lukas:
Interesting.
Like making it simpler to just make an ML model from a set of data?
Tristan:
Yeah. What Continual is doing is they're actually plugging into the metadata inside of dbt. And you can actually add some additional metadata properties that declare certain fields inside of a dbt model as being your features and the success criteria.
And then Continual kind of plugs in with its own AutoML process and trains a model and dumps it back into your data warehouse for you.
Lukas:
Wow. Do you actually use that? Or what-
Tristan:
-I don't, they're super early.
I would like to get my hands on it and use it myself. They have customers though.
Lukas:
Cool, awesome. Continual. I'll check them out.
The final question is usually "What's hard about getting ML working in production?" and people usually answer that question...we should actually do a graph of this, but I think that the most common answer is usually the data pipeline feeding into the ML model.
Within that, when you see companies trying to set up a working data pipeline, what's the long pole? What's the place where people usually get stuck?
Tristan:
Debugging data pipelines is very hard. It's not very hard for people who live in this world all the time, every day, but it's still effort and time-intensive for us.
I think that the whole world of observability, reliability, all of this stuff...my answer to ML in production is kind of...I don't totally understand why...so, dbt runs on Spark, dbt runs on Databricks.
Both Spark and Databricks have SQL run times, so we can plug directly into them. And yet, that is not where most of our users are today.
There's fundamentally not that much difference between doing feature engineering and doing what we would call data transformation. You're doing the same damn stuff.
I think that the answer to why these two groups of humans do not consolidate or collaborate more effectively is, again, the same reason that it goes in reverse. Most ML people, I think, don't think in SQL.
I'm excited because more and more of these data platforms are exposing Python remotely.
dbt does not do any local execution at all. We ship SQL to a data warehouse, which executes it. And the funny thing is that that type of interactive work doesn't exist in the Python ecosystem as much. Mostly it's like, you're on a machine. You're running it there.
Databricks has a notebook API that we can plug into to actually run PySpark code. Snowflake has a new thing called Snowpark where you can run execution of Python.
I think that we are going to be working from our end to close this language gap that exists in practitioners today.

Outro

Lukas:
Cool. Awesome.
Well, thanks for your time. This was super fun and I learned a lot, so I have a feeling our audience will also learn a lot. Thanks.
Tristan:
Thank you. It's been a lot of fun.
Lukas:
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So check it out.
Iterate on AI agents and models faster. Try Weights & Biases today.