Building intuitive data visualization tools with Vega-Lite's Dominik Moritz

Dominik shares the story and principles behind Vega and Vega-Lite, and explains how visualization and machine learning help each other.
Angelica Pan

Listen on these platforms

Apple Podcasts Spotify Google Podcasts YouTube Soundcloud

Guest Bio

Dominik is a co-author of Vega-Lite, a high-level visualization grammar for building interactive plots. He's also a professor at the Human-Computer Interaction Institute Institute at Carnegie Mellon University and an ML researcher at Apple.

Connect with Dominik

Show Notes

Topics covered

0:00 Sneak peek, intro
1:15 What is Vega-Lite?
5:39 The grammar of graphics
9:00 Using visualizations creatively
11:36 Vega vs Vega-Lite
16:03 ggplot2 and machine learning
18:39 Voyager and the challenges of scale
24:54 Model explainability and visualizations
31:24 Underrated topics: constraints and visualization theory
34:38 The challenge of metrics in deployment
36:54 In between aggregate statistics and individual examples

Links Discussed

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
Dominik:
When we designed Vega-Lite, we built it not just as a language that can be authored by people but, actually, as a language where we can automatically generate visualizations. And I think that's also what distinguishes it from other languages, such as D3 or ggplot2 in R. Because we're in JSON, it is very easy to programmatically generate visualizations.
Lukas:
You're listening to Gradient Dissent. Today, we have Lavanya with us who has been watching all the interviews in the background but we wanted to get her in there asking questions. And we're talking to Dominik who's one of the authors of Vega-Lite and we got excited to talk to him because we've been using Vega in our product and we recently released it, but it solves this huge problem for us where we want to let our users have complete control over the graphs in a language that makes sense. And then we discovered Vega, and it was a perfect solution to the problem that we had. And then we talked to Dominik, and he had so many interesting ideas about the way machine learning should be visualized and we didn't even realize he came from a visualization background. So, we have a ton of questions to ask him today.
Lavanya:
Super excited. I can't wait.
Lukas:
I think the main thing or... You've done a bunch of impressive stuff, but the thing that is most exciting for us is that you were one of the authors of Vega-Lite. And so, I, kind of, thought maybe the best place to start for some people who don't know even what Vega is, is just sort of describe what Vega is and what the goals are and then how Vega-Lite works within that context.
Dominik:
Yeah. So, the way Vega came to be is that my advisor, Jeff Heer... So, Jeff together with his graduate students, Arvind and Ham, created a declarative way to describe interactions. Building on ideas from functional reactive programming, which is a concept that's been around for quite a while. So, they've adopted this concept for visualizations to describe not just the visual encodings, but also the interactions, fully declaratively. And so, that then became... I think that was a Vega version 2 at that point. Vega, at that point, was still fairly low-level, and that you had to describe all the details of the visual encoding as well as the axes and legends and potential other configurations. So, around at the same time, my colleague Ham, who also worked on the first version of Vega... on this reactive version of Vega, he was working on a visualization recommendation browser, at that point it was called Voyager and I helped him with it and we needed a visualization language to do recommendation in. And so, Ham and Jeff talked about the need for high-level visualization they were taking. Do a recommendation where you don't have to specify all the details, but really only what's essential, which is this mapping from data to visual properties. And so, I think they talked to the Vis Conference in Paris, and on the flight back, Jeff hacked the first version of it, which then I think the code is still what we were building on today.
Lukas:
That's awesome. Sorry. Before you go too far down this path, I'm going to ask all the dumb questions that I feel embarrassed to ask. I mean, I feel like I've heard declarative language for visualization many times, and I always, kind of, nod but what does declarative really mean? What would be the non-declarative way to describe a visualization?
Dominik:
Yeah, the biggest distinction between declarative and on the other side is imperative, is that in a declarative language, you describe what you want, not how you want an algorithm to execute steps to get to where you want to go. Good examples of that are HTML and CSS where you described what the layout of the page should be but you don't tell the layout engine to move something by a couple of pixels and then move again by a couple of pixels. Another good example of a declarative language is SQL or SEQUEL, whereas the database query language that people used to query databases for both analytics or for, let's say, banking system, for instance. And in these declarative queries you described, what do you want the result to be? So you say, "I want from this table, the two pulls or the rows that have these properties." And you don't describe how you're going to get that. And that's, as opposed to an imperative algorithm, where you would have to write the search, you would know how the data is stored, in what format, whether it's maybe even distributed on multiple machines or not in a declarative language, you only describe what you want. And then, that could run on a small database that's embedded, or it could run on a cluster of a thousand machines and you shouldn't have to worry. And so, for visualization, that means you shouldn't have to worry about how the visualization is drawn, how you draw a pixel here, a rectangle here, a line there. No, you just want to say, "I make a chart that encodes these variables."
Lukas:
So, I guess how declarative, is it? Is it... And I have used Vega a fair amount, but I think people that are listening or watching may not have, right? So, I suppose the most declarative thing might be, sort of, give me an insight about these variables or just compare these variables, right? But that might be unsatisfying. What level are we describing the semantics of what we're doing versus saying, "Hey, give me these three pixels here." Do you say exactly the type of plot that you want or is that inferred? How does all that... How do you think about all of that?
Dominik:
Yeah. The way we built on this, this concept called the grammar of graphics. And that is a really cool concept that a lot of languages, even D3, have built on. And the core idea is that visualization is not just a particular type, so it's not just a horizontal bar chart, or a bubble chart, or a radar plot. But instead, a visualization is described as a combination of basic building blocks, kind of, like in language, we have words that we combine using rules, which is grammar. And so, the words in the grammar of graphics are two things. One is marks and the other one is visual encodings. So, a mark is, for instance, a bar or a line or a point and encoding is a mapping from data properties to visual properties of that mark. So, for instance, a bar chart is a bar mark that maps some category to X and some continuous variable to Y. And that's how you describe a bar chart. And now I think what's cool about this is, if you want to change from a horizontal to a vertical bar chart or some column or vote chart, you don't have to change the type. You just swap the channels in the encoding.
Lavanya:
I have a question. So we see, so many really messed up charts that people make because people get too excited, especially, when they work with a really powerful visualization tool. And I feel like you've spent so much of your life designing really good grammar for visualizations and designing a lot of really cool plots. So, what's your recommendation for people... for best practices, for designing these visualizations?
Dominik:
I think it is actually making mistakes. It is trying it out and seeing how difficult is it or how easy is it to read data in a particular chart. But before you actually go out and publish a chart and show it to the world, maybe think about, "What can I remove from this chart?" I think, a visualization is really showing what you want to show when it's showing the essential of the data. Very important in any visualization design is following two basic principles, and these are often called effectiveness and expressiveness. This goes back to some work from Jock D. Mackinlay who developed, actually, an automated system to follow these rules. So, these two rules, they're, kind of, oddly named, but essentially what they boil down to is, first, expressiveness. It means that a visualization should show all the facts in the data, but not more than that. So, what that also means is that a visualization shouldn't imply something about the data that doesn't exist in the data. And then effectiveness means to make a visualization that as easily perceivable as possible and one rule that you can apply there is to use the most effective channels first. And the most effective channels are X and Y or... They're like length and positions. They're the best. And then afterward, it's like color and size and some other things. So, that's why bar charts got applause. Line charts are so popular where's so effective because they are using those very effective channels first. But this also... Sometimes you have to go beyond effectiveness. Yeah.
Lukas:
I always wonder... Is there any room for fun or novelty in a good visualization?
Dominik:
Yeah, that's a good question. I like to, actually, think back to a paper from Tukey and Wilk. They worked in the sixties, one of the famous papers about exploratory analysis and statistics. And they talked about the relationship of statistics to visualizations. So, the paper is full of amazing quotes and it's, kind of, amazing to read this today because almost everything is still true today. But one of the things they say there is that, it's not necessarily important to invent new visualizations, but think about how we can take the visualizations that we have, or the essential of the visualizations, and combine them in new ways to fit new opportunities. And so, I think there was a lot of creativity in making visualizations, even the simple ones, pie charts, line charts, scatter plots, but combine them in meaningful ways. Also, pre-transforming the data in meaningful ways. And so, there can be a lot of creativity in there. Yeah.
Lukas:
Do you have a favorite visualization that you think is maybe underused or that you'd like to see more of?
Dominik:
I think slope charts are kind of amazing.
Lukas:
What's a slope chart?
Dominik:
What's a slope chart? So, naming charts, by the way, is an interesting concept. If you think about the grammar and the concept of naming charts is kind of odd. I'm going to reveal a secret, but it's something I want to write. Like a system that automatically names a chart or the other way around, give it a name and it tells you what the specifications. Okay. But going back to the slope chart, the slope chart is, imagine you have two categorical variables, let's say two years and you have data for those years. And now what you could do is plot that as a scatter plot. So, on X you have the years and on Y you have some numerical measure. You should then draw different categories that exist in both years as colored points. It's hard to see, actually, trends between those things, between those different years. But if instead, you just draw a line between them, trends, or changes, they just jump out to you. And that I think is great. So, wherever you have categorical data and this bipartite graph and just drawing a line instead of drawing points there is great.
Lukas:
It's called a slope chart?
Dominik:
That's one name, one in the Vega-Lite gallery.
Lukas:
Oh yeah. We'll have to link to that. So, I guess, where do you think about the line between Vega-Lite and Vega? Is it always super clear, what belongs where? Because I would think of declarative, I mean, they're both in a sense, right? A declarative language for charts, right? One, sort of, just higher level and one's like a lower level. So, where do you draw the line?
Dominik:
So, maybe before we go there, one important thing to keep in mind is that Vega and Vega-Lite added something to the grammar of graphics. Vega-Lite in particular added, for instance, support for interactions. So, something that my colleagues Ham, Arvind, and I work together on where we added some other kind of words or language construct that you can add to make charts attractive, and we also add composition. And so, these are high-level concepts, which, then, actually compile from Vega-Lite to the low-level Vega into, in this case, layouts and signals, which are these functional, reactive concepts that Vega has. And so, I think that helps me also, a little bit, understand the difference of where does what go.
Lukas:
And what is, sorry, composition? Before I dropped that-
Dominik:
Composition is being able to layer charts or concatenate charts. And we also have a concept called repeat, which is a convenient concatenation, and then faceting. Faceting is, another word for it is trellis, a way to break down a chart by a categorical variable. So, for instance, if you have data for different countries, you can then draw one histogram for each country or one scatter plot for each country. Faceted charts are also great. Often, faceting is a very powerful way if you have an additional categorical variable to show you data.
Lukas:
So, is this where you make, sorry, a whole array, or a matrix of charts. That's what I'm picturing of the faceted chart, a grid of charts. I see. Okay. Cool.
Dominik:
Yeah. That's faceting. Okay. So, you asked what's composition and then we talked about, Oh, Vega, Vega-Lite. I think the biggest difference really between Vega and Vega-Lite is the abstraction level. Vega-Lite compiles to Vega. So, anything that's possible in Vega-Lite is also possible in Vega because of that. But it requires about one or two orders of magnitude, more code in most cases. So, that's one big difference. And how do we achieve that? Well, one, we have higher level mark types in Vega-Lite. So for instance, Vega only has a rectangle and has some more, but Vega has rectangles. Vega-Lite, actually, has bars as a concept. And so, if you have that, you can have some defaults associated with that high-level mark type, which you then don't have to manually specify in Vega. In Vega-Lite, you don't have to specify because it gets instantiated and picks automatically. And then the other is sensible defaults or smart defaults. Essentially, you don't have to specify an axis. We'll make one for you if you use the XON coding. If you used color, we'll make a legend for you. Chose size, we'll make a legend for you. If you use faceting, we'll make a header for you. And just, kind of, an axis. In Vega, you have to specify all the details of those marks or those elements. You can still override the defaults in Vega-Lite, but by default, we'll do something. And that's really what Vega-Lite is, it's a high-level language and a compiler that compiles from high-level specification to low-level Vega specification. Right now, we don't have a way to easily extend the high-level concepts we have in Vega, I'm sorry, Vega-Lite. We do have a little bit of an extension mechanism where you can add mark micros. So, for instance, box plots in Vega-Lite are just a macro, which actually compiles to a rectangle, a line, and the little ticks at the end. And there's a bunch of other things that are just macros. And so, one could actually build a language on top of Vega-lite. And people have done that. I'll tell you, for instance, it is a Python wrapper, or Python syntax, Python API for generating Vega-Lite JSON specifications. And there are other ones in Elm and R, and then somebody made them on Rust and there's one in JavaScript. Oh, and Julia, there's one in Julia, as well. That's a really good one.
Lukas:
I guess the R comment made me wonder if you have any comments on ggplot2. I feel like that's often like a beloved plotting library. Was that an inspiration for Vega at all? Or did you have reactions to it?
Dominik:
So, ggplot2 came out a long time before Vega and Vega-Lite and it also builds on the grammar of graphics. At the time, really, was the prime example for an implementation of the grammar of graphics in any programming language, really. It uses, slightly, different terminology from Vega and Vega-Lite. ggplot has definitely been a great inspiration. And we, what do I mean? When I say we so, Ham, Arvind, Jeff, and I have talked to Hadley Wickham before. Yeah. Big fans of it. We actually considered using it for Voyager, but because Voyager was easier built as a web application, interfacing from mobile application to R would have been a lot more overhead than building on visualization.
Lukas:
Totally. Maybe switching gears a little bit. One thing I thought was interesting about your background and interest is it's also machine learning. And I thought that was pretty interesting and cool. I wonder if machine learning has informed your thoughts about... Well, first, if it's informed your thoughts about visualizations at all, and then I'd love to hear about if you have suggestions of kind of visualizations that you think are helpful in the machine learning process.
Dominik:
Yeah. I think visualization and machine learning are really good fits for each other. And so, I can think of two things that we can talk about both, where visualization is useful for machine learning and where machine learning is useful for visualization. Maybe let's start with why visualization for machine learning. I think one of the most, and you can disagree with me there if you want to, important thing in machine learning is data, if it's not the most important thing.
Lukas:
I think few people would disagree.
Dominik:
Okay. So, because data is so... Okay, we can agree that data is essential to machine learning. If you have bad data, your model is not going to do anything good. You can still create a bad model with good data, but good data is essential for a good model. And so, understanding that data that becomes part of your model or gets used to train the model is really essential. And I think visualization is a really powerful way there to understand what's the new data that's happening there. Especially, in conjunction with more formal statistics, but formal statistics are only good when you know what you're really looking for. When you're still trying to look around, what's in the state of what might be problems with the data, that's when visualization really shines.
Lukas:
And you actually built a library to help with the exploration of data, right?
Dominik:
Yeah. So, Voyager and then Voyager 2, and some other follow-up work from there was, or, is a visualization recommendation browser. So, the idea there is that rather than having to manually create all the visualizations and still go through this process of deciding which encodings do I want to use and which mark type I want to use, just let you browse recommendations and still be able to steer the recommendations. So, the recommendations shouldn't go too far from where you are. They should still be close to what you've looked at before, but they should take away some of the tedium of having to manually specify all the charts. And the recommendation is great for two things. One is, yeah, because it makes visualizations less tedious, and also, it can encourage best practices, for instance, good statistical practice. Or good practice, data analysis practice, is to look at the univariate summaries when you start looking at a dataset. So, what are the distributions of each of my fields, each of my dimensions? And doing that, before looking into correlations between dimensions. And this is often difficult if you start looking at one field and you're like, "Oh, there's something interesting here. Now, I wonder how this correlates with these other bits." And then you're off on a tangent. And so, by forcing you, or by offering you a gallery of all the dimensions, and all the univariate summaries at first, it makes it a lot easier to follow that best practice of looking at all the univariate summaries first.
Lavanya:
Can you do this at scale? Let's scale it to millions of rows and how do you even begin if your data set is that big to find patterns in it? And how does the software scale too?
Dominik:
Yeah. So, the software is built as a piece of research prototype that is built as a browser application where all the data has to fit into the browser. So, it currently does not scale. But the interesting thing about is that the number of rows shouldn't really matter too much, as long as we can visualize it. We could probably have a whole episode about that.
Lukas:
Wait, the number of rows shouldn't matter, in what sense? It seems like it would make it more complicated to visualize. I mean, it doesn't make the visualization, necessarily, itself harder, but it seems, actually, scanning through all of them might start to get impractical.
Dominik:
Yeah. I guess most... There are two issues, one is a computational issue of just transforming that data and then rendering it. And then the other is, "Can I represent the data in a way that is not overwhelming to the viewer?" But assuming that we can do that for a couple of thousands of data points, or tens of thousands, or hundreds of thousands of data points, if you have many dimensions, the recommendation aspect gets a lot more difficult because now you have to think about, "Okay, how do I represent all these dimensions? Let users browse them? How do I show correlations between dimensions?" There's a lot more of... Correlations between three dimensions, get impractical very quickly. Yeah. So, that's a visualization for machine learning and then going the other way around machine learning for visualization is something that I've become pretty interested in. When we designed Vega-Lite, we build it not just as a language that can be authored by people, but actually as a language where we can automatically generate visualizations. And I think that's, also, what distinguishes it from other languages such as D3 or ggplot2 in R, because we're in JSON it is very easy to programmatically generate visualizations, then we built a recommendation system on top of it. So, when we have a visualization language that is declarative and in a language that is easily generatable, we could think about ways to automatically generate visualizations from programs or models. One of those models is a model called Draco. My colleagues and I have been working together where we encoded design best practices as a formal model, and then we can automatically apply those best practices to recommend visualizations. And so, that can go beyond what I've talked about in Voyager where we would recommend this gallery of visualizations because you can consider a lot more aspects of, both, the data where the visualization or the tasks that the user wants to do, or the context that they're in, or the device that they're looking at.
Lukas:
It's funny. I keep wanting to ask, actually, I don't know how to fit this into the flow, but I think one of the issues with visualizing data and machine learning, especially, with a lot of the deep learning folks that we work with is that the data often has... It's not like the sort of three independent variables and the dependent variable in a stats class. It's more like the data is like an image or the data is like an audio file. And so, I feel like just even visualizing the distributions gets unwieldy. It's also a little unclear what you would do with that. So, do you have thoughts about visualizing things where there's a higher-order structure, like an image or a video or audio file or something like that?
Dominik:
That gets tricky because if visualization is two-dimensional, two-point something dimensional, maybe we can use color and size and every encoding channel, essentially, can represent another dimension, but after four, or five, or so, it becomes overwhelming. So, if you're having a data set with thousands of dimensions, I think the way to do it now is to use dimension and dimensionality reduction methods. So, tSNE UMAP, PCA to reduce the numbers of dimensions to the essential in some way, dimensions., Or create some kind of a domain-specific visualization. So in a way, an image is a domain-specific visualization that maps your long vectors of numbers to a matrix of color encoding.
Lavanya:
So, what do you think about... All of my Twitter feed is talking about model explainability and how that's still a very unsolved problem. So, what do you think are techniques that everyone should know, but, and how do you think the field is progressing? Do you think we can have interpretable models in five years, anytime soon? Are neural networks never going to be explainable?
Dominik:
I don't know but that's a good question I think many people are trying to answer. There's been a trade-off where people often made simpler models because they are more explainable and the more complex the model gets, the harder they get to explain. So, sometimes there are methods similar to a dimensionality reduction, I guess, to reduce your complex model to a simpler model, which you can then explain. But none of those methods are fully satisfying. Some of the techniques I've seen is using more inherently explainable models that are still complex. So, for instance, a good example of that is R GAM's general additive models, which are linear models of functions applied to every dimension.
Lukas:
Well, why is that more explainable?
Dominik:
Why is it more explainable? Because you can apply some techniques where you can understand, for instance, the function that gets applied to every dimension, individually. Or you can also then look at how do those dimensions... Where the functions applied to the submissions? How do those get combined in a linear function? Which is a lot easier to understand than some nonlinear combination of many dimensions.
Lukas:
But wouldn't you want to have the different dimensions interact with each other, or allow for that, I guess maybe taking a step back, can you, kind of, make this a little more concrete for someone who hasn't seen this before? What would be... What kind of functions would you be imagining and how would they be applied?
Dominik:
For instance, if you want to predict the quantitative variable, some number, let's say the standard example, the housing price, the price of a house. Do you want to do that based on the dimensions, the available dimensions? Let's say the size of the square feet, the number of bathrooms, the number of bedrooms, or the number of floors. And so, now what you can do is do a linear combination of the dimensions to get the price. So, if you just take a linear combination or, you could say, multiply the square feet by, I don't know, 10, the number of floors by 20, the plus-size by 5, and then get a number out that does the housing price. So, that would be a simple linear model where you, essentially, apply a weight to every individual dimension. So, now what a general additive model does is that they apply a non-linear function to each dimension, individually. So, it can be like a log function or any other complex can be as complicated as we want, but because it's a function, you can actually visualize it very easily just by looking at the value on the x-axis and the value after applying the function on the y-axis. And so, if you then want to know what is the price of a particular house or the predicted price of a house in each of these charts per dimension, you'd just look up for my value, "what's the corresponding value that goes into the sum?" and then just sum them up.
Lukas:
I see. So, you could see exactly how much each thing contributed to your final score or your final prediction.
Dominik:
Mm-hmm (affirmative). Yeah. And a very good example of if you want to, actually, play with that and try it out is at this system called Gamut, which is as a research project at Microsoft Research, where they built a system for doing exactly this task of understanding the model that is the general. And one of those GAM models, and both being able to, for instance, compare two predictions between for two houses, understanding how much each dimension contributes to the predicted price, and also make it very easy to compare what you look at the general model, the whole model in just one view. And yes, you don't have the ability to have multiple dimensions affect your output, but still, these models work fairly well and are a lot more interpretable than a model that computes many dimensions or incorporates many dimensions in every single point.
Lukas:
Do you have thoughts on visualizations to help with understanding what's going on in a much more complicated models? Like, say, a convolutional network or a fancier type of network?
Dominik:
Yeah. I think visualizations can actually help at different points. And I think visualizations are only as powerful, or only as useful as the task that you designed them for. So, I think in general saying, "Oh, can you visualize this thing", is impossible without a task. So, can you visualize X for Y? So, for instance, one could visualize a model for the purpose of understanding the architecture. And so, when you for instance have a complex network, but many layers and many different complex functions that every inch of your layer might want to visualize it, to see what functions are being applied, what parameters are being used, and how big is each layer. And so, there's a couple of visualizations. I think, one of the most popular ones is probably the one in TensorBoard, which actually my colleague Ham started when he was interning at Google.
Lavanya:
Did you mean the parallel coordinates plot maybe, or which visualization in TensorBoard?
Dominik:
In TensorBoard, it's the visualization of the graph, the data flow graph. It's this... There is, kind of, two views in TensorBoard. There's the one where you look at your model outputs or your metrics. And there's the one where you look at the model architecture and I'm talking about the model architecture one. So, that can help you to, for instance, debug what's happening, but it doesn't help you at all to explain a particular prediction, for instance. So, for that, you might use a different visualization that has future visualization so it lets you inspect different layers and what's the attribution of in different layers.
Lukas:
Cool. We always end with two questions. I want to make sure we have time for it. And I think we, maybe, should modify them slightly to focus on visualization. So, normally we ask like, "What's a subfield of machine learning that people should pay more attention to." Which I'm curious, your thoughts said, but maybe I'd also ask a sort of subfield of, kind of, a visualization that you think doesn't get as much attention as it deserves.
Dominik:
I think for machine learning, I'm very excited that there's a lot more attention to understanding what's happening in these models. I'm also a huge fan of more classical AI methods, which I guess is not machine learning anymore. But yeah, I'm very excited about constraint solvers and using classical elements.
Lukas:
Whoa, maybe we have not had that answer, constraints? I thought you were going to say SVMs or something with constraint solvers.
Dominik:
No. Classical, like AI not even learn-
Lukas:
I thought they used the ML to do constraint satisfaction these days. I don't know-
Dominik:
They can use ML now for learning indexes and databases. I think these classical methods are exciting because they allow you to describe a kind of a model, a way, a concept, a theory in a very formal way, and then automatically apply it very declarative for a declarative problem solving and describing the problems and solving them. And these solvers are amazingly fast today, pretty excited about them. In visualization, because it's a science, we're trying to explain what makes the visualization good. And there's been a lot of work on high-level design of good visualizations. So, I talked about these principles of effectiveness and expressiveness earlier. And there are no systems to automatically apply them, and there are design best practices, and there are books, and people are teaching those in classes and, so on. And then on a very low-level, perceptual level, there's some understanding of how do we perceive colors and shapes and gestalt of shapes, and how do we see patterns. But we don't have a good understanding of how those low-level insights on perception, actually, translate to those higher-level design practices. And I think the two sides slowly are inching towards each other, but they're not this, they're far to each other right now and, kind of, slowly inching towards each other. And what I'm excited about is... It's, kind of, like the general relativity theory of how do these two actually combine? We need a unified theory there of how do two things relate. It's like, we know it's high-level, it's, kind of, like relativity. And we know this is small quarks things. We don't know how they relate to each other. We know the universe behaves. We know how little particles behave but when you combine it, that doesn't work. And so it, kind of, makes this crisis that physics has had for a while, as well, in visualization.
Lukas:
Well, what a great answer. That's so evocative, I want to talk about that for another hour. Normally, we end with asking people, really on behalf of our audience, kind of, what the biggest challenges are that you see in taking ML projects from, sort of, conception to deployed. Do you have thoughts there?
Dominik:
I think one of the trickiest things in deploying machine learning is metrics. Coming up with good, meaningful metrics that you're optimizing... To me machine learning is optimizing a function, but what is that function? And how do I make sure that that's actually a meaningful function and, also, that it's going to be meaningful in the future? Because we know for many examples that if you're over-optimizing a metric, that metric becomes meaningless. So, how do you ensure that a metric is meaningful right now and will be meaningful in the future? And it's actually tracking what you care about. It's a difficult question. And I don't know whether there's going to be one answer. I don't think so.
Lavanya:
Train a model on a bunch of different optimization functions and figure out which one it is or something. But I, kind of, want to specifically ask about what are the biggest challenges around machine learning interpretation. And also when you're training models using visualizations to debug these models. Do you have any thoughts around that, maybe?
Dominik:
As I said earlier, I think data is essential for machine learning and so, understanding data is crucial. And I don't know whether the methods and tools we have for general data analysis, how much they might have to be adjusted for machine learning. For instance, Tableau or Voyager, all these tools that are designed for explorer-type analysis of tabular data, where do they fall short when it comes to machine learning? Because it was pointing out earlier that machine learning often has these high-dimensional data, images and sound and so on. Can we design other representations? I don't even want to say visualizations, but just representations that help us see patterns in that data, meaningful patterns, meaningful for the task of training them up or understanding the model that I think is going to be an interesting question for visualization tool designers who'd like to work in the machine learning space going forward in the future.
Lukas:
You know, it's funny. I feel one thing that everybody working in machine learning misallocates their time a little bit, including me, is I feel like you almost always spend too much time looking at aggregate statistics versus individual examples. Every time you look at an individual example, you're just like, "Ah, like I can't believe I missed this stupid thing that... It was breaking my model or making it worse in some way." And so, I wonder if the gap is... We have really good tools, I feel for, aggregate statistics, but it's hard to quickly drill into stuff, especially, when your data sets could get very large.
Dominik:
I believe, actually, that we have... I totally agree with that. We have very good tools for looking at aggregate statistics. I think we also have reasonable tools for looking at individual examples. Go look at an image. That's okay. We're in a row at a table. But I think where it gets really tricky is understanding the in-between. So, understanding the subgroups that exist in the data and that is because there exists m to the m possible subgroups in a dataset. And if you have a million rows, that's a lot of subgroups and only very few of them are actually meaningful. So, understanding which subgroups are behaving oddly, or are negatively affecting your model, and looking at those, that is a challenge that I see over and over again. I think this problem of not aggregate and not individual, but somewhere in between and wherein between do I want to look, that to me is where we're at the difficulty lies.
Lukas:
All right. I think that's a nice note to end on. Thank you so much. That was really fun.
Dominik:
Okay. Yes. Thanks for all the questions and everything.
Lukas:
Thanks for listening to another episode of Gradient Dissent. Doing these interviews are a lot of fun and it's, especially, fun for me when I can actually hear from the people that are listening to these episodes. So, if you wouldn't mind leaving a comment and telling me what you think, or starting a conversation, that would make me inspired to do more of these episodes. And, also, if you wouldn't mind liking and subscribing, I'd appreciate that a lot.