Skip to main content

Aaron Colak — ML and NLP in Experience Management

Aaron explains how Qualtrics uses machine learning for the enrichment of experience management, discusses the strength and speed of the current NLP ecosystem, and shares tips and tricks for organizing effective ML projects and teams
Created on August 24|Last edited on August 26


About this episode

Aaron Colak is the Leader of Core Machine Learning at Qualtrics, an experiment management company that takes large language models and applies them to real-world, B2B use cases.
In this episode, Aaron describes mixing classical linguistic analysis with deep learning models and how Qualtrics organized their machine learning organizations and model to leverage the best of these techniques. He also explains how advances in NLP have invited new opportunities in low-resource languages.

Connect with Aaron & Qualtrics:

Listen



Timestamps

00:00 Intro
49:27 Outro

Watch on YouTube



Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to riley@wandb.com. Thank you!

Intro

Aaron:
When you do sentiment analysis on this huge set of industries — companies we are trying to help to listen to their customers and employees — out of the box models don’t work. So how do you customize them? You can obviously go through customizing models to specific use cases, brands, or industries, but a much more powerful way is combining the power of these language models and letting the customers override the specific lexicons or rules and whatnot.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host Lukas Biewald.
Aaron Colak leads a team of machine learning engineers, scientists, and linguists at Qualtrics. Qualtrics is a super big company that you might not have heard of that takes large language models and applies them to real world B2B use cases. This is a really interesting interview and I hope you enjoy it.

Evolving from surveys to experience management

Aaron, thanks so much for doing this, I really appreciate it. I kind of thought a good place to start would be Qualtrics, and it happens to be a company that I know well, because I've worked with Qualtrics for a long time and a real star employee of mine, John Le, ended up over there.
And so I know Qualtrics well, but I'm thinking a lot of our listeners will not know even what Qualtrics does. So maybe you can just start by saying what does Qualtrics the company do, and then tell us how machine learning fits into Qualtrics.
Aaron:
Sure. I think that would be a good starting point. Like many B2B companies, sometimes it’s a little bit not-obvious when you just use the technical terms to explain what the company does. But when we get to the bottom of it, it actually is pretty cool, so let me start.
We as individuals — human beings — use products, consume services every day. So every single usage of a product or a brand — being a customer, brand, even working for a company or an organization, being an employee — from the individual's perspective is an experience. Every single transaction is, for us, an experience. Going to a restaurant, taking a flight, taking an interview, working somewhere for some time, is an experience. So what Qualtrics does is give our customers tools to design these experiences, track those experiences, analyze those experiences, and act on those experiences. So there's four pillars to experience management. Overall, we are trying to help our customers manage, find, detect, and fill those experience gaps.
Our company roots started in a somehow-interesting domain, which is surveys. Especially in social sciences and business schools, doing surveys is the primary tool to do research. And our founders were trying to help their father to build a survey tool, and then it just exploded! It became phenomenally successful. And then at the next iteration, at the next level, those users of surveys — students and researchers — came back into the enterprise setting with different problems: trying to understand what customers think.
And that became our... we shifted a little bit towards market research and understanding customers, employees, and whatnot. And eventually, over the last few years, we've been working on this new category, which is experience management. So we would like to think of ourselves as founders and leaders in that space.
Lukas:
Could you tell us a little bit about the scale of Qualtrics?
Aaron:
Sure. That’s a great question. I like to think about scale in a couple of different ways. One of them is obviously the number of customers we have, and we have tens of thousands of customers. But I would like to think about other ways, in terms of, actually, a number of touch points going through our systems — the number of experiences we are analyzing — and also the diversity of the type of experiences and channels from which you can actually track and improve experiences.
So in that respect, our systems analyze millions of experiences every day, and we have different channels and modalities: social media channels, or other input channels such as surveys, tech surveys, web surveys, mobile, and social media call centers. There are various modalities we are actually collecting and analyzing this data from. That is kind of, for me, one other aspect of scaling.

Detecting sentiment with ML

Lukas:
And so what are the really important ML applications to Qualtrics? Is it somehow processing those surveys in different ways? What's really at the core?
Aaron:
Absolutely. So as I mentioned, surveys are one of the most important channels, but it’s not the only one. Even in surveys, there’s definitely a big part of the data that’s being structured. And in my opinion, experience data is most easily — or most naturally — expressed in unstructured data.
So one of the most obvious places where ML comes into the equation is analyzing the unstructured part of the surveys, such as open text questions: analyzing sentiment, emotion, effort, and finding what folks are talking about— employees or customers. What is the individual team-specific sentiment or emotion?
This is the sort of stuff where obviously machine learning is utilized mostly, but obviously there are other aspects. For example, the minute you go into a call center, then comes conversation, which is a totally different beast.
Lukas:
Can you make this more concrete for me though? Tell me one common survey that people might not realize happens.
Aaron:
Right. I think most folks would probably know about CSAT and NPS. NPS stands for “net promoter score,” and CSAT stands for “customer satisfaction score.” These are well-established industry standards, where businesses basically ask questions to customers, and those surveys can be structured. Or the input channel can be structured, depending on how you score and what you express.
And you might be actually just filling in your experience. “How was your experience, Lukas?” You might be just filling it with, “Oh, the price was great, but the service was not good,” right? And you might imagine big enterprises— when they try to listen to their customers, there might be literally, possibly, practically infinitely many topics customers might be thinking about. So how do you detect and act on this?
This is where machine learning comes in, specifically NLP. Detecting what your customers are talking about — is it the price, is it the service quality, is it the taste of the food — and then, "What is the topic-level sentiment on this? What’s the emotion on this?" Things like that. I hope that made it a bit more concrete.
Lukas:
Yeah. I just want to make it even a little more concrete for people that don't know. We actually measure NPS religiously at Weights and Biases. People sometimes complain that we're asking too much, but we really love to ask NPS and that's a measure of "would you recommend this product to a friend" on, I think it's like a scale of one to 10, right?
And then you take the nines and tens and subtract the one through six or zero through six, those are the low ones or the detractors, and the high ones are the promoters. And it’s sort of a sense of “are people liking your product?” And then, I think, is CSAT the one where it’s like, “How would you feel if the service went away? Would you be disappointed or not?” Is that right?
Aaron:
Could be. I think, depending on the context, a good way to think about CSAT and NPS is a little bit like the following: CSAT has to be focused on transactional experiences, whereas NPS is more about relation: your relation to that service provider or company or brand, taking everything into account. “How is your overall experience?”
Lukas:
And sorry, and CSAT is like one moment in time or one thing that you did?
Aaron:
Yeah. Transactional experience. Yes.
Lukas:
And so what's a typical question that a Qualtrics customer would have about all the NPS data they're collecting? It sounds like they maybe want to know what are the themes of things that people are unhappy about.
Aaron:
Correct, exactly. I think this is the most canonical use case. Every company cares about customers. Right? And this is really one of our biggest motivations because… as you are familiar from the big tech companies: some of them are phenomenally successful with their customer obsession.
So how do you enable the rest of the world, who doesn’t have an army of data scientists and engineers, to listen to their customers? And employees too, not just customers. Our tool can be used in these different settings to listen to different personas from their experience perspective.
Lukas:
And now, when you analyze this freeform NPS survey data, just to use a specific example, do you come at it with a specific set of categories that you're interested in, or do you kind of draw themes out with clustering or something like that?
Aaron:
That is a great question. Yes. So there’s two types of experiences — speaking of experience — with our products if you want to do analytics and open-ended questions. This is where you usually find, especially, emerging new stuff.
You can go with what we call “industry-specific libraries,” where our domain experts and industry specialists collect and create these libraries of topics. But as we know, the world is changing fast, especially in certain industries. So how do you stay on top of things? How do you find emerging new stuff? How do you make sure things are not under your radar?
This is where ML comes into play. We actually have deployed machine learning — and things like topic detection, key phrase detection — and surface that. We’ve taken the temporality dimension into the equation, of course. And then that’s where they dig in: either curated, ready to consume libraries, or finding topics on the fly.

Working with large language models and rule-based systems

Lukas:
And now, I feel like natural language processing in the past few years has moved faster than maybe any other field in ML. And suddenly you have this explosion of large language models, which are quite evocative in terms of text generation. But, I'm always wondering how much this effects businesses like Qualtrics.
Do you use large language models a lot, and if so, where do you use them and not use them and how are you thinking about that?
Aaron:
Right. So I would like to first mention a disclaimer: obviously I’m a big fan of ML and deep learning. Having been at the University of Toronto during my grad school when all these things were happening, you can’t escape that gravity obviously.
Lukas:
For sure.
Aaron:
But also having been in ML for so long, our approach is pretty much going after what our customers’ needs dictate. As you know — as you’ve covered in this blog many times — large language models, contextual models, cross-lingual models: they’re game changing. Right? If you use it the right way… if you identify the right situation for them, they can be really powerful. And we do. We do use large language crossings models a lot.
But you might be surprised that we also use rule-based systems a lot. I think rule-based systems, heuristics, they enable you to call— not only fast and be scrappy, but they also enable quite a bit of customization. Because when you use large language models, when you do sentiment analysis in this huge set of industries — companies we are trying to help to listen to their customers and employees — out of the box models don’t work. So how do you customize them?
You can obviously go through customizing models to specific use cases, brands, or industries, but a much more powerful way is combining the power of these language models and letting the customers override the specific lexicons or rules and whatnot. So yeah, we have the full spectrum, starting from classical linguistic analysis, lexicons, all the way to the bleeding edge deep language models. We use the full spectrum, and I think the future includes a hybrid — for us, at least — for the foreseeable future.
Lukas:
And when you use these large language models, where are you getting them? Are you using Hugging Face or are you using some of the APIs out there, like OpenAI or Amazon or others, like how do you think about that? Do you feel like it's important for you to train your own?
Aaron:
Yes, because for a lot of the problems we are looking at, just taking a model, training it, and tuning it to the domain… there are obviously problems that can be solved with just simple tuning or domain adaptation. There is a large spectrum of problems where it’s not sufficient for us. There’s also the whole aspect of when you operate at this scale, we are dealing with millions of short and long text conversations: we also need to care about scale.
So model compression is a big area for us as well that we’re focusing on. We do use, pretty commonly, the xlm roberta-type of models. We experiment all the latest and greatest stuff that’s coming our way, and we pick the right model for the right setup. And if need be, we’re also customizing them in terms of the downstream application and in terms of combining the language models with other modalities and whatnot.
And how much customization we do in the model — how much tuning, where do we phrase — it all depends on the exact, specific use case.

Zero-shot learning, NLP, and low-resource languages

Lukas:
Do you feel like the advances in NLP have changed your approach to machine learning over the last few years?
Aaron:
Absolutely. I think it’s a powerful, powerful tool. It’s not a silver bullet for everything, but — especially at an organization like ours, who tackles multilingual data in low-resource languages — it has been a big, powerful tool.
Lukas:
Interesting. So how do you use it for low-resource languages?
Aaron:
I have to be careful here, because when I say “low-resource language,” I might not be using the exact academic sense of “low-resource,” because that seems to be a bit of a moving target these days.
Lukas:
Sure.
Aaron:
What I mean is, from our perspective, every business has a target depending on where they operate — and what kind of products they’re developing — and has business priorities in terms of languages they want to handle. And the amount of English data brought into the raw, labeled or customer feedback data, which we used to train these models… obviously, for English… we have disproportionately more English data than, say, even some common European languages.
And some of the success we got from these models — just purely based on zero-shot learning — was sometimes more than enough for getting a POC out there and then iterating on it. Because more often than not — I don’t know, having been in this field multiple times — getting training data, labeling, can always be an issue. Like, sometimes it’s the “chicken/egg” problem. If I have one product out there, then I will have customers doing some edits for me, or giving feedback and data. But how do I get that if I don’t even have a model working?
So zero-shot is, in a way, game-changing in that respect. Because it enables us to — as long as you set the expectations right — get something out there, make it a win-win situation for you and your customers, while data starts pouring in and you iterate from there: from the feedback, from the implicit or explicit labels that come to your system. In that respect, it’s been game changing.
In the past, if you wanted support for any NLP system in X languages or K many languages, you probably need K many language experts, K many sub-teams working on those. But right now, again, it’s not a silver bullet for every use case. But for a lot of the use cases, simple investments and small data sets can go a long way.
There’s also, actually, changing the paradigm in a different way too, Lukas, in my opinion. In the past, when you were doing an NLP project, every single project — whether it’s the same project in a different language or a different functionality in the same language — would require, almost from a data perspective, getting to ground zero. You start from ground zero. You cannot share a data set, pretty much.
But these pre-trained language models combined with cross-linguality enables us to basically do a lot of new ideas, new projects, for a fixed amount of budget, just because the amount of data you need to tune to a new feature, a new language is just significantly smaller. And that has been — for us, and for many others I know in our industry, in technology and the NLP space — has been changing the way they look at data.
Lukas:
Interesting. So how does this actually work? So you'd say you have some French language survey results. When you say "zero-shot," do you mean that you take some kind of embedding and put it into some comparable spaces to English, or how do you actually approach the rarer languages practically?
Aaron:
Let’s take an artificial problem, like a text classification. I want to classify to ABC— its sentiment and whatnot. We actually shared this in our blog post, but basically you train for English. You may have more data on it, or label data on it. But for French, if you have limited data, the least you can do is use that data for testing “how is my zero-shot performance?” Right?
And sometimes we have little data sets that are in these languages and we say like, “Okay, from a test perspective, it looks good enough to get...” or even “pretty satisfactory to coming close to the English performance” or whatever's the performance metric we want to hit. And if they’re not, you kind of get an idea of how your model is doing.
That might be enough for you to get a V0 out there and start collecting data from a feedback perspective because our systems allow — not all, but some of them — allow our customers to give feedback in terms of our predictions. Compared to how we would do these kinds of things… not even five years ago, you had to go and start from scratch in French.
Lukas:
Yeah. Yeah, it's really impressive. It just works reasonably well right out of the gate, typically.
Aaron:
Usually. But again, there are problems sometimes. Not all of them. Nevertheless, very, very useful.

Letting customers control data

Lukas:
Do you end up training separate models for each customer then? Are you fine tuning new models in every single customer's pieces of data? And if you have like thousands of customers, does that create a huge logistical problem for you?
Aaron:
Yeah, that’s a great question. First of all, a couple of years ago, even if that were the right thing it would not be feasible. Practically, very challenging. But these days, fortunately, hyperscalers provide a lot functionality with various services in terms of multiple model endpoints and asking for its predictions and whatnot, even if they have a model endpoint for every single language or task you have.
But we do a combination. We try to use multitasking as often as we can. It’s just a powerful tool. Obviously, more often than not, you basically get a linear — proportionally linear — return on your combination of the tasks. So instead of having N models, you have basically one model doing N task if possible or applicable. Obviously, from a model life cycle management perspective it generates its own challenges as well. And that needs to be taken into account in the long-term design, because then you’re coupling models. If model requirements change and you need to update one, or one task, do you really need to update the other task?
Lukas:
And how do you think about what a task is here? So I'm imagining if you have-
Aaron:
Let's give a concrete example.
Lukas:
Yeah.
Aaron:
Let’s say you’re trying to predict sentiment on a given text. And you might imagine sentiment and other related, more nuanced dimensions of human experience — emotion and whatnot — or other things you want to predict about this intent. So, same input, and you basically can predict with a single model the emotion, the intent, and the sentiment at the same time. So these are individual tasks. Right? It doesn’t have to be a single prediction: it can be a classification task combined with a sequence-to-sequence task. Doesn’t matter.
Lukas:
But if you're doing this on behalf of two customers, do you consider that to be a single task, or does each customer's look like a different task?
Aaron:
Yeah, yeah. Yeah. Sorry. That reminds me— I didn’t quite answer your first question. We don’t do… As I mentioned, for customer-specific needs, we tend to think in terms of giving a customer the full power to customize or override the behavior of the model. That comes through using various enrichments we do to the text on top of whatever target task.
You can also do all the linguistic enrichments, and you can combine these linguistic enrichments with rules and other heuristics to actually override the model behavior. There are some initiatives going on… I’m not at liberty of discussing here, but we are thinking of enabling customization on the ML level, as well. This is not implemented yet.
Lukas:
I see. But I guess, does allowing the individual customers to kind of customize what the output's doing, I guess that means there's a single underlying model that's feeding the customers and then they sort of override it?
Aaron:
Right, right, right. That’s a good point. I think I should have made that distinction upfront. So we have two types of models: one of the models is what you call a “universal model” — these are models that work for all customers in the same way irrespective of who’s sending the data. But we also have customer-specific models.
For example, we have this tool called Predict iQ, where you can use experience data — which caused X data, or operational data, all data — a combination of those to build predictor models, starting with churn prediction. Actually, that’s one of John’s products! So the product, Predict iQ, by definition is customer-specific because you as the customer bring your own data, define what your variables are, or let our system kind of do AutoML and automatically build, train, optimize, and deploy a model for you. And as new data comes in, you can actually… in a certain fashion, you can predict…
So, customer-specific model and universal models. But, as I understood your question — initially, by mistake — was, “What about these universal models? How do you customize them?” Our approach to that is letting the customer override behaviors so it’s not completely ML-based. But we can envision a future where we can totally completely let customers give feedback and continue to train these models. This is more just thinking right now— there are no concrete plans or commitments on that one.

Deep learning and tabular data

Lukas:
And now I guess processing language data is pretty different from predicting churn. When you think about predicting churn from survey results or something like that, does deep learning have any role to play or do you go to more traditional models for that kind of tabular data?
Aaron:
That’s a great question, Lukas. The interesting thing is that people have been trying to extend some of the ideas that came from transformers to tabular data, as well. I think there are some variations developed specifically for tabular data. But I’m not convinced that the concept of pre-training — which is where most of the power for these language models comes from — quite applies, at least not in our setting.
Everybody’s customer data, everybody’s transaction data is different. The semantics of it are different. That being said, there is a future we are investing towards where we’ll be able to, hopefully, give our customers options to schematize their data: to map them to a shared schema. And when that happens, obviously, things change a little bit. Then you can actually envision a future where learnings can translate — global patterns can translate — from one data set to another.
Lukas:
And what would this mean to map to a schema? Does this mean kind of standardizing the customer names and standardizing the definition of churn or something else?
Aaron:
Right. Not quite that. More like, for example, think about the following: say you have a question about NPS, like, “How likely are you to recommend company or product X to your friends?” You can imagine this can be expressed in many, many different lexical and semantic forms, and different languages.
So capturing that question— identifying, “Hey, this is the same question, and this is the same. This is an age question. This is income question.” Right? Basically, you’re structuring the data in that way and identifying the fields and numerical values and ranges and whatnot. Then data becomes mappable, data becomes transferable, learnings can become transferable.
Going back to the Predict iQ problem, yes: we use deep learning there as well, mostly canonical techniques. But not surprisingly with tabular data, tree-based models are pretty successful, even if they’re not necessarily always successful in terms of performance metrics. It’s just much easier to work with them because they have this natural way of dealing with missing data, combining categorical and continuous features, numerical features, and whatnot. So there’s lots of ways.
Lukas:
Got it, got it.
Aaron:
One way of still using deep learning, again, in tabular data is obviously even in tabular data, there are… certain questions are still open-ended.
Lukas:
Totally.
Aaron:
Yep.

Hyperscalers and performance monitoring

Lukas:
So I guess, what does your infrastructure look like? Have you standardized everyone on a single machine learning platform? Have you standardized the frameworks that people use or is it open-ended?
Aaron:
I think in many ways we are far ahead from past experiences, or from colleagues that I know, when we discuss the ML ecosystem and the state of affairs. But we have one advantage in a way. Our ML platform development efforts are relatively new, so we leverage a lot of the functionality that these days are coming from hyperscalers. A couple of years ago, building an ML platform was a very big deal. Being able to support different hardware, different workflows, different personas was — even for a small ML team — a big, big deal.
These days, we are using hyperscalers, obviously: moving a lot of the heavy lifting to hyperscale functionality. And most of the work we do is basically harmonizing our data and our workflows and expressing them operationally in terms of our platform, which is based on tools like SageMaker and whatnot.
But yeah, our current ML trainings serving scientist workbenches are all standardized, yet this is a fast moving field. There are a lot of new systems. There are small and big players. You mix and match and try to leverage the best of both worlds.
Lukas:
Where do you feel like there are gaps? If somebody was listening and was thinking about making a company to do some ML tooling, where would you guide them? Or if our product team wanted to roll out something new, what would you appreciate the most?
Aaron:
First of all, having done ML for almost 20 years now, one of the things I most appreciate is that it’s kind of a dream come true seeing such a big ecosystem: things like, for example, experiment tracking.
For those of us who went through grad school by tracking things, I always make this analogy: I used to work in the computational biology field, and a lot of my collaborators and peers have these really nicely organized experiment notebooks. And I’m like, I will never be successful in this field because I’m never organized in files and whatnot. But as computer scientists, we can still write scripts to organize, no matter how messy things are.
But when I work today, I see tools like Weights and Biases, and other tools for model performance monitoring.
Lukas:
Do you have a favorite performance monitoring tool? Is there one actually that you use?
Aaron:
Not yet. We’ve actually narrowed it down to a couple of things, but we are still actively working on it. But my point here is that one of the things I strongly recommend to my team is leveraging productivity-boosting tools such as experimentation tracking, reproducibility.
For me, the biggest gap is still in CI/CD for a couple of reasons. I don’t think it’s as well understood as other parts of the ML lifecycle. And there are different personas involved: data scientists, ML scientists, ML engineers, application engineers. That is a complex problem. I think the nature of the problem is complex. Solving that problem seems really, really big. Some actors — including yourself — are doing some really interesting things out there, so I’m eagerly observing this field.
I think some of the core infrastructure problems — in terms of ability to support different hardware combinations, scale, and all that stuff — that’s been solved to a large degree these days. To me, the next level is really winning these scientists and MLE personas by building something they can connect to. Because I see the adoption of these tools as still — owing to being a new industry — I still think the adoption is not quite there.
So, yeah, CI/CD: I guess I would put that in the top there. Also, depending on your application area, monitoring. And the third I would probably put — depending, again, on your industry application focus area — monitoring but with more focus on that from an operational perspective, and more from a fairness and bias perspective. These are obviously good things to pay attention to, and there are also — these days — societal and legal reasons to pay more attention to these kinds of systems and regulations.
Lukas:
Is there any tooling that you're using or have built to help with fairness or any kind of explainability at Qualtrics?
Aaron:
Right. We are definitely looking at that, because we know our systems are being used in context-sensitive applications. I don’t want to disclose any specific names, but one thing that’s happening in this space is that testing AI systems — developing, testing frameworks, behavioral measurement frameworks for them — has taken off lately.
So there’s both tools from academia papers — papers, tools — as well as industry. I haven’t seen industry adopting it as much. I might be wrong there, to be frank. There’s still, I think, some way to go there, but this is becoming… We are definitely looking at it. We are looking at our models, how they’re behaving under certain… whether it be gender bias, other social identity biases. But bias can creep up in many ways, so this is going to be a continuous effort in our agenda.

Combining deep learning with linguistics

Lukas:
How do you think about building your team? I guess, how is your team structured now, and what skill sets do you look for?
Aaron:
Well, let me start with what my team does. We deal with basically all things ML, from building the ML platform, to working on building dataset anthologies, libraries for NLP applications and beyond. And then I have two applied science teams.
One of them is really focusing on NLP analytics applications. As I mentioned, we discussed a lot about surveys, but surveys are basically solicited feedback. Qualtrics — you would be surprised — we are looking at more in terms of volume of the data, actually. Much more text data is coming from other channels, social media and customer support applications. So for analytics, obviously we have a large team, and then we have made certain investments in this area to really grow our footprint and export it in this one.
The other team we have is focusing more on infusing ML to all our product lines. And that includes more canonical applications — from time series modeling, anomaly detection, recommendation systems, path optimization, yield optimization — to fraud detection and things like that. And this, depending on what business… For analytics obviously we’re looking for subject metrics first. Right?
Though, as much as we love and use deep learning where we hire deep learning experts, we are also looking to make sure we are linguistically grounded. So we have a lot of linguist experts who are actually building very deep linguistic package analysis to make sure we marry the systems in the right way to solve our customers’ needs.
On the more canonical problems, we try to have a diverse team from a skill set perspective: deep learning, statistics, engineering. This field requires really going fast, solving problems, and not necessarily always coming up with a new approach or bleeding edge algorithm.
Lukas:
Interesting. And I guess, do you think there's anything that specifically makes somebody successful at Qualtrics? Or on your team outside of the normal things that a company would look for?
Aaron:
Sure. So Qualtrics ML is… Over the years, as our vision evolved, data and ML have become more and more central to our business. Because listening to these different channels — different data models, understanding and predicting an ability to give actionable data to our customers — to me, boils down to deep data skills. And we have a lot of ways to leverage this different data, marrying experience data with operations data, and we are uniquely positioned to do that.
Maybe I should even answer wrong, “Why consider Qualtrics if you’re working in, for example, in the ML field?” I think it has a lot to do with the uniqueness of the problems and the data sets. When we look at the spectrum of problems, yes, we do have a lot of problems you can immediately relate to, but there are a lot of problems that are very unique that don’t exist in other fields, or the data sets don’t exist in other places that are unique.
Obviously the volume is there, the volume of the data we’re tackling. But someone — I’m particularly speaking from experience for my team and myself — developing ML applications in a B2C setting is very different than a B2B setting. You’re dealing with very different customer personas. Supporting the ML cycle, when you think about the model life cycle’s ability to refresh, the implications of that are much more permanent in an enterprise scale. It’s like switching one model just because a new, better result is not as fast… or you don’t have as much degree of freedom as you would have in a B2C setting. I might be overgeneralizing here, but that’s my own personal experience.
What else… I guess being B2B to working on a very unique data set and problems where it’s not always easy to go look up a paper and implement the technique, you need to really be creative and synthesize new solutions, come up with new ways to look at the data.

A sense of accomplishment

Lukas:
I guess, looking at your career, when you came from school into academia, you went to Amazon, right? What was the biggest surprise— I mean that's always kind of a shock for people I think, going from research to practical applications, what was the biggest surprise for you?
Aaron:
The biggest surprise for me… well, actually, 2022… it was exactly ten years ago when I was doing a graduate internship. That was my first industry experience, and I was very academically oriented. It was the usual thing: writing papers, going to conferences, and trying to look out for the next step, which is post-doc.
For personal reasons, instead of spending a research summer, I took an industry internship. And instead of ending up in a research lab — because of visa problems — I ended up in a more industry application-type lab, and I tremendously enjoyed it. Because, up until that point, I always thought I enjoyed really tackling tough technical, scientific, open problems. But this is when I had the realization that I just like solving problems.
And, being in the same space in ML — where you’re still an applied research field — your every day, pretty much, is filled with some uncertainty. You still have that everyday unknown and excitement about what’s going to happen. “Will this experiment work?” You’re always continuously thinking, creating, looking at the data— everything changes. It never gets monotone. For me, it was never like that.
And then, I was making this joke to my team members, but to some degree it’s true: the fastest return for your work — like writing, I have written my fair share of papers — but here I see things going to production. It just gives a different sense of accomplishment, solving problems. And even today, when I look at what we are doing at Qualtrics — helping our customers solve their customer problems — I think it’s an amazing feeling. And that just keeps me going and focused on staying with problems, even though sometimes the data or technical problems might be very challenging.
And you are — I know it sounds a bit cheesy — but you are changing the world. I just had this terrible experience with one of my home projects, and I feel like I’ve sent 30 emails, nobody even bothering. I’d like to think in my world, one day, somebody at the end of that thing, that there’ll be a tool and they’ll think, “Hey, Aaron’s experience is broken, let’s surface it up, let’s do something about it.” This is the future we are building for Qualtrics.

Causality and observational data in healthcare

Lukas:
That's great. I've definitely come to believe that listening to customer experience survey results is one of the real keys to building a successful company. So I actually totally identify with that. It's a good segue actually to our last two questions.
And the second to last one is basically, what is something that you think is understudied in machine learning? If you had more time or if you were back in academia, that you would spend some time looking into because you think it would be valuable?
Aaron:
I wouldn’t say… perhaps maybe not understudied, but one thing I’m waiting to make a big splash is causality.
Before Qualtrics, I worked in the healthcare space for a couple of years. And, surprisingly, healthcare is both super rich with a lot of interesting ML problems — very meaningful problems — but also, for various reasons, it’s also going a bit slow for regulatory and other problems in that space. It’s been a fertile field for a lot of causality research, but we often have in this space these recommendation systems where we can do some treatment, we can see how these systems can actually make a big boost in terms of how the real way we should think about ML or we should think about stochasticity and predictive systems.
But it's one field — just because of the sheer complexity of obtaining treatment data — where we need to work with observational data in most settings. And I know there are recent interests in making causality work with observational data, and that would be, I think, game changing for a lot of applications. But maybe there’s not enough investment being done in that field, or it’s just fundamentally a hard problem that we need to be patient about. I don’t know, but that’s one field where I’m keenly observing on the side. Sorry, waiting for, yeah.

Challenges of interdisciplinary collaboration

Lukas:
Interesting. And I guess final question, when you think about going from an idea of a new application to deployed, working in production, what's the biggest bottleneck?
Aaron:
Ah, oh, there's a classical question.
Lukas:
So is this a really a classical question, or is this just a question I ask all the time?
Aaron:
Classical, I'm sorry, maybe classical is not the right term. Sorry, deep question.
Lukas:
Deep question. Important question. I agree.
Aaron:
We know ML is… everybody’s excited about it. ML has proven its value. But is ML delivering at the scale it’s being invested in? Probably not. There are all sorts of market research reports out there showing how much ML is failing, why it is failing. I think this boils down to that question. Most of the time, it’s going from that proof of concept to production. In my experience, depending on the setting you are in, there can be a couple of reasons contributing to that.
One of them is structural, probably. And this is where the most common cases I’ve observed in my experience — from startups, to enterprises, to hedge funds, to other places — it really requires... if you’re working for ML, unless you’re doing platform work really... if you’re working with ML for a product feature, that requires a really close connection with the ML folks and with the product folks.
Time after time, ML folks go build models, not cognizant of the underlying production constraints and whatnot, solving sometimes not the problem that the product requires. And that’s not specific to ML: that’s a system design problem. You go design the wrong thing, or you design the system that’s not with respect to the constraints that system needs to work within.
What particularly becomes problematic in ML is that if you don’t really have that structural support process in place, scientists — especially those working on, maybe not a current application but a deeper, technical problem space — they usually don’t know what having a model in production looks like from productionalization, from latency, from input, from output, from monitoring, from a system-design perspective.
The way we solve it in Qualtrics is that we empower ML engineers. ML engineers— they know ML, they know they’re engineers at heart, by training, and we include them from the getgo. They’re in from conception all the way to the product launch. So they play a very critical role between how this model’s going to be used and what’s being designed and moderating that. To me, that’s the essential role machine learning engineers should play. That’s obviously a very biased opinion, because machine learning engineers or data scientists and applied scientists… I don’t think these are universal definitions. Every company goes with their own way of what’s going on.
But I’ve seen that when you don’t have a person who understands both domains well and gets involved in the processes in place… and I’m not even counting all these infrastructure issues. I’ve seen places where they’re trying to do NLP in traditional microservice architecture and places like that. You don’t have the right architecture. Even if you have the right infrastructure, I think it boils down to having the right people with the right skillset and having a process, really. A clean process. So you don’t have basically everybody doing everything. That’s where things start to break down.
That’s how we do it in Qualtrics. We have dedicated roles specializing in different aspects of this process, but always working together end-to-end. This is what we call “the trifecta model” — a machine learning engineer trifecta model — the machine learning engineers, the product engineers, and the ML scientists working together.

Outro

Lukas:
I see. Cool. Awesome. Well, thanks so much. I really appreciate your time.
Aaron:
Of course.
Lukas:
And it's fun to talk to someone who's deploying so many models in production, especially at a B2B company. You don't hear as many stories of this, so thank you very much.
Aaron:
Yep. Thank you, Lukas.


Iterate on AI agents and models faster. Try Weights & Biases today.