Peter Wang on Anaconda, Python and Scientific Computing

Peter Wang talks about his journey of being the CEO of and co-founding Anaconda, his perspective on Python, and its use for scientific computing.
Angelica Pan

Listen on these platforms

Apple Podcasts Spotify Google Podcasts YouTube Soundcloud

Guest Bio

Peter Wang has been developing commercial scientific computing and visualization software for over 15 years. He has extensive experience in software design and development across a broad range of areas, including 3D graphics, geophysics, large data simulation and visualization, financial risk modeling, and medical imaging.
Peter’s interests in the fundamentals of vector computing and interactive visualization led him to co-found Anaconda (formerly Continuum Analytics). Peter leads the open source and community innovation group.
As a creator of the PyData community and conferences, he devotes time and energy to growing the Python data science community and advocating and teaching Python at conferences around the world. Peter holds a BA in Physics from Cornell University.

Show Notes

Topics Covered

0:00​ (intro) Technology is not value-neutral; Don't punt on ethics
1:30​ What is Conda?
2:57​ Peter's Story and Anaconda's beginning
6:45​ Do you ever regret choosing Python?
9:39​ On other programming languages
17:13​ Scientific Data Management in the Coming Decade
21:48​ Who are your customers?
26:24​ The ML hierarchy of needs
30:02​ The cybernetic era and Conway's Law
34:31​ R vs python
42:19​ Most underrated: Ethics - Don't Punt
46:50​ biggest bottlenecks: open-source, python

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
Peter:
We have to be faced with this concept that technology is not value neutral. If you think about what machine learning really is, it is the application of massive amounts of compute, rent a supercomputer in the cloud, kind of massive amounts of compute to massive amounts of data that's even deeper and creepier than ever before because there's sensors everywhere to achieve business ends and to optimize business outcomes. We know just how good business are at capturing and self-regulating about externalities to their business outcomes. Just as a human looking at this, I would say, "Wow. I've got a chance to actually speak to this practitioner crowd about if you're doing your job well, you'll be forced to drive a lot of conversations about ethics and the practice of your thing about what you're doing within your business as it goes through this data transformation." You should be ready for that. Steel yourself for that. Don't punt. Don't punt on it. We can't afford the punt.
Lukas:
You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald. Peter Wang is the co-founder, CEO, and creator of Anaconda. He's been developing commercial scientific computing and visualization software for over 15 years. He created the PyData community, conferences and devotes his time and energy to growing the Python data science community and teaching Python at conferences around the world. Couldn't be more excited to talk to him. Maybe, for starters, I guess, I know of you because of your product Conda which I've used for many years, I have a feeling most of the people listening to this will know what Conda is, but maybe could you describe it in your words just to make sure we're all on the same page here?
Peter:
Yeah. Conda is a package manager that we built as part of the overall Anaconda Python distribution. It started as a way just to get people package updates of the binary builds that we do. It's expanded to then manage virtual environments, so we could have different versions of libraries and different, in fact, versions of Python in user land on any of the platforms we support. Then, we also created a space for people to upload their own packages. That's Conda. That's the anaconda.org service. Around that, a community has grown up called Conda-forge where they make recipes, maintain recipes, upload packages. But lots of other people like the Nvidia folks or like PyTorch, they will upload their own official builds into the anaconda.org system. We run all of that infrastructure. We pay the bills for the CDN and for all the storage and everything. Then, we do have a community around the Conda package manager itself, so people making tools and extensions for it. That's in a nutshell what Conda is. You think of it as like an RPM or something like that, but primarily for data science and numerical-oriented computing.
Lukas:
What's your original background? Were you always making software running a successful software company?
Peter:
No. I've always been programming pretty much since I was like, I think, eight years old. I've been programming something. But I ended up going to college for physics. I graduated with a degree of physics. I decided to join kind of the dot-com, kind of boom by going and joining a startup. I've been in software ever since then. But I spent a number of years working in consulting using the scientific Python and the Python stack in the 2000s. That's really where I started seeing the possibilities for using Python for a broader set of data analysis use cases than just niche scientific and engineering computing kinds of use cases.
Lukas:
Cool. Can you explain to me what was going on that you started this project and what the original conception was when it began?
Peter:
Yeah. Sure. Well, the original conception, so the company was called Continuum Analytics. I started that with my co-founder, Travis Oliphant, who is the creator of NumPy and one of the co-founders of SciPy. We put the company together to promote the use of Python and to advance the state of the art for Python for a broader set of data analysis needs. That was the original vision. At that time, this was 2012 we formed the company, Wes McKinney had just really started pushing pandas as a data frame library. The Jupyter Notebook was relatively new at the time it was still called the IPython Notebook. The world was sort of a wash in Hadoop big data craze. What we could see was that once people through all their data Hadoop they wanted to do bigger analyses. They want to do broader more kind of cross data set, cross schema sort of analyses. They would need tools like Python. SQL wasn’t going to do it for them. We were putting this stuff together. We were trying to find alternative MapReduce frameworks that were nicer to Python than Hadoop. The rest are kind of the Java, the Apache Java JVM big data stack, if you will. The JVM world does not play with the Python C++ native world very well. Any case, as we're looking at doing all this stuff, it became clear to me that if people couldn't install SciPy and Matplotlib and IPython, they were not going to be able to install any new fangled compiler tools we built or any newfangled MapReduce framework. It was just going to be completely off the table. We started by saying, "Well, you know what? We should probably produce a special collection of packages, the distribution of Python, that helps people get started that includes all of the basic things they need, that works on Mac, Windows, Linux. That was the basic idea. We built Anaconda. I came up with a name because it's Python for big data. It's a big snake kind of.
Lukas:
Nice.
Peter:
Although, of course, I don't like snakes that much. Python is of course named after Monty Python, but whatever, we'll ignore that. That's where the name Anaconda came from for that product. Then, that just took off quite well. We eventually renamed the company Continuum to Anaconda because we'd be at conferences. They'd say, "Where are you from or what company are you with?" We'll say, "We're with continuum," and said, "Okay. Yeah. That's nice." We say, "Well, we make this thing called Anaconda." They say, "Oh, we use Anaconda. We love Anaconda." After that happens like the thousandth time, you figure out the world's telling you something so anyway. But anyway, that's the journey. Since then, we've continued to push new open source tools and things like that in the Python called data stack.
Lukas:
It's incredible, the impact, that I think you've had and certainly NumPy and SciPy in terms of just making Python a popular product. Do you ever regret choosing Python for this? Has that been a good choice for you?
Peter:
Oh, no, no. That was completely intentional. A thing that people should understand, I think, especially as more software engineers move into ML and become ML engineers, for them, language is just a choice. It's like, "Well, I'm a C++ coder now. I learned some Go. Now, I'm doing Python. It's like whatever. Python's got some warts, and it's got some good things. But the thing to recognize is that Travis and I when we started this, the reason why we wanted to push Python was because of the democratization and the access, the accessibility of it, when you're a software developer, you learn new languages all the time because that's part of your gig. If you're not a software developer, if you're a subject matter expert or a domain expert in some other field, let's say, you're geneticist or let's say you're a policy maker or whoever, you're astrophysicist, learning a new software programming language is hard. You're not really a coder anyway. You had to learn some Fortran or C++ or Matlab in grad school, but otherwise, you're not doing this on a weekend just because you love it. If you learn a language, this is going to stick with you for a while. If we as people who make languages or make software tools, if we can find a language that people like to use and it's powerful for them and that multiple different kinds of people can use, that's incredibly powerful. One of the things about Python is that the creator of Python, Guido, before Python, he was working on a project called Computer Programming for Everyone. Some of the ideas that went to Python came from that precursor language called ABC. That readability counts in that of executable pseudo code thing, the same things that make Python hard to optimize, to make it consternation for statically typed language aficionados, those things also make it incredibly accessible to lots of people. When we make these kinds of advanced tools available accessible to lots of people, what we do is we grow the universe of possible innovations. For me, it's very intentional that we chose Python. There's thousand new languages you could create that are better than Python and all these different dimensions. But at the end of the day, Python is kind of the language everyone uses. It is valuable that everyone uses that same language. I have a very, very strong opinion about the fact that we should continue promoting its use and growing its use even as I fundamentally believe there must be a better language out there. That's like the successor to it. I have some ideas about that as well.
Lukas:
Oh, interesting. I'd love to hear about that because we were talking with one of the fast.ai founders, Jeremy Howard. He's written so much Python code. He was really emphatic when I was talking to him on the same podcast about Python can't possibly be the future of scientific computing. I was surprised. I would say my perspective is definitely a non-expert, but I really enjoy programming in Python. Maybe, it's hard for me to really see how things could be better or maybe I don't have to worry about performance as much as other people. But what would your take be like? Is there any kind of language of less adoption that you think is really intriguing and could replace Python or are there tweaks to Python that you'd like to see? How do you think about that?
Peter:
Yes and no. There are languages out there that do interesting things that are things that Python can't quite do or that Python may never be able to do. One of the fastest database systems out there is the thing called KDB and the language in it K, you're not going to find any... I mean it comes from like the APL roots which are the precursors to the Fortran stuff. Then, Matlab and NumPy and all these things. In any kind of ALGOL and modulative derived imperative programming language, you're not going to meet the kinds of raw numerical performance that K and KDB can achieve. The creator of K in KDB has a new thing that he's building called Shakti which is even more interesting. There's that kind of lineage of things. They're like the most out there amazing bits of Lisp plus like Fortran, and you get something like that. Python is not there, but Python has a lot of the good parts of the ideas there. It expresses them in a infix imperative language. Then, there's things like Julia that
Lukas:
Sorry. Sorry. Let me try to understood what you said about K and other ones. What's the advantage of that they have the potential to be faster?
Peter:
It's more than just faster. It's a fast and correct and performant representation of your idea, but you have to sort of warp your brain a little bit to thinking in that way. Ken Iverson, the creator of APL which is the root of all of this stuff, he had this idea that notation is a tool of thought. If you really want to help people think better and faster and more correct all the same time, you need better notations. If you ever go and look at a bit of K, it looks different, et's put it that way. Then, what you are mostly used to in a Python or even a C++ or C or Java world, it's completely different. It comes with a different brain space.
Lukas:
Interesting, is that just because it's sort of following different conventions or is there something to this perspective? Because I feel like every so often not in many years, but in grad school, I used to occasionally run across 4chan. It would just be like, "Okay. I'm stopping here. I'm not going to go any deeper, this just feels impenetrable to me, but is that my fault or is that like... Yeah. Is there something there that's like better about it, I guess, in the notation?
Peter:
Well, better is a big word. I'll back up and say the difference between something like K or fourth or J kind of like JK fourth, APL versus ALGOL or like Pascal C kind of this lineage of fairly imperative procedural languages. At the end of the day, we are programming. When we write a program, we're making a balance of three things. There's the expression itself like what it is we're trying to express. There's the data, the representation of the data. Then, there's like some compute system that's able to compute on that data. I call this kind of the iron triangle of programming, is that you've got expressions and expressitivity or expressiveness. You have data, schemas, data correctness, things like that. Then, you've got the compute which is run time, again, correctness, runtime characteristics. Every programming system sits somewhere in the middle of this like ternary chart. Usually, you trade off. What happens is you usually collapse one axis onto the other, and you have a linear trade-off. Most of the post-Nicholas worth era of looking at, okay, you've got data, you've got a virtual machine and you're going to basically load data in and do things to it with functions that proceed like this. That model sort of everyone has in their heads as a programming system. When you look at something like fourth or like K, you actually come from a different perspective. Fourth, I'll throw that in there because even when you do have an explicit data representation in mind, when you write programs in fourth or if you ever had an HP calculator reverse polish notation, probably the closest, that most people will ever get to fourth, you're explicitly manipulating stacks. You're explicitly manipulating these things. You're writing tiny programs that can do a lot. It's amazing. That's what an explicit stack and explicit, these kinds of things. When you go to something like Lisp or like K, you're writing these conceptual things, these expressions. Well, in the case of Lisp it's a conceptual algorithm. In the case of K, it's also an algorithm, but it's an algorithm on parallelizable data structures on arrays and on vectors. Then, part of your first class thing that you can do is you can change the structure of those data structures. You can do fold operators. You're going to apply in these ways. You can broadcast and collapse and project. All of those are first class little things you can do in line as you're trying to express something else. You end up with a line of K that's this long, that would take a page of Java to do. By the way, the genius of the K system is that the underlying machine that interprets that, the compiler and then the interpreter system is incredibly mathematically elegant because there's actually fundamental algebra that you can sit in the heart of this stuff that you can then... Basically, K will load into... I think the claim is that it loads into L1 I-cache. Your program just streams to the CPU like a mofo. You're never even hitting L2. That's kind of an amazing thing. I think when you turn around, you look at something like Python which is not that optimized at all, it's like the C-based virtual machine, but when we do NumPy things, you're expressing some of those same ideas.
Lukas:
Yeah. I was going to say this reminds me of my experience with NumPy where I keep making it tighter and tighter and shorter and shorter and more and more elegant. But then, when I need to debug it, I feel like I often end up just unpacking the whole thing again. I don't know if that's like me being stupid.
Peter:
Well, it depends on what you're debugging though because you can make it compact. Then, when you debug it, it's like, "Are you debugging an actual bug in the runtime of NumPy itself?" Are you debugging a performance mismatch with your expectation relative to how the data structure is laid out in memory? Are you debugging a impedance mismatch between your understanding of what NumPy is going to do in each of these steps versus what it's... There's a lot of things to debug, so to speak, but that's one of the downsides of making really tight NumPy snippets because I did some of that back in the day. I was like, "Oh, this is so great." Then, something blows up, but it's like, "Oh, crap."
Lukas:
But wait. I'm like taking off in all these tangents. I'm actually really fascinated.
Peter:
This is a conversation.
Lukas:
Totally. You were saying, so you're comparing to K which actually Jeremy Howard did talk about and really, really praised.
Peter:
Great.
Lukas:
But then, what are the other kind of languages that have like interesting pieces that could be useful for scientific computing.
Peter:
Yes. Jim Gray, the late great Jim Gray wrote an amazing paper back in 2005 called Scientific Computing in the Coming Decade. It was prescient. It was ahead of its time, I think. It was at Jim's time. He knew it, but he was writing this great paper. It talked about how so many different things he talks about in this paper. It's worth everyone to read it. But he talked about how we would need to have computational sort of notebooks, how we need to have metadata indices over large data that would have to live in data centers that we couldn't move anymore. We'd have do computing. We have to move ideas to code... Oh, sorry, move code to data, move ideas to data, all these different things. But one of the things he explores is why don't scientists use databases? Databases is the realm of business apps and Oracle nerds. Why don't geneticists and astrophysicists use databases? The closest they get is using HDF5 which is really just like it's a file system. Great. It's a tarball. It's talked about lays out a memory, so you can compute on it. That's great. You can do out of core execution on it, but why don't scientists use databases more. He looked at this a little bit more, but one of the things I think that would really move scientific computing forward is to treat the data side of the problem as being more than just fast arrays. Actually, as we have more and more sensor systems that have more and more computational machinery to get to additional data sets which then become transformed into additional data sets, that entire data provenance pipeline even as businesses have to reinvent the enterprise data warehouse to do machine learning on all their business data, I think scientific computing has to honestly sit down and face this giant problem it's trying to ignore for a very long time which is how do we actually make sense of our data, not just some like /home/some grad student's name/temp/project five/whatever. We've got to actually do this for reals. I think one of the ways to move scientific computing forward that is on the completely opposite side of going to the K land and fast APL land is treating the metadata problem and the data catalog problem and, in fact, the schema semantics problem as a first class problem for scientific computing. If you look at what F# did with Type Providers and building a nice extensible catalog of schema that was actually part of your coding as you were using data sets. They did that 10 years ago. That stuff is amazing. That is something that we should make available. That's something that would be a game changer. I don't know if you saw this thing where some like internet council of like geneticists, they declared they would change actually gene names. Did you hear about this?
Lukas:
No.
Peter:
There were gene names they changed from March 1 Sep 1, things like that because [inaudible 00:19:57] use Excel so much. When those show up in Excel data, Excel translates them into dates. It screws them up. Because of Excel's auto formatting of their things, they're literally changing the names of these genes. This is how depraved science has gotten, is that we will... Not that those are necessarily great names to start with, but the fact that we will wrap ourselves around a fairly broken tool for this purpose. I don't know. For me, handling the data and schema problem for science like full stop, that's a huge part of the problem that needs to be done.
Lukas:
Yeah. That's so interesting. Is that something you're working on?
Peter:
No. I have a company to run. Like we have to make money! No. When we get to a certain point where we have the resources to invest in additional projects, then this is one of the ones I would absolutely try to tackle. We do have a project that's in this vein. It's called Intake. It's not the sexiest sounding thing in the world, but Intake is a virtual data catalog that lets you set up a data server. If you set up Intake server over here near your data and you fire up the client just in your terminal, on your notebook or whatever, you can connect to it, and you hit it with like remote Pandas and Dash calls and things like that. You can also create transformed almost like materialized views of those things on the server. It's been used in a few projects. Some people are starting to pick it up, but it's something I would recommend people check out. It's called Intake.
Lukas:
Cool. All right. We'll put a link to it. Can you give me some examples of who your customers are. This is like such business speak. What's the value?
Peter:
Yeah. We have a couple of different things that we sell. For a while now, we've been selling enterprise machine learning platform called Anaconda Enterprise. It's based on Kubernetes, and data scientists can... It can stand it up. Data scientists log into it. They have a managed governed notebook environment, well, any number of different UIs, but generally, people prefer notebook environments. Then, they have one click deploy for dashboards, for notebooks and things like that. They can run machine learning models and have rest endpoints they deploy. It's a, yeah, a big data science platform thing. There's another thing we sell that is just the package server. A lot of the value that businesses get from us is that they have actual vendor backed this place to get binaries to run in their governed environments which actually does matter to them. In that situation, what they want to do is they have like a server. They buy a server from us that has the packages. Then, it's proxied locally for them. We don't get to see all the packages that they're downloading what they're doing with their data analysis. They also have faster access to all of these different packages. They're IT people. This is a really important thing. IT has a chance then to also govern which clusters, which machines, and which environments can use which versions of which libraries which is a really important thing because in an enterprise environment, you have data scientists who want the latest, the greatest and bleeding edge, everything. Then, you've got production machines which you do not want getting the latest and greatest everything. You want to know exactly which version, how many CVEs, which ones are patched. That's all that runs in production. This is a package server that gives business the ability to do that. Those are primarily our two commercial products. We'll be coming up with some more things later in the year. It's an individual commercial edition that individual practitioners can buy, things like that.
Lukas:
You've been doing this a while like at least a decade.
Peter:
No. Not a decade. An octal decade. We started in 2012.
Lukas:
Nice. Even that is quite a long time, I think, for this space. I'm curious when you started, what kinds of customers or what industries are using you the most and how has that changed over the last eight years?
Peter:
Yeah. When we started, it was very heavily in the finance, so hedge funds, investment banks, things like that. There was a heavy use of Python there at the time. We were doing a lot of consulting and training, open source consulting, standard sort of things like that. Nowadays, you see a lot of these venture-backed open source companies that have a product. It's like, "Here's our open source foobar. Here's the enterprise foobar++." Then, Amazon build a clone of it off their open source. They go public anyway, make tons of money. This is a play that many companies have done especially around some of the big data infrastructure projects. It's pretty popular move. We are an open source company that supports an ecosystem of innovation. There's a lot of things that are out there that we deliver and ship via Anaconda that we ourselves don't write. That innovation space has changed. It's gotten sucked into so many different things. Now, we've seen everybody, insurance, oil, and gas, logistics, DoD and three-letter agencies and just like everybody is using Python to do data analysis and machine learning. It's just literally everywhere, like sports betting sites, Netflix, and the Ubers of the world like everybody is doing this stuff. Now, not all of them are paying us yet as paying customers, but that diversification of well, I would say diversification, but that growth in adoption was we were hoping what we were hoping to unleash when we started the company. It's been really great to see all that happening. We couldn't have predicted deep learning. We couldn't have predicted that machine learning was the thing to take off. We were really thinking that it would be more rapid dashboards around notebooks, around building. Here's the data analysis. I'm a subject matter expert because I can write a little bit of Python code. I now can produce a much more meaningful, rich, interactive dashboard and control pane for my business processes or for my whatever like heavy industrial machinery. We saw that happening pretty well in the 2000s around a rich client toolset as sort of a Matlab displacer. But now, with machine learning on the rise, it's completely flipped Python usage into a different mode. That's, as you would know at Weights and Biases, like that's the dominant conversation on Python, but these other use cases are still there. There's still a lot of people using Python for all these engineering simulation things. Anyway, it's just been great to see all this growth and diversification of use cases.
Lukas:
Is machine learning even the top use case that you see? It feels like the buzziest right now, but I always wonder what's the reality of the usage volumes versus what you see on the ground.
Peter:
It's the aspiration that people get paid for that way. I think there's a strong disconnect between older businesses. I would say Python has crossed the chasm. You talk about the chasm of technology and crossing the chasm. Python has crossed the chasm. On the other side of the chasm, the way that this kind of innovative technology has landed is that you have a lot of buyers who are not as sophisticated about what is they really want to buy or what it is they're buying or how ready they are as a business to adopt what they've bought. You can buy the fanciest like Ferrari, but if you have a dirt track road, it's not going to go as fast as you have like an actual smooth paved road. A lot of businesses have this problem where they can buy the hottest sweetest ML gear team tooling, blah, blah, blah, but then their internal data is just a mishmash. You spend 80% of your time digging that ML team out of the data swamp. That message, I think, people are starting to get it now as they come over into the chasm of the trough of, what, not despair, something-
Lukas:
Disillusionment.
Peter:
Disillusionment. That one. Right. But the truth is this, there's an ML hierarchy of needs just like Maslow's. If you don't have your data stuff together, if you don't understand the domain problem you're trying to solve, you have no business even doing data science on it. If you haven't done data science, there's no models to go and optimize with machine learning. But if you get all that in place, then machine learning can absolutely deliver on the promise. I think people try to buy the promise, but most of the people they pay are out there slugging a bunch of trying to basically denormalize data, dedupe data, and just do a lot of that kind of stuff.
Lukas:
Most of the most of the verticals that you mentioned, I think, are not the first things that come to mind here in Silicon Valley for ML applications, but you actually see like insurance doing ML and thinking of it as ML, just as a specific example.
Peter:
Oh, absolutely. The hardcore finance folks are probably the only people, I would say, that lead Silicon Valley in terms of ML. The hedge funds were there first because they operate in a pure data environment. The thing about that data environment is everyone else is operating in the same pure data environment. By the way, it's all zero sum. If you screw up by a millisecond, you lose millions of dollars. Incredible incredibly hard odds or hard boundary conditions to be optimizing in. I think Silicon Valley, it's a lot of consumer behavior. It's a lot of like this kind of thing. Certainly anything in ad tech and the attention economy, the ML there is fairly low stakes. Of course, hundreds of billions of dollars of corporate valuation hang in the balance, but if you screw a little bit of something up, it's like, "Well, they'll be back tomorrow doomscrolling." We'll give them some better content tomorrow, but when you're in insurance and these other things, the ML, those models, the kinds of diligence that a credit card company has about honest models and model integrity, the kinds of actuarial work that goes into building models at an insurance company, that's real. There's real hard uncertainty. If you screw up, that's a $100 million screw up. There's real stuff happening there. There are no light weights on this stuff. They're doing real things.
Lukas:
Cool. I guess when I've talked to insurance companies, it's felt like there's almost these two separate teams that feel a little bit at odds with each other. They're like the old-school math, guys, like the actuaries who are like, "What is this? We've been doing ML forever. This is just a re-branding of the stuff we've always been doing." Then a couple guys off to the side maybe doing some crazy deep learning projects that you wonder how connected they are to the business. Do you feel that same dynamic or -
Peter:
Oh yeah. Absolutely. Any organization over like 50 people is a complex beast. Even 50 people can be pretty complex. These larger firms, there is definitely a struggle internally as they do this data transformation into the cybernetic era is what I've been calling it, the cybernetic era. Many of them, the theory of action, is still open. It's like, "Whoa, we sell this particular insurance policy. We'll see what comes back five years from now." We'll look at a five-year retroactive performance, and then, we'll know if the model is correct. Those kind of old guard folks who are... Yeah. A bunch of actuaries writing a bunch of SAS code, that's some old school stuff. Then, there are new people in that space who have access to the data who have the statistical background and who know they can do way better. There is a conversation happening. Within credit card companies, you'll have... They're a great example because there's regulatory pressure. There's old school models and SAS. There's newer people trying to do some better credit models. There's really cutting-edge people doing real-time risk, real-time fraud, all these kinds of things using deep learning sometimes using all sorts of GPU-based clusters. You just see a whole pile of different things within a credit card company that you might not see it still. In Silicon Valley, it's been more monoculture because there's less tech overburden that they had to dig out from. There's like, "Well, right there's like, "Well, we need a bunch of machines in the cloud. You got it because there's no regulators checking into this stuff." Yeah.
Lukas:
What do you make of, I guess, the Matlabs and the SASes of the world? Is that ever a sensible choice for someone for their tech stack or is that just a completely legacy software choice?
Peter:
Well, let me see here. I think the best way to answer that is that any time we make a technology choice, we should be very respectful of Conway's law which is that the technology systems that we build, the software systems we build are a reflection of the communication patterns within the teams that built it.
Lukas:
Third time that's come up this week in interviews, by the way.
Peter:
Really? Yeah. But it hits the ML stuff in a different way which is that if those different teams speak different languages, then, you have two teams. If the same team speaks two different languages, you have two teams. We see this actually with people trying to get Python into ML production where sometimes those production processes are optimized for managing a pile of Java with a bunch of Maven or you have C++ plus because we only deployed TensorFlow C++. There's this kind of thing when you have a language barrier, you create two technology fiefdoms which then lead to a bimodal or trimodal or whatever kind of software product. An ML system is a software product however you want to look at it. Fundamentally, has a pile of software in it. When we talk about this question of like is Matlab or SAS ever an appropriate choice. The answer is, well obviously, yes because the whole team knows Matlab or SAS. They're building this. Then, you probably use Matlab or SAS even for brand new projects starting tomorrow. However, the question then is taking a step back. If I'm the manager of this team, how much longer do I want to only have a team that only knows how to use Matlab or SAS when clearly all the papers at ICML, whatever are being published in Python. You got to make that call if you're the manager. I would say that the answer is yes, but if you're doing that, you should be aware that there's all this innovation happening in different languages. Even if you bring those languages into a hybrid environment, you say, "Fine. I'll hybridize. I got my legacy Matlab that's never going away because that's how we model air flow through this turbine system. I'm not going to redo all that work." But then, I have to build discipline about how to hybridize, how to bring these people forward. They know some Python. Bring the Python technology back to be able to couple with the Matlab and see yourself as having to become an expert in doing that. I think the answer is yes. That would be my answer. Yes. You'd absolutely make a justification for starting new projects in those things. But, generally, if you're doing it in teams that already know those languages, I probably wouldn't recommend it for a Python team.
Lukas:
Okay. What about R? Where does that sit? Is that is that ever like a reasonable choice for a team where you have greenfield or-
Peter:
Yeah. Of course. There's lots of people who do that.
Lukas:
What would be going on that you would choose to use R versus Python?
Peter:
Well, for me because I'm a Python expert, I would choose Python. The only reason I which you have my team use R is if there's a lot of existing stuff that's an R or they're all our experts which is I'm not going to try to convert them to Python. I'm going to try to make the best go of R with them. But if there are really new capabilities and things that are only available and a Python bridge to some CPU or some GPU stuff, then, I would have to hire some people who are polyglot that could build that bridge. Again, it comes down to the teams.
Lukas:
Although I feel like you do... I don't think you really have the perspective that like all languages are created equal. Of course, we hit the real world. We have to choose our language maybe around what library is available or what's going to be maintainable. But I'm curious what you make of R. When I was in grad school, I used all R, and I absolutely loved it. Then, I had this experience of seeing Pandas and NumPy and just be like, "Well, this is way better." I just want to switch this and use this. Well, some people take the opposite position from you on that. They would say, "I went to R." Now, I can think like a statistician again and actually do my express what I'm... The Tidyverse and dplyr, and these things are so nice and ggplot's gorgeous and all these things. Yeah. That's true.
Peter:
A lot of the R advocates, they have good points. There is, I would say, a more monoculture is the wrong term. There's a smaller set of obvious choices in R. If you've used those and the team around you uses those, you can get to very nice results without a whole lot of people tearing their hair out because they have conflicting versions of 15 different plotting libraries like we have in the Python end.
Lukas:
True.
Peter:
Anyway. I don't know. Of course, I don't think all languages are created equal, but you did ask me a question which is, is there ever a reason to do these different languages. I said, "Yes, there's always a reason to use these other languages." Well, Anaconda does support R. I will point out Anaconda does support R and R packages managed within the Conda environment. You can actually manage… One of the things that we're doing is actually looking at the precise versions and building the dependency graph. If you want to go in, and you don't want to just take a whole snapshot, like a whole cran snapshot and you want to say, "Well, what if I want to use this version of this thing, but that version of that thing? I just want to upgrade that. Can I do it?" We're using a Conda approach to package management for the R ecosystem as well. That's happening in conjunction with the Conda forge folks. That's some building out. Now, we have coverage of several thousand libraries in the R universe too.
Lukas:
Well, awesome. It's interesting. Okay. Here's the question I really want to ask you. I'm just going to ask
Peter:
Go for it.
Lukas:
... this question. This might be dumb, but I guess the one thing that I really felt when I switched from R to Python is like, "Man, the graphing libraries are worse." I hope I don't offend anyone with that comment. I feel like NumPy has improved so much and SciPy seems to have so many libraries that anything that I would want is like not a super deep scientist. It used to feel like R had way more packages to cover my needs. It feels like Python's kind of catching up there, but it sort of feels to me like the graphing libraries are still frustrating. Is that because I'm misusing them or is there a library out there I should know about?
Peter:
Yeah. No. The graphing libraries of Python are awesome. It's clearly the user's fault. I don't know what you're talking about. No. It's a common complaint. I think what happens in Python there is that if one were to take a sort of a more objective look at all these, he's the author of two or three different not the only author, but the originator, let's say, of two or three of the graphing libraries in Python. Now, there's several dozen. There have been. What we have here is a couple of different things going on. R what R does very well precisely because it was designed by a user community for that community. And also because of its sort of Lispy heritage, it is able to do some really neat tricks by preserving the transformation pipeline and coding expressions, things like that, that give you some really awesome superpowers when you're building like just give you a facet of these things. It just does the right thing obviously. Also, Hadley did great work with ggplot2. There's nothing not to say there wasn't hard work involved there, but then, if you want to go and do some additional... If you want to do plots outside of some of the things that ggplot is great for, then, it's a more impoverished landscape, let's say. If you want to do real-time spectrograms there in R, I don't know, man, or if you want to do really large scale active web graphics with all this crazy map data, I don't know. The Python world has always been more multi... It's just been there's a lot more mongols across a bigger plane. There's just many different flavors different things all over the place. Matplotlib was written by a guy in grad school trying to plot EEG plots. Then, he moved on to hedge fund, but then he was trying to copy what he knew which is Matlab. It does great for that actually. If you're an engineering Matlab user, Matplotlib works great. It just fits your brain, but most ML people were not Matlab people. Then, likewise, if you use tools like Seaborn, they get you some of the way there, but then they don't have the support from the language level to encapsulate some of the statistical transformations that would help inform something even better. It has to include some of those transformations within it. Facet's names, things like that. Then, you go around and you look at some of the interactive plotting systems whether it's Altair, whether it's like bouquet or any of these other things, Plotly, then, they all are solving for kinds of different parts of the problem. To do as much of a cover of the Python use cases is just bigger than any kind of project was able to do. I think there's more compact set of use cases in R. Therefore, it was possible to do a more a higher level of cover in a single project. Does that make sense?
Lukas:
Totally. That's really well said. Very non-judgmental.
Peter:
We're all about big 10. Panda, it's all about the big 10.
Lukas:
Big 10. Totally. You do packages. That's my favorite.
Peter:
Not my favorite.
Lukas:
Okay. We always end with two questions that I want to make sure I get them in because I'm curious to hear your thoughts. One question that we always ask people and maybe I should ask this in a more expensive way, we always ask people is, is there a topic in ML that doesn't get as much attention as you think it should, that people should focus on more than they do. I might expand that for you into all of scientific computing. What's one thing that you think people don't pay as much attention to as this usefulness would suggest? Well, a topic, there's lots of topics. My general thing stems in it comes from this place where I feel very strongly that ML practitioners more so than just software like coder nerds are going to run into the ethical implications of their work and even more uncomfortably, they're going to be the ones forcing that conversation in businesses that for a long time maybe have not had to think about that because ML is about engineering the crown jewels of the business models. You're like, "Hey, we just figure out this way. If we buy these two data sets and do this kind of model and reject these kinds of people from our user base, we get this kind of lift. Should we do it?" Well, it's like, "Heck. I'm just a VP of God knows what. I didn't ask to be presented this incredibly difficult trolley problem. Don’t look at me." I slept through that crap in college. I think that ML more than any other thing right now, we have to be faced with this concept that technology is not value neutral. If you think about what machine learning really is, it is the application of massive amounts of compute, rent a supercomputer in the cloud, kind of massive amounts of compute, to massive amounts of data that's even deeper and creepier than ever before because we have sensors everywhere to achieve business ends and to optimize business outcomes. We know just how good businesses are at capturing and self-regulating about externalities to their business outcomes. Just as a human looking at this, I would say, "Wow, I've got a chance to actually speak to this practitioner crowd about if you're doing your job well, you'll be forced to drive a lot of conversations about ethics and the practice of your thing about what you're doing within your business as it goes through this date of transformation." You should be ready for that, steel yourself for that. Don't punt. Don't punt on it. We can't afford the punt. Besides steeling yourself for that which is probably a good verb for that, do you have any suggestions on how someone might educate themselves in that because I think we have a lot of people listening to this that's in that situation that might be wondering where could I find more resources. Do you have any suggestions?
Peter:
Yes. I think there's books that I've read that have been written now especially in the era of the Facebook and information, attention economy sort of dystopia stuff. There's books by Shoshana Zuboff, Cathy O'neil Weapons Of Math Destruction. There are books even I think Christian Rudder, Dataclysm, some of these other things. You can arm yourself with knowledge about the anti-patterns of what happens when ML blindly applied goes wrong. That, at least, gives you a bit of a war chest or a quiver of things you can reach for to say, "What we're doing here is exactly what happened when I just pulled one out of the hat, when AOL anonymized their user data... Was it AOL or AT&T, anonymized their user data back in the early 2000s. They did this anonymous data release. This thing happened, and somebody got outed. There's all sorts of wonderful examples you can pull from because we've actually been making a lot of mistakes. One thing is take your time. Take the time to read about that stuff. Number two is go to and attend talks about sort of this "this soft topic" about ethics in ML and in fairness and whatnot. I know some of it may seem a bit ceremony and preachy. It's like, "Hey, I came here for like the hardcore conv nets. I didn’t come here to go listen to somebody drone on about ethics, but in every conference you go to and everything you go and do, spend some time getting educated about the state-of-the-art thinking because right now, people are trying to think about preserving privacy, privacy preserving encryption and some of these things, differential privacy. Those things are coming. Those are going to be part of the state-of-the-art best practice soon. You should be educated about those things. Not only just do it because you have to, but know why because I guarantee you when you go and scope those into your project, some VP is going to come and say, "Well, can't you just get rid of that and do it faster?" You have to be able to argue the principles of why you need to do it this way. That would be the one thing I would say is I don't know if it gets maybe already too much press, but it probably doesn't get enough press is that the ML practitioners, if they're not going to be just serfs in this, they actually want to have agency in that conversation and hold their own ground in what we should do and not have a pile of regret down the road, that now is the time to start getting educated and start asserting yourselves more in those internal corporate political discussions.
Lukas:
Awesome. Well said. Let's take a deep breath. Final question.
Peter:
All right. Final question.
Lukas:
When you look at companies trying to get stuff into production, what are the surprising bottlenecks that they run into? When somebody's trying to take an ML project from an idea to deploy and it's working and doing something useful, where do you see people get stuck?
Peter:
Oh, well, every part of the process can be troublesome. I don't know if there's a surprise there at all. One thing that's surprising to me is how many corporate IT places are still pretty backwards and relative to open source. This was surprising to me in 2014. It's still surprising now. How many places will say, "Well, we don't really do open source, or here's the open source that we do. It's just these few things. Then, when they say that, they trot out all the tired fad arguments about how could we trust this thing, how can we trust that. The other thing is that there is still a very strong Python allergy and a lot of lack of awareness of what Python actually is and can do. There are some companies that are like, "Well, this is a Java shop or this.net shop." We really only know how to deploy these ways. We don't deploy Python. You have to recode that because it's just a language. You can recode in this other thing. Why wouldn't you be able to? These IT shops, they don't understand that when you use Python, you're harnessing, like you're linking into seriously optimized low-level code that a lot of seriously smart people have been doing. There's not the equivalent over in the Java space. And all the data marshalling back and forth is going to cost you a tremendous amount of performance in the Java space. These IT shops have not yet understood that. Sadly, a lot of the ML engineers, they are relatively new. They don't know how to articulate that argument. They don't know how to sit there and talk about JVM internals and all these other bits because that's not their gig. I think that's been the depressing... It's surprising that it's still this issue because we do have companies that deploy Python in front line production stuff to do some of these ML things. They're fine. Even with that, as proof points, there's still kind of these industry inertia. Yeah.
Lukas:
What would you even use for a non-open source machine learning framework? This shows you how sort of maybe Silicon Valley Kool-Aid I am.
Peter:
No. I think what ends up happening, to be honest, they'll buy some vendor thing which still just embeds the machine learning the same open source machine learning thing and say, "No. I kid you not." That's literally what they will do sometimes. If you get into corporate IT enough, it gets pretty depressing about the kinds of like... The incentives are all messed up there, unfortunately, which is one of the reasons why Silicon Valley does run circles around some of these other companies.
Lukas:
We should have ended on your ethics answer. This is just surprising. I guess both are worrying in different ways.
Peter:
We have it working out for us. We have it working out for us. That's for sure.
Lukas:
Nice. That's a good way of putting it. Thanks so much. That was really fun.
Peter:
Yeah. No. Thank you.