As part of my work at Weights & Biases, I often want to experiment with new ideas in different frameworks. And different frameworks depend on other frameworks which could lead to conflicts and environment errors. Sadly, I usually get stuck in a conda install X or a pip install Y loop until everything breaks and then I have to restart from scratch by creating a "fresh" environment repeating this process multiple times until I finally succeed or give up on the problem I am trying to solve (usually the former).
As an example, I was recently learning about Arcface and wanted to try something out in the Shopee Kaggle competition and I decided to use Rapids cuDF package for similarity detection between various images. But, I wasn't able to find any easy way to install the cuDF package. This is pretty normal and has happened to me in the past too for other packages.
Below, I have copied the official installation command (at the time of this writing) from the Rapids website -
conda create -n rapids-0.19 -c rapidsai -c nvidia -c conda-forge \ blazingsql=0.19 cudf=0.19 python=3.7 cudatoolkit=11.0
This command creates a new conda environment called "rapids-0.19" and installs the cuDF package and all its dependencies in the this environment. This is great because I can now use the "cuDF" package, but all my other packages not part of this environment have to re-installed. Also, the packages that do end up being a part of "rapids-0.19", were a different version than what they were before.
Let me show you what I mean:
# create and activate new conda environment (base): conda create myenv (base): conda activate myenv # install `timm` package which installs dependencies like # pytorch-1.7.0, cudatoolkit-11.1.0 etc (myenv): conda install timm -c fastchan (myenv): python -c 'import torch; print(torch.__version__)'>> 1.7.0 (myenv): python -c 'import timm; print(timm.__version__)'>> 0.4.5# now I want to work with cuDF - but this creates a new environment (myenv): conda create -n rapids-0.19 -c rapidsai -c nvidia -c conda-forge \ blazingsql=0.19 cudf=0.19 python=3.7 cudatoolkit=11.0(myenv): conda activate rapids-0.19# let's check timm and pytorch's version(rapids-0.19): python -c 'import torch; print(torch.__version__)'>> ModuleNotFoundError: No module named 'torch'(rapids-0.19): python -c 'import timm; print(timm.__version__)'>> ModuleNotFoundError: No module named 'timm'
As shown in the example above, I had to create a completely new environment for "cuDF" package to work and now all my other packages not part of this environment have to re-installed. This is neither fun nor time efficient. I have previously been stuck in similar problems when I have also tried to update the packages.
Another recommended way to install Rapids cuDF package as mentioned on the Rapids website, is to use Docker. Unfortunately, using Docker adds another level of complexity and also suffers from the same problems. As a user, I still have to spend a lot of my time getting the environment ready with the required package versions and I usually find this to be hard and time-consuming. We see this in more detail in the Background on Docker section below.
👉: As a Machine Learning Engineer, all I really want is for my environment to work and for my packages to stay "up-to-date". That shouldn't be so hard? Right?
This example which involved Rapids cuDF package is just one recent example that I remember.
Let's try and understand why having "up-to-date working environments" is actually a harder problem to solve than it seems.

Table of Contents

This blog post has been structured in the following way:
  1. Background on Conda
    1. How is conda differnt from Pip?
    2. What is a conda channel?
    3. What is a linux distribution?
    4. What is the Anaconda "defaults" channel?
    5. Why do we need to define multiple channels to install some packages?
  2. Background on Docker
    1. Why is it hard to manage environments inside docker?
  3. What is conda-forge?
    1. How is conda-forge different from Anaconda's "defaults" distribution?
    2. How is conda-forge similar to Pip?
  4. Summary so far..
  5. What is fastchan?
    1. How does fastchan allow me to have an up-to-date environment ?
    2. How does fastchan allow me to install packages from a single channel?
    3. How does fastchan different from conda-forge?

Background on Conda

From the official docs, Conda, is both a package manager and an environment manager. This means it can be used to download and update packages similar to Pip (package manager) and also create and manage "virtual environments" similar to venv (environment manager). But, conda is also very different from Pip and venv.

How is conda different from pip?

There is this really cool blog "Understanding Conda and Pip", on the anaconda website that explains everything in a little more detail. But essentially, conda and pip are both package managers. One of the key differences between the two is this - Pip installs Python packages whereas conda installs packages which may contain software written in any language.
For example, there are conda packages for cudatoolkit, rust, GCC & more that can't be installed using Pip. This is because Pip installs Python software packaged as wheels or source distributions. The latter may require that the system have compatible compilers, and possibly libraries, installed before invoking pip to succeed. Whereas, Conda packages are binaries. There is never a need to have compilers available to install them.

What is a conda channel?

Conda channels are the locations where conda packages are stored.
What is a conda package? It is a compressed file that contains executable programs, metadata, python or other modules. You can learn more about them here.
Unlike Pip, where all packages are on Python Package Index (PyPI) and therefore package names cannot be the same, conda has multiple channels that are mantained by various organizations or individuals and each channel can serve as the base for hosting and managing packages. Thus it is possible to have a package with the same name in two different channels.
Since conda has multiple channels, this means various versions of packages are spread out accross different channels. For example, here are the search results, when I search for "pytorch" on https://anaconda.org/.
As can be seen, there are multiple versions of the package pytorch on various channels such as "pytorch", "fastchan", "zeus1942", "soumith", "conda-forge" and more!
👉: Please don't be confused by between package and channel. Here, the "pytorch" package is hosted inside the "pytorch" channel.
Figure-1: Multiple versions of package PyTorch across various channels
This means that as a user, I always want to go for the pytorch package from a trusted source. The official pytorch channel, in this case is a great place to get the package from.

What is a Linux distribution?

Here is a wonderful blog post that introduces Linux distributions. Linux distributions do the hard work for you, taking all the code from the open-source projects and compiling it for you, combining it into a single operating system you can boot up and install. This means that all packages that form a part of a Linux distribution are well-tested and compatible with one another.
Therefore, we can think of a distribution as "concept" - where all packages that form a part of a particular distribution are well-tested and compatible with each other.

What is "defaults" channel?

By default, the conda package manager installs packages from the defaults channel. All packages in the defaults channel are thoroughly tested and maintained by the Conda team from Anaconda, Inc. When you use a conda command that involves looking for a package to install or upgrade, by default conda searches the default repository located at https://repo.anaconda.com/pkgs.
Since all packages in the defaults channel are well tested and compatible with one another, therefore, one could refer to the defaults channel as a "distribution".

Why do we need to define multiple channels to install some packages?

Below, I copy the command to install pytorch using conda from pytorch's website:
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
Basically, -c stands for channels. Specifying channels pytorch and nvidia tells conda to search for the packages in these two channels with pytorch channel having a higher priority (priority is from left to right). But why do we have to specify two channel names instead of just -c pytorch or just -c nvidia? It's because the cudatoolkit=11.1 package is hosted on the nvidia channel and pytorch package is hosted on the pytorch channel. So, this let's conda know to install the pytorch package package from pytorch channel and the cudatoolkit=11.1 from the nvidia channel.
This is also the reason why command for installing cuDF looks like:
conda create -n rapids-0.19 -c rapidsai -c nvidia -c conda-forge \ blazingsql=0.19 cudf=0.19 python=3.7 cudatoolkit=11.0
It's to let conda know to look for blazingsql and cudf packages from rapidsai channel, cudatoolkit from nvidia channel and python=3.7 from conda-forge channel. This command would fail if we didn't specify multiple channels as no single channel hosts all the dependencies required to install cuDF.

Background on Docker

Personally, I like to think of docker as a loosely isolated environment, separate from the host environment that can have its own list of dependencies and packages. Docker is also capable of much more and more information can be found at the official docs.
But generally, this is a more complex way of sorting out environments compared to conda, especially if someone hasn't used docker before.

Why is it hard to manage environments inside docker?

Just the installation instruction for the "cuDF" package for docker looks something like this:
docker pull rapidsai/rapidsai:0.19-cuda11.0-runtime-ubuntu18.04-py3.7docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 \ rapidsai/rapidsai:0.19-cuda11.0-runtime-ubuntu18.04-py3.7
The above commands first pull a docker image with cudatoolkit-11.0 and everything else required to get the "cuDF" package working and then launch the application at a specific port.
This is even more complex than using conda for managing environments because:
  1. I still lose all my existing packages and they have to be re-installed, as was the case when we had to create a new rapids-0.19 conda environment.
  2. I need to worry about port forwarding, especially if I am on a virtual machine.
  3. I need to know about docker or at least have a basic idea about how docker containers work.
Therefore, installing packages using conda is much simpler and faster process.

What is conda-forge?

From the official conda-forge website, it is:
A community-led collection of recipes, build infrastructure and distributions for the conda package manager.
The project started at SciPy in 2015 and the idea was to create a community-led cross-platform channel for various packages on conda.
Since, conda-forge is community-led, it means anybody can actually go in and contribute to conda-forge by uploading packages. This is both good and bad. It's good because of the sheer size of the packages on conda-forge, over 14000! It's bad because this usually means that packages are outdated when compared to official vendors.
For example, the latest version of pytorch on conda-forge channel is 1.8.0 whereas on the official pytorch channel is 1.8.1. Or for cudatoolkit, while the latest available version from nvidia channel is 11.2.72, the version that's available on conda-forge is 11.2.2.
Figure-2: Outdated versions of cudatoolkit on conda-forge and anaconda channels
This is also the case for torchvision, where the latest available package version from pytorch channel is 0.9.1, but the version that's on conda-forge is 0.9.0.
Figure-3: Outdated version of torchvision on conda-forge when compared to official vendor's version.

How is conda-forge different from Anaconda's "defaults" distribution?

The Anaconda "defaults" channel is an Anaconda distrbution, where Anaconda team ensures that the packages are well-tested and that the versions are compatible with one another.
This is not the case for "conda-forge" channel since it's community led. The "conda-forge" channel is not a distribution and as a user, I am never sure that the versions for various packages available on conda-forge are compatible with one another.

How is conda-forge similar to Pip?

Like Pip, conda-forge channel also serves as a standard place for over 14,000 packages! But this comes at the cost of outdated versions and less stringent testing.

Summary so far..

So far we have discussed some problems when it comes to managing environments using conda and keeping them up-to-date. We found the following problems:

What is fastchan?

Introduced by Jeremy Howard, fastchan is a channel on Anaconda that:
✅: Hosts all supported packages such as "cuDF" and their dependencies in a single channel - fastchan.
✅: Always keeps the latest version of the 120 supported packages.
✅: Ensures that all 120 packages can be installed with one another and thus, work with each other without conflicts.
✅: Makes installation commands much easier. For example for cuDF, one could just do conda install cudf -c fastchan !
Isn't this wonderful? Just by shifting to fastchan for downloading and managing my packages on conda, I have been able to save quite a bit of time when it comes to setting up my deep learning environments for experimentation!

How does fastchan allow me to have an up-to-date environment?

As mentioned, fastchan always keeps the most up-to-date version of the supported packages.
As an example, for pytorch package, fastchan (the only one at the time of writing) also has the latest version of 1.8.1 similar to the official pytorch channel.
Figure-4: "fastchan" has an up-to-date version for pytorch
This is also the case for cudatoolkit package as shown below:
Figure-5: "fastchan" has an up-to-date version for cudatoolkit which matches the official vendor's version
❓: So how does this happen? How are "fastchan" packages up-to-date as official vendor channel packages?
fastchan was built using using fastconda and has scripts that run every 6 hours to update and get the latest version of all 120 supported packages. Anytime there is a new version of the package released on the official vendor channel, fastconda knows about it and it copies this latest version from the official vendor channel to the fastchan channel. Thus, I am always getting up-to-date versions when using fastchan to install my packages.

How does fastchan allow me to install packages from a single channel?

fastchan always hosts a complete list of supported packages and their dependencies. This means, it already has packages like cudatoolkit-11.2.2, pytorch, cuDF and more and unlike using other channels like rapidsai, I no longer need to download cudf package from rapidsai channel and cudatoolkit from nvidia channel.
This simple addition really makes installation commands much easier. Instead of having a long and complex command for installing cudf that looked like:
conda create -n rapids-0.19 -c rapidsai -c nvidia -c conda-forge \ blazingsql=0.19 cudf=0.19 python=3.7 cudatoolkit=11.0
I could now instead just do:
conda install cudf -c fastchan
Since all dependencies for cudf are already on fastchan, this command works! And I no longer need to create a separate environment rapids-0.19 to get things working. :)

How is fastchan different from conda-forge?

Before packages are uploaded to fastchan, it is made sure that all 120 packages on the channel can be installed with one another, and thus, are compatible with one another.
This is a step in the right direction with a better-tested environment and thus, users can always be rest assured that they will always have working environments when using fastchan channel to install conda packages.
It's different from conda-forge, as it's not community-led, but rather maintained by the fastai team and built using fastconda. If there is an environment conflict or the installation of any one of 120 packages fails, the fastai team get's notified and no package is uploaded to fastchan. Thus, the user always benefits from having a working and up-to-date environment at all times when using fastchan.

Conclusion

I hope that as part of this blog I have been able to showcase how fastchan is really a step forward in the right direction and is really a faster and more efficient way of setting up up-to-date conda environments.
I have found that while deep learning related packages also exist conda-forge and anaconda defaults channel, they are mostly out of date and the best place to download them until fastchan was from the official vendors.
But, since there is no surety that packages installed from various channels work with one another, more time was spent on getting a working up-to-date environment.