Kedro + MLFlow + WANDB?
Using a codebase that previously only contained a working end to end flow for data churning and experimentation via the Kedro framework augmented by MLFlow, we attempt to add WANDB in as another layer for data lineage, visual storytelling, and modern experimentation tracking
Created on October 29|Last edited on October 29
Comment
Background of Project
The Problem @ ✋
- I like Spotify and I like discovering music
- Music generation is very popularity based nowadays
- Discovery/interaction with music happens via playlists mostly nowadays
- Music curation tools for playlists are scarce
- Spotify has a lot of tools for interacting with their data/features
Million Playlist Dataset 🎵
- Spotify hosted the 2018 RecSys challenge of 1 Million Playlists made by users
- Publicly released the dataset
- No associated metadata of tracks
The Approach 🧙♂️
- Previously built (and never finished) many Spotify based apps and models
- A lot of the process is essentially the same ->
- Build infrastructure/tooling to easily build and deploy Spotify-based applications and models
- Data good practices and reproducability as a natural byproduct
- Infrastructure-as-code as much as possible
- Make it straightforward to extend upon with flexible tools
Tools Used 🔧
- Kedro
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts from software engineering best-practice and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning.
- MLFlow
MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components:
MLflow Tracking Record and query experiments: code, data, config, and results
MLflow Projects Package data science code in a format to reproduce runs on any platform
MLflow Models Deploy machine learning models in diverse serving environments
Model Registry Store, annotate, discover, and manage models in a central repository
How/Why we incorporated WANDB at this stage
Weights & Biases is the machine learning platform for developers to build better models faster. Use W&B's lightweight, interoperable tools to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues.
- Artifacting:
- MLFlow
- supports artifacting, however the experience is extremely subpar when it comes to things other than models, especially w.r.t UI elements
- Kedro
- supports versioning with very specific caveats that usually rely on the save properties of the dataset and standard naming convention utilizing time
- WANDB
- supports artifacting in a similar way to mlflow however with more features such as the inclusion of the wandb.Table artifact which allows for us to take advantage of data lineaging and EDA together
- E/DA:
- MLFlow
- Trash UI
- You log EDA, not really baked into the tool
- Kedro
- Not supported outside exploration of codebase
- EDA pattern consisted of scheduling dataset creations via kedro and then using a tool like Streamlit to actually visualize
- WANDB
- 10/10 would write another report
- The different artifacting types alongside general usage of Tables and Reports make analysis a natural part of the process, that came packaged in with data in lineage
- Collaboration:
- End goal of this codebase is to make it easy for analysts and engineers to spin up and get out of the box the tools needed to understand key KPIs for their spotify applications
- WANDB
- Collaboration at its core. Very easy to use artifacting with reports to expose components between different analysts and be able to share high level findings with other teams
Incorporating WANDB
Data Processing w/ WANDB Artifacts

Kedro pipeline we use to isolate the track ids for which we want to scrape. In this step we split the playlist metadata and the track -> playlist relational ID table.
- See if PNG support can be easily added - make simple PR if possible
Above we can use this table as the basis of joining tracks and playlists together. Positioning within the playlist will allow us to analyze playlists as a sequence
💡
- Figure out reason playlist metadata won't load


freEDA
- TODO: Add visualization showing relationship of tables and resultant joined tables
Takeaways
M1 Issues
- Setting up python and installing certain packages is whack with M1 in its current form
- Steps Taken:
- pyenv install <python_version>#For M1 ChipCFLAGS="-I$(brew --prefix xz)/include" LDFLAGS="-L$(brew --prefix xz)/lib" pyenv install <python_version>
- brew install openblasexport OPENBLAS=$(brew --prefix openblas)export CFLAGS="-falign-functions=8 ${CFLAGS}"pip install numpy Cython pybind11 pythranpip install --no-use-pep517 scipy
Kedro + WANDB Issues
- Kedro viz and wandb cannot work together due to issues with the needed graphql package version
- Steps Taken:
- pip install wandbImportError: cannot import name 'introspection_query' from 'graphql' (/Users/anishshah/.pyenv/versions/3.8.12/envs/venv-spot-wandb-demo/lib/python3.8/site-packages/graphql/__init__.py)pip install graphql-core==2.3.2strawberry-graphql 0.79.0 requires graphql-core<3.2.0,>=3.1.0, but you have graphql-core 2.3.2 which is incompatible.pip uninstall kedro-viz
- Would love to work on a kedro-wandb package similar to how there exists a kedro-mlflow and a kedro-neptune
Tables Experience
- Not a lot of documentation with explicit mention of how to use pandas with Table
- Wrote this before realizing there was a dataframe parameter available in Tables:
- run = wandb.init(project="test_project")artifact = wandb.Artifact("artifact", type="dataset")# WANDB table doesnt accept Pandas NAT type so we use pandas to coeerce dtypes on each column (which also coeerces null style objects to become an NAT type) before replacing all the coeerced NAT types to None which WANDB acceptsdef prep_pandas_for_wandb(df):return df.convert_dtypes().fillna(np.nan).replace([np.nan], [None])def pandas_to_wandb(df):return wandb.Table(data=prep_pandas_for_wandb(df).values, columns=df.columns.tolist())table = pandas_to_wandb(df)artifact.add(table, "table")run.log(artifact)
- Haven't tested if I would still need to do the NAT typecasting to None that WANDB needs
- Worst enemy:
- 2021-10-28 23:36:48,998 - root - WARNING - Truncating wandb.Table object to 200000 rows.
- When using joined table and the truncation, how does this interaction work?
- Assuming truncate t1 - truncate t2 - join truncated t1 and t2
- No easy way to get the shape/size of a wandb Table
- len(table)TypeError: object of type 'Table' has no len()
- #Workaroundlen(table.get_index())#Can be slow though
- Large Datasets struggles
- Probably should've used the method described in https://docs.wandb.ai/guides/artifacts/artifact-creation-modes#how-do-i-log-a-table-in-collaborative-mode to create a parallel partitioned dataset
- Would partitioned datasets still have the benefits of visualization/how does visualization work with upserted table artifacts?
- Would love to have/make an example showing how to trivially split a large pandas dataframe and store it as a partitioned table
- Cannot join joined tables
- At what point should we expect to join a table ourselves vs via WANDB?
Add a comment