Kedro + MLFlow + WANDB?

Using a codebase that previously only contained a working end to end flow for data churning and experimentation via the Kedro framework augmented by MLFlow, we attempt to add WANDB in as another layer for data lineage, visual storytelling, and modern experimentation tracking
Anish Shah
Created on October 29|Last edited on October 29
Comment
﻿
Background of Project
The Problem @ ✋I like Spotify and I like discovering music
Music generation is very popularity based nowadays
Discovery/interaction with music happens via playlists mostly nowadays
Music curation tools for playlists are scarce
Spotify has a lot of tools for interacting with their data/features
Million Playlist Dataset 🎵Spotify hosted the 2018 RecSys challenge of 1 Million Playlists made by users
Publicly released the dataset
No associated metadata of tracks
The Approach 🧙‍♂️Previously built (and never finished) many Spotify based apps and models
A lot of the process is essentially the same ->
Build infrastructure/tooling to easily build and deploy Spotify-based applications and models
Data good practices and reproducability as a natural byproduct
Infrastructure-as-code as much as possible
Make it straightforward to extend upon with flexible tools
Tools Used 🔧Kedro
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts from software engineering best-practice and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning.
MLFlow
MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components:MLflow Tracking
Record and query experiments: code, data, config, and results
MLflow Projects
Package data science code in a format to reproduce runs on any platform
MLflow Models
Deploy machine learning models in diverse serving environments
Model Registry
Store, annotate, discover, and manage models in a central repository
How/Why we incorporated WANDB at this stageWeights & Biases is the machine learning platform for developers to build better models faster. Use W&B's lightweight, interoperable tools to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues.Artifacting:
MLFlow 
supports artifacting, however the experience is extremely subpar when it comes to things other than models, especially w.r.t UI elements
Kedro 
supports versioning with very specific caveats that usually rely on the save properties of the dataset and standard naming convention utilizing time
WANDB
supports artifacting in a similar way to mlflow however with more features such as the inclusion of the wandb.Table artifact which allows for us to take advantage of data lineaging and EDA together
E/DA:
MLFlow
Trash UI
You log EDA, not really baked into the tool
Kedro
Not supported outside exploration of codebase
EDA pattern consisted of scheduling dataset creations via kedro and then using a tool like Streamlit to actually visualize
WANDB
10/10 would write another report
The different artifacting types alongside general usage of Tables and Reports make analysis a natural part of the process, that came packaged in with data in lineage
Collaboration:
End goal of this codebase is to make it easy for analysts and engineers to spin up and get out of the box the tools needed to understand key KPIs for their spotify applications
WANDB
Collaboration at its core. Very easy to use artifacting with reports to expose components between different analysts and be able to share high level findings with other teams
Incorporating WANDB
Data Processing w/ WANDB Artifacts﻿https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge﻿
Kedro pipeline we use to isolate the track ids for which we want to scrape. In this step we split the playlist metadata and the track -> playlist relational ID table.
See if PNG support can be easily added - make simple PR if possible
﻿
project("a-sh0ts", "spotify_wandb_demo_anish").artifactVersion("mpd_playlist_artifact", "cf1596ed6b3ca2c73112").file("mpd_track_ids_to_pids_table.table.json")
 - 1 of 200000
pos
track_spid
pid
1
Above we can use this table as the basis of joining tracks and playlists together. Positioning within the playlist will allow us to analyze playlists as a sequence
💡
Figure out reason playlist metadata won't load
﻿
﻿
project("a-sh0ts", "spotify_wandb_demo_anish").artifactVersion("mpd_track_artifact", "891b158cc5586b019ac2").file("mpd_track_features_table.table.json")
 - 1 of 200000
track_danceability
track_energy
track_key
track_loudness
track_mode
track_speechiness
track_acousticness
track_instrumentalness
track_liveness
track_valence
track_tempo
track_type
track_spid
track_uri
track_track_href
track_analysis_url
track_duration_ms
track_time_signature
time_pulled
1
﻿
﻿
project("a-sh0ts", "spotify_wandb_demo_anish").artifactVersion("mpd_track_artifact", "891b158cc5586b019ac2").file("mpd_track_metadata_table.table.json")
 - 1 of 3252
album_type
album_artist_spurl
album_artist_spid
album_artist_name
album_artist_type
album_spurl
album_spid
album_img_url
album_name
album_release_date
album_tracks_count
album_track_type
artist_spurl
artist_spid
artist_name
artist_type
track_duration_ms
track_explicit
track_isrc
track_spurl
track_spid
track_is_local
track_name
track_popularity
track_preview_url
track_number
track_type
time_pulled
1
freEDATODO: Add visualization showing relationship of tables and resultant joined tables
Takeaways
M1 IssuesSetting up python and installing certain packages is whack with M1 in its current form
Steps Taken:
pyenv install <python_version>
#For M1 Chip
CFLAGS="-I$(brew --prefix xz)/include" LDFLAGS="-L$(brew --prefix xz)/lib" pyenv install <python_version>
brew install openblas
export OPENBLAS=$(brew --prefix openblas)
export CFLAGS="-falign-functions=8 ${CFLAGS}"
pip install numpy Cython pybind11 pythran
pip install --no-use-pep517 scipy
Kedro + WANDB IssuesKedro viz and wandb cannot work together due to issues with the needed graphql package version
Steps Taken:
pip install wandb
﻿
ImportError: cannot import name 'introspection_query' from 'graphql' (/Users/anishshah/.pyenv/versions/3.8.12/envs/venv-spot-wandb-demo/lib/python3.8/site-packages/graphql/__init__.py)
﻿
pip install graphql-core==2.3.2
strawberry-graphql 0.79.0 requires graphql-core<3.2.0,>=3.1.0, but you have graphql-core 2.3.2 which is incompatible.
﻿
pip uninstall kedro-viz
﻿https://github.com/wandb/client/issues/454﻿
﻿https://github.com/wandb/client/issues/2813﻿
Would love to work on a kedro-wandb package similar to how there exists a kedro-mlflow and a kedro-neptune
Tables ExperienceNot a lot of documentation with explicit mention of how to use pandas with Table
Wrote this before realizing there was a dataframe parameter available in Tables:
run = wandb.init(project="test_project")
artifact = wandb.Artifact("artifact", type="dataset")
# WANDB table doesnt accept Pandas NAT type so we use pandas to coeerce dtypes on each column (which also coeerces null style objects to become an NAT type) before replacing all the coeerced NAT types to None which WANDB accepts
def prep_pandas_for_wandb(df):
    return df.convert_dtypes().fillna(np.nan).replace([np.nan], [None])
def pandas_to_wandb(df):
    return wandb.Table(data=prep_pandas_for_wandb(df).values, columns=df.columns.tolist())
table = pandas_to_wandb(df)
artifact.add(table, "table")
run.log(artifact)
Haven't tested if I would still need to do the NAT typecasting to None that WANDB needs
Worst enemy:
2021-10-28 23:36:48,998 - root - WARNING - Truncating wandb.Table object to 200000 rows.
When using joined table and the truncation, how does this interaction work?
Assuming truncate t1 - truncate t2 - join truncated t1 and t2
No easy way to get the shape/size of a wandb Table
len(table)
﻿
TypeError: object of type 'Table' has no len()
#Workaround
len(table.get_index())
#Can be slow though
Large Datasets struggles
Probably should've used the method described in https://docs.wandb.ai/guides/artifacts/artifact-creation-modes#how-do-i-log-a-table-in-collaborative-mode to create a parallel partitioned dataset
Would partitioned datasets still have the benefits of visualization/how does visualization work with upserted table artifacts?
Would love to have/make an example showing how to trivially split a large pandas dataframe and store it as a partitioned table
Cannot join joined tables
At what point should we expect to join a table ourselves vs via WANDB? 
﻿
Add a comment