Skip to main content

Kedro + MLFlow + WANDB?

Using a codebase that previously only contained a working end to end flow for data churning and experimentation via the Kedro framework augmented by MLFlow, we attempt to add WANDB in as another layer for data lineage, visual storytelling, and modern experimentation tracking
Created on October 29|Last edited on October 29

Background of Project

The Problem @ ✋

  • I like Spotify and I like discovering music
  • Music generation is very popularity based nowadays
  • Discovery/interaction with music happens via playlists mostly nowadays
  • Music curation tools for playlists are scarce
  • Spotify has a lot of tools for interacting with their data/features

Million Playlist Dataset 🎵

  • Spotify hosted the 2018 RecSys challenge of 1 Million Playlists made by users
  • Publicly released the dataset
  • No associated metadata of tracks

The Approach 🧙‍♂️

  • Previously built (and never finished) many Spotify based apps and models
  • A lot of the process is essentially the same ->
  • Build infrastructure/tooling to easily build and deploy Spotify-based applications and models
  • Data good practices and reproducability as a natural byproduct
  • Infrastructure-as-code as much as possible
  • Make it straightforward to extend upon with flexible tools

Tools Used 🔧

  • Kedro
    • Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts from software engineering best-practice and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning.
  • MLFlow
    • MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components:
      • MLflow Tracking Record and query experiments: code, data, config, and results
      • MLflow Projects Package data science code in a format to reproduce runs on any platform
      • MLflow Models Deploy machine learning models in diverse serving environments
      • Model Registry Store, annotate, discover, and manage models in a central repository

How/Why we incorporated WANDB at this stage

Weights & Biases is the machine learning platform for developers to build better models faster. Use W&B's lightweight, interoperable tools to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues.
  • Artifacting:
    • MLFlow
      • supports artifacting, however the experience is extremely subpar when it comes to things other than models, especially w.r.t UI elements
    • Kedro
      • supports versioning with very specific caveats that usually rely on the save properties of the dataset and standard naming convention utilizing time
    • WANDB
      • supports artifacting in a similar way to mlflow however with more features such as the inclusion of the wandb.Table artifact which allows for us to take advantage of data lineaging and EDA together
  • E/DA:
    • MLFlow
      • Trash UI
      • You log EDA, not really baked into the tool
    • Kedro
      • Not supported outside exploration of codebase
      • EDA pattern consisted of scheduling dataset creations via kedro and then using a tool like Streamlit to actually visualize
    • WANDB
      • 10/10 would write another report
      • The different artifacting types alongside general usage of Tables and Reports make analysis a natural part of the process, that came packaged in with data in lineage
  • Collaboration:
    • End goal of this codebase is to make it easy for analysts and engineers to spin up and get out of the box the tools needed to understand key KPIs for their spotify applications
    • WANDB
      • Collaboration at its core. Very easy to use artifacting with reports to expose components between different analysts and be able to share high level findings with other teams

Incorporating WANDB

Data Processing w/ WANDB Artifacts

Kedro pipeline we use to isolate the track ids for which we want to scrape. In this step we split the playlist metadata and the track -> playlist relational ID table.
  • See if PNG support can be easily added - make simple PR if possible

pos
track_spid
pid
1
Above we can use this table as the basis of joining tracks and playlists together. Positioning within the playlist will allow us to analyze playlists as a sequence
💡
  • Figure out reason playlist metadata won't load


track_danceability
track_energy
track_key
track_loudness
track_mode
track_speechiness
track_acousticness
track_instrumentalness
track_liveness
track_valence
track_tempo
track_type
track_spid
track_uri
track_track_href
track_analysis_url
track_duration_ms
track_time_signature
time_pulled
1


album_type
album_artist_spurl
album_artist_spid
album_artist_name
album_artist_type
album_spurl
album_spid
album_img_url
album_name
album_release_date
album_tracks_count
album_track_type
artist_spurl
artist_spid
artist_name
artist_type
track_duration_ms
track_explicit
track_isrc
track_spurl
track_spid
track_is_local
track_name
track_popularity
track_preview_url
track_number
track_type
time_pulled
1

freEDA

  • TODO: Add visualization showing relationship of tables and resultant joined tables

Takeaways

M1 Issues

  • Setting up python and installing certain packages is whack with M1 in its current form
  • Steps Taken:
  • pyenv install <python_version>
    #For M1 Chip
    CFLAGS="-I$(brew --prefix xz)/include" LDFLAGS="-L$(brew --prefix xz)/lib" pyenv install <python_version>
  • brew install openblas
    export OPENBLAS=$(brew --prefix openblas)
    export CFLAGS="-falign-functions=8 ${CFLAGS}"
    pip install numpy Cython pybind11 pythran
    pip install --no-use-pep517 scipy

Kedro + WANDB Issues

  • Kedro viz and wandb cannot work together due to issues with the needed graphql package version
  • Steps Taken:
  • pip install wandb
    
    ImportError: cannot import name 'introspection_query' from 'graphql' (/Users/anishshah/.pyenv/versions/3.8.12/envs/venv-spot-wandb-demo/lib/python3.8/site-packages/graphql/__init__.py)
    
    pip install graphql-core==2.3.2
    strawberry-graphql 0.79.0 requires graphql-core<3.2.0,>=3.1.0, but you have graphql-core 2.3.2 which is incompatible.
    
    pip uninstall kedro-viz
  • Would love to work on a kedro-wandb package similar to how there exists a kedro-mlflow and a kedro-neptune

Tables Experience

  • Not a lot of documentation with explicit mention of how to use pandas with Table
  • Wrote this before realizing there was a dataframe parameter available in Tables:
  • run = wandb.init(project="test_project")
    artifact = wandb.Artifact("artifact", type="dataset")
    # WANDB table doesnt accept Pandas NAT type so we use pandas to coeerce dtypes on each column (which also coeerces null style objects to become an NAT type) before replacing all the coeerced NAT types to None which WANDB accepts
    def prep_pandas_for_wandb(df):
    return df.convert_dtypes().fillna(np.nan).replace([np.nan], [None])
    def pandas_to_wandb(df):
    return wandb.Table(data=prep_pandas_for_wandb(df).values, columns=df.columns.tolist())
    table = pandas_to_wandb(df)
    artifact.add(table, "table")
    run.log(artifact)
  • Haven't tested if I would still need to do the NAT typecasting to None that WANDB needs
  • Worst enemy:
    • 2021-10-28 23:36:48,998 - root - WARNING - Truncating wandb.Table object to 200000 rows.
  • When using joined table and the truncation, how does this interaction work?
    • Assuming truncate t1 - truncate t2 - join truncated t1 and t2
  • No easy way to get the shape/size of a wandb Table
    • len(table)
      
      TypeError: object of type 'Table' has no len()
    • #Workaround
      len(table.get_index())
      #Can be slow though
  • Large Datasets struggles
  • Would love to have/make an example showing how to trivially split a large pandas dataframe and store it as a partitioned table
  • Cannot join joined tables
    • At what point should we expect to join a table ourselves vs via WANDB?
File<(table)>
File<(table)>
File<(table)>