Predicting Song Skipping on Spotify

A classification model to predict song skipping on Spotify. Made by Taha Bouhoun using Weights & Biases
Taha Bouhoun


In early 2019, Spotify shared exciting statistics about its platform. One of those stats: out of more than 35 million songs on the service, Spotify users created over 2+ billion playlists (Oskar Stål, 2019). Personally, I thought of the analogy that our music taste is like our DNA, very diverse across 7 billion people, yet the building blocks (nucleotides/songs) are the same. As a result, inferring a user's music taste is challenging, mostly since Spotify's business model relies on its ability to recommend new songs.
Like all entertainment services, Spotify is battling for its user's attention, making it necessary to recommend songs that are less likely to be skipped. This project explores a portion of the 130 million streaming session data that Spotify shared as part of the Skip Prediction Challenge.
The strategy is to start with a simple model, then gradually add more layers (i.e., more data, feature engineering, encode sequential information). By doing so, we formulate different questions at each step of the modeling process.

Follow along in a Colab →

Exploratory Data Analysis

The data contains a history of 130 million streamings of roughly 4 million unique songs. Spotify didn’t share details on how many users are represented in the dataset, however, they did include the following features:
This project takes in a subset of the Spotify dataset (2M rows after balancing the labels) allocated between a train (1.44M), validation (360K), and test set (200K). The data processing and the modeling are implemented in separate Colab Notebooks.

Audio Characteristics

Each song has a series of audio features as detailed in this Spotify documentation.
Fig 1. Pairplot of audio characteristics correlation coded by whether the song was skipped (0.1% of the dataset). There are very few discernible patterns that can be drawn from this visualization (e.g., all songs with more than 0.5+ liveness AND 0.5+ acoustics were not skipped). Overall, it is quite visible how challenging to draw boundaries based on these features, which means that we shouldn't expect a high accuracy based only on audio characteristics.

User Behavior

Music taste is diverse across Spotify users and the context of each user can help discern signals from noise. By adding user attributes, we can pick up on instances where songs are skipped for reasons that don't depend on the song (e.g., a Spotify free user might hold back from skipping a song since they have a limit as compared to premium users).
Fig 2. Correlation between user behavior and song skipping. The contrast is visible when it comes to the pause before playing a song: A long pause is 11% negatively correlated with not_skipping, whereas no pause is 10% positively correlated with not_skipping.

Session homogeneity

The nature of the dataset is set up as a series of listening sessions, a chain of songs that a single user was listening to in one sitting. The decision to skip a song can be dependent on information about the previous songs in the session. The following plot highlights the assumption that songs within a session are ought to be homogenous (i.e., if a session contains mostly classical music, then a rap song might get skipped).
Fig 3. Plotting music characteristics for a randomly selected three sessions.



A crucial step in modeling is to lay out all the assumptions and limitations to properly interpret the results. Some assumptions are due to the data collection process, and others are part of the modeling process:

Predictions based solely on audio features

Simply, we test whether a model can predict if a song would be skipped based on its audio characteristics (e.g., loudness, instrumental, valence, etc.). That is to say, are there any patterns of song skipping that are consistent across users? Or are there any specific audio features combinations that correlate with song skipping?
These questions are important because they can guide artists to consider which audio characteristics are commonly appealing to music consumers and also shed some light as to what extent we can infer skipping with a minimum amount of information.
Evidently, the model isn't near outperforming a random coin toss. Still, it's important to establish a baseline accuracy and strive to improve further. The grid search shows that a smaller learning rate (0.05), max depth (5), and number of leaves (30) had an overall better accuracy score compared to the other hyperparameter combinations.

Predictions based on Audio & User features

Adding user attributes (e.g., whether they're a premium user, whether they played the song after a long pause, etc) improves the accuracy by 5% compared to the previous model. Nonetheless, mispredicting 40% of the cases is not quite ideal.

Feature Engineering

The previous two examples showed that hyperparameter tuning has a slim marginal return in terms of accuracy. Therefore, feature engineering can further improve the performance of the model. Example additions to the data are:
Incorporating information about the session would add a layer of context into song skipping. For example, one can skip a song not because it's bad but just because they've been listening to music for a while.
The model jumps to 76.24% accuracy after running a sweep across the same grid as the previous models. This result is quite good given that we're only working with a simple decision tree algorithm, yet we're not far from the top submission of the Spotify Challenge (81.2%).

Conclusion & Further Analysis

The Spotify competition was an interesting challenge to explore. As of today, more than 40.4K looked up the dataset, and roughly 700 submissions have been attempted. Neural Network might be a viable alternative, but the dataset's size requires a huge computational cost.
Overall, recommendation engines require both personalized learning about the user and general learning about the songs. I incrementally built a classification model in this project and tracked my experiments using Weights & Biases tools.
Further improvements can be added to this project, such as: