Predicting Song Skipping on Spotify
A classification model to predict song skipping on Spotify. Made by Taha Bouhoun using Weights & Biases
In early 2019, Spotify shared exciting statistics about its platform. One of those stats: out of more than 35 million songs on the service, Spotify users created over 2+ billion playlists (Oskar Stål, 2019). Personally, I thought of the analogy that our music taste is like our DNA, very diverse across 7 billion people, yet the building blocks (nucleotides/songs) are the same. As a result, inferring a user's music taste is challenging, mostly since Spotify's business model relies on its ability to recommend new songs.
Like all entertainment services, Spotify is battling for its user's attention, making it necessary to recommend songs that are less likely to be skipped. This project explores a portion of the 130 million streaming session data that Spotify shared as part of the Skip Prediction Challenge
The strategy is to start with a simple model, then gradually add more layers (i.e., more data, feature engineering, encode sequential information). By doing so, we formulate different questions at each step of the modeling process.
Exploratory Data Analysis
The data contains a history of 130 million streamings of roughly 4 million unique songs. Spotify didn’t share details on how many users are represented in the dataset, however, they did include the following features:
User characteristics: Details on the user’s activity in the platform (e.g., subscription, the position of the song in a streaming session, etc)
Track features: ranging from duration and popularity estimate in the US to audio characteristics breakdown of the track (e.g., tempo, acoustics, instrumentals, etc.)
This project takes in a subset of the Spotify dataset (2M rows after balancing the labels) allocated between a train (1.44M), validation (360K), and test set (200K). The data processing
and the modeling
are implemented in separate Colab Notebooks.
Fig 1. Pairplot of audio characteristics correlation coded by whether the song was skipped (0.1% of the dataset). There are very few discernible patterns that can be drawn from this visualization (e.g., all songs with more than 0.5+ liveness AND 0.5+ acoustics were not skipped). Overall, it is quite visible how challenging to draw boundaries based on these features, which means that we shouldn't expect a high accuracy based only on audio characteristics.
Music taste is diverse across Spotify users and the context of each user can help discern signals from noise. By adding user attributes, we can pick up on instances where songs are skipped for reasons that don't depend on the song (e.g., a Spotify free user might hold back from skipping a song since they have a limit as compared to premium users).
Fig 2. Correlation between user behavior and song skipping. The contrast is visible when it comes to the pause before playing a song: A long pause is 11% negatively correlated with not_skipping, whereas no pause is 10% positively correlated with not_skipping.
The nature of the dataset is set up as a series of listening sessions, a chain of songs that a single user was listening to in one sitting. The decision to skip a song can be dependent on information about the previous songs in the session. The following plot highlights the assumption that songs within a session are ought to be homogenous (i.e., if a session contains mostly classical music, then a rap song might get skipped).
Fig 3. Plotting music characteristics for a randomly selected three sessions.
A crucial step in modeling is to lay out all the assumptions and limitations to properly interpret the results. Some assumptions are due to the data collection process, and others are part of the modeling process:
The users are homogenous, i.e., the mechanism that leads a user to skip a song is static across the population regardless of their music taste.
Songs are broken down into audio features; hence the lyrics are not interpreted as natural language text. This limitation is important to consider since lyrical meaning can be a strong predictor of song skipping.
Predictions based solely on audio features
Simply, we test whether a model can predict if a song would be skipped based on its audio characteristics (e.g., loudness, instrumental, valence, etc.). That is to say, are there any patterns of song skipping that are consistent across users? Or are there any specific audio features combinations that correlate with song skipping?
These questions are important because they can guide artists to consider which audio characteristics are commonly appealing to music consumers and also shed some light as to what extent we can infer skipping with a minimum amount of information.
Evidently, the model isn't near outperforming a random coin toss. Still, it's important to establish a baseline accuracy and strive to improve further. The grid search shows that a smaller learning rate (0.05), max depth (5), and number of leaves (30) had an overall better accuracy score compared to the other hyperparameter combinations.
Predictions based on Audio & User features
Adding user attributes (e.g., whether they're a premium user, whether they played the song after a long pause, etc) improves the accuracy by 5% compared to the previous model. Nonetheless, mispredicting 40% of the cases is not quite ideal.
The previous two examples showed that hyperparameter tuning has a slim marginal return in terms of accuracy. Therefore, feature engineering can further improve the performance of the model. Example additions to the data are:
Duration since starting the session: How long has a user been listening to music?
The age of the song: From year released to the time it was listened to in a session
Characteristics of the previous song in the session: e.g., whether the previous song was skipped, the duration of the previous song, etc.
Incorporating information about the session would add a layer of context into song skipping. For example, one can skip a song not because it's bad but just because they've been listening to music for a while.
The model jumps to 76.24% accuracy after running a sweep across the same grid as the previous models. This result is quite good given that we're only working with a simple decision tree algorithm, yet we're not far from the top submission of the Spotify Challenge (81.2%)
Conclusion & Further Analysis
The Spotify competition was an interesting challenge to explore. As of today, more than 40.4K looked up the dataset, and roughly 700 submissions have been attempted. Neural Network might be a viable alternative, but the dataset's size requires a huge computational cost.
Overall, recommendation engines require both personalized learning about the user and general learning about the songs. I incrementally built a classification model in this project and tracked my experiments using Weights & Biases tools.
Further improvements can be added to this project, such as:
Interpreting lyrics: using NLP to retrieve skipping patterns related to the artists' word choices.
Artists: details about artists can prove helpful such as the genre of music they make, their level of fame, or their collaboration with other artists.
Brian Brost, Rishabh Mehrotra, and Tristan Jehan. 2019. The Music Streaming Sessions Dataset. In Proceedings of the 2019 World Wide Web Conference (WWW ’19), May 13–17, 2019, San Francisco, CA, USA. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3308558.3313641