As we move ahead into the 2020s, an ever-increasing share of music consumption and discovery is going to be mediated by AI-driven recommendation systems. Back in 2020, as much as 62% of consumers rated across platforms like Spotify and YouTube among their top sources of music discovery — and be sure that a healthy chunk of that discovery is going to be mediated by recommender systems. On Spotify, for instance, over one third of all new artist discoveries happen through "Made for You" recommendation sessions according to the recently released Made to be Found report.

Yet, as algorithmic recommendations take center stage in the music discovery landscape, the professional community at large still perceives these recommender algorithms as black boxes. Music professionals rely on recommender systems across platforms like Spotify and YouTube to amplify the ad budgets, connect with the new audiences, and all-around execute successful release campaigns — while often having no clear vision of how these systems operate and how to leverage them to amplify artist discovery.

The topic of unveiling AI-driven recommender systems and providing music professionals with the resources and tools they need to understand and manage these algorithms will be a big focus for Music Tomorrow throughout 2022. A few weeks back, we kicked off the year with an article covering the ins and outs of the famed TikTok "For You" algorithm. This time around, we'll dive deeper into the topic with a breakdown of Spotify's recommender system (which can be, to an extent, extrapolated to other DSP recommendation engines).

How recommendation and music discovery works on Spotify?

In a lot of ways, Spotify's recommendation engine is dealing with a similar flow as TikTok's "For You" algorithm, playing the matchmaker between the creators (or artists) and users (or fans) on a two-sided marketplace. However, as opposed to TikTok, in this case, we don't have the courtesy of recently leaked internal documentation to uncover the makeup of the system. What we do have is the company's extensive public R&D records, its API, and some common sense. That is not to say that we don't know anything definitive on how the system works — in fact, a healthy chunk of Spotify's recommendation approach has been widely publicized — but we would have to descend into the area of educated guesses when it comes to some of the more granular details. Don't worry, though: we'll make it clear once we depart the land of facts.

Behind the algorithm: understanding music and user tastes

In broad strokes, at the core of any AI recommender system, there's an ML model optimized for the key business goals: user retention, time spent on the platform, and, ultimately, generated revenue. For this recommendation system to work, it needs to understand the content it recommends and the users it recommends it to. On each side of that proposition, Spotify employs several independent ML models and algorithms to generate item representations and user representations. Let's break down exactly how this process works — starting with the track/artist representations:

Generating Track Representations: Content-based and Collaborative filtering

Spotify's approach to track representation is made up of two primary components:

  1. Content-based filtering, aiming to describe the track by examining the content itself
  2. Collaborative filtering, aiming to describe the track in its connection with other tracks on the platform by studying user-generated assets

The recommendation engine needs data generated by both methods to get a holistic view of the content on the platform and solve the cold start problems when dealing with newly uploaded tracks. First, let's take a look at the content-based filtering algorithms:

Analyzing artist-sourced metadata

As soon as Spotify ingests the new track, an algorithm will analyze all the general song metadata provided by the distributor and metadata specific to Spotify (sourced through the Spotify for Artist pitch form). In the ideal scenario, where all the metadata is filled correctly and makes its way to the Spotify database, this list should include:

  • Track title
  • Release title
  • Artist name
  • Featured artists
  • Songwriter credits
  • Producers credits
  • Label
  • Release Date
  • Genre & sub-genre tags* 
  • Music culture tags*
  • Mood tags*
  • Style tags*
  • Primary language*
  • Instruments used throughout recording*
  • Track typology (Is it a cover? Is it a remix? Is it an instrumental?)
  • Artist hometown/local market*

*Sourced through S4A

The artist-sourced metadata is then passed downstream, as input into other content-based models and the recommender system itself.

Analyzing raw audio signals

The second step of the content-based filtering is the raw audio analysis, which runs as soon as the audio files, accompanied by the artist-soured metadata, are ingested into Spotify's database. The precise way in which that analysis is carried out remains one of the secret sauces of the Spotify recommender system. Yet, here's what we know for sure — and what we can reasonably assume:

Let's begin with the concrete facts. The audio features data available through Spotify API consists of 12 metrics describing the sonic characteristics of the track. Most of these features have to do with objective sonic descriptions. For example, the metric of "instrumentalness" reflects the algorithm's confidence that the track has no vocals, scored on a scale from 0 to 1. However, on top of these "objective" audio attributes, Spotify generates at least three perceptual, high-level features designed to reflect how the track sounds like in a more holistic way:

  1. Danceability, describing how suitable a track is for dancing based on a combination of musical elements, including tempo, rhythm stability, beat strength, and overall regularity.
  2. Energy, representing "a perceptual measure of intensity and activity", based on the track's dynamic range, perceived loudness, timbre, onset rate, and general entropy.
  3. Valence, describing "the musical positiveness of the track". Generally speaking, tracks with high valence sound more positive (e.g., happy, cheerful, euphoric), while songs with low valence sound more negative (e.g., sad, depressed, angry)

Yet, these audio features are just the first component of Spotify's audio analysis system. In addition to the audio feature extraction, a separate algorithm will also analyze the track's temporal structure and split the audio into different segments of varying granularity: from sections (defined by significant shifts in the song timbre or rhythm, that highlight transitions between key parts of the track such as verse, chorus, bridge, solo, etc.) down to tatums (representing the smallest cognitively meaningful subdivision of the main beat).

Temporal Audio Analysis for Lil Nas X – Industry Baby (feat. Jack Harlow) (Visualization by Spotify Audio Analysis)

Educated assumption: The combination of data generated by the audio analysis methods should allow Spotify to discern the audio characteristics of the song and follow their development throughout time and between different sections of the track. Furthermore, the 12 audio features date back to 2013, when it was just a part of The Echo Nest audio analysis output (an audio intelligence company acquired by Spotify back in 2014). 

The chances are that in almost ten years, Spotify audio analysis algorithms have gone a long way — which would mean that the audio feature data fed into the recommendation system today is much more detailed and granular than what's available through the company's public API. For instance, one of the research papers published by Spotify in 2021 states that the audio features are passed into the model as a 42-dimensional vector — which could mean that Spotify's audio analysis produces 42 different audio features (although that's circumstantial at best). 

In addition, the company's extensive research record details Spotify's experiments with ML-based source separation and pitch tracking & melody estimation. If these projects were to make it into production, that would mean that the audio analysis system would be able to slice the track down to isolated instrumental parts, process them separately and define all the melodies and chord progressions used throughout the composition.

In practice, all of the above would mean that Spotify audio analysis can define the recordings uploaded to the platform in great detail. The final output of the system might define the track along the lines of "this song follows a V-C-V-C-B-V-C structure, builds up in energy towards the bridge and features an aggressive, dissonant guitar solo that resolves into a more melancholic and calm outro". Or even something much more detailed — the point is that in all likelihood, Spotify's audio analysis algorithm can reverse-engineer the track almost entirely and extract the broadest range of characteristics from the raw audio files. From the artist's perspective, it might be wise to assume that Spotify has access to the DAW project files.

Analyzing text with Natural Language Processing models

The final component of the content-based track representation is the Natural Language Processing models, employed to extract semantic information describing the track/artist from music-related text content. These models are applied in three primary contexts:

  1. Lyrics analysis. The primary goal here is to establish the prominent themes and the general meaning of the song's lyrics while also scanning for potential "clues" that might be useful down the road, such as locations, brands, or people mentioned throughout the text. 
  2. Web-crawled data (focusing primarily on music blogs and online media outlets). Running NLP models against web-crawled data allows Spotify to uncover how people (and gatekeepers) describe music online by analyzing the terms and adjectives that have the most co-occurrence with the song's title or the artist's name.
  3. User-generated playlists. The NLP algorithms run against the user-generated playlists featuring the track on Spotify to uncover additional insights into the song's mood, style, and genre. "If the song appears on a lot of playlists with "sad" in the title, it is a sad song."

The NLP models allow Spotify to tap into the track's cultural context and expand on the sonic analysis of how the song sounds with a social dimension of how the song is perceived

The three components outlined above — artist-sourced metadata, audio analysis, and NLP models — make up the content-based approach of the track representation within Spotify's recommender system. Yet, there's one more key ingredient to Spotify's recipe for track representation: 

Collaborative Filtering

In many ways, collaborative filtering has become synonymous with Spotify's recommender system. The DSP giant has pioneered the application of this so-called "Netflix approach" in context of music recommendation — and widely publicized collaborative filtering as the driving power behind its recommendation engine. So the chances are, you've heard the process laid out time and again. At least the following version of it:

"We can understand songs to recommend to a user by looking at what other users with similar tastes are listening to." The algorithm simply compares users' listening history: if user A has enjoyed songs X, Y and Z, and user B has enjoyed songs X and Y (but haven't heard Z yet), we should recommend song Z to them. By maintaining a massive user-item interaction matrix covering all users and tracks on the platform, Spotify can tell if two songs are similar (if similar users listen to them) and if two users are similar (if they listen to the same songs).

Sounds like a silver bullet for music recommendation, doesn't it? In reality, however, this item-user matrix approach comes with a host of issues that have to do with accuracy, scalability, speed, and cold start problems. So, in recent years, Spotify has moved away from consumption-based filtering — or at least drastically downplayed its role in track representation. Instead, the current version of collaborative filtering focuses on the track's organizational similarity: i.e., "two songs are similar if a user puts them on the same playlist".

By studying playlist and listening session co-occurrence, collaborative filtering algorithms access a deeper level of detail and capture well-defined user signals. Simply put, streaming users often have pretty broad and diverse listening profiles — in fact, building listening diversity is one of Spotify's priorities, as we've covered in our recent article on fairness in music recommender systems — and so the fact that a lot of users listen to song A and song B doesn't automatically that these two artists are similar. After all, artists like Metallica and ABBA probably have quite a bit of shared listeners. 

If, on the other hand, a lot of users put song A and song B on the same playlist, that is a much more conclusive sign that these two songs have something in common. On top of that, the playlist-centric approach also offers insight into the context in which these two songs are similar — and with playlist creation being one of the most widespread practices on the platform, Spotify has no shortage of collaborative filtering data to work through.

Today, the Spotify collaborative filtering model is trained on a sample of ~700 million user-generated playlists selected out of the much broader set of all user-generated playlists on the platform. The main principle for choosing the playlists that make it into that sample? "Passion, care, love, and time users put into creating those playlists."

Now, we finally arrived at the point where the combination of collaborative and content-based approaches allows Spotify's recommender system to develop a holistic representation of the track. At this point, the track profile is further enriched by combining the outputs of several independent algorithms to generate higher-level vectors (think of these as mood, genre, style tags, etc.). In addition, to deal with the cold start problem when processing freshly uploaded releases that don't have enough NLP/playlist data behind them, some of these properties are also extrapolated to develop overarching artist algorithmic profiles.

However, to turn this track- and artist-level data into relevant recommendations, the engine needs to marry it with the data describing the users — which brings me to the next section.

Generating User Taste Profiles

The approach to user profiling on Spotify is quite a bit simpler, at least once we solve the track representations. Essentially, the recommender engine logs all of the user's listening activity, split into separate context-rich listening sessions. This context component is vital when interpreting user activity to generate taste profiles. For instance, if the user engages with Spotify's "What's New" tab, the primary goal of the listening session is often to quickly explore music recently added to the platform. In that context, high skip rates are to be expected, as the user's primary goal is to skim through the feed and save some of the content served for later — which means that a track skip shouldn't be interpreted as a definite negative signal. On the other hand, if the user skips a track when listening to a "Deep Focus" playlist designed to be consumed in the background, that skip is a much stronger sign of user dissatisfaction.

Generally speaking, the user feedback can be split into two primary categories: 

  • Explicit, or active feedback: library saves, playlist adds, shares, skips, click-through to artist/album page, artist follows, "downstream" plays 
  • Implicit, or passive feedback: listening sessions length, track playthrough, and repeat listens

In the case of the Spotify recommender system, explicit feedback weighs in more when developing user profiles. Music is often enjoyed as off-screen content, meaning that uninterrupted consumption doesn't always relate to enjoyment. Then, user feedback data is processed to develop the user profile, defined in terms of:

  • Most-played and preferred songs and artists 
  • Saved songs and albums & followed artists
  • Genre, mood, style, and era preferences
  • Popularity and diversity preferences
  • Temporal patterns
  • Demographic & geolocation profile

Then the user taste profile is further subdivided based on the consumption context: i.e., the same user might prefer mellow indie-pop on Sunday evenings and high-energy motivational hip-hop on Monday mornings. In the end, Spotify ends up with a context-aware user profile, that might look something like this:

Source: Spotify Research

This history-based user profile constantly develops and expands with fresh consumption and interaction data. Recent activity is also prioritized over historic profile: for instance, if the user gets into the new genre and it scores well in terms of user feedback, the recommender system will try to serve more adjacent music — even if the user's all-time favorite music is widely different. 

Recommending music: integrating user and track representations 

Woooh. You've made it. The intertwined constellation of algorithms behind the Spotify recommendations has produced the two core components — track and user representations — required to serve relevant music. Now, we just need the algorithm to make the perfect match between the two and find the right track for the right person (and the right moment).

However, the recommendation landscape on Spotify is much more diverse than on some of the other consumption platforms. Just consider the range of Spotify features that are generated with the help of the recommendation engine:

  1. Discover Weekly & Release Radar playlists
  2. Your Daily Mix playlists
  3. Artist / Decade / Mood / Genre Mix playlists 
  4. Special personalized playlists (Your Time Capsule, On Repeat, Repeat Rewind, etc.)
  5. Personalized editorial playlists
  6. Personalized browse section
  7. Personalized search results
  8. Playlist suggestions & enhance playlist feature
  9. Artist/song radio and autoplay features

In one way or another, all these diverse spaces are mediated by the recommender engine — but each of them is running on a separate algorithm with its own inner logic and reward system. The track and user representation form a sort of universal foundation for these algorithms, providing a shared model layer designed to answer the common questions of feature-specific algorithms, such as:

  • User-entity affinity: "How much does user X like artist A or track B? What are the favorite artists/tracks of user Y?"
  • Item similarity: "How similar are artist A & artist B? What are the 10 tracks most similar to track C?"
  • Item clustering: "How would we split these 50 tracks/artists into separate groups?"

The feature-specific algorithms can then tap into these unified models to generate recommendations optimized for a given consumption space/context. For instance, the algorithm behind Your Time Capsule playlists would primarily engage with user-entity affinity data to try and find the tracks that users love but haven't listened to in a while. On the other hand, Discover Weekly algorithms would employ a mix of affinity and similarity data to find tracks similar to the user's preferences, which they haven't heard yet. Finally, generating Your Daily Mix playlists would involve all three methods — first, clustering the user's preferences into several groups and then expanding these lists with similar tracks. 

The goals and rewards of Spotify recommendation algorithms

Now, as we mentioned in the beginning of this breakdown, the overarching goal of the Spotify recommender system has to do primarily with retention, time spent on the platform, and general user satisfaction. However, these top-level goals are way too broad to devise a balanced reward system for ML algorithms serving content recommendations across a variety of features and contexts — and so the definition of success for the algorithms will largely depend on where and why the user engages with the system.

For instance, the success of the autoplay queue features is defined mainly in terms of user engagement — explicit/implicit feedback of listen-through and skip rates, library and playlist saves, click-through to the artist profile and/or album, shares, and so on. In the case of Release Radar playlists, however, the set of rewards would be widely different, as users would often skim through the playlist rather than listen to it from cover to cover. So, instead of studying engagement with content, the algorithms would optimize for long-term feature retention and feature-specific behavior. "Users are satisfied with the feature if they keep coming back to it every week; users are satisfied with Release Radar if they save tracks to their playlists or libraries." 

Finally, in some cases, Spotify would employ yet another set of algorithms just to devise the reward functions for a specific feature. For example, Spotify has trained a separate ML model to predict user satisfaction with Discover Weekly (with the training set sourced by user surveys). This model would look at the entire wealth of user interaction data, user past Discover Weekly behavior, and user goal clusters (i.e., if the user engaged with Discover Weekly as a background, to search for new music, save music or later, etc.) — and then produce a unified satisfaction metric based on all that activity. 

The satisfaction prediction produced by the model is then, in turn, used as the reward for the algorithm that would compose Discover Weekly playlists, thus building a comprehensive reward system that doesn't rely on siloed, potentially ambiguous user signals.

The Spotify recommender system is an extremely complex and intricate system, with dozens (if not hundreds) of algorithms and ML models employed across various levels, all working together to create one of the most advanced recommendation experiences on the music streaming market. This system has been developed and iterated on for close to 12 years now — growing in size, capabilities, and complexity. Yet, as you can probably see, it is far from unexplainable. Even without having direct documentation describing the composition of the recommendation engine and all the secret ingredients, we can get a pretty good understanding of its main parts and the governing principles behind them.

From the music industry perspective, it is possible to use that knowledge to optimize the artist's profile within that recommender system. A meaningful, well-educated algorithmic strategy can maximize your chances of making it onto algorithmic playlists and help ensure that the engine serves your music to the right audiences, amplifying discovery and turning casual listeners into fans.

How would we go about it? We're currently hard at work building a way for artists and their teams to understand and optimize their algorithmic profiles: think of it as something of an SEO tool for streaming services. More on that in the coming months — follow our newsletter to make sure you don't miss it!