Mapping Spotify Artists

May 2023

View the source code for this project here.

Introduction

Music streaming service Spotify transformed music listening when it first launched in 2008, and has since become the go-to method for streaming and listening to new music. Gone were the days of CDs and cassettes; users could now listen to high-quality music at the touch of a button.

The rise of Spotify and other online streaming servers has fundamentally changed the way we listen to music. Whereas, in the past, individual songs had to be played as part of a complete album by a singular artist, either on vinyl or, later, via cassette tapes, streaming services have made it easier for listeners to mix-and-match individual songs by different artists to curate and share completely personal and original playlists; in short, playlists have become the default way to listen to music.

The content in Spotify playlists can vary widely. Some Spotify users group music by genre, such as in this k-pop girl group playlist:

Others group songs according to a certain set of “vibes”, such as in this “chill vibes” playlist:

Yet others create playlists in order to set the mood for an activity, whether it be a dance party playlist, a workout playlist, or a “soft pop/rock” playlist to “take the edge off” of labor & delivery (!?):

Others are silly, such as this user-created playlist that takes you along on an emotional journey making banana bread:

or this playlist consisting of 343 hours of “Satan’s Saxophones”:

In sum, playlists are generally organized with a central theme in mind. Therefore, tracks on the same user-created playlists can be expected to have some sort of commonality.

About the Dataset

Spotify’s Million Playlist Dataset contains a million user-created Spotify playlists from between January 2010 and October 2017.

On average, playlists contain 66 tracks from 38 distinct artists and 50 different albums. Here’s a visualization of the distribution of playlist length across all the playlists:

And a visualization of the number of distinct artists represented per playlist:

Both distributions have a right skew, but more curiously they are not normal distributions but rather decrease exponentially in frequency for greater numbers of artists and tracks.

Artist Interactions

Across the one million playlists, there are a total of 287742 artists represented. For our analysis, however, we decided to focus on the top 2500 artists as determined by the number of occurrences of any of their songs across all playlists.

We were interested in the concept of “artist proximity” – in other words, how often two artists appeared in user-created playlists together. To rigorize this concept, we defined the “Artist Interaction Score” (AIS) between two artists to be

$\operatorname{AIS}(A,B) = \ln\left(\sum_{p\text{ in playlists}}(\text{no. tracks by }A\text{ in }p)(\text{no. tracks by }B\text{ in }p)\right).$

In other words, the more times two artists both have many songs in the same playlist, the higher their Artist Interaction Score will be.

We then created a 2500 by 2500 matrix containing artist interaction scores for each of the 6250000 ordered pairs of artists. After ordering the rows and columns in order of artist popularity and coloring cells by value (darker cells correspond to higher values)

It makes sense that the highest values occur towards the top and the left of the matrix, since artists are sorted with respect to metrics of playlist inclusion and, broadly speaking, the total artist interaction score increases the more an artist is included in a playlist. However, the fact that the top left corner still contains the highest values indicates that popular artists are still more frequently put in playlists with other popular artists. Furthermore, the moderately high values along the top and left edges of the matrix indicate that even smaller artists appear with large artists more than they do with other small artists.

PCA (Principal Component Analysis)

Principal Component Analysis (PCA) is a technique for analyzing and visualizing high-dimensional data, such as our artist interaction matrix. Broadly speaking, the method attempts to increase data interprability by reducing the number of dimensions (usually to just 2 dimensions) while still preserving as much information as possible.

As a general principle, PCA seeks to keep far-apart points far apart and keep close-together points close together in the smaller-dimensional projection.

After normalizing our data by standardizing the columns to have a mean of 0 and a standard deviation of 1 to lessen the effects of the large artist biases, we conducted PCA on our dataset to reduce it from a 2500-dimensional dataset to a 2-dimensional map.

Limitations of our Analysis

One limitation of our analysis is that defines its metrics of artist popularity and artist interaction through user-created playlists. This creates certain genres of music that are potentially underrepresented within the data: for instance, songs from genres such as “album rock” are better listened to in the context of other songs from that album, while songs in more recently-released albums are more frequently written with playlist inclusion in mind: these songs are potentially overrepresented in our data. It is difficult, however, to know for sure how severely these issues are manifested in our data and analysis because of the difficulty of finding objective measures of the already-vague notions of “artist popularity” and “artist interactions” and due to the sheer size of the data.

Another limitation of our analysis comes from the limitations of PCA itself. Our implementation of PCA projects high-dimension data onto a two-dimensional grid and does really well at grouping “similar” artist together, but the actual interpretation of the axes on the PCA map as a whole is unclear.

Finally, the dataset we are working from, while large, is restricted to playlists created by U.S. Spotify users between 2010 and 2017. While a U.S.-focused dataset could potentially show clearer patterns as opposed to a more global dataset because of the relatively homogeneous culture of American music consumption, our analysis does not account for or include artists that rose to prominence after 2017 and is therefore slightly out of date. However, with more data, the analysis could easily be applied to a more recent dataset.

The PCA Map

Finally, we get to the PCA map of the top 2500 artists.

A note on artist colors

To assign artist colors by genre, we created a list of broad genres that would have their own individual colors. We used the Spotify API to generate a list of raw genres for each artist, then used a substring search to categorize these raw genres into the broad genres.

In the case where artists had two or more raw genre tags that fell under different broad genres, we colored their associated data point with the color of the genre that came first alphabetically, which creates a slight bias in the colors; however, given the size of the data it would have been difficult to assign genres manually.

The Map Itself

Here’s the PCA map of the top 2500 artists in its entirety, with artists colored by genre.

alt text

In the top left, we have a pretty large hip hop/rap cluster, with artists such as Flo Rida, T-Pain, and Eminem located at the top part of the cluster.

As we move down the hip hop/rap cluster, we start seeing some more EDM and dance pop-adjacent artists such as Major Lazer, Zedd, and Marshmello.

At the bottom of the graph is the “indie” cluster. The indie cluster features artists in the alternative rock genre, such as Arctic Monkeys and Radiohead, dreampop, indietronica, and shoegaze artists such as Glass Animals, and folk and indie pop artists such as Bon Iver. It’s generally characterized by a slower, guitar-heavy, and sometimes even dreamy sound.

The right side of the map contains a very large and broad rock cluster. Rock is a generally diverse genre, which is reflected in its breadth on the PCA map.

Towards the bottom of the rock cluster are artists that fall under the “art rock” genre such as David Bowie, Pink Floyd, and The Beatles. Like the indie cluster below and to the left of this sub-cluster, these artists have a more mellow and soft rock sound.

A bit to the left of the art rock sub-cluster, closer to the center, are artists in the rock genre that are more alternative and funk rock-adjacent, with artists such as Red Hot Chili Peppers, Weezer, and Nirvana.

A bit higher up are punk rock, skate punk, and grunge artists, such as Green Day and My Chemical Romance.

To the right of these artists are classic rock and album rock artists such as Queen, The Rolling Stones, and the Eagles.

Finally, the top of the rock cluster contains metal-adjacent rock artists such as AC/DC, Guns N’ Roses, and Metallica.

Along the top of the graph, next to the rock cluster, is a small country featuring artists such as Carrie Underwood and Blake Shelton. There are also many artists who produce Christian music located in that cluster.

Finally, artists considered “pop” are scattered all around the map. This likely comes from the fact that “pop” is a pretty generic tag, and encompasses a wide range of musical sounds. Furthermore, many of the genres for which we already have classifications have an associated “pop” version of that genre, so that pop has become a sort of default genre assignment under our genre assignment scheme for some artists.

Interactive PCA map

Finally, here is the interactive PCA map in all its glory.

The interactive defaults to displaying only the top 500 artists for performance reasons; to display more artists, click the settings icon on the bottom left (the interactive may lag).

You can also open the visualization in a new window

If the data doesn't appear soon, consider reloading the page.

(Also, loading takes longer on mobile for some reason.)