Home..

Metal Music Principle Component Analysis

<!DOCTYPE html>

Metalmusic PCA

In this post, I will explain how I performed a small analysis on the characteristics of songs from albums in the Rolling Stone top 100 greatest metal albums of all times. I was inspired to do this after I followed this tutorial.

Loading libraries

As always with R, the needed libraries need to be loaded before functions of those libraries can be used:

# Load the necessary libraries
library(dplyr)
library(tidyverse)
library(ggplot2)
library(reshape2)
library(broom)
library(stringr)
library(ggrepel)
library(ggtext)
library(glue)
library(showtext)
library(patchwork)

Another thing to load in advance are the fonts needed to create the PCA plot:

# Load fonts
font_add_google("Acme", "acme")
font_add_google("Raleway", "raleway")
font_add_google("Cabin Condensed","cabin")
font_add_google("Barlow Condensed","barlow")
font_add_google("Stint Ultra Condensed","stint")
font_add_google("Poppins","poppins")
font_add_google("Pacifico","pacifico")
font_add_google("Oswald","oswald")

# Automatically use {showtext} for plots
showtext_auto()

Importing data

First, I imported the data from TSV files obtained from a dataset on data.world. These datasets contain data about the Rolling Stone top 100 best metal albums of all time. For each album, metadata such as the artist, name, release year, duration, genre, rating and more is included. For each of the songs on the album, information about the individual characteristics is included, such as BPM, popularity and various metrics about how the song sounds (energy, dance, loud, valence etc.).

# Read in album dataset
raw_albums <- read.delim("RS_metal_albums.txt")

# View results
head(raw_albums)
# Read in song dataset
raw_songs <- read.delim("RS_metal_songs.txt")

# View results
head(raw_songs)

Some of the headers in both datasets contained spaces, which are automatically replaced by a “.” by R. I replaced these with an underscore (“_“):

# Change column names of raw_albums
colnames(raw_albums)[4] <- "Release_Year"
colnames(raw_albums)[11] <- "Total_Seconds"
colnames(raw_albums)[13] <- "Sub_Metal_Genre"
colnames(raw_albums)[15] <- "Rolling_Stone_Rating"

# Change column names of raw_songs
colnames(raw_songs)[1] <- "Song_Index"
colnames(raw_songs)[4] <- "Track_No"

The dataset with the songs does not contain any data on the release year of the album, but the album dataset does. I needed to add a column to the songs dataset that shows the release year of the album on which each of the songs is. I did this by using the “AlbumID_Rank” column which corresponds between the 2 datasets. The method I used for this operation is from the dplyr package, and is called left_join. I first selected the relevant columns from raw_albums (AlbumID_Rank and Release_Year) and then joined it to raw_songs using the AlbumID_Rank column.

# Left join on AlbumID_Rank to add the Release_Year to raw_songs
songs_with_year <- raw_songs %>%
  left_join(raw_albums %>% select(AlbumID_Rank, Release_Year), by = "AlbumID_Rank")

# View result
head(songs_with_year)

This resulted in a new dataset (songs_with_year) with a column appended to the raw_songs dataset. This column contains the release year of the album on which the song is.

Another column that might come in useful is the Sub_Metal_Genre column of the raw_albums dataset; I also performed a left_join on this data:

# Left join on AlbumID_Rank to add the Sub_Metal_Genre to raw_songs
songs_with_year_genre <- songs_with_year %>%
  left_join(raw_albums %>% select(AlbumID_Rank, Sub_Metal_Genre), by = "AlbumID_Rank")

# View result
head(songs_with_year_genre)

To choose which variables are going to be included in the PCA, I wanted to check which variables correlate the most with each other. For this, I created a correlation matrix with the cor function for all numeric columns of the songs_with_year_genre dataset with the code below:

# Select only numeric columns
numeric_data <- songs_with_year_genre %>%
  select_if(is.numeric)

# Calculate correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")

To get a visual insight into the correlation, I created a simple heatmap of the correlation values with ggplot2:

# Convert correlation matrix to a long format for ggplot2
cor_data <- melt(cor_matrix)

# Plot heatmap
ggplot(cor_data, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") + # White space between tiles
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", # Heatmap color range
                       midpoint = 0, limit = c(-1, 1), space = "Lab", # Midpoint 0 and fill between -1 and 1
                       name = "Correlation") # Legend name

This heatmap does not look the best… There are no labels in the plot and the x-axis labels overlap. Let’s improve it:

# Plot heatmap, but pretty
ggplot(cor_data, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") + # White space between tiles
  geom_text(aes(label = round(value, 2)), size = 2, color = "black") + # Add labels on tiles
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1, 1), space = "Lab",
                       name = "Correlation") +
  theme_minimal() + # Set theme
  theme(text = element_text(size = 9),axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + # Adjust axis labels
  labs(title = "Metalsongs Correlation Matrix Heatmap", # Plot title
       x = "Variables", # Axis title
       y = "Variables") # Axis title

Now, we can see that Song_index and AlbumID_Rank are highly correlated, which makes sense because the song index values were assigned based on the songs sorted based on the album ranking. In addition, Loud and Energy are highly correlated, which was also to be expected because loud songs are also very energetic most of the times. To reduce redundancy, I will only use one of these two categories: Energy for the PCA. 2 combinations that are highly negatively correlated are Acoustic and Loud and Acoustic and Energy. Again, this makes sense because within the metal genre, most acoustic songs are not loud and do not contain a lot of energy.

A bit more unexpected is the correlation between Release_Year and AlbumID_Rank (and Song_Index). This indicates that the year an album was released influences the position on the Rolling Stone top 100. Let’s see if I can further analyze this correlation during the PCA.

Selecting PCA data

To perform a PCA, I had to select the variables with numerical values of interest from the data. These included the ranking place of the album, but also information about the characteristics of each song and the year they were released. I also added Song and Artist because this will be usefull when labeling the different datapoints. To select the variables that are interesting for our PCA from the data, I used the code below:

# New data for PCA
PCA1_data<-songs_with_year_genre%>%
  select(
    # Interesting variables for PCA
    c(AlbumID_Rank,BPM,Dance,
      Loud,Valence,Acoustic,Popularity,
      Release_Year,Song,Artist)
  )%>%
  # Remove NA entries
  drop_na()

Performing the PCA

After these steps I was ready to perform the PCA, which I did using the prcomp function:

# Remove the non-numerical artist and song name data for the PCA
PCA1_rank <- PCA1_data%>%
  select(-Artist,-Song)%>%
  # Perform PCA 
  prcomp(scale=TRUE)

I used another package (broom) to view the results in the tidyverse syntax:

# View results in eigenvalues matrix
PCA1_rank%>%
  tidy(matrix = "eigenvalues")

As the tibble shows, PCA1 accounts for 26% of the overall variability, and PCA2 for 22%. I wanted to visualize how the different categories separate the data, before plotting the actual datapoints of each song.

Plotting variables

To see what variables account for the variability in the plot, I extracted the coordinates of the variables from the “rotation” matrix of the PCA1_var data:

# New dataframe for variables
PCA1_var<-PCA1_rank %>%
  # Extract variable coords
  tidy(matrix = "rotation") %>%
  # Format table long to wide
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  # Rename column with variable names
  rename(Variable=column)
  # Check tibble
  head(PCA1_var)

To actually plot the variables, I used the ggrepel library because this prevents errors when variable names overlap. I chose a different color for the AlbumID_Rank variable, because it is interesting to see how this variable is influenced by the others.

ggplot(data = PCA1_var, aes(PC1, PC2)) +
  # Add variable arrows with conditional color
  geom_segment(
    aes(color = ifelse(Variable == "AlbumID_Rank", "black", "red")),
    xend = 0, yend = 0,
    arrow = arrow(
      length = unit(0.03, "npc"),
      type = "closed",
      ends = "first"
    )
  ) +
  # Add variable names with conditional color
  geom_text_repel(
    aes(label = Variable, color = ifelse(Variable == "AlbumID_Rank", "black", "red")),
    hjust = 1, size = 3,
    min.segment.length = Inf,
    nudge_x = 0.01, nudge_y = 0.08
  ) +
  coord_fixed() +
  labs(
    # Add titles
    title = 'Plot of PCA Variables Top 100 Metal Albums',
    x = 'PC1',
    y = 'PC2',
    color = 'Variable'
  ) +
  scale_color_identity() +
  theme_minimal()

This plot of variables again shows that the AlbumID_Rank is mostly influenced by Release_Year, but also that Acoustic and Loud separate a lot of songs. With a larger plot in mind, I wanted to remove the information about the PC numbers, and make the plot look more visually appealing so it can be shown together with the PCA plot:

pretty_var_plot <- ggplot(data = PCA1_var, aes(PC1, PC2)) +
  # Add variable arrows with conditional color
  geom_segment(
    aes(color = ifelse(Variable == "AlbumID_Rank", "black", "#c93434")), # Black and red arrows
    xend = 0, yend = 0,
    arrow = arrow(
      length = unit(0.05, "npc"),
      type = "closed",
      ends = "first"
    ),
    linewidth = 0.9, # Wide lines
  ) +
  # Add variable names
  geom_text_repel(
    aes(label = Variable, color = ifelse(Variable == "AlbumID_Rank", "black", "#c93434")), # Black and red names
    hjust = 1, size = 3.5,
    min.segment.length = Inf,
    nudge_x = 0.01, nudge_y = 0.08
  ) +
  # Add dashed line for y axis
  annotate(
    geom = "segment", x = 0, xend = 0, y = -Inf, yend = Inf, col = "grey60", lty = "dashed",
    linewidth = 0.5
  ) +
  # Add dashed line for x axis
  annotate(
    geom = "segment", x = Inf, xend = -Inf, y = 0, yend = 0, col = "grey60", lty = "dashed",
    linewidth = 0.5
  ) +
  # Keep fixed aspect ratio
  coord_fixed() +
  # Maintain color identity and remove legend
  scale_color_identity() +
  guides(color = "none") +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    # No axis labels
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y = element_blank()
  )
pretty_var_plot

Plotting songs

To actually create a plot of all songs included in the first PCA, I used ggplot again. Of course I used red and black in the color palette, this is a metal data analysis after all :).

# Add rank data back to PCA calculated data
PCA1_indiv_rank<-PCA1_rank%>%
  augment(PCA1_data)

# Plot individuals
ggplot(
  # Based on calculated PCA data
  data=PCA1_indiv_rank,
  # Use PCA values and color based on release year
  aes(.fittedPC1, .fittedPC2,color=Release_Year))+
  geom_point()+
  # Add labels and color based on year
  labs(
    title = 'Plot of individual songs',
    subtitle = 'Color shows year of album release of songs',
    x='PC1 (26%)',
    y='PC2 (22%)',
    color='Release_Year'
  )+
  # Gradient with reds
  scale_color_gradient(low = "black", high = "red") +
  # Light theme
  theme_light()

The plot shows that PC1 is the principal component that is able to distinguish the songs the best based on the album release year.

Now, I wanted to try and make the plot contain more information and also look a bit prettier…

Optimizing the plot

The first bit of information I thought about adding to the plot was the song names. I started with trying to add all labels to the plot with the code below, but as was to be expected, it became a big mess:

# Check if song names are valid and handle special characters
PCA1_indiv_rank <- PCA1_indiv_rank %>%
  # Check that song is a character vector
  mutate(Song = as.character(Song)) %>%
  # Remove not accepted characters
  mutate(Song = iconv(Song, "UTF-8", "ASCII", sub = "")) %>%
  mutate(Song = gsub("[^[:alnum:] ]", "", Song))

# Plot the PCA with all song name labels, allowing overlaps
ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2, label = Song)) +
  geom_point(aes(color = Release_Year)) +  # Add points colored by release year
  geom_text(aes(label = Song), size = 3, vjust = -0.5, hjust = 0.5) +  # Add song labels directly (no overlap avoidance)
  # Add labels and color based on year
  labs(
    title = 'PCA of Songs',
    subtitle = 'Color shows year of album release',
    x = 'PC1',
    y = 'PC2',
    color = 'Release Year'
  ) +
  # Color gradient
  scale_color_gradient(low = "black", high = "red") +
  theme_minimal()

Based on this plot, I noticed two major problems. The first one being that labeling every single data point leads to overlaps and clutter. The second one being that some song names are unnecessary long due to them being remastered versions. I started with trying to fix the first problem.

Plotting again

Let’s see if reducing the amount of labels to only one for each album and shortening remastered song names made a difference in making the plot look better:

# Plot the PCA with non-overlapping labels for popular songs
ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = Release_Year)) +  # Add points colored by release year
  geom_text_repel(
    data = subset(PCA1_indiv_rank, Popular == TRUE),  # Only label popular songs
    aes(label = Song),
    size = 3, # Text size
    box.padding = 0.35,  # Padding around the text box
    point.padding = 0.5,  # Space between the label and the point
    max.overlaps = Inf,  # Allow all labels to attempt to display
    segment.color = 'grey50',  # Line color
    segment.size = 0.5  # Line thickness
  ) +
  # Add labels and color based on year
  labs(
    title = 'PCA of Songs',
    subtitle = 'Color shows year of album release; labels for popular songs',
    x = 'PC1',
    y = 'PC2',
    color = 'Release Year'
  ) +
  # Colors and theme 
  scale_color_gradient(low = "black", high = "red") +
  theme_minimal()

Well, this still looks very chaotic… Maybe it’s better to manually select songs that have to be labelled in the graph. For this, I selected my personal favorites from the most popular metal songs.

# Manually specify the song names to be labeled
selected_songs <- c("Chop Suey","Breaking the Law","Master Of Puppets","Would","Nothing Else Matters","Blind","People  Shit","Transilvanian Hunger","Walk","Bat Country","Raining Blood","Killing In The Name")

# New dataset to include only the selected songs
popular_favorites <- popular_no_remaster %>%
  filter(Song %in% selected_songs)

# Add a flag for favorite songs in the original PCA dataset
PCA1_indiv_rank <- PCA1_indiv_rank %>%
  mutate(Favorite = ifelse(Song %in% popular_favorites$Song, TRUE, FALSE))

# Plot the PCA with repelled labels for popular songs
neat_pca <- ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = Release_Year)) +  # Add points colored by release year
  geom_text_repel(
    data = subset(PCA1_indiv_rank, Favorite == TRUE),  # Only label favorite songs
    aes(label = Song),
    size = 3,
    nudge_x = -0.01,
    nudge_y = -0.01,
    box.padding = 0.4,  # Padding around the text box
    point.padding = 0,  # Space between the label and the point
    max.overlaps = Inf,  # Allow all labels to attempt to display
    segment.color = 'black',  # Line color
    segment.size = 0.6  # Line thickness
  ) +
  labs(
    title = 'PCA of Songs',
    subtitle = 'Color shows year of album release; labels for favorite songs',
    x = 'PC1',
    y = 'PC2',
    color = 'Release Year'
  ) +
  scale_color_gradient(low = "black", high = "red") +
  theme_minimal()
neat_pca

Creating the ideal plot

I already really liked this version of the plot, but I wanted to make it look more styled. In the code below, I played around with different looks for the datapoints, labelling mechanisms and fonts.

pretty_pca_plot1 <- ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2, color = Release_Year)) + # Map color to a variable
  geom_point(alpha=0.2,size = 2) +
  scale_color_gradient(low = "black", high = "red") +
  scale_x_continuous(limits = c(-5, 7)) + # Plot size x
  scale_y_continuous(limits = c(-7.8, 10))+ # Plot size y
  # Labels that can not overlap and are not on the datapoint
  geom_text_repel( 
    data = subset(PCA1_indiv_rank, Favorite == TRUE), # Only label favorite songs
    aes(label = paste(Song, "\n", Artist, sep = ""), fontface = "bold.italic"), # Song and artist label
    size = 2.8,  # Text size
    box.padding = 0.6,  # Increase padding around text boxes
    point.padding = 0.1,  # Padding between points and labels
    force = 20,  # Increase repulsion between labels
    force_pull = 0.1,  # Slight pull to the data points, ensuring alignment
    min.segment.length = 0.5,  # Avoid overly short segment lines
    max.overlaps = Inf,  # Ensure all labels are shown
    segment.color = 'black',  # Line color
    segment.size = 0.3  # Thinner segment lines
  )
# View
pretty_pca_plot1

I liked the styling of the points and labels here, but I wanted the axes to run through the middle of the data instead of being on the sides. I did this by creating some arrow styles and adding them manually:

arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(4, "pt")
)

arrow_style_ind <- arrow(
  angle = 45, 
  length = grid::unit(8, "pt")
)

pretty_pca_plot2 <- pretty_pca_plot1 +
  # x-axis line
  annotate(
    geom = "segment", x = -5, xend = 6, y = 0, yend = 0,
    lty = "dashed",
    col = "grey60",
    size = 0.5
  ) +
  # x-axis arrow
  annotate(
    geom = "segment", x = 5.9, xend = 6, y = 0, yend = 0,
    col = "grey60",
    size = 0.5, arrow = arrow_style_ind, lineend = 'round', linejoin = 'round'
  ) +
  # x-axis label
  annotate(
    geom = "text", x = 4.7, y = 0.25, label = "Principal Component 1 (26%)",
    col = "dimgrey", size = 3, family = "sans", fontface = "italic"
  ) +
  # y-axis label
  annotate(
    geom = "text", x = -0.1, y = -7.7, label = "Principal Component 2 (22%)",
    col = "dimgrey", size = 3, family = "sans", angle = 90, vjust = 0, hjust = 0, fontface = "italic"
  ) +
  # y-axis
  annotate(
    geom = "segment", x = 0, xend = 0, y = -7.8, yend = 5, col = "grey60", lty = "dashed",
    size = 0.5
  ) +
  # y-axis arrow
  annotate(
    geom = "segment", x = 0, xend = 0, y = 4.9, yend = 5, col = "grey60",
    size = 0.5, arrow = arrow_style_ind, lineend = 'round', linejoin = 'round'
  )
pretty_pca_plot2

Now, there are too many axes and the background does not fit the style anymore. I fixed it like this:

pretty_pca_plot3 <- pretty_pca_plot2 +
  # Set minimal theme
  theme_minimal() +
  theme(
    panel.grid = element_blank(), # No grid
    # No axis labels
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
  )
pretty_pca_plot3

This plot now really misses some additional information, such as what type of data we are looking at, what kind of analysis was performed and the data source. I also incorporated the legend into the subtitle.

# Create title object
maintitle<-cbind.data.frame(
  # Position on left top
  x = -5, y = 9.5,
  # Glue for HTML styling
  label = glue("<span style='color:darkred'>PCA Rolling Stone top 100 metal albums of all time</span>")
)

# Subtitle object
subtitle <- cbind.data.frame(
  # Position below title
  x = -5, y = 8.5,
  # Glue for HTML styling, include legend colors
  label = glue("<span style='color:#2b2a30'>Principal component analysis of metal songs' features from the<br>Rolling Stone top 100 Metal Albums from </span><b style='color:black'> 1970 </b><span style='color:#2b2a30'>to</span><b style='color:red'> 2013 </b><span style='color:#2b2a30'>.<br>Each dot stands for a song, with title and artist for some favorite tracks.</span> ")
)

pretty_pca_plot4 <- pretty_pca_plot3 +
  annotate(
    geom = "text", x = 6, y = -7.5, hjust = 1, color = "dimgrey",
    label = "Data: data.world",
    size = 2, family = "cabin", lineheight = 0.4
  ) +
  geom_richtext(
    data = maintitle,
    aes(x = x, y = y, label = label),
    inherit.aes = FALSE, lineheight = 0.5,
    alpha = 1, size = 7, family = 'cabin', hjust = 0, vjust = 1,
    fill = NA, label.color = NA, 
    label.padding = grid::unit(rep(0, 4), "pt")
  ) +
  geom_richtext(
    data = subtitle,
    aes(x = x, y = y, label = label),
    inherit.aes = FALSE, lineheight = 0.5,
    alpha = 1, size = 5, family = 'cabin', hjust = 0, vjust = 1,
    fill = NA, label.color = NA, 
    label.padding = grid::unit(rep(0, 4), "pt")
  ) +
  # Remove original legend
  theme(
    legend.position = "none"
  )
pretty_pca_plot4

This looks good! But, I have also created a plot of variables that I want to include in this figure. I used patchwork to combine these two plots into one.

# Make plot with patchwork
layout = c(
  area(t = 0, l = 0, b = 10, r = 15), # Large area
  area(t = 0, l = 6, b = 5, r = 17) # Smaller area in right corner
)  
pretty_pca_plot4+pretty_var_plot+plot_layout(design=layout)

Okay, one final touch! I want to help the viewer understand the plot of variables better, so I want to add a description of the variable plot to the figure:

pretty_pca_plot5 <- pretty_pca_plot4 +
  annotate(
  geom = "text", x = 3.2, y = 3.3, hjust = 0.5, label = "Album ranking is\n heavily influenced\n by release year",
  size = 3.8, family = "cabin", lineheight = 1
  )
# Make plot with patchwork
layout2 = c(
  area(t = 0, l = 0, b = 10, r = 15), # Large area
  area(t = 0, l = 6, b = 5, r = 17) # Smaller area in right corner
) 
pretty_pca_plot5+pretty_var_plot+plot_layout(design=layout)

Unfortunately, some of the label positions keep changing due to the use of patchwork, but this is the best I could achieve with my skills.

This plot shows that the release year of the album of the songs correlates the most with the rank of the albums, which shows that Rolling Stone ranked older metal albums higher than the newer metal albums. It also shows that my personal taste differs leans towards the left and darker side of the main cloud of songs, which means that I prefer older songs too new ones.

This document will be improved in the future!

---
title: "Metalmusic PCA"
output: html_notebook
---
In this post, I will explain how I performed a small analysis on the characteristics of songs from albums in the Rolling Stone top 100 greatest metal albums of all times. I was inspired to do this after I followed [this tutorial](https://bjnnowak.netlify.app/2021/09/15/r-pca-with-tidyverse/).

# Loading libraries
As always with R, the needed libraries need to be loaded before functions of those libraries can be used:
```{r libraries, message=FALSE, warning=FALSE}
# Load the necessary libraries
library(dplyr)
library(tidyverse)
library(ggplot2)
library(reshape2)
library(broom)
library(stringr)
library(ggrepel)
library(ggtext)
library(glue)
library(showtext)
library(patchwork)
```

Another thing to load in advance are the fonts needed to create the PCA plot:
```{r fonts, message=FALSE}
# Load fonts
font_add_google("Acme", "acme")
font_add_google("Raleway", "raleway")
font_add_google("Cabin Condensed","cabin")
font_add_google("Barlow Condensed","barlow")
font_add_google("Stint Ultra Condensed","stint")
font_add_google("Poppins","poppins")
font_add_google("Pacifico","pacifico")
font_add_google("Oswald","oswald")

# Automatically use {showtext} for plots
showtext_auto()
```
# Importing data
First, I imported the data from TSV files obtained from a [dataset on data.world](https://data.world/kcmillersean/rs-100-greatest-metal-albums-of-all-time). These datasets contain data about the Rolling Stone top 100 best metal albums of all time. For each album, metadata such as the artist, name, release year, duration, genre, rating and more is included. For each of the songs on the album, information about the individual characteristics is included, such as BPM, popularity and various metrics about how the song sounds (energy, dance, loud, valence etc.).
```{r load_album_data}
# Read in album dataset
raw_albums <- read.delim("RS_metal_albums.txt")

# View results
head(raw_albums)
```
```{r load_song_data}
# Read in song dataset
raw_songs <- read.delim("RS_metal_songs.txt")

# View results
head(raw_songs)
```
Some of the headers in both datasets contained spaces, which are automatically replaced by a "." by R. I replaced these with an underscore ("_"):
```{r remove_underscores}
# Change column names of raw_albums
colnames(raw_albums)[4] <- "Release_Year"
colnames(raw_albums)[11] <- "Total_Seconds"
colnames(raw_albums)[13] <- "Sub_Metal_Genre"
colnames(raw_albums)[15] <- "Rolling_Stone_Rating"

# Change column names of raw_songs
colnames(raw_songs)[1] <- "Song_Index"
colnames(raw_songs)[4] <- "Track_No"
```
The dataset with the songs does not contain any data on the release year of the album, but the album dataset does. I needed to add a column to the songs dataset that shows the release year of the album on which each of the songs is. I did this by using the "AlbumID_Rank" column which corresponds between the 2 datasets. The method I used for this operation is from the `dplyr` package, and is called `left_join`. I first selected the relevant columns from raw_albums (`AlbumID_Rank` and `Release_Year`) and then joined it to `raw_songs` using the `AlbumID_Rank` column. 
```{r ljoin_years}
# Left join on AlbumID_Rank to add the Release_Year to raw_songs
songs_with_year <- raw_songs %>%
  left_join(raw_albums %>% select(AlbumID_Rank, Release_Year), by = "AlbumID_Rank")

# View result
head(songs_with_year)
```
This resulted in a new dataset (`songs_with_year`) with a column appended to the `raw_songs` dataset. This column contains the release year of the album on which the song is. 

Another column that might come in useful is the `Sub_Metal_Genre` column of the `raw_albums` dataset; I also performed a `left_join` on this data: 
```{r ljoin_genre}
# Left join on AlbumID_Rank to add the Sub_Metal_Genre to raw_songs
songs_with_year_genre <- songs_with_year %>%
  left_join(raw_albums %>% select(AlbumID_Rank, Sub_Metal_Genre), by = "AlbumID_Rank")

# View result
head(songs_with_year_genre)
```
To choose which variables are going to be included in the PCA, I wanted to check which variables correlate the most with each other. For this, I created a correlation matrix with the `cor` function for all numeric columns of the `songs_with_year_genre` dataset with the code below:

```{r calculate_cormatrix}
# Select only numeric columns
numeric_data <- songs_with_year_genre %>%
  select_if(is.numeric)

# Calculate correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")
```
To get a visual insight into the correlation, I created a simple heatmap of the correlation values with `ggplot2`:
```{r plot_cormatrix}
# Convert correlation matrix to a long format for ggplot2
cor_data <- melt(cor_matrix)

# Plot heatmap
ggplot(cor_data, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") + # White space between tiles
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", # Heatmap color range
                       midpoint = 0, limit = c(-1, 1), space = "Lab", # Midpoint 0 and fill between -1 and 1
                       name = "Correlation") # Legend name
```
This heatmap does not look the best... There are no labels in the plot and the x-axis labels overlap. Let's improve it:
```{r plot_cormatrix_pretty}
# Plot heatmap, but pretty
ggplot(cor_data, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") + # White space between tiles
  geom_text(aes(label = round(value, 2)), size = 2, color = "black") + # Add labels on tiles
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1, 1), space = "Lab",
                       name = "Correlation") +
  theme_minimal() + # Set theme
  theme(text = element_text(size = 9),axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + # Adjust axis labels
  labs(title = "Metalsongs Correlation Matrix Heatmap", # Plot title
       x = "Variables", # Axis title
       y = "Variables") # Axis title
```
Now, we can see that `Song_index` and `AlbumID_Rank` are highly correlated, which makes sense because the song index values were assigned based on the songs sorted based on the album ranking. In addition, `Loud` and `Energy` are highly correlated, which was also to be expected because loud songs are also very energetic most of the times. To reduce redundancy, I will only use one of these two categories: `Energy` for the PCA. 2 combinations that are highly negatively correlated are `Acoustic` and `Loud` and `Acoustic` and `Energy`. Again, this makes sense because within the metal genre, most acoustic songs are not loud and do not contain a lot of energy. 

A bit more unexpected is the correlation between `Release_Year` and `AlbumID_Rank` (and `Song_Index`). This indicates that the year an album was released influences the position on the Rolling Stone top 100. Let's see if I can further analyze this correlation during the PCA.

# Selecting PCA data
To perform a PCA, I had to select the variables with numerical values of interest from the data. These included the ranking place of the album, but also information about the characteristics of each song and the year they were released. I also added `Song` and `Artist` because this will be usefull when labeling the different datapoints.
To select the variables that are interesting for our PCA from the data, I used the code below:
```{r selecting}
# New data for PCA
PCA1_data<-songs_with_year_genre%>%
  select(
    # Interesting variables for PCA
    c(AlbumID_Rank,BPM,Dance,
      Loud,Valence,Acoustic,Popularity,
      Release_Year,Song,Artist)
  )%>%
  # Remove NA entries
  drop_na()
```
# Performing the PCA
After these steps I was ready to perform the PCA, which I did using the `prcomp` function:

```{r PCA1_rank}
# Remove the non-numerical artist and song name data for the PCA
PCA1_rank <- PCA1_data%>%
  select(-Artist,-Song)%>%
  # Perform PCA 
  prcomp(scale=TRUE)
```

I used another package (`broom`) to view the results in the `tidyverse` syntax:

```{r access_PCA1_rank}
# View results in eigenvalues matrix
PCA1_rank%>%
  tidy(matrix = "eigenvalues")
```
As the tibble shows, PCA1 accounts for 26% of the overall variability, and PCA2 for 22%. I wanted to visualize how the different categories separate the data, before plotting the actual datapoints of each song.

# Plotting variables
To see what variables account for the variability in the plot, I extracted the coordinates of the variables from the "rotation" matrix of the `PCA1_var` data:
```{r PCA1_variables}
# New dataframe for variables
PCA1_var<-PCA1_rank %>%
  # Extract variable coords
  tidy(matrix = "rotation") %>%
  # Format table long to wide
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  # Rename column with variable names
  rename(Variable=column)
  # Check tibble
  head(PCA1_var)
```
To actually plot the variables, I used the `ggrepel` library because this prevents errors when variable names overlap.  I chose a different color for the `AlbumID_Rank` variable, because it is interesting to see how this variable is influenced by the others.
```{r plot_variables}
ggplot(data = PCA1_var, aes(PC1, PC2)) +
  # Add variable arrows with conditional color
  geom_segment(
    aes(color = ifelse(Variable == "AlbumID_Rank", "black", "red")),
    xend = 0, yend = 0,
    arrow = arrow(
      length = unit(0.03, "npc"),
      type = "closed",
      ends = "first"
    )
  ) +
  # Add variable names with conditional color
  geom_text_repel(
    aes(label = Variable, color = ifelse(Variable == "AlbumID_Rank", "black", "red")),
    hjust = 1, size = 3,
    min.segment.length = Inf,
    nudge_x = 0.01, nudge_y = 0.08
  ) +
  coord_fixed() +
  labs(
    # Add titles
    title = 'Plot of PCA Variables Top 100 Metal Albums',
    x = 'PC1',
    y = 'PC2',
    color = 'Variable'
  ) +
  scale_color_identity() +
  theme_minimal()
```
This plot of variables again shows that the `AlbumID_Rank` is mostly influenced by `Release_Year`, but also that `Acoustic` and `Loud` separate a lot of songs. With a larger plot in mind, I wanted to remove the information about the PC numbers, and make the plot look more visually appealing so it can be shown together with the PCA plot:
```{r plot_variables_pretty}
pretty_var_plot <- ggplot(data = PCA1_var, aes(PC1, PC2)) +
  # Add variable arrows with conditional color
  geom_segment(
    aes(color = ifelse(Variable == "AlbumID_Rank", "black", "#c93434")), # Black and red arrows
    xend = 0, yend = 0,
    arrow = arrow(
      length = unit(0.05, "npc"),
      type = "closed",
      ends = "first"
    ),
    linewidth = 0.9, # Wide lines
  ) +
  # Add variable names
  geom_text_repel(
    aes(label = Variable, color = ifelse(Variable == "AlbumID_Rank", "black", "#c93434")), # Black and red names
    hjust = 1, size = 3.5,
    min.segment.length = Inf,
    nudge_x = 0.01, nudge_y = 0.08
  ) +
  # Add dashed line for y axis
  annotate(
    geom = "segment", x = 0, xend = 0, y = -Inf, yend = Inf, col = "grey60", lty = "dashed",
    linewidth = 0.5
  ) +
  # Add dashed line for x axis
  annotate(
    geom = "segment", x = Inf, xend = -Inf, y = 0, yend = 0, col = "grey60", lty = "dashed",
    linewidth = 0.5
  ) +
  # Keep fixed aspect ratio
  coord_fixed() +
  # Maintain color identity and remove legend
  scale_color_identity() +
  guides(color = "none") +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    # No axis labels
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y = element_blank()
  )
pretty_var_plot
```
# Plotting songs
To actually create a plot of all songs included in the first PCA, I used `ggplot` again. Of course I used red and black in the color palette, this is a metal data analysis after all :).
```{r PCA1_plot}
# Add rank data back to PCA calculated data
PCA1_indiv_rank<-PCA1_rank%>%
  augment(PCA1_data)

# Plot individuals
ggplot(
  # Based on calculated PCA data
  data=PCA1_indiv_rank,
  # Use PCA values and color based on release year
  aes(.fittedPC1, .fittedPC2,color=Release_Year))+
  geom_point()+
  # Add labels and color based on year
  labs(
    title = 'Plot of individual songs',
    subtitle = 'Color shows year of album release of songs',
    x='PC1 (26%)',
    y='PC2 (22%)',
    color='Release_Year'
  )+
  # Gradient with reds
  scale_color_gradient(low = "black", high = "red") +
  # Light theme
  theme_light()
```
The plot shows that PC1 is the principal component that is able to distinguish the songs the best based on the album release year.

Now, I wanted to try and make the plot contain more information and also look a bit prettier...

# Optimizing the plot
The first bit of information I thought about adding to the plot was the song names. I started with trying to add all labels to the plot with the code below, but as was to be expected, it became a big mess:
```{r PCA1_allnames}
# Check if song names are valid and handle special characters
PCA1_indiv_rank <- PCA1_indiv_rank %>%
  # Check that song is a character vector
  mutate(Song = as.character(Song)) %>%
  # Remove not accepted characters
  mutate(Song = iconv(Song, "UTF-8", "ASCII", sub = "")) %>%
  mutate(Song = gsub("[^[:alnum:] ]", "", Song))

# Plot the PCA with all song name labels, allowing overlaps
ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2, label = Song)) +
  geom_point(aes(color = Release_Year)) +  # Add points colored by release year
  geom_text(aes(label = Song), size = 3, vjust = -0.5, hjust = 0.5) +  # Add song labels directly (no overlap avoidance)
  # Add labels and color based on year
  labs(
    title = 'PCA of Songs',
    subtitle = 'Color shows year of album release',
    x = 'PC1',
    y = 'PC2',
    color = 'Release Year'
  ) +
  # Color gradient
  scale_color_gradient(low = "black", high = "red") +
  theme_minimal()

```
Based on this plot, I noticed two major problems. The first one being that labeling every single data point leads to overlaps and clutter. The second one being that some song names are unnecessary long due to them being remastered versions. I started with trying to fix the first problem.

## Selecting most popular song from each album
To reduce the amount of songs with a label, I decided to make a new dataset that contains only the most popular song of each album in the dataset:
```{r popular_songs_only}
# New dataset
highest_popularity_per_album <- PCA1_indiv_rank %>%
  # Group based on rank
  group_by(AlbumID_Rank) %>%
  # Filter rows with max Popularity
  filter(Popularity == max(Popularity)) %>%
  # Handle ties by selecting the first row
  slice(1) %>%  
  ungroup()
```
## Removing remaster info from popular song names
After making a subset of only the most popular songs, I wanted to shorten some of the song names. A lot of the songs are remastered after the initial album release, which means that some information is added to the song title.
I created a `regex` pattern to filter out all songs that have some version of "Remaster", "Remastered" and a year in the song title.
All text that matches this pattern is removed in the code below:
```{r no_remaster_cleaning}
# Clean the song names column
popular_no_remaster <- highest_popularity_per_album %>%
  mutate(Song = str_remove_all(Song, "(?i)\\s*\\b(remaster(ed)?( \\d{4})?|\\d{4} remaster(ed)?)\\b.*$"))

# Add a flag for popular songs in the original PCA dataset
PCA1_indiv_rank <- PCA1_indiv_rank %>%
  mutate(Popular = ifelse(Song %in% popular_no_remaster$Song, TRUE, FALSE))
```
To make the downstream plotting of this data easier, I added a flag to the original PCA data to indicate if the song is the most popular of an album.

## Plotting again
Let's see if reducing the amount of labels to only one for each album and shortening remastered song names made a difference in making the plot look better:
```{r PCA1_popular_no_remaster}
# Plot the PCA with non-overlapping labels for popular songs
ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = Release_Year)) +  # Add points colored by release year
  geom_text_repel(
    data = subset(PCA1_indiv_rank, Popular == TRUE),  # Only label popular songs
    aes(label = Song),
    size = 3, # Text size
    box.padding = 0.35,  # Padding around the text box
    point.padding = 0.5,  # Space between the label and the point
    max.overlaps = Inf,  # Allow all labels to attempt to display
    segment.color = 'grey50',  # Line color
    segment.size = 0.5  # Line thickness
  ) +
  # Add labels and color based on year
  labs(
    title = 'PCA of Songs',
    subtitle = 'Color shows year of album release; labels for popular songs',
    x = 'PC1',
    y = 'PC2',
    color = 'Release Year'
  ) +
  # Colors and theme 
  scale_color_gradient(low = "black", high = "red") +
  theme_minimal()
```
Well, this still looks very chaotic... Maybe it's better to manually select songs that have to be labelled in the graph. For this, I selected my personal favorites from the most popular metal songs. 
```{r PCA_favorites}
# Manually specify the song names to be labeled
selected_songs <- c("Chop Suey","Breaking the Law","Master Of Puppets","Would","Nothing Else Matters","Blind","People  Shit","Transilvanian Hunger","Walk","Bat Country","Raining Blood","Killing In The Name")

# New dataset to include only the selected songs
popular_favorites <- popular_no_remaster %>%
  filter(Song %in% selected_songs)

# Add a flag for favorite songs in the original PCA dataset
PCA1_indiv_rank <- PCA1_indiv_rank %>%
  mutate(Favorite = ifelse(Song %in% popular_favorites$Song, TRUE, FALSE))

# Plot the PCA with repelled labels for popular songs
neat_pca <- ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = Release_Year)) +  # Add points colored by release year
  geom_text_repel(
    data = subset(PCA1_indiv_rank, Favorite == TRUE),  # Only label favorite songs
    aes(label = Song),
    size = 3,
    nudge_x = -0.01,
    nudge_y = -0.01,
    box.padding = 0.4,  # Padding around the text box
    point.padding = 0,  # Space between the label and the point
    max.overlaps = Inf,  # Allow all labels to attempt to display
    segment.color = 'black',  # Line color
    segment.size = 0.6  # Line thickness
  ) +
  labs(
    title = 'PCA of Songs',
    subtitle = 'Color shows year of album release; labels for favorite songs',
    x = 'PC1',
    y = 'PC2',
    color = 'Release Year'
  ) +
  scale_color_gradient(low = "black", high = "red") +
  theme_minimal()
neat_pca
```
# Creating the ideal plot
I already really liked this version of the plot, but I wanted to make it look more styled. In the code below, I played around with different looks for the datapoints, labelling mechanisms and fonts. 
```{r pretty_pca_plot_part1}
pretty_pca_plot1 <- ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2, color = Release_Year)) + # Map color to a variable
  geom_point(alpha=0.2,size = 2) +
  scale_color_gradient(low = "black", high = "red") +
  scale_x_continuous(limits = c(-5, 7)) + # Plot size x
  scale_y_continuous(limits = c(-7.8, 10))+ # Plot size y
  # Labels that can not overlap and are not on the datapoint
  geom_text_repel( 
    data = subset(PCA1_indiv_rank, Favorite == TRUE), # Only label favorite songs
    aes(label = paste(Song, "\n", Artist, sep = ""), fontface = "bold.italic"), # Song and artist label
    size = 2.8,  # Text size
    box.padding = 0.6,  # Increase padding around text boxes
    point.padding = 0.1,  # Padding between points and labels
    force = 20,  # Increase repulsion between labels
    force_pull = 0.1,  # Slight pull to the data points, ensuring alignment
    min.segment.length = 0.5,  # Avoid overly short segment lines
    max.overlaps = Inf,  # Ensure all labels are shown
    segment.color = 'black',  # Line color
    segment.size = 0.3  # Thinner segment lines
  )
# View
pretty_pca_plot1
```
I liked the styling of the points and labels here, but I wanted the axes to run through the middle of the data instead of being on the sides. I did this by creating some arrow styles and adding them manually:
```{r pretty_pca_plot_part2}
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(4, "pt")
)

arrow_style_ind <- arrow(
  angle = 45, 
  length = grid::unit(8, "pt")
)

pretty_pca_plot2 <- pretty_pca_plot1 +
  # x-axis line
  annotate(
    geom = "segment", x = -5, xend = 6, y = 0, yend = 0,
    lty = "dashed",
    col = "grey60",
    size = 0.5
  ) +
  # x-axis arrow
  annotate(
    geom = "segment", x = 5.9, xend = 6, y = 0, yend = 0,
    col = "grey60",
    size = 0.5, arrow = arrow_style_ind, lineend = 'round', linejoin = 'round'
  ) +
  # x-axis label
  annotate(
    geom = "text", x = 4.7, y = 0.25, label = "Principal Component 1 (26%)",
    col = "dimgrey", size = 3, family = "sans", fontface = "italic"
  ) +
  # y-axis label
  annotate(
    geom = "text", x = -0.1, y = -7.7, label = "Principal Component 2 (22%)",
    col = "dimgrey", size = 3, family = "sans", angle = 90, vjust = 0, hjust = 0, fontface = "italic"
  ) +
  # y-axis
  annotate(
    geom = "segment", x = 0, xend = 0, y = -7.8, yend = 5, col = "grey60", lty = "dashed",
    size = 0.5
  ) +
  # y-axis arrow
  annotate(
    geom = "segment", x = 0, xend = 0, y = 4.9, yend = 5, col = "grey60",
    size = 0.5, arrow = arrow_style_ind, lineend = 'round', linejoin = 'round'
  )
pretty_pca_plot2

```
Now, there are too many axes and the background does not fit the style anymore. I fixed it like this: 
```{r pretty_pca_plot_part3}
pretty_pca_plot3 <- pretty_pca_plot2 +
  # Set minimal theme
  theme_minimal() +
  theme(
    panel.grid = element_blank(), # No grid
    # No axis labels
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
  )
pretty_pca_plot3
```
This plot now really misses some additional information, such as what type of data we are looking at, what kind of analysis was performed and the data source. I also incorporated the legend into the subtitle.
```{r pretty_pca_plot_part4}
# Create title object
maintitle<-cbind.data.frame(
  # Position on left top
  x = -5, y = 9.5,
  # Glue for HTML styling
  label = glue("<span style='color:darkred'>PCA Rolling Stone top 100 metal albums of all time</span>")
)

# Subtitle object
subtitle <- cbind.data.frame(
  # Position below title
  x = -5, y = 8.5,
  # Glue for HTML styling, include legend colors
  label = glue("<span style='color:#2b2a30'>Principal component analysis of metal songs' features from the<br>Rolling Stone top 100 Metal Albums from </span><b style='color:black'> 1970 </b><span style='color:#2b2a30'>to</span><b style='color:red'> 2013 </b><span style='color:#2b2a30'>.<br>Each dot stands for a song, with title and artist for some favorite tracks.</span> ")
)

pretty_pca_plot4 <- pretty_pca_plot3 +
  annotate(
    geom = "text", x = 6, y = -7.5, hjust = 1, color = "dimgrey",
    label = "Data: data.world",
    size = 2, family = "cabin", lineheight = 0.4
  ) +
  geom_richtext(
    data = maintitle,
    aes(x = x, y = y, label = label),
    inherit.aes = FALSE, lineheight = 0.5,
    alpha = 1, size = 7, family = 'cabin', hjust = 0, vjust = 1,
    fill = NA, label.color = NA, 
    label.padding = grid::unit(rep(0, 4), "pt")
  ) +
  geom_richtext(
    data = subtitle,
    aes(x = x, y = y, label = label),
    inherit.aes = FALSE, lineheight = 0.5,
    alpha = 1, size = 5, family = 'cabin', hjust = 0, vjust = 1,
    fill = NA, label.color = NA, 
    label.padding = grid::unit(rep(0, 4), "pt")
  ) +
  # Remove original legend
  theme(
    legend.position = "none"
  )
pretty_pca_plot4
```
This looks good! But, I have also created a plot of variables that I want to include in this figure. I used `patchwork` to combine these two plots into one.
```{r combined_plots_part1}
# Make plot with patchwork
layout = c(
  area(t = 0, l = 0, b = 10, r = 15), # Large area
  area(t = 0, l = 6, b = 5, r = 17) # Smaller area in right corner
)  
pretty_pca_plot4+pretty_var_plot+plot_layout(design=layout)
```
Okay, one final touch! I want to help the viewer understand the plot of variables better, so I want to add a description of the variable plot to the figure:
```{r combined_plots_part2}
pretty_pca_plot5 <- pretty_pca_plot4 +
  # Add note for variable plot
  annotate(
  geom = "text", x = 3.2, y = 3.3, hjust = 0.5, label = "Album ranking is\n heavily influenced\n by release year",
  size = 3.8, family = "cabin", lineheight = 1
  )
pretty_pca_plot5+pretty_var_plot+plot_layout(design=layout)
```
Unfortunately, some of the label positions keep changing due to the use of `patchwork`, but this is the best I could achieve with my skills. 

This plot shows that the release year of the album of the songs correlates the most with the rank of the albums, which shows that Rolling Stone ranked older metal albums higher than the newer metal albums. It also shows that my personal taste differs leans towards the left and darker side of the main cloud of songs, which means that I prefer older songs too new ones. 

This document will be improved in the future!

© 2025 Birgit Rijvers   •