In this post, I will explain how I performed a small analysis on the
characteristics of songs from albums in the Rolling Stone top 100
greatest metal albums of all times. I was inspired to do this after I
followed this
tutorial.
Loading libraries
As always with R, the needed libraries need to be loaded before
functions of those libraries can be used:
# Load the necessary libraries
library(dplyr)
library(tidyverse)
library(ggplot2)
library(reshape2)
library(broom)
library(stringr)
library(ggrepel)
library(ggtext)
library(glue)
library(showtext)
library(patchwork)
Another thing to load in advance are the fonts needed to create the
PCA plot:
# Load fonts
font_add_google("Acme", "acme")
font_add_google("Raleway", "raleway")
font_add_google("Cabin Condensed","cabin")
font_add_google("Barlow Condensed","barlow")
font_add_google("Stint Ultra Condensed","stint")
font_add_google("Poppins","poppins")
font_add_google("Pacifico","pacifico")
font_add_google("Oswald","oswald")
# Automatically use {showtext} for plots
showtext_auto()
Importing data
First, I imported the data from TSV files obtained from a dataset
on data.world. These datasets contain data about the Rolling Stone
top 100 best metal albums of all time. For each album, metadata such as
the artist, name, release year, duration, genre, rating and more is
included. For each of the songs on the album, information about the
individual characteristics is included, such as BPM, popularity and
various metrics about how the song sounds (energy, dance, loud, valence
etc.).
# Read in album dataset
raw_albums <- read.delim("RS_metal_albums.txt")
# View results
head(raw_albums)
# Read in song dataset
raw_songs <- read.delim("RS_metal_songs.txt")
# View results
head(raw_songs)
Some of the headers in both datasets contained spaces, which are
automatically replaced by a “.” by R. I replaced these with an
underscore (“_“):
# Change column names of raw_albums
colnames(raw_albums)[4] <- "Release_Year"
colnames(raw_albums)[11] <- "Total_Seconds"
colnames(raw_albums)[13] <- "Sub_Metal_Genre"
colnames(raw_albums)[15] <- "Rolling_Stone_Rating"
# Change column names of raw_songs
colnames(raw_songs)[1] <- "Song_Index"
colnames(raw_songs)[4] <- "Track_No"
The dataset with the songs does not contain any data on the release
year of the album, but the album dataset does. I needed to add a column
to the songs dataset that shows the release year of the album on which
each of the songs is. I did this by using the “AlbumID_Rank” column
which corresponds between the 2 datasets. The method I used for this
operation is from the dplyr
package, and is called
left_join
. I first selected the relevant columns from
raw_albums (AlbumID_Rank
and Release_Year
) and
then joined it to raw_songs
using the
AlbumID_Rank
column.
# Left join on AlbumID_Rank to add the Release_Year to raw_songs
songs_with_year <- raw_songs %>%
left_join(raw_albums %>% select(AlbumID_Rank, Release_Year), by = "AlbumID_Rank")
# View result
head(songs_with_year)
This resulted in a new dataset (songs_with_year
) with a
column appended to the raw_songs
dataset. This column
contains the release year of the album on which the song is.
Another column that might come in useful is the
Sub_Metal_Genre
column of the raw_albums
dataset; I also performed a left_join
on this data:
# Left join on AlbumID_Rank to add the Sub_Metal_Genre to raw_songs
songs_with_year_genre <- songs_with_year %>%
left_join(raw_albums %>% select(AlbumID_Rank, Sub_Metal_Genre), by = "AlbumID_Rank")
# View result
head(songs_with_year_genre)
To choose which variables are going to be included in the PCA, I
wanted to check which variables correlate the most with each other. For
this, I created a correlation matrix with the cor
function
for all numeric columns of the songs_with_year_genre
dataset with the code below:
# Select only numeric columns
numeric_data <- songs_with_year_genre %>%
select_if(is.numeric)
# Calculate correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")
To get a visual insight into the correlation, I created a simple
heatmap of the correlation values with ggplot2
:
# Convert correlation matrix to a long format for ggplot2
cor_data <- melt(cor_matrix)
# Plot heatmap
ggplot(cor_data, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") + # White space between tiles
scale_fill_gradient2(low = "blue", high = "red", mid = "white", # Heatmap color range
midpoint = 0, limit = c(-1, 1), space = "Lab", # Midpoint 0 and fill between -1 and 1
name = "Correlation") # Legend name

This heatmap does not look the best… There are no labels in the plot
and the x-axis labels overlap. Let’s improve it:
# Plot heatmap, but pretty
ggplot(cor_data, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") + # White space between tiles
geom_text(aes(label = round(value, 2)), size = 2, color = "black") + # Add labels on tiles
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() + # Set theme
theme(text = element_text(size = 9),axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + # Adjust axis labels
labs(title = "Metalsongs Correlation Matrix Heatmap", # Plot title
x = "Variables", # Axis title
y = "Variables") # Axis title

Now, we can see that Song_index
and
AlbumID_Rank
are highly correlated, which makes sense
because the song index values were assigned based on the songs sorted
based on the album ranking. In addition, Loud
and
Energy
are highly correlated, which was also to be expected
because loud songs are also very energetic most of the times. To reduce
redundancy, I will only use one of these two categories:
Energy
for the PCA. 2 combinations that are highly
negatively correlated are Acoustic
and Loud
and Acoustic
and Energy
. Again, this makes
sense because within the metal genre, most acoustic songs are not loud
and do not contain a lot of energy.
A bit more unexpected is the correlation between
Release_Year
and AlbumID_Rank
(and
Song_Index
). This indicates that the year an album was
released influences the position on the Rolling Stone top 100. Let’s see
if I can further analyze this correlation during the PCA.
Selecting PCA data
To perform a PCA, I had to select the variables with numerical values
of interest from the data. These included the ranking place of the
album, but also information about the characteristics of each song and
the year they were released. I also added Song
and
Artist
because this will be usefull when labeling the
different datapoints. To select the variables that are interesting for
our PCA from the data, I used the code below:
# New data for PCA
PCA1_data<-songs_with_year_genre%>%
select(
# Interesting variables for PCA
c(AlbumID_Rank,BPM,Dance,
Loud,Valence,Acoustic,Popularity,
Release_Year,Song,Artist)
)%>%
# Remove NA entries
drop_na()
Plotting variables
To see what variables account for the variability in the plot, I
extracted the coordinates of the variables from the “rotation” matrix of
the PCA1_var
data:
# New dataframe for variables
PCA1_var<-PCA1_rank %>%
# Extract variable coords
tidy(matrix = "rotation") %>%
# Format table long to wide
pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
# Rename column with variable names
rename(Variable=column)
# Check tibble
head(PCA1_var)
To actually plot the variables, I used the ggrepel
library because this prevents errors when variable names overlap. I
chose a different color for the AlbumID_Rank
variable,
because it is interesting to see how this variable is influenced by the
others.
ggplot(data = PCA1_var, aes(PC1, PC2)) +
# Add variable arrows with conditional color
geom_segment(
aes(color = ifelse(Variable == "AlbumID_Rank", "black", "red")),
xend = 0, yend = 0,
arrow = arrow(
length = unit(0.03, "npc"),
type = "closed",
ends = "first"
)
) +
# Add variable names with conditional color
geom_text_repel(
aes(label = Variable, color = ifelse(Variable == "AlbumID_Rank", "black", "red")),
hjust = 1, size = 3,
min.segment.length = Inf,
nudge_x = 0.01, nudge_y = 0.08
) +
coord_fixed() +
labs(
# Add titles
title = 'Plot of PCA Variables Top 100 Metal Albums',
x = 'PC1',
y = 'PC2',
color = 'Variable'
) +
scale_color_identity() +
theme_minimal()

This plot of variables again shows that the AlbumID_Rank
is mostly influenced by Release_Year
, but also that
Acoustic
and Loud
separate a lot of songs.
With a larger plot in mind, I wanted to remove the information about the
PC numbers, and make the plot look more visually appealing so it can be
shown together with the PCA plot:
pretty_var_plot <- ggplot(data = PCA1_var, aes(PC1, PC2)) +
# Add variable arrows with conditional color
geom_segment(
aes(color = ifelse(Variable == "AlbumID_Rank", "black", "#c93434")), # Black and red arrows
xend = 0, yend = 0,
arrow = arrow(
length = unit(0.05, "npc"),
type = "closed",
ends = "first"
),
linewidth = 0.9, # Wide lines
) +
# Add variable names
geom_text_repel(
aes(label = Variable, color = ifelse(Variable == "AlbumID_Rank", "black", "#c93434")), # Black and red names
hjust = 1, size = 3.5,
min.segment.length = Inf,
nudge_x = 0.01, nudge_y = 0.08
) +
# Add dashed line for y axis
annotate(
geom = "segment", x = 0, xend = 0, y = -Inf, yend = Inf, col = "grey60", lty = "dashed",
linewidth = 0.5
) +
# Add dashed line for x axis
annotate(
geom = "segment", x = Inf, xend = -Inf, y = 0, yend = 0, col = "grey60", lty = "dashed",
linewidth = 0.5
) +
# Keep fixed aspect ratio
coord_fixed() +
# Maintain color identity and remove legend
scale_color_identity() +
guides(color = "none") +
theme_minimal() +
theme(
panel.grid = element_blank(),
# No axis labels
axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()
)
pretty_var_plot

Plotting songs
To actually create a plot of all songs included in the first PCA, I
used ggplot
again. Of course I used red and black in the
color palette, this is a metal data analysis after all :).
# Add rank data back to PCA calculated data
PCA1_indiv_rank<-PCA1_rank%>%
augment(PCA1_data)
# Plot individuals
ggplot(
# Based on calculated PCA data
data=PCA1_indiv_rank,
# Use PCA values and color based on release year
aes(.fittedPC1, .fittedPC2,color=Release_Year))+
geom_point()+
# Add labels and color based on year
labs(
title = 'Plot of individual songs',
subtitle = 'Color shows year of album release of songs',
x='PC1 (26%)',
y='PC2 (22%)',
color='Release_Year'
)+
# Gradient with reds
scale_color_gradient(low = "black", high = "red") +
# Light theme
theme_light()

The plot shows that PC1 is the principal component that is able to
distinguish the songs the best based on the album release year.
Now, I wanted to try and make the plot contain more information and
also look a bit prettier…
Optimizing the plot
The first bit of information I thought about adding to the plot was
the song names. I started with trying to add all labels to the plot with
the code below, but as was to be expected, it became a big mess:
# Check if song names are valid and handle special characters
PCA1_indiv_rank <- PCA1_indiv_rank %>%
# Check that song is a character vector
mutate(Song = as.character(Song)) %>%
# Remove not accepted characters
mutate(Song = iconv(Song, "UTF-8", "ASCII", sub = "")) %>%
mutate(Song = gsub("[^[:alnum:] ]", "", Song))
# Plot the PCA with all song name labels, allowing overlaps
ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2, label = Song)) +
geom_point(aes(color = Release_Year)) + # Add points colored by release year
geom_text(aes(label = Song), size = 3, vjust = -0.5, hjust = 0.5) + # Add song labels directly (no overlap avoidance)
# Add labels and color based on year
labs(
title = 'PCA of Songs',
subtitle = 'Color shows year of album release',
x = 'PC1',
y = 'PC2',
color = 'Release Year'
) +
# Color gradient
scale_color_gradient(low = "black", high = "red") +
theme_minimal()

Based on this plot, I noticed two major problems. The first one being
that labeling every single data point leads to overlaps and clutter. The
second one being that some song names are unnecessary long due to them
being remastered versions. I started with trying to fix the first
problem.
Selecting most popular song from each album
To reduce the amount of songs with a label, I decided to make a new
dataset that contains only the most popular song of each album in the
dataset:
# New dataset
highest_popularity_per_album <- PCA1_indiv_rank %>%
# Group based on rank
group_by(AlbumID_Rank) %>%
# Filter rows with max Popularity
filter(Popularity == max(Popularity)) %>%
# Handle ties by selecting the first row
slice(1) %>%
ungroup()
Removing remaster info from popular song names
After making a subset of only the most popular songs, I wanted to
shorten some of the song names. A lot of the songs are remastered after
the initial album release, which means that some information is added to
the song title. I created a regex
pattern to filter out all
songs that have some version of “Remaster”, “Remastered” and a year in
the song title. All text that matches this pattern is removed in the
code below:
# Clean the song names column
popular_no_remaster <- highest_popularity_per_album %>%
mutate(Song = str_remove_all(Song, "(?i)\\s*\\b(remaster(ed)?( \\d{4})?|\\d{4} remaster(ed)?)\\b.*$"))
# Add a flag for popular songs in the original PCA dataset
PCA1_indiv_rank <- PCA1_indiv_rank %>%
mutate(Popular = ifelse(Song %in% popular_no_remaster$Song, TRUE, FALSE))
To make the downstream plotting of this data easier, I added a flag
to the original PCA data to indicate if the song is the most popular of
an album.
Plotting again
Let’s see if reducing the amount of labels to only one for each album
and shortening remastered song names made a difference in making the
plot look better:
# Plot the PCA with non-overlapping labels for popular songs
ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2)) +
geom_point(aes(color = Release_Year)) + # Add points colored by release year
geom_text_repel(
data = subset(PCA1_indiv_rank, Popular == TRUE), # Only label popular songs
aes(label = Song),
size = 3, # Text size
box.padding = 0.35, # Padding around the text box
point.padding = 0.5, # Space between the label and the point
max.overlaps = Inf, # Allow all labels to attempt to display
segment.color = 'grey50', # Line color
segment.size = 0.5 # Line thickness
) +
# Add labels and color based on year
labs(
title = 'PCA of Songs',
subtitle = 'Color shows year of album release; labels for popular songs',
x = 'PC1',
y = 'PC2',
color = 'Release Year'
) +
# Colors and theme
scale_color_gradient(low = "black", high = "red") +
theme_minimal()

Well, this still looks very chaotic… Maybe it’s better to manually
select songs that have to be labelled in the graph. For this, I selected
my personal favorites from the most popular metal songs.
# Manually specify the song names to be labeled
selected_songs <- c("Chop Suey","Breaking the Law","Master Of Puppets","Would","Nothing Else Matters","Blind","People Shit","Transilvanian Hunger","Walk","Bat Country","Raining Blood","Killing In The Name")
# New dataset to include only the selected songs
popular_favorites <- popular_no_remaster %>%
filter(Song %in% selected_songs)
# Add a flag for favorite songs in the original PCA dataset
PCA1_indiv_rank <- PCA1_indiv_rank %>%
mutate(Favorite = ifelse(Song %in% popular_favorites$Song, TRUE, FALSE))
# Plot the PCA with repelled labels for popular songs
neat_pca <- ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2)) +
geom_point(aes(color = Release_Year)) + # Add points colored by release year
geom_text_repel(
data = subset(PCA1_indiv_rank, Favorite == TRUE), # Only label favorite songs
aes(label = Song),
size = 3,
nudge_x = -0.01,
nudge_y = -0.01,
box.padding = 0.4, # Padding around the text box
point.padding = 0, # Space between the label and the point
max.overlaps = Inf, # Allow all labels to attempt to display
segment.color = 'black', # Line color
segment.size = 0.6 # Line thickness
) +
labs(
title = 'PCA of Songs',
subtitle = 'Color shows year of album release; labels for favorite songs',
x = 'PC1',
y = 'PC2',
color = 'Release Year'
) +
scale_color_gradient(low = "black", high = "red") +
theme_minimal()
neat_pca

Creating the ideal plot
I already really liked this version of the plot, but I wanted to make
it look more styled. In the code below, I played around with different
looks for the datapoints, labelling mechanisms and fonts.
pretty_pca_plot1 <- ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2, color = Release_Year)) + # Map color to a variable
geom_point(alpha=0.2,size = 2) +
scale_color_gradient(low = "black", high = "red") +
scale_x_continuous(limits = c(-5, 7)) + # Plot size x
scale_y_continuous(limits = c(-7.8, 10))+ # Plot size y
# Labels that can not overlap and are not on the datapoint
geom_text_repel(
data = subset(PCA1_indiv_rank, Favorite == TRUE), # Only label favorite songs
aes(label = paste(Song, "\n", Artist, sep = ""), fontface = "bold.italic"), # Song and artist label
size = 2.8, # Text size
box.padding = 0.6, # Increase padding around text boxes
point.padding = 0.1, # Padding between points and labels
force = 20, # Increase repulsion between labels
force_pull = 0.1, # Slight pull to the data points, ensuring alignment
min.segment.length = 0.5, # Avoid overly short segment lines
max.overlaps = Inf, # Ensure all labels are shown
segment.color = 'black', # Line color
segment.size = 0.3 # Thinner segment lines
)
# View
pretty_pca_plot1

I liked the styling of the points and labels here, but I wanted the
axes to run through the middle of the data instead of being on the
sides. I did this by creating some arrow styles and adding them
manually:
arrow_style <- arrow(
angle = 20, ends = "first", type = "closed", length = grid::unit(4, "pt")
)
arrow_style_ind <- arrow(
angle = 45,
length = grid::unit(8, "pt")
)
pretty_pca_plot2 <- pretty_pca_plot1 +
# x-axis line
annotate(
geom = "segment", x = -5, xend = 6, y = 0, yend = 0,
lty = "dashed",
col = "grey60",
size = 0.5
) +
# x-axis arrow
annotate(
geom = "segment", x = 5.9, xend = 6, y = 0, yend = 0,
col = "grey60",
size = 0.5, arrow = arrow_style_ind, lineend = 'round', linejoin = 'round'
) +
# x-axis label
annotate(
geom = "text", x = 4.7, y = 0.25, label = "Principal Component 1 (26%)",
col = "dimgrey", size = 3, family = "sans", fontface = "italic"
) +
# y-axis label
annotate(
geom = "text", x = -0.1, y = -7.7, label = "Principal Component 2 (22%)",
col = "dimgrey", size = 3, family = "sans", angle = 90, vjust = 0, hjust = 0, fontface = "italic"
) +
# y-axis
annotate(
geom = "segment", x = 0, xend = 0, y = -7.8, yend = 5, col = "grey60", lty = "dashed",
size = 0.5
) +
# y-axis arrow
annotate(
geom = "segment", x = 0, xend = 0, y = 4.9, yend = 5, col = "grey60",
size = 0.5, arrow = arrow_style_ind, lineend = 'round', linejoin = 'round'
)
pretty_pca_plot2

Now, there are too many axes and the background does not fit the
style anymore. I fixed it like this:
pretty_pca_plot3 <- pretty_pca_plot2 +
# Set minimal theme
theme_minimal() +
theme(
panel.grid = element_blank(), # No grid
# No axis labels
axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
)
pretty_pca_plot3

This plot now really misses some additional information, such as what
type of data we are looking at, what kind of analysis was performed and
the data source. I also incorporated the legend into the subtitle.
# Create title object
maintitle<-cbind.data.frame(
# Position on left top
x = -5, y = 9.5,
# Glue for HTML styling
label = glue("<span style='color:darkred'>PCA Rolling Stone top 100 metal albums of all time</span>")
)
# Subtitle object
subtitle <- cbind.data.frame(
# Position below title
x = -5, y = 8.5,
# Glue for HTML styling, include legend colors
label = glue("<span style='color:#2b2a30'>Principal component analysis of metal songs' features from the<br>Rolling Stone top 100 Metal Albums from </span><b style='color:black'> 1970 </b><span style='color:#2b2a30'>to</span><b style='color:red'> 2013 </b><span style='color:#2b2a30'>.<br>Each dot stands for a song, with title and artist for some favorite tracks.</span> ")
)
pretty_pca_plot4 <- pretty_pca_plot3 +
annotate(
geom = "text", x = 6, y = -7.5, hjust = 1, color = "dimgrey",
label = "Data: data.world",
size = 2, family = "cabin", lineheight = 0.4
) +
geom_richtext(
data = maintitle,
aes(x = x, y = y, label = label),
inherit.aes = FALSE, lineheight = 0.5,
alpha = 1, size = 7, family = 'cabin', hjust = 0, vjust = 1,
fill = NA, label.color = NA,
label.padding = grid::unit(rep(0, 4), "pt")
) +
geom_richtext(
data = subtitle,
aes(x = x, y = y, label = label),
inherit.aes = FALSE, lineheight = 0.5,
alpha = 1, size = 5, family = 'cabin', hjust = 0, vjust = 1,
fill = NA, label.color = NA,
label.padding = grid::unit(rep(0, 4), "pt")
) +
# Remove original legend
theme(
legend.position = "none"
)
pretty_pca_plot4

This looks good! But, I have also created a plot of variables that I
want to include in this figure. I used patchwork
to combine
these two plots into one.
# Make plot with patchwork
layout = c(
area(t = 0, l = 0, b = 10, r = 15), # Large area
area(t = 0, l = 6, b = 5, r = 17) # Smaller area in right corner
)
pretty_pca_plot4+pretty_var_plot+plot_layout(design=layout)

Okay, one final touch! I want to help the viewer understand the plot
of variables better, so I want to add a description of the variable plot
to the figure:
pretty_pca_plot5 <- pretty_pca_plot4 +
annotate(
geom = "text", x = 3.2, y = 3.3, hjust = 0.5, label = "Album ranking is\n heavily influenced\n by release year",
size = 3.8, family = "cabin", lineheight = 1
)
# Make plot with patchwork
layout2 = c(
area(t = 0, l = 0, b = 10, r = 15), # Large area
area(t = 0, l = 6, b = 5, r = 17) # Smaller area in right corner
)
pretty_pca_plot5+pretty_var_plot+plot_layout(design=layout)

Unfortunately, some of the label positions keep changing due to the
use of patchwork
, but this is the best I could achieve with
my skills.
This plot shows that the release year of the album of the songs
correlates the most with the rank of the albums, which shows that
Rolling Stone ranked older metal albums higher than the newer metal
albums. It also shows that my personal taste differs leans towards the
left and darker side of the main cloud of songs, which means that I
prefer older songs too new ones.
This document will be improved in the future!
---
title: "Metalmusic PCA"
output: html_notebook
---
In this post, I will explain how I performed a small analysis on the characteristics of songs from albums in the Rolling Stone top 100 greatest metal albums of all times. I was inspired to do this after I followed [this tutorial](https://bjnnowak.netlify.app/2021/09/15/r-pca-with-tidyverse/).

# Loading libraries
As always with R, the needed libraries need to be loaded before functions of those libraries can be used:
```{r libraries, message=FALSE, warning=FALSE}
# Load the necessary libraries
library(dplyr)
library(tidyverse)
library(ggplot2)
library(reshape2)
library(broom)
library(stringr)
library(ggrepel)
library(ggtext)
library(glue)
library(showtext)
library(patchwork)
```

Another thing to load in advance are the fonts needed to create the PCA plot:
```{r fonts, message=FALSE}
# Load fonts
font_add_google("Acme", "acme")
font_add_google("Raleway", "raleway")
font_add_google("Cabin Condensed","cabin")
font_add_google("Barlow Condensed","barlow")
font_add_google("Stint Ultra Condensed","stint")
font_add_google("Poppins","poppins")
font_add_google("Pacifico","pacifico")
font_add_google("Oswald","oswald")

# Automatically use {showtext} for plots
showtext_auto()
```
# Importing data
First, I imported the data from TSV files obtained from a [dataset on data.world](https://data.world/kcmillersean/rs-100-greatest-metal-albums-of-all-time). These datasets contain data about the Rolling Stone top 100 best metal albums of all time. For each album, metadata such as the artist, name, release year, duration, genre, rating and more is included. For each of the songs on the album, information about the individual characteristics is included, such as BPM, popularity and various metrics about how the song sounds (energy, dance, loud, valence etc.).
```{r load_album_data}
# Read in album dataset
raw_albums <- read.delim("RS_metal_albums.txt")

# View results
head(raw_albums)
```
```{r load_song_data}
# Read in song dataset
raw_songs <- read.delim("RS_metal_songs.txt")

# View results
head(raw_songs)
```
Some of the headers in both datasets contained spaces, which are automatically replaced by a "." by R. I replaced these with an underscore ("_"):
```{r remove_underscores}
# Change column names of raw_albums
colnames(raw_albums)[4] <- "Release_Year"
colnames(raw_albums)[11] <- "Total_Seconds"
colnames(raw_albums)[13] <- "Sub_Metal_Genre"
colnames(raw_albums)[15] <- "Rolling_Stone_Rating"

# Change column names of raw_songs
colnames(raw_songs)[1] <- "Song_Index"
colnames(raw_songs)[4] <- "Track_No"
```
The dataset with the songs does not contain any data on the release year of the album, but the album dataset does. I needed to add a column to the songs dataset that shows the release year of the album on which each of the songs is. I did this by using the "AlbumID_Rank" column which corresponds between the 2 datasets. The method I used for this operation is from the `dplyr` package, and is called `left_join`. I first selected the relevant columns from raw_albums (`AlbumID_Rank` and `Release_Year`) and then joined it to `raw_songs` using the `AlbumID_Rank` column. 
```{r ljoin_years}
# Left join on AlbumID_Rank to add the Release_Year to raw_songs
songs_with_year <- raw_songs %>%
  left_join(raw_albums %>% select(AlbumID_Rank, Release_Year), by = "AlbumID_Rank")

# View result
head(songs_with_year)
```
This resulted in a new dataset (`songs_with_year`) with a column appended to the `raw_songs` dataset. This column contains the release year of the album on which the song is. 

Another column that might come in useful is the `Sub_Metal_Genre` column of the `raw_albums` dataset; I also performed a `left_join` on this data: 
```{r ljoin_genre}
# Left join on AlbumID_Rank to add the Sub_Metal_Genre to raw_songs
songs_with_year_genre <- songs_with_year %>%
  left_join(raw_albums %>% select(AlbumID_Rank, Sub_Metal_Genre), by = "AlbumID_Rank")

# View result
head(songs_with_year_genre)
```
To choose which variables are going to be included in the PCA, I wanted to check which variables correlate the most with each other. For this, I created a correlation matrix with the `cor` function for all numeric columns of the `songs_with_year_genre` dataset with the code below:

```{r calculate_cormatrix}
# Select only numeric columns
numeric_data <- songs_with_year_genre %>%
  select_if(is.numeric)

# Calculate correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")
```
To get a visual insight into the correlation, I created a simple heatmap of the correlation values with `ggplot2`:
```{r plot_cormatrix}
# Convert correlation matrix to a long format for ggplot2
cor_data <- melt(cor_matrix)

# Plot heatmap
ggplot(cor_data, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") + # White space between tiles
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", # Heatmap color range
                       midpoint = 0, limit = c(-1, 1), space = "Lab", # Midpoint 0 and fill between -1 and 1
                       name = "Correlation") # Legend name
```
This heatmap does not look the best... There are no labels in the plot and the x-axis labels overlap. Let's improve it:
```{r plot_cormatrix_pretty}
# Plot heatmap, but pretty
ggplot(cor_data, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") + # White space between tiles
  geom_text(aes(label = round(value, 2)), size = 2, color = "black") + # Add labels on tiles
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1, 1), space = "Lab",
                       name = "Correlation") +
  theme_minimal() + # Set theme
  theme(text = element_text(size = 9),axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + # Adjust axis labels
  labs(title = "Metalsongs Correlation Matrix Heatmap", # Plot title
       x = "Variables", # Axis title
       y = "Variables") # Axis title
```
Now, we can see that `Song_index` and `AlbumID_Rank` are highly correlated, which makes sense because the song index values were assigned based on the songs sorted based on the album ranking. In addition, `Loud` and `Energy` are highly correlated, which was also to be expected because loud songs are also very energetic most of the times. To reduce redundancy, I will only use one of these two categories: `Energy` for the PCA. 2 combinations that are highly negatively correlated are `Acoustic` and `Loud` and `Acoustic` and `Energy`. Again, this makes sense because within the metal genre, most acoustic songs are not loud and do not contain a lot of energy. 

A bit more unexpected is the correlation between `Release_Year` and `AlbumID_Rank` (and `Song_Index`). This indicates that the year an album was released influences the position on the Rolling Stone top 100. Let's see if I can further analyze this correlation during the PCA.

# Selecting PCA data
To perform a PCA, I had to select the variables with numerical values of interest from the data. These included the ranking place of the album, but also information about the characteristics of each song and the year they were released. I also added `Song` and `Artist` because this will be usefull when labeling the different datapoints.
To select the variables that are interesting for our PCA from the data, I used the code below:
```{r selecting}
# New data for PCA
PCA1_data<-songs_with_year_genre%>%
  select(
    # Interesting variables for PCA
    c(AlbumID_Rank,BPM,Dance,
      Loud,Valence,Acoustic,Popularity,
      Release_Year,Song,Artist)
  )%>%
  # Remove NA entries
  drop_na()
```
# Performing the PCA
After these steps I was ready to perform the PCA, which I did using the `prcomp` function:

```{r PCA1_rank}
# Remove the non-numerical artist and song name data for the PCA
PCA1_rank <- PCA1_data%>%
  select(-Artist,-Song)%>%
  # Perform PCA 
  prcomp(scale=TRUE)
```

I used another package (`broom`) to view the results in the `tidyverse` syntax:

```{r access_PCA1_rank}
# View results in eigenvalues matrix
PCA1_rank%>%
  tidy(matrix = "eigenvalues")
```
As the tibble shows, PCA1 accounts for 26% of the overall variability, and PCA2 for 22%. I wanted to visualize how the different categories separate the data, before plotting the actual datapoints of each song.

# Plotting variables
To see what variables account for the variability in the plot, I extracted the coordinates of the variables from the "rotation" matrix of the `PCA1_var` data:
```{r PCA1_variables}
# New dataframe for variables
PCA1_var<-PCA1_rank %>%
  # Extract variable coords
  tidy(matrix = "rotation") %>%
  # Format table long to wide
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  # Rename column with variable names
  rename(Variable=column)
  # Check tibble
  head(PCA1_var)
```
To actually plot the variables, I used the `ggrepel` library because this prevents errors when variable names overlap.  I chose a different color for the `AlbumID_Rank` variable, because it is interesting to see how this variable is influenced by the others.
```{r plot_variables}
ggplot(data = PCA1_var, aes(PC1, PC2)) +
  # Add variable arrows with conditional color
  geom_segment(
    aes(color = ifelse(Variable == "AlbumID_Rank", "black", "red")),
    xend = 0, yend = 0,
    arrow = arrow(
      length = unit(0.03, "npc"),
      type = "closed",
      ends = "first"
    )
  ) +
  # Add variable names with conditional color
  geom_text_repel(
    aes(label = Variable, color = ifelse(Variable == "AlbumID_Rank", "black", "red")),
    hjust = 1, size = 3,
    min.segment.length = Inf,
    nudge_x = 0.01, nudge_y = 0.08
  ) +
  coord_fixed() +
  labs(
    # Add titles
    title = 'Plot of PCA Variables Top 100 Metal Albums',
    x = 'PC1',
    y = 'PC2',
    color = 'Variable'
  ) +
  scale_color_identity() +
  theme_minimal()
```
This plot of variables again shows that the `AlbumID_Rank` is mostly influenced by `Release_Year`, but also that `Acoustic` and `Loud` separate a lot of songs. With a larger plot in mind, I wanted to remove the information about the PC numbers, and make the plot look more visually appealing so it can be shown together with the PCA plot:
```{r plot_variables_pretty}
pretty_var_plot <- ggplot(data = PCA1_var, aes(PC1, PC2)) +
  # Add variable arrows with conditional color
  geom_segment(
    aes(color = ifelse(Variable == "AlbumID_Rank", "black", "#c93434")), # Black and red arrows
    xend = 0, yend = 0,
    arrow = arrow(
      length = unit(0.05, "npc"),
      type = "closed",
      ends = "first"
    ),
    linewidth = 0.9, # Wide lines
  ) +
  # Add variable names
  geom_text_repel(
    aes(label = Variable, color = ifelse(Variable == "AlbumID_Rank", "black", "#c93434")), # Black and red names
    hjust = 1, size = 3.5,
    min.segment.length = Inf,
    nudge_x = 0.01, nudge_y = 0.08
  ) +
  # Add dashed line for y axis
  annotate(
    geom = "segment", x = 0, xend = 0, y = -Inf, yend = Inf, col = "grey60", lty = "dashed",
    linewidth = 0.5
  ) +
  # Add dashed line for x axis
  annotate(
    geom = "segment", x = Inf, xend = -Inf, y = 0, yend = 0, col = "grey60", lty = "dashed",
    linewidth = 0.5
  ) +
  # Keep fixed aspect ratio
  coord_fixed() +
  # Maintain color identity and remove legend
  scale_color_identity() +
  guides(color = "none") +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    # No axis labels
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y = element_blank()
  )
pretty_var_plot
```
# Plotting songs
To actually create a plot of all songs included in the first PCA, I used `ggplot` again. Of course I used red and black in the color palette, this is a metal data analysis after all :).
```{r PCA1_plot}
# Add rank data back to PCA calculated data
PCA1_indiv_rank<-PCA1_rank%>%
  augment(PCA1_data)

# Plot individuals
ggplot(
  # Based on calculated PCA data
  data=PCA1_indiv_rank,
  # Use PCA values and color based on release year
  aes(.fittedPC1, .fittedPC2,color=Release_Year))+
  geom_point()+
  # Add labels and color based on year
  labs(
    title = 'Plot of individual songs',
    subtitle = 'Color shows year of album release of songs',
    x='PC1 (26%)',
    y='PC2 (22%)',
    color='Release_Year'
  )+
  # Gradient with reds
  scale_color_gradient(low = "black", high = "red") +
  # Light theme
  theme_light()
```
The plot shows that PC1 is the principal component that is able to distinguish the songs the best based on the album release year.

Now, I wanted to try and make the plot contain more information and also look a bit prettier...

# Optimizing the plot
The first bit of information I thought about adding to the plot was the song names. I started with trying to add all labels to the plot with the code below, but as was to be expected, it became a big mess:
```{r PCA1_allnames}
# Check if song names are valid and handle special characters
PCA1_indiv_rank <- PCA1_indiv_rank %>%
  # Check that song is a character vector
  mutate(Song = as.character(Song)) %>%
  # Remove not accepted characters
  mutate(Song = iconv(Song, "UTF-8", "ASCII", sub = "")) %>%
  mutate(Song = gsub("[^[:alnum:] ]", "", Song))

# Plot the PCA with all song name labels, allowing overlaps
ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2, label = Song)) +
  geom_point(aes(color = Release_Year)) +  # Add points colored by release year
  geom_text(aes(label = Song), size = 3, vjust = -0.5, hjust = 0.5) +  # Add song labels directly (no overlap avoidance)
  # Add labels and color based on year
  labs(
    title = 'PCA of Songs',
    subtitle = 'Color shows year of album release',
    x = 'PC1',
    y = 'PC2',
    color = 'Release Year'
  ) +
  # Color gradient
  scale_color_gradient(low = "black", high = "red") +
  theme_minimal()

```
Based on this plot, I noticed two major problems. The first one being that labeling every single data point leads to overlaps and clutter. The second one being that some song names are unnecessary long due to them being remastered versions. I started with trying to fix the first problem.

## Selecting most popular song from each album
To reduce the amount of songs with a label, I decided to make a new dataset that contains only the most popular song of each album in the dataset:
```{r popular_songs_only}
# New dataset
highest_popularity_per_album <- PCA1_indiv_rank %>%
  # Group based on rank
  group_by(AlbumID_Rank) %>%
  # Filter rows with max Popularity
  filter(Popularity == max(Popularity)) %>%
  # Handle ties by selecting the first row
  slice(1) %>%  
  ungroup()
```
## Removing remaster info from popular song names
After making a subset of only the most popular songs, I wanted to shorten some of the song names. A lot of the songs are remastered after the initial album release, which means that some information is added to the song title.
I created a `regex` pattern to filter out all songs that have some version of "Remaster", "Remastered" and a year in the song title.
All text that matches this pattern is removed in the code below:
```{r no_remaster_cleaning}
# Clean the song names column
popular_no_remaster <- highest_popularity_per_album %>%
  mutate(Song = str_remove_all(Song, "(?i)\\s*\\b(remaster(ed)?( \\d{4})?|\\d{4} remaster(ed)?)\\b.*$"))

# Add a flag for popular songs in the original PCA dataset
PCA1_indiv_rank <- PCA1_indiv_rank %>%
  mutate(Popular = ifelse(Song %in% popular_no_remaster$Song, TRUE, FALSE))
```
To make the downstream plotting of this data easier, I added a flag to the original PCA data to indicate if the song is the most popular of an album.

## Plotting again
Let's see if reducing the amount of labels to only one for each album and shortening remastered song names made a difference in making the plot look better:
```{r PCA1_popular_no_remaster}
# Plot the PCA with non-overlapping labels for popular songs
ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = Release_Year)) +  # Add points colored by release year
  geom_text_repel(
    data = subset(PCA1_indiv_rank, Popular == TRUE),  # Only label popular songs
    aes(label = Song),
    size = 3, # Text size
    box.padding = 0.35,  # Padding around the text box
    point.padding = 0.5,  # Space between the label and the point
    max.overlaps = Inf,  # Allow all labels to attempt to display
    segment.color = 'grey50',  # Line color
    segment.size = 0.5  # Line thickness
  ) +
  # Add labels and color based on year
  labs(
    title = 'PCA of Songs',
    subtitle = 'Color shows year of album release; labels for popular songs',
    x = 'PC1',
    y = 'PC2',
    color = 'Release Year'
  ) +
  # Colors and theme 
  scale_color_gradient(low = "black", high = "red") +
  theme_minimal()
```
Well, this still looks very chaotic... Maybe it's better to manually select songs that have to be labelled in the graph. For this, I selected my personal favorites from the most popular metal songs. 
```{r PCA_favorites}
# Manually specify the song names to be labeled
selected_songs <- c("Chop Suey","Breaking the Law","Master Of Puppets","Would","Nothing Else Matters","Blind","People  Shit","Transilvanian Hunger","Walk","Bat Country","Raining Blood","Killing In The Name")

# New dataset to include only the selected songs
popular_favorites <- popular_no_remaster %>%
  filter(Song %in% selected_songs)

# Add a flag for favorite songs in the original PCA dataset
PCA1_indiv_rank <- PCA1_indiv_rank %>%
  mutate(Favorite = ifelse(Song %in% popular_favorites$Song, TRUE, FALSE))

# Plot the PCA with repelled labels for popular songs
neat_pca <- ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = Release_Year)) +  # Add points colored by release year
  geom_text_repel(
    data = subset(PCA1_indiv_rank, Favorite == TRUE),  # Only label favorite songs
    aes(label = Song),
    size = 3,
    nudge_x = -0.01,
    nudge_y = -0.01,
    box.padding = 0.4,  # Padding around the text box
    point.padding = 0,  # Space between the label and the point
    max.overlaps = Inf,  # Allow all labels to attempt to display
    segment.color = 'black',  # Line color
    segment.size = 0.6  # Line thickness
  ) +
  labs(
    title = 'PCA of Songs',
    subtitle = 'Color shows year of album release; labels for favorite songs',
    x = 'PC1',
    y = 'PC2',
    color = 'Release Year'
  ) +
  scale_color_gradient(low = "black", high = "red") +
  theme_minimal()
neat_pca
```
# Creating the ideal plot
I already really liked this version of the plot, but I wanted to make it look more styled. In the code below, I played around with different looks for the datapoints, labelling mechanisms and fonts. 
```{r pretty_pca_plot_part1}
pretty_pca_plot1 <- ggplot(PCA1_indiv_rank, aes(.fittedPC1, .fittedPC2, color = Release_Year)) + # Map color to a variable
  geom_point(alpha=0.2,size = 2) +
  scale_color_gradient(low = "black", high = "red") +
  scale_x_continuous(limits = c(-5, 7)) + # Plot size x
  scale_y_continuous(limits = c(-7.8, 10))+ # Plot size y
  # Labels that can not overlap and are not on the datapoint
  geom_text_repel( 
    data = subset(PCA1_indiv_rank, Favorite == TRUE), # Only label favorite songs
    aes(label = paste(Song, "\n", Artist, sep = ""), fontface = "bold.italic"), # Song and artist label
    size = 2.8,  # Text size
    box.padding = 0.6,  # Increase padding around text boxes
    point.padding = 0.1,  # Padding between points and labels
    force = 20,  # Increase repulsion between labels
    force_pull = 0.1,  # Slight pull to the data points, ensuring alignment
    min.segment.length = 0.5,  # Avoid overly short segment lines
    max.overlaps = Inf,  # Ensure all labels are shown
    segment.color = 'black',  # Line color
    segment.size = 0.3  # Thinner segment lines
  )
# View
pretty_pca_plot1
```
I liked the styling of the points and labels here, but I wanted the axes to run through the middle of the data instead of being on the sides. I did this by creating some arrow styles and adding them manually:
```{r pretty_pca_plot_part2}
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(4, "pt")
)

arrow_style_ind <- arrow(
  angle = 45, 
  length = grid::unit(8, "pt")
)

pretty_pca_plot2 <- pretty_pca_plot1 +
  # x-axis line
  annotate(
    geom = "segment", x = -5, xend = 6, y = 0, yend = 0,
    lty = "dashed",
    col = "grey60",
    size = 0.5
  ) +
  # x-axis arrow
  annotate(
    geom = "segment", x = 5.9, xend = 6, y = 0, yend = 0,
    col = "grey60",
    size = 0.5, arrow = arrow_style_ind, lineend = 'round', linejoin = 'round'
  ) +
  # x-axis label
  annotate(
    geom = "text", x = 4.7, y = 0.25, label = "Principal Component 1 (26%)",
    col = "dimgrey", size = 3, family = "sans", fontface = "italic"
  ) +
  # y-axis label
  annotate(
    geom = "text", x = -0.1, y = -7.7, label = "Principal Component 2 (22%)",
    col = "dimgrey", size = 3, family = "sans", angle = 90, vjust = 0, hjust = 0, fontface = "italic"
  ) +
  # y-axis
  annotate(
    geom = "segment", x = 0, xend = 0, y = -7.8, yend = 5, col = "grey60", lty = "dashed",
    size = 0.5
  ) +
  # y-axis arrow
  annotate(
    geom = "segment", x = 0, xend = 0, y = 4.9, yend = 5, col = "grey60",
    size = 0.5, arrow = arrow_style_ind, lineend = 'round', linejoin = 'round'
  )
pretty_pca_plot2

```
Now, there are too many axes and the background does not fit the style anymore. I fixed it like this: 
```{r pretty_pca_plot_part3}
pretty_pca_plot3 <- pretty_pca_plot2 +
  # Set minimal theme
  theme_minimal() +
  theme(
    panel.grid = element_blank(), # No grid
    # No axis labels
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
  )
pretty_pca_plot3
```
This plot now really misses some additional information, such as what type of data we are looking at, what kind of analysis was performed and the data source. I also incorporated the legend into the subtitle.
```{r pretty_pca_plot_part4}
# Create title object
maintitle<-cbind.data.frame(
  # Position on left top
  x = -5, y = 9.5,
  # Glue for HTML styling
  label = glue("<span style='color:darkred'>PCA Rolling Stone top 100 metal albums of all time</span>")
)

# Subtitle object
subtitle <- cbind.data.frame(
  # Position below title
  x = -5, y = 8.5,
  # Glue for HTML styling, include legend colors
  label = glue("<span style='color:#2b2a30'>Principal component analysis of metal songs' features from the<br>Rolling Stone top 100 Metal Albums from </span><b style='color:black'> 1970 </b><span style='color:#2b2a30'>to</span><b style='color:red'> 2013 </b><span style='color:#2b2a30'>.<br>Each dot stands for a song, with title and artist for some favorite tracks.</span> ")
)

pretty_pca_plot4 <- pretty_pca_plot3 +
  annotate(
    geom = "text", x = 6, y = -7.5, hjust = 1, color = "dimgrey",
    label = "Data: data.world",
    size = 2, family = "cabin", lineheight = 0.4
  ) +
  geom_richtext(
    data = maintitle,
    aes(x = x, y = y, label = label),
    inherit.aes = FALSE, lineheight = 0.5,
    alpha = 1, size = 7, family = 'cabin', hjust = 0, vjust = 1,
    fill = NA, label.color = NA, 
    label.padding = grid::unit(rep(0, 4), "pt")
  ) +
  geom_richtext(
    data = subtitle,
    aes(x = x, y = y, label = label),
    inherit.aes = FALSE, lineheight = 0.5,
    alpha = 1, size = 5, family = 'cabin', hjust = 0, vjust = 1,
    fill = NA, label.color = NA, 
    label.padding = grid::unit(rep(0, 4), "pt")
  ) +
  # Remove original legend
  theme(
    legend.position = "none"
  )
pretty_pca_plot4
```
This looks good! But, I have also created a plot of variables that I want to include in this figure. I used `patchwork` to combine these two plots into one.
```{r combined_plots_part1}
# Make plot with patchwork
layout = c(
  area(t = 0, l = 0, b = 10, r = 15), # Large area
  area(t = 0, l = 6, b = 5, r = 17) # Smaller area in right corner
)  
pretty_pca_plot4+pretty_var_plot+plot_layout(design=layout)
```
Okay, one final touch! I want to help the viewer understand the plot of variables better, so I want to add a description of the variable plot to the figure:
```{r combined_plots_part2}
pretty_pca_plot5 <- pretty_pca_plot4 +
  # Add note for variable plot
  annotate(
  geom = "text", x = 3.2, y = 3.3, hjust = 0.5, label = "Album ranking is\n heavily influenced\n by release year",
  size = 3.8, family = "cabin", lineheight = 1
  )
pretty_pca_plot5+pretty_var_plot+plot_layout(design=layout)
```
Unfortunately, some of the label positions keep changing due to the use of `patchwork`, but this is the best I could achieve with my skills. 

This plot shows that the release year of the album of the songs correlates the most with the rank of the albums, which shows that Rolling Stone ranked older metal albums higher than the newer metal albums. It also shows that my personal taste differs leans towards the left and darker side of the main cloud of songs, which means that I prefer older songs too new ones. 

This document will be improved in the future!
