The questions and exercises below all include examples using a dataset put together in the Tottenham Lab of many Billboard Hot 100 songs from 1946-1983. Today in this challenge you’ll want to work to answer several questions:
rstanarm::stan_glm()
, and plotting with ggplot()
You don’t have to write code in this section, but it will be helpful to read through to understand the data a little better.
Load the needed packages (install if necessary and ASK AN INSTRUCTOR if you’re getting stuck here
library(tidyverse)
library(car)
library(rstanarm)
library(tidybayes)
library(bayesplot)
Now, we can load in the data from Github using read_csv()
music = read_csv('../../../DANLAB/Projects/AMFM/music_databases/familiar_music_database.csv')
## Parsed with column specification:
## cols(
## .default = col_double(),
## title = col_character(),
## artist = col_character(),
## id = col_character(),
## uri = col_character(),
## genre = col_character(),
## cover = col_logical(),
## start_time = col_time(format = "")
## )
## See spec(...) for full column specifications.
names(music)
## [1] "year" "title" "artist" "id"
## [5] "uri" "rank" "genre" "cover"
## [9] "danceability" "energy" "key" "loudness"
## [13] "speechiness" "acousticness" "instrumentalness" "liveness"
## [17] "valence" "tempo" "mode" "time_signature"
## [21] "duration_ms" "start_time"
We have a lot of columns here we could look at, but for today’s challenge just focus on genre, loudnesss, and year. We’ll also only look at the Pop
and Rock
genres for today
# select just the relevant columns
music = dplyr::select(music, title, artist, year, genre, loudness) %>%
# filter the data to only include Pop/Rock genre
dplyr::filter(genre %in% c('Pop', 'Rock')) %>%
# scale/center loudness
dplyr::mutate(loudness = as.vector(scale(loudness, center = TRUE, scale = TRUE)))
head(music)
## # A tibble: 6 x 5
## title artist year genre loudness
## <chr> <chr> <dbl> <chr> <dbl>
## 1 I Can't Begin To Tell You - Single Version Bing Crosby 1946 Pop -2.96
## 2 Let It Snow! Let It Snow! Let It Snow! Vaughn Mon… 1946 Pop 1.24
## 3 Doctor, Lawyer, Indian Chief Betty Hutt… 1946 Pop -0.461
## 4 Personality Johnny Mer… 1946 Pop -1.68
## 5 Oh! What It Seemed to Be Frankie Ca… 1946 Pop -1.13
## 6 Prisoner of Love (with Russ Case & His Orche… Perry Como 1946 Pop -0.495
rstanarm::stan_glm()
with 1 continuous predictor to understand relationships betweenyear
and loudness
in the music dataset!tidybayes
and ggplot()
Fit the model
year_loud_model = rstanarm::stan_glm(data = music, loudness ~ year)
# use probs = c(.025, .975) to get the middle 95% of the posterior
# use digits=3 to see more digits after the decimal for each term
year_loud_summary = summary(year_loud_model, probs = c(.025, .975), digits = 3)
year_loud_summary
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: loudness ~ year
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 2135
## predictors: 2
##
## Estimates:
## mean sd 2.5% 97.5%
## (Intercept) -43.769 4.219 -52.002 -35.474
## year 0.022 0.002 0.018 0.026
## sigma 0.976 0.015 0.947 1.006
##
## Fit Diagnostics:
## mean sd 2.5% 97.5%
## mean_PPD 0.000 0.030 -0.059 0.058
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.064 0.999 4356
## year 0.000 0.999 4356
## sigma 0.000 1.000 3623
## mean_PPD 0.000 1.000 3830
## log-posterior 0.029 1.003 1912
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
- On average, the music was get louder in later years. On average, each year was associated with an average loudness increase of 0.0222496, 95% PI [0.018034, 0.0264386] (in standardized units)
# Make a dataframe of years, then get model predictions for average loudness each year in the dataset
year_df = data.frame(year = unique(music$year))
# get model predictions for each year
year_loudness_predictions = tidybayes::add_fitted_draws(model = year_loud_model, newdata = year_df)
head(year_loudness_predictions)
## # A tibble: 6 x 6
## # Groups: year, .row [1]
## year .row .chain .iteration .draw .value
## <dbl> <int> <int> <int> <int> <dbl>
## 1 1946 1 NA NA 1 -0.445
## 2 1946 1 NA NA 2 -0.469
## 3 1946 1 NA NA 3 -0.492
## 4 1946 1 NA NA 4 -0.438
## 5 1946 1 NA NA 5 -0.412
## 6 1946 1 NA NA 6 -0.519
What’s going on in this dataframe of predictions?
.value
column. In general with this add_fitted_draws()
function, the .value
column is the outcome variable in your regression model.row
, .chain
, and .iteration
columns. The .draw
column is helpful though, because it helps us see that there are exactly 4000 ‘draws’ (estimate ‘drawn’ from the posterior distribution) for each value of year
. This will be true in general - tidybayes will generate 4000 draws for every combination of predictors in the dataframe given as the newdata
argument**Now, let’s plot the predictions, using tidybayes::stat_lineribbon()
, which gives us a nice ‘ribbon band’ that summarizes the 4000 predictions at different value for year and makes a smooth ‘ribbon’ for different levels of posterior uncertainty. Here, the darker inner band shows the 80% posterior interval and the outer lighter band shows the 95% posterior interval
ggplot(data = year_loudness_predictions, aes(x = year, y = .value)) +
tidybayes::stat_lineribbon(.width = c(.8, .95)) +
scale_fill_brewer() +
labs(fill = 'Posterior Interval Level', x = 'Year', y = 'Song Loudness (standardized)') +
theme_bw()
Now, with points for raw data too. This helps us see how large any changes in average volume are on the scale of the distribution of all songs
ggplot(data = year_loudness_predictions, aes(x = year, y = .value)) +
geom_point(data = music, aes(x = year, y = loudness), alpha = .4) +
tidybayes::stat_lineribbon(.width = c(.8, .95)) +
scale_fill_brewer() +
labs(fill = 'Posterior Interval Level', x = 'Year', y = 'Song Loudness (standardized)') +
theme_bw()
rstanarm::stan_glm()
with 1 categorical predictor to understand relationships between genre
and loudness
in the music dataset!tidybayes
and ggplot()
genre_loud_model = rstanarm::stan_glm(data = music, loudness ~ genre)
# use probs = c(.025, .975) to get the middle 95% of the posterior
# use digits=3 to see more digits after the decimal for each term
genre_loud_summary = summary(genre_loud_model, probs = c(.025, .975), digits = 3)
genre_loud_summary
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: loudness ~ genre
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 2135
## predictors: 2
##
## Estimates:
## mean sd 2.5% 97.5%
## (Intercept) -0.116 0.028 -0.168 -0.062
## genreRock 0.286 0.044 0.201 0.370
## sigma 0.991 0.015 0.962 1.020
##
## Fit Diagnostics:
## mean sd 2.5% 97.5%
## mean_PPD 0.001 0.030 -0.057 0.062
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.000 1.000 3919
## genreRock 0.001 1.000 3664
## sigma 0.000 1.000 4156
## mean_PPD 0.000 1.000 3990
## log-posterior 0.028 1.001 1950
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
- On average, the rock songs were louder than pop. This is indicated by the
genreRock
term which describes differencess in loudness for rock songs relative to pop songs. On average, each rock was associated with an average loudness increase of 0.2863652, 95% PI [0.2009516, 0.3702407] (in standardized units) compared to pop music
# Make a dataframe of years, then get model predictions for average loudness each genre in the dataset
# We use stringsAsFactors=FALSE because the genre variable was treated as a string, not as a factor, in the raw data fed into the model
genre_df = data.frame(genre = c('Rock', 'Pop'), stringsAsFactors = FALSE)
# get model predictions for each genre
genre_loudness_predictions = tidybayes::add_fitted_draws(model = genre_loud_model, newdata = genre_df)
head(genre_loudness_predictions)
## # A tibble: 6 x 6
## # Groups: genre, .row [1]
## genre .row .chain .iteration .draw .value
## <chr> <int> <int> <int> <int> <dbl>
## 1 Rock 1 NA NA 1 0.171
## 2 Rock 1 NA NA 2 0.181
## 3 Rock 1 NA NA 3 0.195
## 4 Rock 1 NA NA 4 0.0819
## 5 Rock 1 NA NA 5 0.114
## 6 Rock 1 NA NA 6 0.220
Like with the first model, we get 4000 posterior predictions for each level of the predictors in newdata
. Here, the dataframe is 8000 rows – 4000 predictions in the .value
column for the averaage loudness of Rock and Pop songs, respectively.
One way of plotting this – we use tidybayes::stat_halfeye()
to get plots of the full posterior distributions plus points (posterior mean) and eror bars (we use the .width
parameter to set here that we want an 80% (thicker bar) and 95% (thinner bar) posterior interval)
ggplot(data = genre_loudness_predictions, aes(x = .value, fill = genre)) +
tidybayes::stat_halfeye(.width = c(.8, .95)) +
labs(y = 'Relative Posterior Density',
x = 'Estimated Loudness (standardized)',
fill = 'Song Genre') +
theme_bw()
Or we could flip the axes the other way
ggplot(data = genre_loudness_predictions, aes(x = genre, y = .value, fill = genre)) +
tidybayes::stat_halfeye(.width = c(.8, .95)) +
labs(y = 'Estimated Loudness (standardized)',
x = 'Song Genre',
fill = 'Song Genre') +
theme_bw() +
theme(legend.position = 'none')
We can also add the raw data in as jittered points, or a boxplot.
ggplot(data = genre_loudness_predictions, aes(x = genre, y = .value, fill = genre)) +
geom_jitter(data = music, aes(x = genre, y = loudness), width = .4, alpha = .2) +
tidybayes::stat_halfeye(.width = c(.8, .95)) +
labs(y = 'Estimated Loudness (standardized)',
x = 'Song Genre',
fill = 'Song Genre') +
theme_bw() +
theme(legend.position = 'none')
Hmm, looks kinda weird - why?
With a boxplot instead
ggplot(data = genre_loudness_predictions, aes(x = genre, y = .value, fill = genre)) +
geom_boxplot(data = music, aes(x = genre, y = loudness), width = .3, alpha = .2, position = position_nudge(-.3)) +
tidybayes::stat_halfeye(.width = c(.8, .95)) +
labs(y = 'Estimated Loudness (standardized)',
x = 'Song Genre',
fill = 'Song Genre') +
theme_bw() +
theme(legend.position = 'none')
rstanarm::stan_glm()
with 1 categorical predictor, 1 continuous predictor, and an interaction term between them, to understand loudness
as a function of genre
, year
, and their interaction.tidybayes
and ggplot()
# here we center year first to make the interaction term more interpretable
music = dplyr::mutate(music, year_center = as.vector(scale(year, center = TRUE, scale = FALSE)))
# the mean year in the dataset, so year_scale=0 represents this year
mean(music$year)
## [1] 1967.181
interaction_model = rstanarm::stan_glm(data = music, loudness ~ year_center*genre)
# use probs = c(.025, .975) to get the middle 95% of the posterior
# use digits=3 to see more digits after the decimal for each term
interaction_summary = summary(interaction_model, probs = c(.025, .975), digits = 3)
interaction_summary
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: loudness ~ year_center * genre
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 2135
## predictors: 4
##
## Estimates:
## mean sd 2.5% 97.5%
## (Intercept) -0.048 0.027 -0.102 0.006
## year_center 0.028 0.003 0.023 0.033
## genreRock 0.225 0.045 0.136 0.312
## year_center:genreRock -0.030 0.005 -0.039 -0.021
## sigma 0.964 0.015 0.936 0.994
##
## Fit Diagnostics:
## mean sd 2.5% 97.5%
## mean_PPD 0.000 0.030 -0.059 0.056
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.000 1.001 4641
## year_center 0.000 1.000 2961
## genreRock 0.001 1.001 4050
## year_center:genreRock 0.000 0.999 3196
## sigma 0.000 1.000 4554
## mean_PPD 0.000 1.000 4227
## log-posterior 0.036 1.001 1759
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
From this model, the year_center
term now describes the estimated difference in loudness associated with 1 year specifically for pop songs. The genreRock
term describes the estimated difference in rock > pop music loudness at year = 0 (1967, since we centered year), and the year_center:genreRock
term describes how different the associations between year and loudness are for rock songs, compared to pop songs. In other words, this interaction term describes how the slope of year (where x is year, y = loudness) differs for rock > pop music. For each of these terms, the 95% posterior interval excludes 0, so we might interpret these as consistent associations in the dataset.
Overall, the estimates indicate: * that pop songs tended to be louder in later years as indicated by the year_center
parameter * At year_center=0
, rock songs tended to be louder than pop songs * rock songs had a more negative association between year and loudness compared to pop songs
However, this is a little hard to parse from the model output alone, so let’s plot!
First, make a dataframe of years AND genre using expand.grid() – now our dataframe is ALL combinations of genre and year present in the dataset
interaction_df = expand.grid(genre = c('Rock', 'Pop'),
year_center = unique(music$year_center),
stringsAsFactors = FALSE)
# for plotting purposes we'll want the actual year back in
interaction_df$year = interaction_df$year_center + mean(music$year)
head(interaction_df)
## genre year_center year
## 1 Rock -21.1808 1946
## 2 Pop -21.1808 1946
## 3 Rock -20.1808 1947
## 4 Pop -20.1808 1947
## 5 Rock -19.1808 1948
## 6 Pop -19.1808 1948
# get model predictions for each combindation of genre and year
interaction_predictions = tidybayes::add_fitted_draws(model = interaction_model, newdata = interaction_df)
head(interaction_predictions)
## # A tibble: 6 x 8
## # Groups: genre, year_center, year, .row [1]
## genre year_center year .row .chain .iteration .draw .value
## <chr> <dbl> <dbl> <int> <int> <int> <int> <dbl>
## 1 Rock -21.2 1946 1 NA NA 1 0.210
## 2 Rock -21.2 1946 1 NA NA 2 0.135
## 3 Rock -21.2 1946 1 NA NA 3 0.249
## 4 Rock -21.2 1946 1 NA NA 4 0.144
## 5 Rock -21.2 1946 1 NA NA 5 0.114
## 6 Rock -21.2 1946 1 NA NA 6 0.165
Great, now we have 4000 posterior predictions for loudness for every combination of genre
and year_center
, with the predictions stored in the .value
column. We can make a plot now very similar to the first one we made with tidybayes::stat_lineribbon()
only this time we’ll set fill=genre
ggplot(data = interaction_predictions, aes(x = year, y = .value, fill = genre)) +
tidybayes::stat_lineribbon(.width = c(.95), alpha = .8) +
labs(fill = 'Song Genre', x = 'Year', y = 'Song Loudness (standardized)') +
theme_bw()
This gives us a nice clear visualization showing that while pop songs tended to be louder in later years, this wasn’t true for rock songs. However, rock songs in the earlier years were louder than pop, with pop songs eventually catching up in loudness around 1975. Now let’s add the raw data in too
ggplot(data = interaction_predictions, aes(x = year, y = .value, fill = genre)) +
geom_point(data = music, aes(x = year, y = loudness, color = genre)) +
tidybayes::stat_lineribbon(.width = c(.95), alpha = .8) +
labs(fill = 'Song Genre', x = 'Year', y = 'Song Loudness (standardized)') +
theme_bw() +
guides(color = FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Apart from contextualizing these trends on the scaale of the data, this graph also reveals something else important when we see the raw data – there really weren’t many rock songs in the dataset at all before the the late 1950s! So, how might this impact our conclusions if we hoped to say that pop music increased in loudness over time while rock did not? We might have to scale back our inference and dig a little deeper into what kinds of songs were released over time, and what processes lead to different songs being labeled as ‘Rock’ versus ‘Pop’ in the first place (i.e. record label and radio promotion strategies, geographical locationsand identities of the musical artists, who was doing the genre ‘labeling’).
Congrats! You’ve finished the challenge!