We will be using data from the same study (PANDAS) as from last week’s challenge to practice making longitudinal multilevel models. As a reminder, we are not focusing on running models yet. These data include a variable for ID and gender, and then assess self-worth (SPPA_SWORTH), communication quality with mothers (MPACS), and generalized anxiety (SCARED_GAD). Self-worth, communication and GAD were assessed every 6-months at 6 different timepoints. The csv file is the same as the lesson mlm_datacleaning.csv and can be found here: https://columbia-sipps.github.io/workshops/coding_advanced/
Our question for today will be how did self worth change over time? We’ll use longitudinal multilevel models in rstanarm
to approach this
library(tidyverse)
library(rstanarm)
library(bayesplot)
library(tidybayes)
d = read_csv('https://columbia-sipps.github.io/workshop_files/coding-adv-04/mlm-data-cleaning.csv')
d <- d %>%
filter(ID <= 100)
Being: id, gender, and all of the self worth scores
d <- d %>%
select(ID, contains("SPPA"))
Rename them so that they are easier to understand
d <- d %>%
# rename vectors
rename(
id = 1,
sworth_1 = 2,
sworth_2 = 3,
sworth_3 = 4,
sworth_4 = 5,
sworth_5 = 6,
sworth_6 = 7
)
Hint: use the pivot longer function. Note, we want there to be one column for time, and another column for self worth scores. There should be one row of data for each individual’s self worth score at each timepoint (e.g., 6 rows per person if no missing data)
d_long <- d %>%
tidyr :: pivot_longer(., cols = contains("sworth"), names_to = "time", values_to = "self_worth")
Drop the rows with NA values (the observations where there is no self worth score). Then recode time into a factor and rename the variables so that they are only numeric values that range from time 0 to time 5 instead of time 1 to time 6.
d_long <- d_long %>%
dplyr::filter(!is.na(self_worth)) %>%
dplyr::mutate(time = as.numeric(gsub('sworth_', '', time)))
ggplot(data = d_long, aes(x = time, y = self_worth)) +
geom_line(aes(group = id), color = "black") + geom_point(aes(group = id), size = .4) +
facet_wrap("id") + # Group variable gives text labels to panels
labs(x = "Time",
y = "Self-worth (self-reported)",
title = "")
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
(1|id)
gives us random intercepts - intercepts for each person
m1 = rstanarm::stan_glmer(data = d_long, self_worth ~ time + (1|id), cores = 4)
# point estimates
fixef(m1)
## (Intercept) time
## 3.46465180 -0.08943226
# summary of full model
m1_summary = summary(m1, probs = c(.025, .975))
m1_summary[1:10, ]
## mean mcse sd 2.5%
## (Intercept) 3.46388003 0.0014967630 0.06467047 3.338117858
## time -0.08942295 0.0002135406 0.01524420 -0.119868637
## b[(Intercept) id:1] -0.16016513 0.0025823953 0.16985973 -0.492089150
## b[(Intercept) id:2] -0.76832437 0.0028386596 0.20350379 -1.165416272
## b[(Intercept) id:3] 0.34686894 0.0026190080 0.17195948 0.009442167
## b[(Intercept) id:4] 0.43773540 0.0028000295 0.18139829 0.086202225
## b[(Intercept) id:5] -0.32916274 0.0031829013 0.22428210 -0.773475638
## b[(Intercept) id:6] 0.29266237 0.0025694379 0.17440615 -0.043970021
## b[(Intercept) id:7] -0.89665622 0.0028051782 0.20525850 -1.292143626
## b[(Intercept) id:8] -0.76718368 0.0029469372 0.19643870 -1.148060354
## 97.5% n_eff Rhat
## (Intercept) 3.58842888 1867 1.0003305
## time -0.05893875 5096 0.9994887
## b[(Intercept) id:1] 0.17402994 4326 0.9998193
## b[(Intercept) id:2] -0.36703143 5139 0.9995735
## b[(Intercept) id:3] 0.69155946 4311 0.9994173
## b[(Intercept) id:4] 0.79989640 4197 1.0001505
## b[(Intercept) id:5] 0.10494675 4965 1.0000878
## b[(Intercept) id:6] 0.62290630 4607 1.0002792
## b[(Intercept) id:7] -0.49704225 5354 0.9992195
## b[(Intercept) id:8] -0.36809148 4443 0.9998679
- Woah! Why so much output here? Well, all of the
b[(Intercept) id:1]
parameters in the model are the intercepts for each different person (each id), so there are a lot of parameterss in this model!
Note that we have to add the re_formula = NA
argument in the add_fitted_draws()
function here to tell the model we want just the fixed effects (not predictions for specific participants)
# define a data frame of timepoints to make predictions for for the group average
pred_frame = data.frame(time = 1:6)
# get predictions using tidybayes!
# again this dataframe looks a little confuisng at first, but the outcome (self worth) is always in the .value column
fixed_preds_m1 = tidybayes::add_fitted_draws(m1, newdata = pred_frame, re_formula = NA)
ggplot(fixed_preds_m1, aes(x = time, y= .value)) +
tidybayes::stat_lineribbon(alpha = .4) +
labs(x = 'Time', y = 'Self-reported self worth')
a little nicer looking…
ggplot(fixed_preds_m1, aes(x = time, y= .value)) +
tidybayes::stat_lineribbon(alpha = .4) +
labs(x = 'Time', y = 'Self-reported self worth') +
scale_fill_brewer()
We can get predictions for each point in the original dataset by specificying the original dataset d_long
as newdata
We also use the median_qi
function to summarize and get the median and a posterior interval for each prediction. Now the .value
column will be the median model prediction for each datapoint. We won’t use .upper
and .lower
yet but we could later
participant_preds_m1 = tidybayes::add_fitted_draws(model = m1, newdata = d_long) %>%
median_qi(.value, .width = .95)
## Warning: `combine()` was deprecated in dplyr 1.0.0.
## Please use `vctrs::vec_c()` instead.
Now, add participant ‘spaghetti’ to our plot. We can see all the lines are parallel – they have different intercepts (random intercepts!) but the same exact slope as the fixed effect. This is because there are no random slopes – slopes cannot vary for different participants under this model.
ggplot(fixed_preds_m1, aes(x = time, y= .value)) +
geom_line(data = participant_preds_m1, aes(group = id), alpha = .5) +
tidybayes::stat_lineribbon(alpha = .4) +
labs(x = 'Time', y = 'Self-reported self worth') +
scale_fill_brewer()
m2 = rstanarm::stan_glmer(data = d_long, self_worth ~ time + (time|id), cores = 4)
# point estimates
fixef(m1)
## (Intercept) time
## 3.46465180 -0.08943226
# summary of full model (a good way to shorten output)
g = summary(m2, probs = c(.025, .975))
g[1:5,]
## mean mcse sd 2.5% 97.5%
## (Intercept) 3.45436219 0.0009549072 0.06268023 3.3325389 3.57788559
## time -0.08559143 0.0004424263 0.02061516 -0.1252687 -0.04494921
## b[(Intercept) id:1] -0.01269774 0.0030898036 0.23607912 -0.4845167 0.45622626
## b[time id:1] -0.04951025 0.0009222081 0.06387882 -0.1822634 0.07523683
## b[(Intercept) id:2] -0.56752480 0.0047533576 0.27912209 -1.1455183 -0.05639859
## n_eff Rhat
## (Intercept) 4309 1.0011004
## time 2171 1.0029477
## b[(Intercept) id:1] 5838 0.9992595
## b[time id:1] 4798 0.9995458
## b[(Intercept) id:2] 3448 0.9996078
Now, we have parameters like b[time id:2]
for each participant! These are the random slopes!
Just like with the previous model
fixed_preds_m2 = tidybayes::add_fitted_draws(m2, newdata = pred_frame, re_formula = NA)
ggplot(fixed_preds_m2, aes(x = time, y= .value)) +
tidybayes::stat_lineribbon(alpha = .4) +
labs(x = 'Time', y = 'Self-reported self worth') +
scale_fill_brewer()
participant_preds_m2 = tidybayes::add_fitted_draws(model = m2, newdata = d_long) %>%
median_qi(.value, .width = .95)
ggplot(fixed_preds_m2, aes(x = time, y= .value)) +
geom_line(data = participant_preds_m2, aes(group = id), alpha = .5) +
tidybayes::stat_lineribbon(alpha = .4) +
labs(x = 'Time', y = 'Self-reported self worth') +
scale_fill_brewer()
- Now, we can see the lines for the model predictions for different participants are not parallel! Random intercepts AND random slopes
- Now, we might want to make a panel plot to see how well our model fits each participant, comparing the model predicitons with the raw data
- First we plot the raw data
- Then we add in our predictions from m2 & m1
panel_plt = ggplot(data = d_long, aes(x = time, y = self_worth)) +
geom_line(aes(group = id, color = "Raw Daata")) + geom_point(aes(group = id), size = .4) +
geom_line(data = participant_preds_m1 , aes(x = time, y = .value, color = 'Random Intercept Model Prediction')) +
geom_line(data = participant_preds_m2, aes(x = time, y = .value, color = 'Random Slope Model Prediction')) +
facet_wrap("id") + # Group variable gives text labels to panels
labs(x = "Time",
y = "Self-reported self worth",
title = "")
# save to a file since it's a little hard to see inline
ggsave(panel_plt, file = 'panel_plt.png', height = 8, width = 15)
It’s a little hard to see because the panels are small, but we can look to visually inspect which model seems to fit better!