- Convince you that Bayesian modeling is helpful
- Convince you that, whatever you have heard, Bayesian modeling isn’t any harder than Frequentist modeling
- You’ll feel comfortable and confident doing Bayesian modeling with regression
We’ll use exactly the same data as last week!
?function_name
in the console (or using google!)Note: sometimes these packages can take a little time & effort to install. If you are having trouble, please reach out to your intstructors or mentor right away for help getting them set up. Installation is often the hardest part for using R packages!
# load some packages
library(tidyverse) # contains a bunch of packages for working with data, plotting, etc.
library(rstanarm)
library(tidybayes)
library(bayesplot)
library(car)
# if you need to install these packages, run the following lines of code:
# install.packages('tidyverse')
# install.packages('car')
# install.packages('rstanarm')
# install.packages('tidybayes')
# install.packages('bayesplot')
But please reach out to your instructors — or check out materials associated with the basic track coding workshops — if you need a refresher! We’re happy to help.
# first, let's load in some data
# we're going to use a potentially-familiar dataset from within the `car` package called Prestige
# this contains information about various types of jobs (e.g., average income, education, etc.)
df <- Prestige
# for simplicity, we just want to look at blue collar and white collar jobs
# we'll also get rid of 'census' since we're not using it
df <- df %>%
dplyr::filter(type != 'prof') %>%
dplyr::select(-census)
# the head() function shows the first few rows of the dataframe
head(df)
## education income women prestige type
## nursing.aides 9.45 3485 76.14 34.9 bc
## medical.technicians 12.79 5180 76.04 67.5 wc
## radio.tv.announcers 12.71 7562 11.15 57.6 wc
## secretaries 11.59 4036 97.51 46.0 wc
## typists 11.49 3148 95.97 41.9 wc
## bookkeepers 11.32 4348 68.24 49.4 wc
Notice, we have all our variables of interest in separate columns here! In order to use the lm()
(and related) functions, it’s important that all of your variables of interest exist in separate columns. In our case, the Prestige dataset is already set up like this, so there’s no need to make any adjustments. But it’s not uncommon for datasets to require some restructuring in advance.
Coding of Xs (variables), including continuous and categorical Xs, matters for interpreting our results, so we probably want to check the coding and centering of our variables in advance. This is especially important for when we have interaction effects in our model. But even in models without interaction effects, how we code variables will affect our interpretation of the intercept.
Typically, continuous Xs will be mean-centered (substract the sample mean from each observation)
# scale education, women, and prestige
df = df %>%
dplyr::mutate(education = as.vector(scale(education, center = TRUE, scale = TRUE)),
women = as.vector(scale(education, center = TRUE, scale = TRUE)),
prestige = as.vector(scale(education, center = TRUE, scale = TRUE)))
head(df)
## education income women prestige type
## nursing.aides 0.1057896 3485 0.1057896 0.1057896 bc
## medical.technicians 2.1052494 5180 2.1052494 2.1052494 wc
## radio.tv.announcers 2.0573582 7562 2.0573582 2.0573582 wc
## secretaries 1.3868806 4036 1.3868806 1.3868806 wc
## typists 1.3270166 3148 1.3270166 1.3270166 wc
## bookkeepers 1.2252477 4348 1.2252477 1.2252477 wc
- There are ways of turning categorical variables into numbers (which you need to do in order to run a regression model), and many statistical packages will do this for you “under the hood.” However, unless you recode your variables yourself, you might not know what different R packages/functions are actually doing, which may make it tricky to accurately interpret your results.
- For categorical variables, it’s particularly important that zero is meaningful — as this will allow us to interpret the intercept of our model. (since the intercept = the value of our outcome variables when all predictors/moderators are set to 0) For simplicity, we will work with examples of categorical Xs involving two levels only (e.g., blue collar and white collar).
- In effect coding, all of your levels/categories of X should sum up to 0.
- In a case where your categorical variable has two levels, a common choice is to pick the values -0.5 and 0.5, or -1 and 1. Neither of these options will change the overall results (i.e., the significance) of your model, but they will impact your interpretation of the slope.
# let's also effect code the job `type` variable
# blue collar (bc) = -0.5, and white collar (wc) = 0.5
df$type.e <- dplyr::recode(df$type, "bc" = -0.5, "wc" = 0.5)
# note: the 'dplyr::' tells R which package to look in for the recode function
# while this isn't always necessary to do before calling a function, it's good practice
# it IS necessary to do, however, when multiple packages you've loaded have a function with the same name
head(df)
## education income women prestige type type.e
## nursing.aides 0.1057896 3485 0.1057896 0.1057896 bc -0.5
## medical.technicians 2.1052494 5180 2.1052494 2.1052494 wc 0.5
## radio.tv.announcers 2.0573582 7562 2.0573582 2.0573582 wc 0.5
## secretaries 1.3868806 4036 1.3868806 1.3868806 wc 0.5
## typists 1.3270166 3148 1.3270166 1.3270166 wc 0.5
## bookkeepers 1.2252477 4348 1.2252477 1.2252477 wc 0.5
lm()
lm()
code is equation based, so is the commandstan_glm()
, which uses Bayesian inference ‘under the hood’- Follows the basic equation y ~ x, where y is the oucome variable and x is the predictor variable
- For example, let’s say we want to predict income as a function of prestige:
- The model syntax is almost exactly the same to run the Bayesian model using
rstanarm::stan_glm()
, everything inside the model call is the saame
- Note: the Bayesian model will produce a lot more output as it is sampling – this is a good way to check how far along it is towards being finished runninh. By default, the model will run 4 ‘Chains’ with 2000 Iterations each. So, if your model says
Chain 2: Iteraation 1800 / 2000
for example, you can know you’re a little less than halfway through.
# frequentist model
m1_freq <- lm(income ~ prestige, data = df)
# bayesian model
m1_bayes = rstanarm::stan_glm(income ~ prestige, data = df)
rstanarm::stan_glm()
model outputsWhat is consistent across both models?
-The intercept (i.e.
(Intercept)
) Estimate describes the estimated value of the outcome variable when all predictors are set to 0. In this case, this means the estimatedincome
whenprestige
is set to 0. Since prestige has been mean-centered, this would repressent the estimated income for a job of mean prestige.
- The
prestige
Estimate term, or the ‘beta estimate’ for prestige, represents the estimated average difference in the outcome variable (income
) associated with a 1-unit increase inprestige
. So, in this case, if we increaseprestige
by 1 (1 standard deviation, since we’ve scaled it),income
is expected to increase on average by this amount.
summary(m1_freq)
##
## Call:
## lm(formula = income ~ prestige, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3449.9 -1449.9 -215.4 1518.2 3858.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5263.7 238.3 22.09 <2e-16 ***
## prestige 391.4 240.1 1.63 0.108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1951 on 65 degrees of freedom
## Multiple R-squared: 0.03926, Adjusted R-squared: 0.02448
## F-statistic: 2.656 on 1 and 65 DF, p-value: 0.108
If we inspect the output from the Bayesian model, we can see that the first column of output for the Estimates, mean
, gives almost exactly the same values. That is because the Estimate
from the first model is roughly equivalent to the mean posterior estimate in the Bayesian model in many cases – both are try to provide a single ‘best guess’ for the value of each parameter.
Now, with the Bayesian model, you might notice a few things:
- There are no t-values or p-values! Depending on your point of view, this could either be a great thing or an awful thing. Either way though, this will mean that we have to think about our model in other ways.
- Instead of t-values and p-values, our output by default from the
summary
command is giving us an 80% posterior interval – that is, the model’s best guess with 80% confidence about the true value of each parameter. We can see this 80% interval has a miniumu value labeled10%
and a maximum value labeled90%
. So, the middle 80% of the posterior values for each parameter lie in this interval.
summary(m1_bayes)
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: income ~ prestige
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 67
## predictors: 2
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) 5266.2 245.7 4961.4 5262.2 5572.9
## prestige 390.1 238.8 83.8 391.3 697.1
## sigma 1968.6 176.8 1753.5 1953.0 2203.4
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 5267.1 343.1 4826.9 5266.2 5702.5
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 3.9 1.0 4034
## prestige 3.8 1.0 4035
## sigma 2.9 1.0 3847
## mean_PPD 5.4 1.0 4085
## log-posterior 0.0 1.0 1589
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
What if we want a 95% posterior interval though? We can set exactly what interval we want in our summary with the probs
command:
# we use .025 and .975 because the MIDDLE 95% falls between these two values
summary(m1_bayes, probs = c(.025, .975))
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: income ~ prestige
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 67
## predictors: 2
##
## Estimates:
## mean sd 2.5% 97.5%
## (Intercept) 5266.2 245.7 4782.4 5761.4
## prestige 390.1 238.8 -80.6 853.5
## sigma 1968.6 176.8 1660.3 2339.8
##
## Fit Diagnostics:
## mean sd 2.5% 97.5%
## mean_PPD 5267.1 343.1 4613.6 5952.1
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 3.9 1.0 4034
## prestige 3.8 1.0 4035
## sigma 2.9 1.0 3847
## mean_PPD 5.4 1.0 4085
## log-posterior 0.0 1.0 1589
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
type
)(Intercept)
tell us?type.e
tell us?(Intercept)
tell us?type.e
tell us?m2_freq <- lm(income ~ type.e, data = df)
m2_bayes <- rstanarm::stan_glm(income ~ type.e, data = df)
As with the previous model, the summary outputs are very similar, just we get a posterior interval from the Bayesian model
summary(m2_freq)
##
## Call:
## lm(formula = income ~ type.e, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3718.1 -1610.6 -240.1 1398.9 3727.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5213.2 255.3 20.42 <2e-16 ***
## type.e -321.8 510.5 -0.63 0.531
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1984 on 65 degrees of freedom
## Multiple R-squared: 0.006076, Adjusted R-squared: -0.009215
## F-statistic: 0.3974 on 1 and 65 DF, p-value: 0.5307
summary(m2_bayes)
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: income ~ type.e
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 67
## predictors: 2
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) 5205.4 254.6 4882.7 5205.2 5531.3
## type.e -319.8 528.7 -992.9 -329.2 358.8
## sigma 2007.8 179.1 1783.9 1994.3 2239.1
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 5256.9 346.7 4814.0 5259.3 5707.2
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 4.2 1.0 3640
## type.e 8.4 1.0 3982
## sigma 2.9 1.0 3819
## mean_PPD 5.6 1.0 3835
## log-posterior 0.0 1.0 1755
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
- Often, it is hard to understand what our model is telling us until we visualize our model’s predictions, especially with the raw data
- Bayesian models give us special power and flexibility for visualizing our model predictions. For this, we’ll use the
tidybayes
package!
With this step, we now ask our models to predict what the expected mean value of income
should be for each level of prestige
, as well as an uncertainty interval. This is super helpful for visualizing what our models are telling us.
# create a dataframe with a range of prestige values to make predictions for -- in this casse going from the minimum to the maximum value with 20 total values
continuous_predict_df = data.frame(prestige = seq(min(df$prestige), max(df$prestige), length.out = 20))
# now, ask the models to PREDICT the estimated income value for each level of prestige, with a confidence interval
m1_freq_predictions = data.frame(predict(m1_freq, newdata = continuous_predict_df, interval = 'confidence')) %>%
cbind(continuous_predict_df, .)
# this looks similar, but a *little* different using the tidybayes package
m1_bayes_predictions = tidybayes::add_fitted_draws(m1_bayes, newdata = continuous_predict_df)
- For the frequentist model, as we saw last week, we can get an estimate and an
upr
andlwr
bound of a 95% confidence interval for each level ofprestige
head(m1_freq_predictions)
## prestige fit lwr upr
## 1 -1.7320372 4585.775 3628.431 5543.118
## 2 -1.5300747 4664.819 3790.184 5539.453
## 3 -1.3281123 4743.862 3948.736 5538.989
## 4 -1.1261498 4822.906 4103.025 5542.787
## 5 -0.9241874 4901.950 4251.571 5552.328
## 6 -0.7222249 4980.994 4392.335 5569.652
- What we get for the Bayesian model is a little more detailed. We actually have 4000 different values for each level of
prestige
representing the 4000 posterior ‘draws’ (they are differentiated in the.draw
column) for estimating mean income (the outcome variable, which is stored in the.value
column) at each level of prestige. So, when we plot these, we’ll find ways to summarize the posterior predictions in different ways
- While this output is slightly more complicated, it allows us to more easly view the posterior distribution in lots of different ways
head(m1_bayes_predictions)
## # A tibble: 6 x 6
## # Groups: prestige, .row [1]
## prestige .row .chain .iteration .draw .value
## <dbl> <int> <int> <int> <int> <dbl>
## 1 -1.73 1 NA NA 1 4534.
## 2 -1.73 1 NA NA 2 4324.
## 3 -1.73 1 NA NA 3 5315.
## 4 -1.73 1 NA NA 4 4396.
## 5 -1.73 1 NA NA 5 4934.
## 6 -1.73 1 NA NA 6 4682.
By using the stat_lineribbon
function from tidybayes, we can generate a very similar graph to the type we made last week
ggplot(m1_bayes_predictions, aes(x = prestige, y = .value)) +
tidybayes::stat_lineribbon(.width = .95)
ggplot(m1_freq_predictions, aes(x = prestige, y = fit)) +
geom_line() +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = .5)
Or with a little more style - we can plot both the 80% and 95% posterior intervals! Note: a lot of researchers stick to 95% intervals because these have somewhat arbitrarly been traditional, but if you want to break out of the mold of the 95% this is a great way to do it!
ggplot(m1_bayes_predictions, aes(x = prestige, y = .value)) +
tidybayes::stat_lineribbon(.width = c(.80, .95)) +
theme_bw() +
labs(y = 'Model-estimated income', x = 'Prestige (Mean-Centered)') +
scale_fill_brewer(palette = "Blues")
Or with the raw data:
ggplot(m1_bayes_predictions, aes(x = prestige, y = .value)) +
tidybayes::stat_lineribbon(.width = c(.80, .95)) +
theme_bw() +
labs(y = 'Model-estimated income', x = 'Prestige (Mean-Centered)') +
scale_fill_brewer(palette = "Blues") +
geom_point(data = df, aes(x = prestige, y = income))
Now let’s plot model predictions for the model with one categorical variable – income
as a function of type.e
– whether jobs were white-collar or blue-collar
# these are the only values we need since there were just two categories
categorical_predict_df = data.frame(type.e = c(-.5, .5))
m2_freq_predictions = data.frame(predict(m2_freq, newdata = categorical_predict_df, interval = 'confidence')) %>%
cbind(categorical_predict_df, .)
# this looks similar, but a *little* different using the tidybayes package
m2_bayes_predictions = tidybayes::add_fitted_draws(m2_bayes, newdata = categorical_predict_df)
# recode back to informative labels
m2_freq_predictions$type = dplyr::recode(m2_freq_predictions$type.e, '-0.5' ='Blue Collar', '.5'= 'White Collar')
m2_bayes_predictions$type = dplyr::recode(m2_bayes_predictions$type.e, '-0.5' ='Blue Collar', '.5'= 'White Collar')
Now that we have the predictions from each model, we can plot them:
ggplot(m2_freq_predictions, aes(x = type, y = fit)) +
geom_point() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0)
Again, we have more options with plotting the Bayesian model fits, namely:
- We can plot the entire posterior distributions here using
tidybayes::stat_halfeye()
, not just the 95% intervals- In this case, these distributions are fairly normal looking and don’t have much skew. BUT, if they were, we would know, and we could adjust our interpretations accordingly
- We can again use the
.width
option to get intervals of multiple widths – below we’ll try 80% and 95%
ggplot(m2_bayes_predictions, aes(y = .value, x = type)) +
tidybayes::stat_halfeye(.width = c(.80, .95)) +
labs(y = 'Model-estimated income')
- Just like last time, so far we have explored the relationship between income and prestige and income and type separately. In addition to being interested in main effects, we sometimes want to look at how two variables may interact with one another in their associations with an outcome.
- For example we might hypothesize that the asssociation between of prestige and income is stronger for white collar jobs but weaker for blue collar jobs. This is what we call a moderation, or an interaction (although some reseaarchers prefer to stick with the term ‘interaction’ because it does not imply a specific causal mechanism, whereas ‘moderation’ does)
- For this next regression, we want to test both the main effects of type and prestige on income (y ~ X1 + X2) and their interaction (y ~ X1 * X2)
- Note that running an interaction automatically gives us the main effects as well
- What do the intercept, prestige.c, type.e, and prestige.c:type.e now tell us? We will go through each term one by one.
Run interaction model syntax
# frequentist
m3_freq <- lm(income ~ prestige * type.e, data = df)
# bayesiaan
m3_bayes <- rstanarm::stan_glm(income ~ prestige * type.e, data = df)
Inspect model summaries
summary(m3_freq)
##
## Call:
## lm(formula = income ~ prestige * type.e, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4364.3 -1151.1 -32.4 1120.6 3695.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5349.5 451.0 11.861 <2e-16 ***
## prestige 971.2 405.3 2.396 0.0195 *
## type.e -1632.2 902.1 -1.809 0.0752 .
## prestige:type.e -950.9 810.5 -1.173 0.2451
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1828 on 63 degrees of freedom
## Multiple R-squared: 0.1824, Adjusted R-squared: 0.1435
## F-statistic: 4.686 on 3 and 63 DF, p-value: 0.005119
summary(m3_bayes)
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: income ~ prestige * type.e
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 67
## predictors: 4
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) 5385.4 453.6 4812.0 5381.4 5970.5
## prestige 940.8 400.4 448.1 942.6 1436.0
## type.e -1561.6 898.2 -2689.6 -1574.0 -417.8
## prestige:type.e -997.9 830.1 -2023.4 -1001.1 53.6
## sigma 1848.1 163.5 1649.2 1835.7 2063.0
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 5277.9 317.0 4876.3 5280.3 5671.6
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 9.7 1.0 2165
## prestige 9.0 1.0 1991
## type.e 20.8 1.0 1869
## prestige:type.e 17.4 1.0 2288
## sigma 3.1 1.0 2849
## mean_PPD 5.4 1.0 3451
## log-posterior 0.0 1.0 1615
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Questions for model interpretation:
(Intercept)
tell us?prestige
tell us?type.e
tell us?prestige:type.e
tell us?(Intercept)
tell us?prestige
tell us?type.e
tell us?prestige:type.e
tell us?
- Here we can use the
expand.grid
command to set up a grid of all combinations ofprestige
andtype.e
interaction_prediction_df = expand.grid(prestige = seq(min(df$prestige), max(df$prestige), length.out = 20),
type.e = c(-.5, .5))
# frequentist predictions
m3_freq_predictions = data.frame(predict(m3_freq, newdata = interaction_prediction_df, interval = 'confidence')) %>%
cbind(interaction_prediction_df, .)
# bayesian predictions
m3_bayes_predictions = tidybayes::add_fitted_draws(m3_bayes, newdata = interaction_prediction_df)
# recode back to informative labels
m3_freq_predictions$type = dplyr::recode(m3_freq_predictions$type.e, '-0.5' ='Blue Collar', '.5'= 'White Collar')
m3_bayes_predictions$type = dplyr::recode(m3_bayes_predictions$type.e, '-0.5' ='Blue Collar', '.5'= 'White Collar')
ggplot(m3_freq_predictions, aes(x = prestige, y = fit)) +
geom_line(aes(color = type)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, fill = type), alpha = .5)
ggplot(m3_bayes_predictions, aes(x = prestige, y = .value)) +
tidybayes::stat_lineribbon(.width = .95, aes(fill = type))
Or with a little more glamour
ggplot(m3_bayes_predictions, aes(x = prestige, y = .value)) +
tidybayes::stat_lineribbon(.width = c(.95, .80), aes(fill = type), alpha = .5) +
theme_bw() +
labs(x = 'Prestige (Mean-Centered)', y = 'Estimated Income')
- What are your questions?