A Brief Introduction To Linear Models in R
A Brief Introduction To Linear Models in R
0. Introduction
Many bioinformatics applications involving repeatedly fitting linear models to data. Examples include:
Scope
Basics of linear models
R model syntax
Understanding contrasts
Models with continuous covariates
Diagnostic plots
Data-driven model selection
Anything that doesn’t scale well when applied to 1000’s of genes/SNPs/proteins
1. Linear models
A linear model is a model for a continuous outcome Y of the form
Y=β0+β1X1+β2X2+⋯+βpXp+ϵ
The covariates X can be:
2. Linear models in R
R uses the function lm to fit linear models.
Read in ’lm_example_data.csv`:
Fit a linear model using expression as the outcome and treatment as a categorical covariate:
In R model syntax, the outcome is on the left side, with covariates (separated by +) following the ~
oneway.model
##
## Call:
## lm(formula = expression ~ treatment, data = dat)
##
## Coefficients:
## (Intercept) treatmentB treatmentC treatmentD treatmentE
## 1.1725 0.4455 0.9028 2.5537 7.4140
class(oneway.model)
## [1] "lm"
In the output:
“Coefficients” refer to the β�’s
“Estimate” is the estimate of each coefficient
“Std. Error” is the standard error of the estimate
“t value” is the coefficient divided by its standard error
“Pr(>|t|)” is the p-value for the coefficient
The residual standard error is the estimate of the variance of ϵ�
Degrees of freedom is the sample size minus # of coefficients estimated
R-squared is (roughly) the proportion of variance in the outcome explained by the model
The F-statistic compares the fit of the model as a whole to the null model (with no covariates)
coef(oneway.model)
## (Intercept) treatmentB treatmentC treatmentD treatmentE
## 1.1724940 0.4455249 0.9027755 2.5536669 7.4139642
What if you don’t want reference group coding? Another option is to fit a model without an intercept:
no.intercept.model <- lm(expression ~ 0 + treatment, data = dat) # '0' means 'no intercept' here
summary(no.intercept.model)
##
## Call:
## lm(formula = expression ~ 0 + treatment, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9310 -0.5353 0.1790 0.7725 3.6114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## treatmentA 1.1725 0.7783 1.506 0.147594
## treatmentB 1.6180 0.7783 2.079 0.050717 .
## treatmentC 2.0753 0.7783 2.666 0.014831 *
## treatmentD 3.7262 0.7783 4.787 0.000112 ***
## treatmentE 8.5865 0.7783 11.032 5.92e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.74 on 20 degrees of freedom
## Multiple R-squared: 0.8878, Adjusted R-squared: 0.8598
## F-statistic: 31.66 on 5 and 20 DF, p-value: 7.605e-09
coef(no.intercept.model)
## treatmentA treatmentB treatmentC treatmentD treatmentE
## 1.172494 1.618019 2.075270 3.726161 8.586458
Without the intercept, the coefficients here estimate the mean in each level of treatment:
treatmentmeans
## A B C D E
## 1.172494 1.618019 2.075270 3.726161 8.586458
The no-intercept model is the SAME model as the reference group coded model, in the sense that it gives the same estimate for any comparison between groups:
Treatment B - treatment A, reference group coded model:
The design matrix X� has one row for each observation and one column for each model coefficient.
Sound complicated? The good news is that the design matrix can be specified through the model.matrix function using the same syntax as for lm, just without a response:
Design matrix for reference group coded model:
(Note that “contr.treatment”, or treatment contrasts, is how R refers to reference group coding)
The first column will always be 1 in every row if your model has an intercept
The column treatmentB is 1 if an observation has treatment B and 0 otherwise
The column treatmentC is 1 if an observation has treatment C and 0 otherwise
etc.
Exercises and Things to Think About
Use ?formula to explore specifying models in R.
Use ?lm.fit to see how lm uses the design matrix internally.
If the response y is log gene expression, the model coefficients are often referred to as log fold-changes. Why does this make sense? (Hint: log(x/y) = log(x) - log(y))
For a model with more than one coefficient, summary provides estimates and tests for each coefficient adjusted for all the other coefficients in the model.
The notation treatment*time refers to treatment, time, and the interaction effect of treatment by time. (This is different from other statistical software).
Interpretation of coefficients:
Each coefficient for treatment represents the difference between the indicated group and the reference group at the reference level for the other covariates
For example, “treatmentB” is the difference in expression between treatment B and treatment A at time 1
Similarly, “timetime2” is the difference in expression between time2 and time1 for treatment A
The interaction effects (coefficients with “:”) estimate the difference between treatment groups in the effect of time
The interaction effects ALSO estimate the difference between times in the effect of treatment
To estimate the difference between treatment B and treatment A at time 2, we need to include the interaction effects:
# A - B at time 2
coefs <- coef(twoway.model)
coefs["treatmentB"] + coefs["treatmentB:timetime2"]
## treatmentB
## 0.3109271
We can see from summary that one of the interaction effects is significant. Here’s what that interaction effect looks like graphically:
Next, fit a one-way ANOVA model with the new covariate. Don’t include an intercept in the model.
c1 <- coef(twoway.model)
c1["treatmentB"]
## treatmentB
## 0.4063679
c2 <- coef(other.2way.model)
c2["tx.timeB.time1"] - c2["tx.timeA.time1"]
## tx.timeB.time1
## 0.4063679
We get the same estimates for the effect of treatment B vs. A at time 2:
c1 <- coef(twoway.model)
c1["treatmentB"] + c1["treatmentB:timetime2"]
## treatmentB
## 0.3109271
c2 <- coef(other.2way.model)
c2["tx.timeB.time2"] - c2["tx.timeA.time2"]
## tx.timeB.time2
## 0.3109271
And we get the same estimates for the interaction effect (remembering that an interaction effect here is a difference of differences):
c1 <- coef(twoway.model)
c1["treatmentB:timetime2"]
## treatmentB:timetime2
## -0.09544075
c2 <- coef(other.2way.model)
(c2["tx.timeB.time2"] - c2["tx.timeA.time2"]) - (c2["tx.timeB.time1"] - c2["tx.timeA.time1"])
## tx.timeB.time2
## -0.09544075
4. Continuous Covariates
Linear models with continuous covariates (“regression models”) are fitted in much the same way:
For the above model, the intercept is the expression at temperature 0 and the “temperature” coefficient is the slope, or how much expression increases for each unit increase in
temperature:
The slope from a linear regression model is related to but not identical to the Pearson correlation coefficient:
cor.test(dat$expression, dat$temperature)
##
## Pearson's product-moment correlation
##
## data: dat$expression and dat$temperature
## t = 14.063, df = 23, p-value = 8.768e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8807176 0.9764371
## sample estimates:
## cor
## 0.9464761
summary(continuous.model)
##
## Call:
## lm(formula = expression ~ temperature, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.87373 -0.67875 -0.07922 1.00672 1.89564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.40718 0.93724 -10.04 7.13e-10 ***
## temperature 0.97697 0.06947 14.06 8.77e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 23 degrees of freedom
## Multiple R-squared: 0.8958, Adjusted R-squared: 0.8913
## F-statistic: 197.8 on 1 and 23 DF, p-value: 8.768e-13
Notice that the p-values for the correlation and the regression slope are identical.
Scaling and centering both variables yields a regression slope equal to the correlation coefficient: