H-409 Multivariate Analysis With R and Stata
H-409 Multivariate Analysis With R and Stata
Md. Mostakim
Session: 2018-19
Department of Statistics
University of Dhaka
Published Date:
Acknowledgements:
I would like to acknowledge our course teacher Dr. Md. Belal Hossain Sir for helping us learning
Multivariate Analysis. Also, thanks to Amina Siddika for her lecture.
N.B. You may share this pdf book as much as you like but don’t use it for any unethical purpose.
For any kind of feedback please contact to mostakimbd2016@gmail.com. Your feedback will be very inspiring for me.
Table of Contents
Problem 01: Testing Vector Means....................................................................................................... 1
and interpretation using R. We provide the Stata code starting with a dot (.).
Answer:
With R:
a.
1|Page
> ## [,1]
## [1,] 13.63636
b.
> ((n-1)*p)/(n-p)
> ## [1] 3
> c(df1=p,df2=n-p)
c.
> ## [1] 57
Since 𝑇 2 = 13.64 < 57 we failed to reject Ho at the ⍺=0.05 level. Therefore, we may conclude
that the means of column 1 and column 2 of X don’t differ significantly from the values 7, 11
respectively at 5% level of significance.
> ## [,1]
## [1,] 0.1803279
Since p-value = 0.18 > 0.05 we failed to reject Ho at the ⍺=0.05 level. Therefore, we may
conclude that the means of column 1 and column 2 of X don’t differ significantly from the
values 7, 11 respectively at 5% level of significance.
With Stata:
2|Page
Type the data in data editor in STATA.
Code:
.. mvtest
mvtest means X1 X2,
means X1 X2, equals(mu)
equals(mu)
Hotelling T2 = 13.64
Hotelling F(2,2) = 4.55
Prob > F = 0.1803
Menu:
Statistics > Multivariate Analysis > Manova, Multivariate Regression and Related >
Multivariate test of means, covariances and matrices
3|Page
Problem 02: Testing Vector Means
a. Evaluate 𝑇 2 for testing the hypothesis that Ho: µ = (20, 200, 150, 3) vs Ha: µ ≠ (20, 200, 150,
3) using the variables (“mpg”,“disp”,“hp”,“wt”) from the data “mtcars”.
c. Using (a) and (b), test the Ho at the level 𝑎𝑙𝑝ℎ𝑎 = 0.01. What conclusion do you reach.
Answer:
a.
# Create a matrix X with the variables "mpg", "disp", "hp", and "wt"
X <- matrix(c(mtcars$mpg, mtcars$disp, mtcars$hp, mtcars$wt), ncol = 4)
4|Page
p <- ncol(X)
> ## [,1]
## [1,] 10.78587
b.
> ((n-1)*p)/(n-p)
> c(df1=p,df2=n-p)
c.
5|Page
Since 𝑇 2 = 10.79 < 22.97 we failed to reject Ho at the ⍺=0.01 level. Therefore, we may conclude
that the means of “mpg”,“disp”,“hp”,“wt” don’t differ significantly from the values 20, 200, 150,
3 respectively at 1% level of significance.
> ## [,1]
## [1,] 0.07058328
Since p-value = 0.07 > 0.01 we failed to reject Ho at the ⍺=0.01 level. Therefore, we may
conclude that the means of “mpg”,“disp”,“hp”,“wt” don’t differ significantly from the values 20,
200, 150, 3 respectively at 1% level of significance.
With Stata:
> library("foreign")
data("mtcars")
write.dta(mtcars, "C:/Users/Mostakim/Documents/mtcars.dta")
Hotelling T2 = 10.79
6|Page
Problem 03: Testing Vector Means
4
let, 𝑋1 = √𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 𝑟𝑎𝑑𝑖𝑎𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑑𝑜𝑜𝑟 𝑐𝑙𝑜𝑠𝑒𝑑
a. Evaluate 𝑇 2 for testing the hypothesis that Ho: µ = (0.562, 0.589) vs Ha: µ ≠ (0.562, 0.589)
using the variables X1 and X2
c. Using (a) and (b), test the Ho at the level 𝑎𝑙𝑝ℎ𝑎 = 0.01. What conclusion do you reach.
7|Page
Answer:
a.
> radc<-
c(0.15,0.09,0.18,0.1,0.05,0.12,0.08,0.05,0.08,0.1,0.07,0.02,.01,0.1,0.1
,0.1,0.02,0.1,0.01,0.4,0.1,0.05,0.03,0.05,0.15,0.1,0.15,0.09,0.08,0.18,
0.1,0.2,0.11,0.3,0.02,0.2,0.2,0.3,0.3,0.4,0.3,0.05)
rado<-
c(0.30,0.09,0.30,0.10,0.10,0.12,0.09,0.10,0.09,0.10,0.07,0.05,0.01,0.45
,0.12,0.2,0.04,0.1,0.01,0.6,0.12,0.1,0.05,0.05,0.15,0.3,0.15,0.09,0.09,
0.28,0.1,0.1,0.1,0.3,0.12,0.25,0.2,0.4,0.33,0.32,0.12,0.12)
x1<-matrix(radc^(1/4))
x2<-matrix(rado^(1/4))
x<-cbind(x1,x2)
xbar <-colMeans(x)
s<-cov(x)
p <- 2
n <- 42
mu<-c(.562,.589)
T2<-n*t(xbar-mu)%*%solve(s)%*%(xbar-mu)
T2
> ## [,1]
## [1,] 1.2573
b.
> ((n-1)*p)/(n-p)
> c(df1=p,df2=n-p)
c.
8|Page
> ## [1] 6.62504
Since the value of T2 is 1.25 < 6.62 We conclude that 𝜇 = [.562, .589] is in the region.
Equivalently, a test of 𝐻𝑜: 𝜇 = (.562, .589) would not be rejected in favor of 𝐻1 : 𝜇 ≠
(.562, .589) at the alpha=0.05 significance level. Therefore, the mean value of X1 and X2 do not
significantly differ from the value 0.562, 0.589 respectively.
> ## [,1]
## [1,] 0.5465654
The p-value > 0.05 also indicates the same result of above.
d.
> e$vectors
> lambda1<-e$values[1]
lambda2<-e$values[2]
e.
Half length of major axis is 0.06 and half length of minor axis is 0.02
9|Page
> v1/v2
The length of the major axis is 3.1 times the length of the minor axis.
f.
Improved anesthetics are often developed by first studying their effects on animals. In one study,
19 dogs were initially given the drug pentobarbital. Each dog was then administered carbon
dioxide (C02) at each of two pressure levels. Next, halothane (H) was added, and the
administration of (C02) was repeated. The response, milliseconds between heartbeats, was
measured for the four treatment combinations: Sleep dog data link
10 | P a g e
Treatment 4 = low C02 pressure with H
Answer:
With STATA:
. use "F:\Mostakim\4th Year\Stat H-409; Statistical computing VII;
Multivariate Analysis and Experimental Design\Data\sleepdog.dta", clear
. mvtest means T1 T2 T3 T4
Hotelling T2 = 116.02
Hotelling F(3,16) = 34.38
Prob > F = 0.0000
Since p-value < 0.05 we may reject null hypothesis and conclude that at least two of the
treatment means are not equal at 5% level of significance.
11 | P a g e
Problem 05: Testing Vector Means
a. Evaluate 𝑇 2 for testing the hypothesis that Ho: µ= (0.8, 2, 60, 30, 2, 200) vs Ha: µ≠ (0.8, 2, 60,
30, 2, 200) using the following data.
c. Using (a) and (b), test the Ho at the level 𝑎𝑙𝑝ℎ𝑎 = 0.01. What conclusion do you reach?
a. Evaluate 𝑇 2 for testing the hypothesis that Ho: µ= (4, 50, 10) vs Ha: µ≠ (4, 50, 10) using the
following data.
Sweat data
c. Using (a) and (b), test the Ho at the level 𝑎𝑙𝑝ℎ𝑎 = 0.01. What conclusion do you reach?
Answer:
With R:
12 | P a g e
> library("mvtnorm") #Reading the required library
# Scatter plots
pairs(sample, main = "Scatter Plots")
> # Boxplot
boxplot(sample, main="Box plot")
13 | P a g e
By examining the scatter plots, we can observe patterns and correlations between the variables. If
the points cluster tightly around a straight line, it suggests a strong linear relationship. On the
other hand, if the points are scattered with no apparent pattern, it indicates a weak or no linear
relationship. In our case there exists no relationship between the variables. However there seems
a negative relation between var1 and var2.
The box plots provide information about the distribution of each variable, including measures
such as the median, quartiles, and potential outliers. They help us understand the central
tendency, variability, and skewness of the individual variables.
With Stata:
. matrix mu = (-3, 2, 1)
. matrix s =(3,-1,0\-1,3,0\0,0,3)
. graph matrix p q r
. graph box p q r
Menu:
Data > Create or change data > Other variable-creation commands > Draw sample from
normal distribution
14 | P a g e
Problem 08: Generating Multivariate Normal Samples
Draw a sample of 500 observations from a multivariate normal distribution 𝑁 (𝜇, 𝛴) where 𝜇′ =
9 5 2
(5, −6, 0.5) is the mean matrix and 𝛴 = 5 4 1 is the covariance matrix. Then draw scatter
2 1 1
plots and box plots of the variables and discuss the aspects of multivariate data.
Draw a sample of 1000 observations from a multivariate normal distribution N (μ, Σ) where 𝜇′ =
1 −2 0
(−3, 1, 4) is the mean matrix and 𝛴 = −2 5 0 is the covariance matrix. Then draw scatter
0 0 2
plots and box plots of the variables and discuss the aspects of multivariate data.
15 | P a g e
Problem 10: Test for Multivariate Normality
Construct a Q-Q plot for the “mtcars” data set for the variables “mpg”,“disp”,“hp”,“wt” and
carry out a test for normality. What do you conclude?
Answer:
With R:
> data("mtcars")
x <- mtcars[,c("mpg","disp","hp","wt")]
n <- nrow(x)
p <- ncol(x)
xbar <- colMeans(x)
Sx <- cov(x)
D2 <- mahalanobis(x, xbar, Sx)
16 | P a g e
We observe from the Q-Q plot that the majority of the points closely align with the reference
line, indicating a reasonable approximation to normality. However, there is one point that
deviates noticeably from the line, which introduces some uncertainty in our decision-making
process. To further assess the multivariate normality, we can perform the “mshapiro.test” to
obtain a formal statistical test.
> library("mvnormtest")
mshapiro.test(t(x)) #Shapiro-Wilk Multivariate Normality Test
> ##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.82883, p-value = 0.0001491
Since p-value=0.000 < 0.01 we may reject null hypothesis at 1% level of significance and may
conclude that the variables “mpg”,“disp”,“hp”,“wt” may not follow multivariate normal
distribution.
With Stata:
> library("foreign")
data("mtcars")
write.dta(mtcars, "C:/Users/Mostakim/Documents/mtcars.dta")
17 | P a g e
Test for multivariate normality
Menu:
File > Open > mtcars. Dta;
Statistics > Multivariate Analysis > Manova, Multivariate Regression and Related >
Multivariate test of means, covariances and matrices
For Q-Q plot we have to construct Q-Q plot for each of the variable as there is no straight
forward method for calculating Mahalanobis Distance for four variables in Stata.
Code:
. qnorm mpg
. qnorm disp
. qnorm hp
. qnorm wt
Menu:
Statistics > Summaries, tables, and tests > Distributional plots and tests > Normal quantile
plot
18 | P a g e
Problem 11: MANOVA
Construct MANOVA for gear and carb factor of the data “mtcars” considering mpg, disp, wt as
dependent variables and comment the results.
Answer:
𝐻30 : 𝑇ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑒𝑓𝑓𝑒𝑐𝑡 𝑜𝑓 𝑔𝑒𝑎𝑟 𝑎𝑛𝑑 𝑐𝑎𝑟𝑏 𝑖𝑠 𝑛𝑜𝑡 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡 𝑒𝑓𝑓𝑒𝑐𝑡 𝑜𝑛 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙
𝐻31 : 𝑇ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑒𝑓𝑓𝑒𝑐𝑡 𝑜𝑓 𝑔𝑒𝑎𝑟 𝑎𝑛𝑑 𝑐𝑎𝑟𝑏 𝑖𝑠 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡 𝑒𝑓𝑓𝑒𝑐𝑡 𝑜𝑛 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙
With R:
> data(mtcars)
Y <- cbind(mtcars[,1],mtcars[,3],mtcars[,6]) #Defining the dependent
variables as Y
gear<-mtcars[,10]
carb<-mtcars[,11]
fit <- manova(Y ~ gear*carb)
summary(fit, test="Pillai")
Comment:
The p-value for both gear and carb is less that 0.05, so, we may reject null hypothesis at 5% level
of significance and may conclude that gear and carb has significant effect on the model. But the
19 | P a g e
p-value for the interaction effect of gear and garb is 0.4355 > 0.05, so, we failed to reject null
hypothesis of interaction effect at 5% level of significance and conclude that there is no
interaction effect of gear and carb on the model.
Therefore, the MANOVA results indicate that both gear and carb have significant main effects
on the dependent variables (mpg, hp, and wt). However, the interaction effect between gear and
carb is not statistically significant. These findings suggest that gear and carb independently
influence the dependent variables but do not interact significantly with each other.
With Stata:
Code:
Menu:
Statistics > Multivariate Analysis > Manova, Multivariate Regression and Related > MANOVA
20 | P a g e
Problem 12: Multivariate Regression
Use the dataset “mtcars” to fit multivariate regression model. Use “mpg”, “disp”, “wt” as
dependent variables and “gear” and “carb” as independent variables.
Answer:
> data(mtcars)
y<-cbind(mtcars[,1],mtcars[,3],mtcars[,6])
gear<-mtcars[,10]
carb<-mtcars[,11]
fit <- lm(y~gear+carb)
summary(fit,method ="Pillai")
> ## Response Y1 :
##
## Call:
## lm(formula = Y1 ~ gear + carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3385 -2.5873 0.3211 1.3742 7.0758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2756 2.9465 2.469 0.0197 *
## gear 5.5756 0.8129 6.859 1.56e-07 ***
## carb -2.7537 0.3713 -7.416 3.59e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.211 on 29 degrees of freedom
## Multiple R-squared: 0.7344, Adjusted R-squared: 0.7161
## F-statistic: 40.09 on 2 and 29 DF, p-value: 4.481e-09
##
##
## Response Y2 :
##
## Call:
## lm(formula = Y2 ~ gear + carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.22 -46.33 -12.41 45.84 224.61
##
21 | P a g e
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 547.621 71.279 7.683 1.80e-08 ***
## gear -120.567 19.664 -6.131 1.11e-06 ***
## carb 45.402 8.982 5.055 2.18e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77.69 on 29 degrees of freedom
## Multiple R-squared: 0.6325, Adjusted R-squared: 0.6071
## F-statistic: 24.95 on 2 and 29 DF, p-value: 4.977e-07
##
##
## Response Y3 :
##
## Call:
## lm(formula = Y3 ~ gear + carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.97574 -0.33262 -0.03964 0.24969 1.05929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.83881 0.49694 11.750 1.51e-12 ***
## gear -1.00441 0.13709 -7.326 4.53e-08 ***
## carb 0.38478 0.06262 6.144 1.07e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5416 on 29 degrees of freedom
## Multiple R-squared: 0.7134, Adjusted R-squared: 0.6936
## F-statistic: 36.09 on 2 and 29 DF, p-value: 1.352e-08
Coefficient of gear: For increasing no. of forward gears (gear) by one unit holding other
regressors constant, Miles/gallon (mpg) increases by 5.57 units on average. The p-value < 0.05
indicates that gear significantly affects mpg at 5% level of significance.
Also, For increasing no. of forward gears (gear) by one unit holding other regressors constant,
displacement (disp) decreases by 120.57 units on average. The p-value < 0.05 indicates that gear
significantly affects disp at 5% level of significance.
Furthermore, For increasing no. of forward gears (gear) by one unit holding other regressors
constant, weight (wt) decreases by 1 units on average. The p-value < 0.05 indicates that gear
significantly affects wt at 5% level of significance.
22 | P a g e
Similarly the interpretation goes for other coefficients.
> library(olsrr)
23 | P a g e
24 | P a g e
Homoscedasticity:
Residual vs fitted plot shows somewhat pattern indicates that the constant variance assumption is
may or may not met. We will conduct Breusch-Pagan test to confirm homoscedasticity.
> library(lmtest)
> bptest(model1)
> ##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 6.0304, df = 2, p-value = 0.04904
From Breusch-Pagan test, the p-value is less than 0.05, we may reject the null hypothesis and
may conclude that heteroscedasticity is present in the regression model. Therefore,
homoscedasticity assumption is not fulfilled.
Normality:
From Q-Q plot we can see that most of the residuals lie along with the reference line however,
some of the residuals deviate from the reference line indicates that the residuals may not come
from normal distribution. We will check this result by Shapiro-Wilk normality test:
> shapiro.test(model1$residuals)
> ##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.94706, p-value = 0.1189
Independence:
The plot of residuals against time order does not exhibit any systematic pattern, indicates that
there is no presence of autocorrelation among the residuals that is the residuals are independent.
Therefore, the independence assumption is satisfied.
Stata:
. use "C:\Users\Mostakim\Documents\mtcars.dta", clear
26 | P a g e
. mvreg, notable noheader corr
Menu:
Statistics > Multivariate Analysis > Manova, Multivariate Regression and Related >
Multivariate regression;
Linear models and related > Regression diagnostics >
Satellite applications motivated the development of a silver-zinc battery. Table 7.5 contains
failure data collected to characterize the performance of the battery during its life cycle. Use
these data to Battery life data link
27 | P a g e
(a) Find the estimated linear regression of on an appropriate ("best") subset of predictor
variables.
Answer:
> ##
## Call:
## lm(formula = y ~ z1 + z2 + z3 + z4 + z5, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -184.715 -30.446 2.968 26.375 147.850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2937.7571 4040.6401 -0.727 0.47918
## z1 -33.7934 43.3653 -0.779 0.44879
## z2 -0.1798 13.9073 -0.013 0.98987
## z3 -1.7397 1.3414 -1.297 0.21564
## z4 7.0627 1.9728 3.580 0.00302 **
## z5 1529.2897 2020.2396 0.757 0.46161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.49 on 14 degrees of freedom
## Multiple R-squared: 0.5201, Adjusted R-squared: 0.3487
## F-statistic: 3.034 on 5 and 14 DF, p-value: 0.04627
From the above table, we can see that p-value for z2 variable is the highest so we remove z2 from
our model.
> model1<-lm(y~z1+z4+z3+z5,data=data)
summary(model1)
28 | P a g e
> ##
## Call:
## lm(formula = y ~ z1 + z4 + z3 + z5, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -184.735 -30.235 3.275 26.424 147.452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2934.986 3898.158 -0.753 0.46315
## z1 -33.776 41.875 -0.807 0.43251
## z4 7.063 1.906 3.706 0.00211 **
## z3 -1.743 1.276 -1.365 0.19227
## z5 1527.713 1948.190 0.784 0.44515
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.62 on 15 degrees of freedom
## Multiple R-squared: 0.5201, Adjusted R-squared: 0.3921
## F-statistic: 4.064 on 4 and 15 DF, p-value: 0.01991
From the above table, we can see that p-value for z5 variable is the highest so we remove z5 from
our model.
> model2<-lm(y~z1+z4+z3,data=data)
summary(model2)
> ##
## Call:
## lm(formula = y ~ z1 + z4 + z3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.602 -26.148 -2.675 21.164 166.585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120.591 109.814 1.098 0.28840
## z1 -33.771 41.368 -0.816 0.42629
## z4 6.891 1.870 3.685 0.00201 **
## z3 -1.716 1.260 -1.361 0.19223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.64 on 16 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4067
## F-statistic: 5.342 on 3 and 16 DF, p-value: 0.009653
29 | P a g e
From the above table, we can see that p-value for z1 variable is the highest so we remove z1 from
our model.
> model3<-lm(y~z3+z4,data=data)
summary(model3)
> ##
## Call:
## lm(formula = y ~ z3 + z4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -160.32 -26.94 -11.31 31.07 171.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.224 82.528 0.754 0.46118
## z3 -1.399 1.187 -1.178 0.25496
## z4 7.075 1.838 3.849 0.00129 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79.84 on 17 degrees of freedom
## Multiple R-squared: 0.4796, Adjusted R-squared: 0.4184
## F-statistic: 7.833 on 2 and 17 DF, p-value: 0.003881
From the above table, we can see that p-value for z3 variable is the highest so we remove z3 from
our model.
> model4<-lm(y~z4,data=data)
summary(model4)
> ##
## Call:
## lm(formula = y ~ z4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -153.37 -43.61 -11.61 31.08 200.93
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.842 40.401 -0.565 0.5788
## z4 6.930 1.854 3.739 0.0015 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
30 | P a g e
## Residual standard error: 80.7 on 18 degrees of freedom
## Multiple R-squared: 0.4371, Adjusted R-squared: 0.4058
## F-statistic: 13.98 on 1 and 18 DF, p-value: 0.001504
Therefore, our final model with best regressor will be: 𝑌̂ = −22.84 + 6.93𝑧4
Residuals Diagnostics:
i. Plot the residuals against the predicted values, no systematic pattern indicates equal
variances and no dependence on 𝑦̂ that is the model is good.
ii. Plot the residuals against a predictor variable, systematic pattern suggests the need for
more terms in the model. No systematic pattern indicates the model is fine.
iii. Q-Q plots of the residuals, if the residuals lie along the reference line then they are
normally distributed and the model is adequate.
iv. Plot the residuals versus time, no systematic pattern indicates that the residuals are
independent and the model is adequate.
31 | P a g e
> plot(model4$fitted.values,model4$residuals); abline(h=0)
32 | P a g e
> plot(sample(1:20), model4$residuals); abline(h=0)
We leave the interpretation for the above plots as an exercise for the reader!
Similarly, find the best predictors for the response variable Y2.
33 | P a g e
Z1 == Gender: 1 if female, 0 if male (GEN)
Answer:
34 | P a g e
> data<-read.csv("F:/Mostakim/4th Year/Stat H-409; Statistical computing
VII; Multivariate Analysis and Experimental
Design/Data/amitriptyline.csv", header=TRUE)
Y<-cbind(data$y1,data$y2)
model<-lm(Y~z1+z2+z3+z4+z5, data=data)
summary(model)
> ## Response Y1 :
##
## Call:
## lm(formula = Y1 ~ z1 + z2 + z3 + z4 + z5, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -399.2 -180.1 4.5 164.1 366.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.879e+03 8.933e+02 -3.224 0.008108 **
## z1 6.757e+02 1.621e+02 4.169 0.001565 **
## z2 2.848e-01 6.091e-02 4.677 0.000675 ***
## z3 1.027e+01 4.255e+00 2.414 0.034358 *
## z4 7.251e+00 3.225e+00 2.248 0.046026 *
## z5 7.598e+00 3.849e+00 1.974 0.074006 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 281.2 on 11 degrees of freedom
## Multiple R-squared: 0.8871, Adjusted R-squared: 0.8358
## F-statistic: 17.29 on 5 and 11 DF, p-value: 6.983e-05
##
##
## Response Y2 :
##
## Call:
## lm(formula = Y2 ~ z1 + z2 + z3 + z4 + z5, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -373.85 -247.29 -83.74 217.13 462.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.729e+03 9.288e+02 -2.938 0.013502 *
## z1 7.630e+02 1.685e+02 4.528 0.000861 ***
## z2 3.064e-01 6.334e-02 4.837 0.000521 ***
## z3 8.896e+00 4.424e+00 2.011 0.069515 .
## z4 7.206e+00 3.354e+00 2.149 0.054782 .
## z5 4.987e+00 4.002e+00 1.246 0.238622
## ---
35 | P a g e
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 292.4 on 11 degrees of freedom
## Multiple R-squared: 0.8764, Adjusted R-squared: 0.8202
## F-statistic: 15.6 on 5 and 11 DF, p-value: 0.0001132
For both response variable Y1 and Y2, z5 is not significant so we can drop z5 from our model.
36 | P a g e
## z4 6.352e+00 3.358e+00 1.892 0.082896 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 299.1 on 12 degrees of freedom
## Multiple R-squared: 0.8589, Adjusted R-squared: 0.8119
## F-statistic: 18.27 on 4 and 12 DF, p-value: 4.847e-05
For both response Y1 and Y2, z4 is not significant, so we can drop z5 from our model.
37 | P a g e
## z3 6.998e+00 4.809e+00 1.455 0.169336
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 327.4 on 13 degrees of freedom
## Multiple R-squared: 0.8169, Adjusted R-squared: 0.7746
## F-statistic: 19.33 on 3 and 13 DF, p-value: 4.498e-05
For both response Y1 and Y2, z3 is not significant, so we can drop z3 from our model.
38 | P a g e
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 340.2 on 14 degrees of freedom
## Multiple R-squared: 0.787, Adjusted R-squared: 0.7566
## F-statistic: 25.87 on 2 and 14 DF, p-value: 1.986e-05
Both z1 and z2 are significant to our model. Therefore, backward elimination method terminates
and we got our final model containing z1 and z2 as our predictors.
Type the data in Excel/SPSS file and read it in R/STATA to answer the following questions:
(a) Perform a regression analysis using only the first response Y1.
(i) Suggest and fit appropriate (with best predictors) linear regression models.
(iii) Construct a 95% prediction interval for N02 corresponding to z1 = 10 and z2 = 80.
(b) Perform a multivariate multiple regression analysis using both responses Y1 and Y2.
(iii) Construct a 95% prediction ellipse for both N02 and 03 for z1 = 10 and z2 = 80.
Compare this ellipse with the prediction interval in Part a (iii). Comment.
39 | P a g e
Answer:
Try yourself!
40 | P a g e
References and Further Reading
41 | P a g e