0% found this document useful (0 votes)
11 views35 pages

PS Assignments

The document describes a statistics assignment involving analyzing linear regression models. It includes: 1. A requirement to fit a linear regression model to data on tree diameters and heights, and check assumptions. 2. An overview of the solving method, including manually or using R, and checking the assumptions of linearity, independence of errors, normality of errors, and homoscedasticity. 3. A background section on linear regression models and the four assumptions that are checked.

Uploaded by

phuong.cao1310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views35 pages

PS Assignments

The document describes a statistics assignment involving analyzing linear regression models. It includes: 1. A requirement to fit a linear regression model to data on tree diameters and heights, and check assumptions. 2. An overview of the solving method, including manually or using R, and checking the assumptions of linearity, independence of errors, normality of errors, and homoscedasticity. 3. A background section on linear regression models and the four assumptions that are checked.

Uploaded by

phuong.cao1310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY

UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING

CC04

Assignment

Probabilities and Statistics

Advisor: Nguyễn Tiến Dũng


Students: Nguyễn Đình Sáng (CC04) - 1952955
Vũ Minh Long (CC04) - 1950024
Trần Nhật Tân (CC04) - 1852732
Đinh Hoàng Anh (CC04) - 1952553
Nguyễn Bình Minh (CC04) - 1952846

HO CHI MINH CITY, MAY 2021


University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Contents
1 Member list & Workload 2

2 Project 1 2
2.1 Topic 8-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.4 Calculation in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.5 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Topic 8-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4 Calculate in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.5 Display the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.6 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Topic 8-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Calculation in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.5 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Topic 8-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.4 Calculation in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.5 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Project 2 23
3.1 Import data: Diet.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Cleaning data: NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Data visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Descriptive statistics for each of the variables . . . . . . . . . . . . 27
3.3.2 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 t.test: between pre.weight and weight6weeks . . . . . . . . . . . . . . . . . 31
3.5 One way ANOVA: What is the best diet for weight loss? . . . . . . . . . 31
3.6 Two way ANOVA: How do Diet and gender affect weightLOST? . . . . 32

4 References 34

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 1/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 Member list & Workload


No. Fullname Student ID Problems Percentage of work

1 Nguyễn Đình Sáng 1952955 P1.8.4 & P2.2 100%

2 Vũ Minh Long 1950024 P1.8.2 & P2.3 100%

3 Đinh Hoàng Anh 1952553 P1.8.3 & P2.4 100%

4 Trần Nhật Tân 1852732 P1.8.1 & P2.5 100%

5 Nguyễn Bình Minh 1952846 P1.8.1 & P2.6 100%

2 Project 1
2.1 Topic 8 - 1
2.1.1 Requirement

The following table shows the diameter X and the height Y of 20 trees respectively

X 2.3 2.5 2.6 3.1 3.4 3.7 7.3 3.9 4 4.1 4.1 4.2 4.4 5.7 5.1 5.5 5.8 6.2 6.9 6.9
Y 7 8 4 4 6 6 14 12 8 5 7 8 7 9 10 13 7 11 11 16
Fit the linear regression model and check all assumptions.

2.1.2 Solving method

a) Fit the linear regression

- We can find the linear regression model manually or by using R

b) Check all assumptions


There are four assumptions we should check:
- Linearity: we can check the linearity by using scatter plot.
- Independence of errors: we can check the independence of error by using plot of residuals
versus fits or Durbin Watson test
- Normality of errors: we can check the normality of errors by using Q-Q plot or Shapiro-Wilk
Test
- Homoscedasticity (non-constant variance): we can check this assumption by using non-
constant variance test

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 2/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.1.3 Theory base

a) Linear regression
Linear regression attempts to model the relationship between two variables by fitting a
linear equation to observed data. One variable is considered to be an explanatory variable, and
the other is considered to be a dependent variable. For example, a modeler might want to relate
the weights of individuals to their heights using a linear regression model.
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory
variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the
value of y when x = 0).
b) Four assumptions
- Linearity: The response variable y should be a linearly related to the explanatory variables
X
- Independence of errors: After fitting the model on the training data set, the residual errors
of the model should be independent and identically distributed random variables. If the residual
errors are not independent, they will likely demonstrate some sort of a pattern (which is not
always obvious to the naked eye). There is information in this pattern that the regression model
was not able to capture during its training on the training set, thereby making the model sub-
optimal
- Normality of errors: If the residual errors of regression are not N (0, σ 2 ), then statistical tests
of significance that depend on the errors having an N (0, σ 2 ) distribution, simply stop working.
- Homoscedasticity (non-constant variance): In the previous section we saw why the residual
errors should be N (0, σ 2 ) distributed, i.e. normally distributed with mean zero and variance σ 2 .
In this section we impose an additional constraint on them: the variance σ 2 should be constant.
Particularly, σ 2 should not be a function of the response variable y, and thereby indirectly the
explanatory variables X.

2.1.4 Calculation in manual

- Linear regression model: Y = a + bX


Pn
Sxx = i=1 (xi − x̄)2 = 42.6055
Pn
Sxy = i=1 (xi − x̄)(y − ȳ) = 71.445
Pn
Syy = i=1 (y − ȳ)2 = 208.55

- Therefore
Sxy
b= = 1.676896175
Sxx
a = ȳ − bx̄ = 1.045275845

- So the linear regression model is: y = 1.045 + 1.677x

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 3/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.1.5 Solving the problem in R

a) Fit the linear regression

- First, we allocate diameter to x variable and height to y variable:


1 > x <- c (2.3 , 2.5 , 2.6 , 3.1 , 3.4 , 3.7 , 7.3 , 3.9 , 4 , 4.1 , 4.1 , 4.2 , 4.4 , 4.7 , 5.1 ,
5.5 ,
2 5.8 , 6.2 , 6.9 , 6.9)
3 > y <- c (7 , 8 , 4 , 4 , 6 , 6 , 14 , 12 , 8 , 5 , 7 , 8 , 7 , 9 , 10 , 13 , 7 , 11 , 11 , 16)

- Second, we find the linear regression model for x and y:


1 > relation <- lm ( y ~ x )
2 > print ( relation )

- We get:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
1.045 1.677
So the linear regression model is: y = 1.045 + 1.677x
We plot the data:
1 > plot (x ,y , col = " blue " , main = " Diameter and Height regression " , abline ( lm ( y ~ x ) )
,
2 cex = 1.3 , pch = 16 , xlab = " Diameter " , ylab = " Height ")

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 4/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

b) Check all assumptions


- Linearity: The scatter plot below shows that the relationship between diameter and height is
linear

- Independence of errors: Plot of residual versus fits:


1 > plot ( relation , 1)

The plot of residual versus fits below shows that the correlation is approximately 0, thus this
assumption has been met

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 5/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

We use Durbin Watson test:


1 > d u r b i n W a t s o n T e s t ( relation )

We get:
lag Autocorrelation D-W Statistic p-value
1 0.009385456 1.802587 0.5
Alternative hypothesis: rho != 0
As the p-value is larger than 0.05, we cannot reject the null hypothesis

- Normality of errors: We check the normality of errors using Q-Q Plot method:
1 > plot ( relation , 2)

The Q-Q plot below shows that the standardized residual is normally distributed

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 6/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Shapiro-Wilk Test:
1 > library ( MASS )
2 > shapiro <- studres ( relation )
3 > shapiro . test ( shapiro )

We get the result:


Shapiro-Wilk normality test

data: shapiro
W = 0.96613, p-value = 0.6719
p-value is larger than 0.5, so the error is normaly distributed
- Equal variance (Homoscedasticity):
Non-constant variance test:
Let H0 denote the hypothesis that the variances are not constant
H1 denote the hypothesis that the variances are constant

ncvTest(relation)

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 7/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

We get:
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.01590465, Df = 1, p = 0.89964
The p-value is larger than 0.05, so we can reject the null hypothesis, it means that the variance
of errors are constant

2.1.6 Conclusion

After checking all assumptions, we can conclude that the diameter and the height of those
trees are linear, and all assumptions are met.

2.2 Topic 8 - 2
2.2.1 Requirement
A company wants target 3 overseas markets. To compare the potential market share of the
company and that of some competitors in the markets, their products were sold to potential
customers. The following results were obtained.
Markets
A B C
The company 55 38 24
Competitor 1 28 30 21
Competitor 2 20 18 31
Other competitors 47 64 74

Draw a conclusion at the significance level of 5%.

2.2.2 Solving method


- Two way ANOVA: to find effect between products to both markets and companies
Set up hypothesis
H0a : There is no effect between products to companies.
H0b : There is no effect between products to markets.
H1a : There is an effect between products to companies.
H1b : There is an effect between products to markets.
- Tukey HSD: to find is there any significant between each companies

2.2.3 Theory base

ANOVA tests for significance using the F-test for statistical significance. The F-test is a
groupwise comparison test, which means it compares the variance in each group mean to the
overall variance in the dependent variable.
If the variance within groups is smaller than the variance between groups, the F-test will
find a higher F-value, and therefore a higher likelihood that the difference observed is real and
not due to chance.

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 8/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

A two-way ANOVA with interaction tests three null hypotheses at the same time:

• There is no difference in group means at any level of the first independent variable.
• There is no difference in group means at any level of the second independent vari-
able.
• The effect of one independent variable does not depend on the effect of the other
independent variable (a.k.a. no interaction effect).

A two-way ANOVA without interaction (a.k.a. an additive two-way ANOVA) only tests the
first two of these hypotheses.

Adj MS

The calculations for the mean square for the factors, interaction, and error follow:

SS(A) SS(B)
M S(A) = M S(B) =
df (A) df (B)
SS(AB) SS(Error)
M S(AB) = M S(Error) =
df (AB) df (Error)

Notation

MS: Mean square


SS: Sum of squares
df: Degree of freedom
Adj SS

The sum of squared distances. SS Total is the total variation in the data. SS (A) and SS (B)
are the amount of variation of the estimated factor level mean around the overall mean. These
statistics are also known as the sum of squares for factor A or factor B. SS Error is the amount
of variation of the observations from their fitted values. The calculations follow:

Source of variation Df Sum of squares Mean square F


Pa SSA M SA
Factor A (between groups) a−1 SSA = i=1 nj (ȳi − ȳ.. )2 M SA =
(a − 1) M SE
Pb SSB M SB
Factor B (between groups) b−1 SSB = i=1 nj (ȳi − ȳ.. )2 M SB =
(b − 1) M SE
SSE
Error (within groups) (a − 1)(b − 1) SSE = SST − SSA − SSB M SE =
Pa Pn (a − 1)(b − 1)
Total N −1 SST = i=1 j=1 (yij − ȳ.. )2

P-value – Analysis of variance table

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 9/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

The degrees of freedom for the F statistic that you use to calculate the p-value depend on
the term that is in the test.

When you test a term, the denominator degrees of freedom are always the degrees of freedom
for error. The degrees of freedom for error depend on whether the interaction term is in the
model or not.

• With no interaction in the model, DF Error = (n − 1) − (a − 1) − (b − 1)


• With the interaction in the model, DF Error = (n−1)−(a−1)−(b−1)−(a−1)(b−1)
When you test a term, the numerator degrees of freedom depend on the term.

• For F (A), the degrees of freedom for the numerator are a − 1


• For F (B), the degrees of freedom for the numerator are b − 1
• For F (AB), the degrees of freedom for the numerator are (a − 1)(b − 1)
For the lack-of-fit test, the degrees of freedom follow:
• Denominator DF = nc
• Numerator DF = cp

2.2.4 Calculate in manual

- We have 2 factors: Companies and Markets

a, Calculate the mean of each factors

Markets
A B C Average
The company 55 38 24 39
Competitor 1 28 30 21 26.33
Competitor 2 20 18 31 23
Other competitors 47 64 74 61.67
Average 37.5 37.5 37.5
b, Calculate the Sum of SquaresPn
- We have the formula: SS = i=1 (yi − ȳ)2
Where:
SS = Sum of Square
n = number of observations
yi = value in a sample
ȳ = mean value of a sample
- Therefore:
*Companies

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 10/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

The company
Products The company mean Grand mean (The company mean - grand mean)2
55 39 37.5 2.25
38 39 37.5 2.25
24 39 37.5 2.25
Sum of Squares 6.75

Competitor 1
Products Competitor 1 mean Grand mean (Competitor 1 mean - grand mean)2
28 26.33 37.5 124.77
30 26.33 37.5 124.77
21 26.33 37.5 124.77
Sum of Squares 374.31

Competitor 2
Products Competitor 2 mean Grand mean (Competitor 2 mean - grand mean)2
20 23 37.5 210.25
18 23 37.5 210.25
31 23 37.5 210.25
Sum of Squares 630.75

Other competitors
Products Other competitors mean Grand mean (Other competitors mean - grand mean)2
47 61.67 37.5 584.19
64 61.67 37.5 584.19
74 61.67 37.5 584.19
Sum of Squares 1752.57

Total sum of squares of companies 2764.38

*Markets
- We can see that the mean of each markets is equal to each other and also equal to the
grand mean
=> The total sum of square of markets is equal to 0
*Total
Pa Pb y2
- Formula: SST = i=1 j=1 yij 2
− t
ab
Where
SST = Total Sum of Square
a,b = The number of row and column
yt = Sum of all observations

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 11/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

- Therefore
4502
SST = (552 + 382 + ... + 742 ) − = 3761
12
*Error
- Formula: SSE = SST − SS(Companies) − SS(M arkets)
- Therefore
SSE = 3761 − 2764.38 − 0 = 996, 62

c, Calculate the Degree of Freedom


- Formula: DF = n - 1
Where:
DF = Degree of Freedom
n = The number of column ( or row )
- Therefore
DF (Companies) = 4 - 1 = 3
DF (Markets) =3-1=2
DF (Error) =3x2=6

d, Calculate the mean square


SS
- Formula: MS =
DF
Where:
SS = Sum of Square
DF = Degree of Freedom
MS = Mean Square
- Therefore
2764.68
MS (Companies) = = 921.46
3
MS (Markets) = 0
996.62
MS (Error) = = 166.1
6
e, Calculate the F ratio
- We have formula:
MS
F-ratio =
M S(Error)
- Therefore
921.46
F (Companies) = = 5.55
166.1
F (Markets) = 0

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 12/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.2.5 Display the result

SS DF MS F
Companies 2764.38 3 921.46 5.55
Markets 0 2 0 0
Error 996.62 6 166.1
Total 3761

2.2.6 Solving the problem in R

a, Import data
1 > companies <- gl (4 ,3 ,12)
2 > markets <- gl (3 ,1 ,12)
3 > id <- 1:12
4 > products <- c (55 ,38 ,24 ,28 ,30 ,21 ,20 ,18 ,31 ,47 ,64 ,74)
5 > data <- data . frame ( companies , markets , id , products )
6 > data

b, Result

1 companies markets id products


2 1 1 1 1 55
3 2 1 2 2 38
4 3 1 3 3 24
5 4 2 1 4 28
6 5 2 2 5 30
7 6 2 3 6 21
8 7 3 1 7 20
9 8 3 2 8 18
10 9 3 3 9 31
11 10 4 1 10 47
12 11 4 2 11 64
13 12 4 3 12 74

c, Two-way ANOVA test

1 > x < - aov ( products ~ companies + markets )


2 > summary ( x )

=> Result:
Df Sum Sq Mean Sq F value Pr(>F)
Companies 3 2763.7 921.2 5.542 0.0365
Markets 2 0.0 0 0.0 1.0
Residuals 6 997.3 166.2

d, Tukey HSD test

1 > Tukey HSD ( x )

=> Result: Tukey multiple comparisons of means 95% family-wise confidence level
*companies

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 13/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

diff lwr upr p adj


2-1 -12.67 -49.11 23.77 0.657
3-1 -16.00 -52.44 20.44 0.482
4-1 22.67 -13.77 59.11 0.238
3-2 -16.00 -39.77 33.11 0.988
4-2 -16.00 -1.11 71.77 0.056
4-3 -16.00 2.26 75.11 0.039

*markets
diff lwr upr p adj
2-1 0 -27.97 27.97 1
3-1 0 -27.97 27.97 1
3-2 0 -27.97 27.97 1

2.2.7 Conclusion
- Because F (Companies) = 5.55 > 4.76 = F3;6 => Reject H0a . Therefore, there is an effect
between products to companies. (From ANOVA)
- Because F (Markets) = 0 => Cannot reject H0b . Therefore, there is no effect between
products to markets. (From ANOVA)
- Because P-value < α = 0.05, so there is only significant differences between Competitor 2
and Other competitors. (From Tukey HSD)
- The total products in each market is equal, which is 150 products.

2.3 Topic 8 - 3
2.3.1 Requirement
The river water level was measured at several locations in a province on the same day and
the data follow.
Measurement location
Measurement time
F1 F2 F3 F4
1 5.5 4.9 4.6 4.5
2 4.6 5.1 4.8 6.2
3 5.8 6.5 5.8 4.8
4 5.9 5.4 5.1 4.8
5 6.0 6.1 6.2 6.5
6 6.7 7.1 6.8
7 7.2

At the significance level α = 2%, are the average river water levels each day at these locations
significantly different?

2.3.2 Solving method


The method we use is one-way ANOVA with Unequal Sample Sizes.

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 14/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.3.3 Theory base


We may describe the observations in table by the linear statistical model

i = 1, 2, ..., a
Yij = µ + τi + ij
j = 1, 2, ..., ni

Where yij is a random variable denoting the (ij)th observation, µ is a parameter common to
all treatments called the overall mean, τi is a parameter associated with the ith treatment called
the ith treatment effect, and ij is a random error component.

We assume that the errors ij are normally and independently distributed with mean zero
and variance σ 2 . Therefore, each treatment can be thought of as a normal population with mean
µi and varianceσ 2 .

We have to use ni instead of n because the measurement time of each treatments are
not equal

In the fixed-effects model, the treatment effects i are usually defined as deviations from
the overall mean µ, so that
Pa
i=1 τi =0

Let yi· represent the total of the observations under the ith treatment and y i· . represent the
average of the observations under the ith treatment. Similarly, let y·· represent the grand total of
all observations and y ·· represent the grand mean of all observations. Expressed mathematically,
Pni
yi· = j=1 yij
yi·
y i· = with i = 1, 2, ..., a
ni
Pa Pni
y·· = i=1 j=1 yij
y·· Pa
y ·· = with N = i=1 ni
N

We are interested in testing the equality of the a treatment meansµ1 , µ2 , ..., µa . We find that
this is equivalent to testing the hypotheses
H0 : τ1 = τ2 = ... = τa = 0

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 15/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

H1 : At least one τi 6= 0

Thus, if the null hypothesis is true, each observation consists of the overall mean µ plus a
realization of the random error component i j. This is equivalent to saying that all N observa-
tions are taken from a normal distribution with mean µ and variance σ 2 . Therefore, if the null
hypothesis is true, changing the levels of the factor has no effect on the mean response.

The sums of squares computing formulas for the ANOVA with unequal sample sizes ni
in each treatment are:
Pa Pni 2 y··2
SS T = i=1 j=1 yij −
N
Pa yi·2 y2
SS T reatments = i=1 − ··
ni N
SS E = SS T − SS T reatments

2.3.4 Calculation in manual


Measurement location
Measurement time
F1 F2 F3 F4
1 5.5 4.9 4.6 4.5
2 4.6 5.1 4.8 6.2
3 5.8 6.5 5.8 4.8
4 5.9 5.4 5.1 4.8
5 6.0 6.1 6.2 6.5
6 6.7 7.1 6.8
7 7.2
SUM 41.7 28 33.6 33.6
AVG 5.9571 5.6 5.6 5.6

136.92
SS T = (5.52 + 4.92 + ... + 7.22 ) − = 16.2896
24
41.72 282 33.62 33.62 136.92
SS T reatments = ( + + + )− = 0.6324
7 5 6 6 24
SS E = 16.2896 − 0.6324 = 15.6572
SS T r 0.6324
M S T reatments = = = 0.2108
df (SS T r ) 4−1
SS E 15.6572
M SE = = = 0.7829
df (SS E ) 24 − 4
M ST r 0.2108
F = = = 0.2693
M SE 0.7829

Source of variation Df Sum of squares Mean square F


Treatment 3 0.6324 0.2108 0.2693
Error 20 15.6572 0.7829
Total 23 16.2896

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 16/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.3.5 Solving the problem in R


First, we have to create data frame in R
1 > F1 <- c (5.5 ,4.6 ,5.8 ,5.9 ,6.0 ,6.7 ,7.2)
2 > F2 <- c (4.9 ,5.1 ,6.5 ,5.4 ,6.1)
3 > F3 <- c (4.6 ,4.8 ,5.8 ,5.1 ,6.2 ,7.1)
4 > F4 <- c (4.5 ,6.2 ,4.8 ,4.8 ,6.5 ,6.8)
5
6 > Data <- data . frame ( Y = c ( F1 , F2 , F3 , F4 ) , Locations = factor ( rep ( c (" F1 " ," F2 " ," F3 " ," F4
") , times = c ( length ( F1 ) , length ( F2 ) , length ( F3 ) , length ( F4 ) ) ) ) )

Then, compute one-way ANOVA test


1 > oneway <- aov ( Y ~ Locations , data = Data )
2 > anova ( oneway )
3
4 Analysis of Variance Table
5
6 Response : Y
7 Df Sum Sq Mean Sq F value Pr ( > F )
8 Locations 3 0.6324 0.21081 0.2693 0.8468
9 Residuals 20 15.6571 0.78286

Then apply the function qf() (f-dist critical value), we can get:
1 >f = qf ( p =.02 , df1 =3 , df2 =20 , lower . tail = FALSE )
2
3 >f
4 [1] 4.113404

2.3.6 Conclusion
Using F-dist table critical values: F = 0.2693 < F0.02,3,20 = 4.1134 =⇒ Accept Hypothesis H0

Therefore, we can conclude that there are not significant differences of the average river wa-
ter levels each day at these locations at significant level α = 2%.

2.4 Topic 8 - 4
2.4.1 Requirement
The following table show the average cost, measured as 1000 VND expended for each service at
each city.

Services
I II III
City I 61 52 69
City II 58 51 61
City III 68 64 79

At the significance level = 5%, compare the costs for three types of services in three cities.

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 17/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.4.2 Solving method


A two-way ANOVA is used to estimate how the mean of a quantitative variable changes ac-
cording to the levels of two categorical variables. Use a two-way ANOVA help us know how two
independent variables, in combination, affect a dependent variable.

According to the data tables, we have 2 factors which are Cities and Services and each factor
has 3 levels without replication and a dependent variable Cost. Therefore, two-way ANOVA is
chosen.

2.4.3 Theory base


Theory base in this exercise is similar to 2.2.3

2.4.4 Calculation in manual


We have two factors: Cities and Services
City I City II
Service I Service II Service III Service I Service II Service III
61 52 69 58 51 61

City III
Service I Service II Service III
68 64 79

Step 1: We calculate the average of each factors in the following tables

Services
Service I Service II Service III Average
City I 61 52 69 60.67
City II 58 51 61 56.67
City III 68 64 79 70.33
Average 62.33 55.67 69.67 62.56

There are two null hypotheses: one for the rows and the other for the columns. Let’s look first
at the rows:
H0 : There is no significant difference in cost between the cities
The null hypothesis for the columns is:
H0 : There is no significant difference in cost between the services

Step 2: Calculating the sum of squares

- Sum of squares for 1st factor: Cities


Pn
- We have the formula: SS = i=1 (yi − ȳ)2
Where:
SS = Sum of Square
n = number of observations

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 18/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

yi = value in a sample
ȳ = mean value of a sample

- Sum of squares for 1st factor: Cities

City I
Cost City I Mean Grand mean (City I Mean - Grand mean)2
61 60.67 62.56 3.57
52 60.67 62.56 3.57
69 60.67 62.56 3.57
Sum of Squares 10.70
City II
Cost City II Mean Grand mean (City II Mean - Grand mean)2
58 56.67 62.56 34.68
51 56.67 62.56 34.68
61 56.67 62.56 34.68
Sum of Squares 104.04
City III
Cost City III Mean Grand mean (City III Mean - Grand mean)2
68 70.34 62.56 60.49
64 70.34 62.56 60.49
79 70.34 62.56 60.49
Sum of Squares 181.48
Sum of squares for 1st factor: Cities 296.22
- Sum of squares for 1st factor: Cities

Service I
Cost City I Mean Grand mean (City I Mean - Grand mean)2
61 62.34 62.56 0.049
58 62.34 62.56 0.049
68 62.34 62.56 0.049
Sum of Squares 0.15
Service II
Cost Service II Mean Grand mean (Service II Mean - Grand mean)2
52 55.67 62.34 47.46
51 55.67 62.34 47.46
64 55.67 62.34 47.46
Sum of Squares 142.37
Service III
Cost Service III Mean Grand mean (Service III Mean - Grand mean)2
69 69.67 62.56 50.57
61 69.67 62.56 50.57
79 69.67 62.56 50.57
Sum of Squares 151.7031

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 19/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Sum of squares for 2nd factor: Services 294.22


- Sum of squares of both factors
Cost Grand Mean (Cost - Grand Mean)2
61 62.56 2.42
52 62.56 111.42
69 62.56 41.53
58 62.56 20.75
51 62.56 133.53
61 62.56 2.42
68 62.56 29.64
64 62.56 2.09
79 62.56 270.42
Error 23.78
Step 3: Calculate the degree of freedom
- Formula: DF = n - 1
Where:
DF = Degree of Freedom
n = The number of column ( or row )
- Therefore
DF (Cities) = 3 - 1 = 2
DF (Services) =3-1=2
DF (Error) =2x2=4

Step 4: Calculate the mean square


SS
- Formula: MS =
DF
Where:
SS = Sum of Square
DF = Degree of Freedom
MS = Mean Square
- Therefore
296.22
MS (Cities) = = 148.11
2
MS (Services) = 0
294.22
MS (Error) = = 147.11
2
Step5: Calculate the F ratio
- We have formula:
MS
F-ratio =
M S(Error)
- Therefore
148.11
F (Cities) = = 24.91
5.945
147.11
F (Services) = = 24.74
5.945

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 20/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Step 6: Display the result

Sum of squares Df Mean square F-ratio


1st Factor: Cities 296.22 2 148.11 24.91
2nd Factor: Services 294.22 2 147.11 24.74
Error 23.78 4 5.945
Total 614.22 8 76.79

2.4.5 Solving the problem in R


After import the data
1 > data
2 cities services cost
3 1 I I 61
4 2 I II 52
5 3 I III 69
6 4 II I 58
7 5 II II 51
8 6 II III 61
9 7 III I 68
10 8 III II 64
11 9 III III 79

We set Cities and Services as factors


1 > # as . factor ()
2 > data$cities <- as . factor ( data$cities )
3 > data$services <- as . factor ( data$services )
4 > str ( data )
5 Classes tbl_df , tbl and ' data . frame ': 9 obs . of 3 variables :
6 $ cities : Factor w / 3 levels " I " ," II " ," III ": 1 1 1 2 2 2 3 3 3
7 $ services : Factor w / 3 levels " I " ," II " ," III ": 1 2 3 1 2 3 1 2 3
8 $ cost : num 61 52 69 58 51 61 68 64 79

We use two-way ANOVA to analyze the data


1 > anova1 <- aov ( cost ~ cities + services , data = data )
2 > summary ( anova1 )
3 Df Sum Sq Mean Sq F value Pr ( > F )
4 cities 2 296.22 148.11 24.92 0.00552 **
5 services 2 294.22 147.11 24.75 0.00559 **
6 Residuals 4 23.78 5.94
7 ---
8 Signif . codes : 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

The ANOVA test can not show us the details of the effect between each element so we decided
to use the Tukey test to investigate the data by using the command “TukeyHSD”
1 > TukeyHSD ( anova1 )
2 Tukey multiple comparisons of means
3 95% family - wise confidence level
4
5 Fit : aov ( formula = cost ~ cities + services , data = data )
6
7 $cities
8 diff lwr upr p adj
9 II - I -4.000000 -11.094901 3.094901 0.2253564
10 III - I 9.666667 2.571766 16.761568 0.0180671
11 III - II 13.666667 6.571766 20.761568 0.0052028

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 21/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

12
13 $services
14 diff lwr upr p adj
15 II - I -6.666667 -13.7615676 0.4282342 0.0604773
16 III - I 7.333333 0.2384324 14.4282342 0.0450964
17 III - II 14.000000 6.9050991 21.0949009 0.0047572

We can draw a bar graph to show the relation between 3 variables


1 > anova2 <- aov ( cost ~ services + cities + services : cities , data = data )
2 > summary ( anova2 )
3 Df Sum Sq Mean Sq
4 services 2 294.22 147.11
5 cities 2 296.22 148.11
6 services : cities 4 23.78 5.94
7 ########################################
8 > model . tables ( anova2 )
9 Tables of means
10 Grand mean
11
12 62.55556
13
14 services
15 services
16 I II III
17 62.33 55.67 69.67
18
19 cities
20 cities
21 I II III
22 60.67 56.67 70.33
23
24 services : cities
25 cities
26 services I II III
27 I 61 58 68
28 II 52 51 64
29 III 69 61 79
30 ########################################
31 > Input = " services I II III
32 + I 61 58 68
33 + II 52 51 64
34 + III 69 61 79 "
35
36 > Means = as . matrix ( read . table ( te xtConn ection ( Input ) ) )
37 > Means
38 V1 V2 V3 V4
39 [1 ,] " services " " I " " II " " III "
40 [2 ,] " I " "61" "58" "68"
41 [3 ,] " II " "52" "51" "64"
42 [4 ,] " III " "69" "61" "79"
43 ########################################
44 > Means = as . matrix ( read . table ( te xtConn ection ( Input ) , header = TRUE , row . names =
1) )
45 > barplot ( Means , beside = TRUE , legend . text = TRUE , ylim = c (0 ,80) , xlab = " Cities
" , ylab =" Cost ")

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 22/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.4.6 Conclusion
We received the same result from both calculation in manual and code in R

The F-ratio of Cities is 24.91 > F(2,4) = 6.94.


The F-ratio of Services is 24.74 > F(2,4) = 6.94 (5% significant level).
Furthermore, p-value from two factors < 0.05.

We can reject the both null hypothesis and conclude that services and cities affect
the average cost.

Besides, from the result resulted from tukeyHSD, and the bar graph we can also conclude
that the differences III-I and III-II are significant, while II-I is not significant and base on the
difference mean of each group, it can seem that:

The cost of the I and II services is considered as no difference while the III ser-
vice’ cost is higher compare to others.

The result is similar to the city when

The first and second cities nearly have the same cost, the third one has the
highest compare to two remains.

3 Project 2
3.1 Import data: Diet.csv
The first command (setwd - wd means working directory) tells R that we want to save the nu-
meric data in a folder named “c://diet”)

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 23/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

The second command is used to save the excel file Diet.csv into an object named gh.

The third command (save) tells R that the data in the gh object will be stored in a file named
(“gh.rda”). After typing the above two commands, a file named gh.rda will be present in that
directory.
1
2 > setwd (" c :// diet ")
3 >gh < - read . csv (" Diet . csv " , header = TRUE )
4 > save ( gh , file =" gh . rda ")
5 > gh

The output:
1 Person gender Age Height pre . weight Diet weight6weeks
2 1 25 NA 41 171 60 2 60.0
3 2 26 NA 32 174 103 2 103.0
4 3 1 0 22 159 58 1 54.2
5 4 2 0 46 192 60 1 54.0
6 5 3 0 55 170 64 1 63.3
7 6 4 0 33 171 64 1 61.1
8 7 5 0 50 170 65 1 62.2
9 8 6 0 50 201 66 1 64.0
10 9 7 0 37 174 67 1 65.0
11 10 8 0 28 176 69 1 60.5
12 11 9 0 28 165 70 1 68.1
13 12 10 0 45 165 70 1 66.9
14 13 11 0 60 173 72 1 70.5
15 14 12 0 48 156 72 1 69.0
16 15 13 0 41 163 72 1 68.4
17 16 14 0 37 167 82 1 81.1
18 17 27 0 44 174 58 2 60.1
19 18 28 0 37 172 58 2 56.0
20 19 29 0 41 165 59 2 57.3
21 20 30 0 43 171 61 2 56.7
22 21 31 0 20 169 62 2 55.0
23 22 32 0 51 174 63 2 62.4
24 23 33 0 31 163 63 2 60.3
25 24 34 0 54 173 63 2 59.4
26 25 35 0 50 166 65 2 62.0
27 26 36 0 48 163 66 2 64.0
28 27 37 0 16 165 68 2 63.8
29 28 38 0 37 167 68 2 63.3
30 29 39 0 30 161 76 2 72.7
31 30 40 0 29 169 77 2 77.5
32 31 52 0 51 165 60 3 53.0
33 32 53 0 35 169 62 3 56.4
34 33 54 0 21 159 64 3 60.6
35 34 55 0 22 169 65 3 58.2
36 35 56 0 36 160 66 3 58.2
37 36 57 0 20 169 67 3 61.6
38 37 58 0 35 163 67 3 60.2
39 38 59 0 45 155 69 3 61.8
40 39 60 0 58 141 70 3 63.0
41 40 61 0 37 170 70 3 62.7
42 41 62 0 31 170 72 3 71.1
43 42 63 0 35 171 72 3 64.4
44 43 64 0 56 171 73 3 68.9
45 44 65 0 48 153 75 3 68.7
46 45 66 0 41 157 76 3 71.0
47 46 15 1 39 168 71 1 71.6
48 47 16 1 31 158 72 1 70.9

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 24/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

49 48 17 1 40 173 74 1 69.5
50 49 18 1 50 160 78 1 73.9
51 50 19 1 43 162 80 1 71.0
52 51 20 1 25 165 80 1 77.6
53 52 21 1 52 177 83 1 79.1
54 53 22 1 42 166 85 1 81.5
55 54 23 1 39 166 87 1 81.9
56 55 24 1 40 190 88 1 84.5
57 56 41 1 51 191 71 2 66.8
58 57 42 1 38 199 75 2 72.6
59 58 43 1 54 196 75 2 69.2
60 59 44 1 33 190 76 2 72.5
61 60 45 1 45 160 78 2 72.7
62 61 46 1 37 194 78 2 76.3
63 62 47 1 44 163 79 2 73.6
64 63 48 1 40 171 79 2 72.9
65 64 49 1 37 198 79 2 71.1
66 65 50 1 39 180 80 2 81.4
67 66 51 1 31 182 80 2 75.7
68 67 67 1 36 155 71 3 68.5
69 68 68 1 47 179 73 3 72.1
70 69 69 1 29 166 76 3 72.5
71 70 70 1 37 173 78 3 77.5
72 71 71 1 31 177 78 3 75.2
73 72 72 1 26 179 78 3 69.4
74 73 73 1 40 179 79 3 74.5
75 74 74 1 35 183 83 3 80.2
76 75 75 1 49 177 84 3 79.9
77 76 76 1 28 164 85 3 79.7
78 77 77 1 40 167 87 3 77.8
79 78 78 1 51 175 88 3 81.9

3.2 Cleaning data: NA


We use the command:
1 >gh < - na . omit ( gh )

The output:
1 > gh
2 Person gender Age Height pre . weight Diet weight6weeks
3 3 1 0 22 159 58 1 54.2
4 4 2 0 46 192 60 1 54.0
5 5 3 0 55 170 64 1 63.3
6 6 4 0 33 171 64 1 61.1
7 7 5 0 50 170 65 1 62.2
8 8 6 0 50 201 66 1 64.0
9 9 7 0 37 174 67 1 65.0
10 10 8 0 28 176 69 1 60.5
11 11 9 0 28 165 70 1 68.1
12 12 10 0 45 165 70 1 66.9
13 13 11 0 60 173 72 1 70.5
14 14 12 0 48 156 72 1 69.0
15 15 13 0 41 163 72 1 68.4
16 16 14 0 37 167 82 1 81.1
17 17 27 0 44 174 58 2 60.1
18 18 28 0 37 172 58 2 56.0
19 19 29 0 41 165 59 2 57.3
20 20 30 0 43 171 61 2 56.7
21 21 31 0 20 169 62 2 55.0

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 25/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

22 22 32 0 51 174 63 2 62.4
23 23 33 0 31 163 63 2 60.3
24 24 34 0 54 173 63 2 59.4
25 25 35 0 50 166 65 2 62.0
26 26 36 0 48 163 66 2 64.0
27 27 37 0 16 165 68 2 63.8
28 28 38 0 37 167 68 2 63.3
29 29 39 0 30 161 76 2 72.7
30 30 40 0 29 169 77 2 77.5
31 31 52 0 51 165 60 3 53.0
32 32 53 0 35 169 62 3 56.4
33 33 54 0 21 159 64 3 60.6
34 34 55 0 22 169 65 3 58.2
35 35 56 0 36 160 66 3 58.2
36 36 57 0 20 169 67 3 61.6
37 37 58 0 35 163 67 3 60.2
38 38 59 0 45 155 69 3 61.8
39 39 60 0 58 141 70 3 63.0
40 40 61 0 37 170 70 3 62.7
41 41 62 0 31 170 72 3 71.1
42 42 63 0 35 171 72 3 64.4
43 43 64 0 56 171 73 3 68.9
44 44 65 0 48 153 75 3 68.7
45 45 66 0 41 157 76 3 71.0
46 46 15 1 39 168 71 1 71.6
47 47 16 1 31 158 72 1 70.9
48 48 17 1 40 173 74 1 69.5
49 49 18 1 50 160 78 1 73.9
50 50 19 1 43 162 80 1 71.0
51 51 20 1 25 165 80 1 77.6
52 52 21 1 52 177 83 1 79.1
53 53 22 1 42 166 85 1 81.5
54 54 23 1 39 166 87 1 81.9
55 55 24 1 40 190 88 1 84.5
56 56 41 1 51 191 71 2 66.8
57 57 42 1 38 199 75 2 72.6
58 58 43 1 54 196 75 2 69.2
59 59 44 1 33 190 76 2 72.5
60 60 45 1 45 160 78 2 72.7
61 61 46 1 37 194 78 2 76.3
62 62 47 1 44 163 79 2 73.6
63 63 48 1 40 171 79 2 72.9
64 64 49 1 37 198 79 2 71.1
65 65 50 1 39 180 80 2 81.4
66 66 51 1 31 182 80 2 75.7
67 67 67 1 36 155 71 3 68.5
68 68 68 1 47 179 73 3 72.1
69 69 69 1 29 166 76 3 72.5
70 70 70 1 37 173 78 3 77.5
71 71 71 1 31 177 78 3 75.2
72 72 72 1 26 179 78 3 69.4
73 73 73 1 40 179 79 3 74.5
74 74 74 1 35 183 83 3 80.2
75 75 75 1 49 177 84 3 79.9
76 76 76 1 28 164 85 3 79.7
77 77 77 1 40 167 87 3 77.8
78 78 78 1 51 175 88 3 81.9

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 26/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

3.3 Data visualization


3.3.1 Descriptive statistics for each of the variables
Use the command and we get the values explained in the following figure
1 > summary ( gh )
2 Person gender Age Height
3 Min . : 1.00 Min . :0.0000 Min . :16.00 Min . :141.0
4 1 st Qu .:19.75 1 st Qu .:0.0000 1 st Qu .:32.50 1 st Qu .:163.8
5 Median :40.50 Median :0.0000 Median :39.00 Median :169.0
6 Mean :39.87 Mean :0.4342 Mean :39.22 Mean :170.8
7 3 rd Qu .:59.25 3 rd Qu .:1.0000 3 rd Qu .:47.25 3 rd Qu .:175.2
8 Max . :78.00 Max . :1.0000 Max . :60.00 Max . :201.0
9 pre . weight Diet weight6weeks
10 Min . :58.00 Min . :1.000 Min . :53.00
11 1 st Qu .:66.00 1 st Qu .:1.000 1 st Qu .:61.95
12 Median :72.00 Median :2.000 Median :68.95
13 Mean :72.29 Mean :2.039 Mean :68.34
14 3 rd Qu .:78.00 3 rd Qu .:3.000 3 rd Qu .:73.67
15 Max . :88.00 Max . :3.000 Max . :84.50
16 **1 st Qu . : first quartile ( tuong duong vi tri 25%)
17 **3 rd Qu . : third quartile (75%)
18 ** mean : gia tri trung binh
19 ** min : gia tri nho nhat
20 ** max : gia tri lon nhat
21 ** median : so trung vi

Set the Standard deviation and Standard error as SD and SE respectively.


1 > desc < - function ( x )
2 +{ av < - mean (x , na . rm = TRUE )
3 + sd < - sd (x , na . rm = TRUE )
4 + se < - sd / sqrt ( length ( x ) )
5 + c ( MEAN = av , SD = sd , SE = se ) }

The output:
1 > desc ( gh$Person )
2 MEAN SD SE
3 39.868421 22.843434 2.620322
4 > desc ( gh$gender )
5 MEAN SD SE
6 0.43421053 0.49894626 0.05723306
7 > desc ( gh$Age )
8 MEAN SD SE
9 39.223684 9.908379 1.136569
10 > desc ( gh$Height )
11 MEAN SD SE
12 170.776316 11.419981 1.309962
13 > desc ( gh$pre . weight )
14 MEAN SD SE
15 72.2894737 7.9746528 0.9147554
16 > desc ( gh$Diet )
17 MEAN SD SE
18 2.03947368 0.82366319 0.09448065
19 > desc ( gh $w e ig ht 6 we e ks )
20 MEAN SD SE
21 68.3434211 8.0589385 0.9244236

3.3.2 Boxplot

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 27/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 > boxplot ( gh$Person , main =" PERSON " , ylab =" people ")

The output:

1 > boxplot ( gh$Age , main =" AGE " , ylab =" years ")

The output:

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 28/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 > boxplot ( gh$Height , main =" HEIGHT " , ylab =" cm ")

The output:

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 29/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 > boxplot ( gh$pre . weight , weight6weeks , main =" PRE & AFTER " , ylab =" kg ")

The output:

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 30/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

3.4 t.test: between pre.weight and weight6weeks


Save pre.weight, weight6weeks, Diet is pre,after,diet respectively
1 >pre < - gh$pre . weight
2 > after . g h $w e ig ht 6 we ek s
3 > diet < - gh$Diet

Use the command t.text paired pre.weight and weight6weeks


1 t . test ( pre , after , paired = TRUE )

The output:
1 Paired t - test
2
3 data : pre and after
4 t = 13.728 , df = 75 , p - value < 2.2 e -16
5 alternative hypothesis : true difference in means is not equal to 0
6 95 percent confidence interval :
7 3.373452 4.518653
8 sample estimates :
9 mean of the differences
10 3.946053

CONCLUSION

Because p-value < 0.05 so that we can reject that pre.weight is’nt different from weight6weeks.
And the average of the differences between them is 3.95kg.

3.5 One way ANOVA: What is the best diet for weight loss?
Named weight loss is loss, we got loss=pre-after
1 > loss < - pre - after

To analyze variance, we need to define diet as a factor and then put diet and loss into the
same data frame called data.
1 > diet2 < - as . factor ( diet )
2 > data < - data . frame ( diet , loss )
3 > attach ( data )
4 The following objects are masked _by_ . GlobalEnv :
5
6 diet , loss

compute one-way ANOVA test:


1 > oneway < - aov ( loss ~ diet , data = gh )
2 > anova ( oneway )
3 Analysis of Variance Table
4
5 Response : loss
6 Df Sum Sq Mean Sq F value Pr ( > F )
7 diet 2 60.53 30.2635 5.3831 0.006596 **
8 Residuals 73 410.40 5.6219
9 ---
10 Signif . codes : 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

So that there was a statistically significant interaction the effects of Diet on weight loss(p =
0.006596 < 0.05)
Then we use TukeyHSD(oneway)

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 31/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 > TukeyHSD ( oneway )


2 Tukey multiple comparisons of means
3 95% family - wise confidence level
4
5 Fit : aov ( formula = loss ~ diet , data = gh )
6
7 $diet
8 diff lwr upr p adj
9 2 -1 -0.032000 -1.6530850 1.589085 0.9987711
10 3 -1 1.848148 0.2567422 3.439554 0.0188047
11 3 -2 1.880148 0.3056826 3.454614 0.0152020

CONCLUSION

Diet 3 was significantly different to diet 1 (p = 0.0188) and diet 2 (p = 0.0152). We can see
that diet 3 lost on average 1.85kg more than those on diet 1 and 1.88kg more than those on diet
2. So that the best diet for weight loss is diet 3

3.6 Two way ANOVA: How do Diet and gender affect weightLOST?
To carry out a two way ANOVA with an interaction, use:
aov(dependent as.factor(independent1)*as.factor(indepndent2),data= filename)

give the ANOVA model a name: anova2


1 > anova2 < - aov ( loss ~ as . factor ( gender ) * diet , data = gh )

To view the ANOVA table use the anova() command


1 > anova ( anova2 )
2 Analysis of Variance Table
3
4 Response : loss
5 Df Sum Sq Mean Sq F value Pr ( > F )
6 as . factor ( gender ) 1 0.28 0.2785 0.0518 0.820623
7 diet 2 60.42 30.2086 5.6190 0.005456 **
8 as . factor ( gender ) : diet 2 33.90 16.9520 3.1532 0.048842 *
9 Residuals 70 376.33 5.3761
10 ---
11 Signif . codes : 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

CONCLUSION

There was a statistically significant interaction between the effects of Diet and Gender on
weight loss [F(2, 70)=3.153, p = 0.049]. The ‘Diet’ effect cannot be generalised for both males
and females because the interaction effect is substantial (p =0.049).

The TukeyHSD(anova2) command will produce post hoc tests for the main effects and interac-
tions. Only interpret post hoc tests for the significant factors from the ANOVA. If the interaction
is NOT significant, interpret the post hoc tests for significant main effects but if it is significant,
only interpret the interactions post hoc tests.

Post hoc tests for main effects of diet and gender:


1 > TukeyHSD ( anova2 )
2 Tukey multiple comparisons of means
3 95% family - wise confidence level

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 32/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

4
5 Fit : aov ( formula = loss ~ as . factor ( gender ) * diet , data = gh )
6
7 $ ` as . factor ( gender ) `
8 diff lwr upr p adj
9 1 -0 0.1221283 -0.9480861 1.192343 0.8206233
10
11 $diet
12 diff lwr upr p adj
13 2 -1 -0.03484966 -1.6215073 1.551808 0.9984761
14 3 -1 1.84475570 0.2871469 3.402365 0.0162482
15 3 -2 1.87960536 0.3385771 3.420634 0.0128844
16
17 $ ` as . factor ( gender ) : diet `
18 diff lwr upr p adj
19 1:1 -0:1 0.6000000 -2.2129628 3.4129628 0.9887997
20 0:2 -0:1 -0.4428571 -3.0107291 2.1250148 0.9958151
21 1:2 -0:1 1.0590909 -1.6782698 3.7964516 0.8656520
22 0:3 -0:1 2.8300000 0.3052886 5.3547114 0.0191170
23 1:3 -0:1 1.1833333 -1.4893925 3.8560592 0.7855223
24 0:2 -1:1 -1.0428571 -3.8558199 1.7701056 0.8852416
25 1:2 -1:1 0.4590909 -2.5093998 3.4275816 0.9975014
26 0:3 -1:1 2.2300000 -0.5436187 5.0036187 0.1863470
27 1:3 -1:1 0.5833333 -2.3256625 3.4923292 0.9915569
28 1:2 -0:2 1.5019481 -1.2354126 4.2393087 0.5963201
29 0:3 -0:2 3.2728571 0.7481458 5.7975685 0.0040103
30 1:3 -0:2 1.6261905 -1.0465354 4.2989163 0.4833188
31 0:3 -1:2 1.7709091 -0.9260048 4.4678230 0.3965102
32 1:3 -1:2 0.1242424 -2.7117126 2.9601974 0.9999949
33 1:3 -0:3 -1.6466667 -4.2779524 0.9846191 0.4513580

The following output for post hoc interactions tests has been adjusted in Excel to make it easier
to read.

The interactions post hoc tests compare each pair of combinations. This shows that the only
significant differences are for females and are between diets 1 and 3 (p=0.0191) and diets 2 and
3 (p=0.004). Women on diet 3 lose on average 2.83kg more than those on diet 1 and 3.27kg more
than those on diet 2.

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 33/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

CONCLUSION

A two way ANOVA was carried out on weight lost by diet type and gender. There was a
statistically significant interaction between the effects of Diet and Gender on weight loss [F(2,
70)=3.153, p = 0.049]. Tukey’s HSD post hoc tests were carried out. For females, diet 3 was
significantly different to diet 1 (p = 0.0191) and diet 2 (p = 0.004) but there is no evidence to
suggest that any diets differed for males. Women on diet 3 lost on average 2.83kg more than
those on diet 1 and 3.27kg more than those on diet 2.

4 References
1. Douglas-C.-Montgomery-Applied-Statistics-and-Probability-for-Engineers-2014
2. Scribbr website
3. Investopedia website
4. Khanacademy website
5. File pdf Intro to R Vietnamese

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 34/34

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy