0% found this document useful (0 votes)

11 views35 pages

PS Assignments

The document describes a statistics assignment involving analyzing linear regression models. It includes: 1. A requirement to fit a linear regression model to data on tree diameters and heights, and check assumptions. 2. An overview of the solving method, including manually or using R, and checking the assumptions of linearity, independence of errors, normality of errors, and homoscedasticity. 3. A background section on linear regression models and the four assumptions that are checked.

Uploaded by

phuong.cao1310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views35 pages

PS Assignments

Uploaded by

phuong.cao1310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY

UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING

CC04

Assignment

Probabilities and Statistics

Advisor: Nguyễn Tiến Dũng

Students: Nguyễn Đình Sáng (CC04) - 1952955
Vũ Minh Long (CC04) - 1950024
Trần Nhật Tân (CC04) - 1852732
Đinh Hoàng Anh (CC04) - 1952553
Nguyễn Bình Minh (CC04) - 1952846

HO CHI MINH CITY, MAY 2021

University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Contents
1 Member list & Workload 2

2 Project 1 2
2.1 Topic 8-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.4 Calculation in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.5 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Topic 8-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4 Calculate in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.5 Display the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.6 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Topic 8-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Calculation in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.5 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Topic 8-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.4 Calculation in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.5 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Project 2 23
3.1 Import data: Diet.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Cleaning data: NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Data visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Descriptive statistics for each of the variables . . . . . . . . . . . . 27
3.3.2 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 t.test: between pre.weight and weight6weeks . . . . . . . . . . . . . . . . . 31
3.5 One way ANOVA: What is the best diet for weight loss? . . . . . . . . . 31
3.6 Two way ANOVA: How do Diet and gender affect weightLOST? . . . . 32

4 References 34

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 1/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 Member list & Workload

No. Fullname Student ID Problems Percentage of work

1 Nguyễn Đình Sáng 1952955 P1.8.4 & P2.2 100%

2 Vũ Minh Long 1950024 P1.8.2 & P2.3 100%

3 Đinh Hoàng Anh 1952553 P1.8.3 & P2.4 100%

4 Trần Nhật Tân 1852732 P1.8.1 & P2.5 100%

5 Nguyễn Bình Minh 1952846 P1.8.1 & P2.6 100%

2 Project 1
2.1 Topic 8 - 1
2.1.1 Requirement

The following table shows the diameter X and the height Y of 20 trees respectively

X 2.3 2.5 2.6 3.1 3.4 3.7 7.3 3.9 4 4.1 4.1 4.2 4.4 5.7 5.1 5.5 5.8 6.2 6.9 6.9
Y 7 8 4 4 6 6 14 12 8 5 7 8 7 9 10 13 7 11 11 16
Fit the linear regression model and check all assumptions.

2.1.2 Solving method

a) Fit the linear regression

- We can find the linear regression model manually or by using R

b) Check all assumptions

There are four assumptions we should check:
- Linearity: we can check the linearity by using scatter plot.
- Independence of errors: we can check the independence of error by using plot of residuals
versus fits or Durbin Watson test
- Normality of errors: we can check the normality of errors by using Q-Q plot or Shapiro-Wilk
Test
- Homoscedasticity (non-constant variance): we can check this assumption by using non-
constant variance test

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 2/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.1.3 Theory base

a) Linear regression
Linear regression attempts to model the relationship between two variables by fitting a
linear equation to observed data. One variable is considered to be an explanatory variable, and
the other is considered to be a dependent variable. For example, a modeler might want to relate
the weights of individuals to their heights using a linear regression model.
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory
variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the
value of y when x = 0).
b) Four assumptions
- Linearity: The response variable y should be a linearly related to the explanatory variables
X
- Independence of errors: After fitting the model on the training data set, the residual errors
of the model should be independent and identically distributed random variables. If the residual
errors are not independent, they will likely demonstrate some sort of a pattern (which is not
always obvious to the naked eye). There is information in this pattern that the regression model
was not able to capture during its training on the training set, thereby making the model sub-
optimal
- Normality of errors: If the residual errors of regression are not N (0, σ 2 ), then statistical tests
of significance that depend on the errors having an N (0, σ 2 ) distribution, simply stop working.
- Homoscedasticity (non-constant variance): In the previous section we saw why the residual
errors should be N (0, σ 2 ) distributed, i.e. normally distributed with mean zero and variance σ 2 .
In this section we impose an additional constraint on them: the variance σ 2 should be constant.
Particularly, σ 2 should not be a function of the response variable y, and thereby indirectly the
explanatory variables X.

2.1.4 Calculation in manual

- Linear regression model: Y = a + bX

Pn
Sxx = i=1 (xi − x̄)2 = 42.6055
Pn
Sxy = i=1 (xi − x̄)(y − ȳ) = 71.445
Pn
Syy = i=1 (y − ȳ)2 = 208.55

- Therefore
Sxy
b= = 1.676896175
Sxx
a = ȳ − bx̄ = 1.045275845

- So the linear regression model is: y = 1.045 + 1.677x

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 3/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.1.5 Solving the problem in R

a) Fit the linear regression

- First, we allocate diameter to x variable and height to y variable:

1 > x <- c (2.3 , 2.5 , 2.6 , 3.1 , 3.4 , 3.7 , 7.3 , 3.9 , 4 , 4.1 , 4.1 , 4.2 , 4.4 , 4.7 , 5.1 ,
5.5 ,
2 5.8 , 6.2 , 6.9 , 6.9)
3 > y <- c (7 , 8 , 4 , 4 , 6 , 6 , 14 , 12 , 8 , 5 , 7 , 8 , 7 , 9 , 10 , 13 , 7 , 11 , 11 , 16)

- Second, we find the linear regression model for x and y:

1 > relation <- lm ( y ~ x )
2 > print ( relation )

- We get:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
1.045 1.677
So the linear regression model is: y = 1.045 + 1.677x
We plot the data:
1 > plot (x ,y , col = " blue " , main = " Diameter and Height regression " , abline ( lm ( y ~ x ) )
,
2 cex = 1.3 , pch = 16 , xlab = " Diameter " , ylab = " Height ")

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 4/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

b) Check all assumptions

- Linearity: The scatter plot below shows that the relationship between diameter and height is
linear

- Independence of errors: Plot of residual versus fits:

1 > plot ( relation , 1)

The plot of residual versus fits below shows that the correlation is approximately 0, thus this
assumption has been met

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 5/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

We use Durbin Watson test:

1 > d u r b i n W a t s o n T e s t ( relation )

We get:
lag Autocorrelation D-W Statistic p-value
1 0.009385456 1.802587 0.5
Alternative hypothesis: rho != 0
As the p-value is larger than 0.05, we cannot reject the null hypothesis

- Normality of errors: We check the normality of errors using Q-Q Plot method:
1 > plot ( relation , 2)

The Q-Q plot below shows that the standardized residual is normally distributed

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 6/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Shapiro-Wilk Test:
1 > library ( MASS )
2 > shapiro <- studres ( relation )
3 > shapiro . test ( shapiro )

We get the result:

Shapiro-Wilk normality test

data: shapiro
W = 0.96613, p-value = 0.6719
p-value is larger than 0.5, so the error is normaly distributed
- Equal variance (Homoscedasticity):
Non-constant variance test:
Let H0 denote the hypothesis that the variances are not constant
H1 denote the hypothesis that the variances are constant

ncvTest(relation)

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 7/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

We get:
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.01590465, Df = 1, p = 0.89964
The p-value is larger than 0.05, so we can reject the null hypothesis, it means that the variance
of errors are constant

2.1.6 Conclusion

After checking all assumptions, we can conclude that the diameter and the height of those
trees are linear, and all assumptions are met.

2.2 Topic 8 - 2
2.2.1 Requirement
A company wants target 3 overseas markets. To compare the potential market share of the
company and that of some competitors in the markets, their products were sold to potential
customers. The following results were obtained.
Markets
A B C
The company 55 38 24
Competitor 1 28 30 21
Competitor 2 20 18 31
Other competitors 47 64 74

Draw a conclusion at the significance level of 5%.

2.2.2 Solving method

- Two way ANOVA: to find effect between products to both markets and companies
Set up hypothesis
H0a : There is no effect between products to companies.
H0b : There is no effect between products to markets.
H1a : There is an effect between products to companies.
H1b : There is an effect between products to markets.
- Tukey HSD: to find is there any significant between each companies

2.2.3 Theory base

ANOVA tests for significance using the F-test for statistical significance. The F-test is a
groupwise comparison test, which means it compares the variance in each group mean to the
overall variance in the dependent variable.
If the variance within groups is smaller than the variance between groups, the F-test will
find a higher F-value, and therefore a higher likelihood that the difference observed is real and
not due to chance.

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 8/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

A two-way ANOVA with interaction tests three null hypotheses at the same time:

• There is no difference in group means at any level of the first independent variable.
• There is no difference in group means at any level of the second independent vari-
able.
• The effect of one independent variable does not depend on the effect of the other
independent variable (a.k.a. no interaction effect).

A two-way ANOVA without interaction (a.k.a. an additive two-way ANOVA) only tests the
first two of these hypotheses.

Adj MS

The calculations for the mean square for the factors, interaction, and error follow:

SS(A) SS(B)
M S(A) = M S(B) =
df (A) df (B)
SS(AB) SS(Error)
M S(AB) = M S(Error) =
df (AB) df (Error)

Notation

MS: Mean square

SS: Sum of squares
df: Degree of freedom
Adj SS

The sum of squared distances. SS Total is the total variation in the data. SS (A) and SS (B)
are the amount of variation of the estimated factor level mean around the overall mean. These
statistics are also known as the sum of squares for factor A or factor B. SS Error is the amount
of variation of the observations from their fitted values. The calculations follow:

Source of variation Df Sum of squares Mean square F

Pa SSA M SA
Factor A (between groups) a−1 SSA = i=1 nj (ȳi − ȳ.. )2 M SA =
(a − 1) M SE
Pb SSB M SB
Factor B (between groups) b−1 SSB = i=1 nj (ȳi − ȳ.. )2 M SB =
(b − 1) M SE
SSE
Error (within groups) (a − 1)(b − 1) SSE = SST − SSA − SSB M SE =
Pa Pn (a − 1)(b − 1)
Total N −1 SST = i=1 j=1 (yij − ȳ.. )2

P-value – Analysis of variance table

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 9/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

The degrees of freedom for the F statistic that you use to calculate the p-value depend on
the term that is in the test.

When you test a term, the denominator degrees of freedom are always the degrees of freedom
for error. The degrees of freedom for error depend on whether the interaction term is in the
model or not.

• With no interaction in the model, DF Error = (n − 1) − (a − 1) − (b − 1)

• With the interaction in the model, DF Error = (n−1)−(a−1)−(b−1)−(a−1)(b−1)
When you test a term, the numerator degrees of freedom depend on the term.

• For F (A), the degrees of freedom for the numerator are a − 1

• For F (B), the degrees of freedom for the numerator are b − 1
• For F (AB), the degrees of freedom for the numerator are (a − 1)(b − 1)
For the lack-of-fit test, the degrees of freedom follow:
• Denominator DF = nc
• Numerator DF = cp

2.2.4 Calculate in manual

- We have 2 factors: Companies and Markets

a, Calculate the mean of each factors

Markets
A B C Average
The company 55 38 24 39
Competitor 1 28 30 21 26.33
Competitor 2 20 18 31 23
Other competitors 47 64 74 61.67
Average 37.5 37.5 37.5
b, Calculate the Sum of SquaresPn
- We have the formula: SS = i=1 (yi − ȳ)2
Where:
SS = Sum of Square
n = number of observations
yi = value in a sample
ȳ = mean value of a sample
- Therefore:
*Companies

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 10/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

The company
Products The company mean Grand mean (The company mean - grand mean)2
55 39 37.5 2.25
38 39 37.5 2.25
24 39 37.5 2.25
Sum of Squares 6.75

Competitor 1
Products Competitor 1 mean Grand mean (Competitor 1 mean - grand mean)2
28 26.33 37.5 124.77
30 26.33 37.5 124.77
21 26.33 37.5 124.77
Sum of Squares 374.31

Competitor 2
Products Competitor 2 mean Grand mean (Competitor 2 mean - grand mean)2
20 23 37.5 210.25
18 23 37.5 210.25
31 23 37.5 210.25
Sum of Squares 630.75

Other competitors
Products Other competitors mean Grand mean (Other competitors mean - grand mean)2
47 61.67 37.5 584.19
64 61.67 37.5 584.19
74 61.67 37.5 584.19
Sum of Squares 1752.57

Total sum of squares of companies 2764.38

*Markets
- We can see that the mean of each markets is equal to each other and also equal to the
grand mean
=> The total sum of square of markets is equal to 0
*Total
Pa Pb y2
- Formula: SST = i=1 j=1 yij 2
− t
ab
Where
SST = Total Sum of Square
a,b = The number of row and column
yt = Sum of all observations

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 11/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

- Therefore
4502
SST = (552 + 382 + ... + 742 ) − = 3761
12
*Error
- Formula: SSE = SST − SS(Companies) − SS(M arkets)
- Therefore
SSE = 3761 − 2764.38 − 0 = 996, 62

c, Calculate the Degree of Freedom

- Formula: DF = n - 1
Where:
DF = Degree of Freedom
n = The number of column ( or row )
- Therefore
DF (Companies) = 4 - 1 = 3
DF (Markets) =3-1=2
DF (Error) =3x2=6

d, Calculate the mean square

SS
- Formula: MS =
DF
Where:
SS = Sum of Square
DF = Degree of Freedom
MS = Mean Square
- Therefore
2764.68
MS (Companies) = = 921.46
3
MS (Markets) = 0
996.62
MS (Error) = = 166.1
6
e, Calculate the F ratio
- We have formula:
MS
F-ratio =
M S(Error)
- Therefore
921.46
F (Companies) = = 5.55
166.1
F (Markets) = 0

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 12/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.2.5 Display the result

SS DF MS F
Companies 2764.38 3 921.46 5.55
Markets 0 2 0 0
Error 996.62 6 166.1
Total 3761

2.2.6 Solving the problem in R

a, Import data
1 > companies <- gl (4 ,3 ,12)
2 > markets <- gl (3 ,1 ,12)
3 > id <- 1:12
4 > products <- c (55 ,38 ,24 ,28 ,30 ,21 ,20 ,18 ,31 ,47 ,64 ,74)
5 > data <- data . frame ( companies , markets , id , products )
6 > data

b, Result

1 companies markets id products

2 1 1 1 1 55
3 2 1 2 2 38
4 3 1 3 3 24
5 4 2 1 4 28
6 5 2 2 5 30
7 6 2 3 6 21
8 7 3 1 7 20
9 8 3 2 8 18
10 9 3 3 9 31
11 10 4 1 10 47
12 11 4 2 11 64
13 12 4 3 12 74

c, Two-way ANOVA test

1 > x < - aov ( products ~ companies + markets )

2 > summary ( x )

=> Result:
Df Sum Sq Mean Sq F value Pr(>F)
Companies 3 2763.7 921.2 5.542 0.0365
Markets 2 0.0 0 0.0 1.0
Residuals 6 997.3 166.2

d, Tukey HSD test

1 > Tukey HSD ( x )

=> Result: Tukey multiple comparisons of means 95% family-wise confidence level
*companies

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 13/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

diff lwr upr p adj

2-1 -12.67 -49.11 23.77 0.657
3-1 -16.00 -52.44 20.44 0.482
4-1 22.67 -13.77 59.11 0.238
3-2 -16.00 -39.77 33.11 0.988
4-2 -16.00 -1.11 71.77 0.056
4-3 -16.00 2.26 75.11 0.039

*markets
diff lwr upr p adj
2-1 0 -27.97 27.97 1
3-1 0 -27.97 27.97 1
3-2 0 -27.97 27.97 1

2.2.7 Conclusion
- Because F (Companies) = 5.55 > 4.76 = F3;6 => Reject H0a . Therefore, there is an effect
between products to companies. (From ANOVA)
- Because F (Markets) = 0 => Cannot reject H0b . Therefore, there is no effect between
products to markets. (From ANOVA)
- Because P-value < α = 0.05, so there is only significant differences between Competitor 2
and Other competitors. (From Tukey HSD)
- The total products in each market is equal, which is 150 products.

2.3 Topic 8 - 3
2.3.1 Requirement
The river water level was measured at several locations in a province on the same day and
the data follow.
Measurement location
Measurement time
F1 F2 F3 F4
1 5.5 4.9 4.6 4.5
2 4.6 5.1 4.8 6.2
3 5.8 6.5 5.8 4.8
4 5.9 5.4 5.1 4.8
5 6.0 6.1 6.2 6.5
6 6.7 7.1 6.8
7 7.2

At the significance level α = 2%, are the average river water levels each day at these locations
significantly different?

2.3.2 Solving method

The method we use is one-way ANOVA with Unequal Sample Sizes.

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 14/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.3.3 Theory base

We may describe the observations in table by the linear statistical model

i = 1, 2, ..., a
Yij = µ + τi + ij
j = 1, 2, ..., ni

Where yij is a random variable denoting the (ij)th observation, µ is a parameter common to
all treatments called the overall mean, τi is a parameter associated with the ith treatment called
the ith treatment effect, and ij is a random error component.

We assume that the errors ij are normally and independently distributed with mean zero
and variance σ 2 . Therefore, each treatment can be thought of as a normal population with mean
µi and varianceσ 2 .

We have to use ni instead of n because the measurement time of each treatments are
not equal

In the fixed-effects model, the treatment effects i are usually defined as deviations from
the overall mean µ, so that
Pa
i=1 τi =0

Let yi· represent the total of the observations under the ith treatment and y i· . represent the
average of the observations under the ith treatment. Similarly, let y·· represent the grand total of
all observations and y ·· represent the grand mean of all observations. Expressed mathematically,
Pni
yi· = j=1 yij
yi·
y i· = with i = 1, 2, ..., a
ni
Pa Pni
y·· = i=1 j=1 yij
y·· Pa
y ·· = with N = i=1 ni
N

We are interested in testing the equality of the a treatment meansµ1 , µ2 , ..., µa . We find that
this is equivalent to testing the hypotheses
H0 : τ1 = τ2 = ... = τa = 0

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 15/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

H1 : At least one τi 6= 0

Thus, if the null hypothesis is true, each observation consists of the overall mean µ plus a
realization of the random error component i j. This is equivalent to saying that all N observa-
tions are taken from a normal distribution with mean µ and variance σ 2 . Therefore, if the null
hypothesis is true, changing the levels of the factor has no effect on the mean response.

The sums of squares computing formulas for the ANOVA with unequal sample sizes ni
in each treatment are:
Pa Pni 2 y··2
SS T = i=1 j=1 yij −
N
Pa yi·2 y2
SS T reatments = i=1 − ··
ni N
SS E = SS T − SS T reatments

2.3.4 Calculation in manual

Measurement location
Measurement time
F1 F2 F3 F4
1 5.5 4.9 4.6 4.5
2 4.6 5.1 4.8 6.2
3 5.8 6.5 5.8 4.8
4 5.9 5.4 5.1 4.8
5 6.0 6.1 6.2 6.5
6 6.7 7.1 6.8
7 7.2
SUM 41.7 28 33.6 33.6
AVG 5.9571 5.6 5.6 5.6

136.92
SS T = (5.52 + 4.92 + ... + 7.22 ) − = 16.2896
24
41.72 282 33.62 33.62 136.92
SS T reatments = ( + + + )− = 0.6324
7 5 6 6 24
SS E = 16.2896 − 0.6324 = 15.6572
SS T r 0.6324
M S T reatments = = = 0.2108
df (SS T r ) 4−1
SS E 15.6572
M SE = = = 0.7829
df (SS E ) 24 − 4
M ST r 0.2108
F = = = 0.2693
M SE 0.7829

Source of variation Df Sum of squares Mean square F

Treatment 3 0.6324 0.2108 0.2693
Error 20 15.6572 0.7829
Total 23 16.2896

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 16/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.3.5 Solving the problem in R

First, we have to create data frame in R
1 > F1 <- c (5.5 ,4.6 ,5.8 ,5.9 ,6.0 ,6.7 ,7.2)
2 > F2 <- c (4.9 ,5.1 ,6.5 ,5.4 ,6.1)
3 > F3 <- c (4.6 ,4.8 ,5.8 ,5.1 ,6.2 ,7.1)
4 > F4 <- c (4.5 ,6.2 ,4.8 ,4.8 ,6.5 ,6.8)
5
6 > Data <- data . frame ( Y = c ( F1 , F2 , F3 , F4 ) , Locations = factor ( rep ( c (" F1 " ," F2 " ," F3 " ," F4
") , times = c ( length ( F1 ) , length ( F2 ) , length ( F3 ) , length ( F4 ) ) ) ) )

Then, compute one-way ANOVA test

1 > oneway <- aov ( Y ~ Locations , data = Data )
2 > anova ( oneway )
3
4 Analysis of Variance Table
5
6 Response : Y
7 Df Sum Sq Mean Sq F value Pr ( > F )
8 Locations 3 0.6324 0.21081 0.2693 0.8468
9 Residuals 20 15.6571 0.78286

Then apply the function qf() (f-dist critical value), we can get:
1 >f = qf ( p =.02 , df1 =3 , df2 =20 , lower . tail = FALSE )
2
3 >f
4 [1] 4.113404

2.3.6 Conclusion
Using F-dist table critical values: F = 0.2693 < F0.02,3,20 = 4.1134 =⇒ Accept Hypothesis H0

Therefore, we can conclude that there are not significant differences of the average river wa-
ter levels each day at these locations at significant level α = 2%.

2.4 Topic 8 - 4
2.4.1 Requirement
The following table show the average cost, measured as 1000 VND expended for each service at
each city.

Services
I II III
City I 61 52 69
City II 58 51 61
City III 68 64 79

At the significance level = 5%, compare the costs for three types of services in three cities.

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 17/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.4.2 Solving method

A two-way ANOVA is used to estimate how the mean of a quantitative variable changes ac-
cording to the levels of two categorical variables. Use a two-way ANOVA help us know how two
independent variables, in combination, affect a dependent variable.

According to the data tables, we have 2 factors which are Cities and Services and each factor
has 3 levels without replication and a dependent variable Cost. Therefore, two-way ANOVA is
chosen.

2.4.3 Theory base

Theory base in this exercise is similar to 2.2.3

2.4.4 Calculation in manual

We have two factors: Cities and Services
City I City II
Service I Service II Service III Service I Service II Service III
61 52 69 58 51 61

City III
Service I Service II Service III
68 64 79

Step 1: We calculate the average of each factors in the following tables

Services
Service I Service II Service III Average
City I 61 52 69 60.67
City II 58 51 61 56.67
City III 68 64 79 70.33
Average 62.33 55.67 69.67 62.56

There are two null hypotheses: one for the rows and the other for the columns. Let’s look first
at the rows:
H0 : There is no significant difference in cost between the cities
The null hypothesis for the columns is:
H0 : There is no significant difference in cost between the services

Step 2: Calculating the sum of squares

- Sum of squares for 1st factor: Cities

Pn
- We have the formula: SS = i=1 (yi − ȳ)2
Where:
SS = Sum of Square
n = number of observations

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 18/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

yi = value in a sample
ȳ = mean value of a sample

- Sum of squares for 1st factor: Cities

City I
Cost City I Mean Grand mean (City I Mean - Grand mean)2
61 60.67 62.56 3.57
52 60.67 62.56 3.57
69 60.67 62.56 3.57
Sum of Squares 10.70
City II
Cost City II Mean Grand mean (City II Mean - Grand mean)2
58 56.67 62.56 34.68
51 56.67 62.56 34.68
61 56.67 62.56 34.68
Sum of Squares 104.04
City III
Cost City III Mean Grand mean (City III Mean - Grand mean)2
68 70.34 62.56 60.49
64 70.34 62.56 60.49
79 70.34 62.56 60.49
Sum of Squares 181.48
Sum of squares for 1st factor: Cities 296.22
- Sum of squares for 1st factor: Cities

Service I
Cost City I Mean Grand mean (City I Mean - Grand mean)2
61 62.34 62.56 0.049
58 62.34 62.56 0.049
68 62.34 62.56 0.049
Sum of Squares 0.15
Service II
Cost Service II Mean Grand mean (Service II Mean - Grand mean)2
52 55.67 62.34 47.46
51 55.67 62.34 47.46
64 55.67 62.34 47.46
Sum of Squares 142.37
Service III
Cost Service III Mean Grand mean (Service III Mean - Grand mean)2
69 69.67 62.56 50.57
61 69.67 62.56 50.57
79 69.67 62.56 50.57
Sum of Squares 151.7031

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 19/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Sum of squares for 2nd factor: Services 294.22

- Sum of squares of both factors
Cost Grand Mean (Cost - Grand Mean)2
61 62.56 2.42
52 62.56 111.42
69 62.56 41.53
58 62.56 20.75
51 62.56 133.53
61 62.56 2.42
68 62.56 29.64
64 62.56 2.09
79 62.56 270.42
Error 23.78
Step 3: Calculate the degree of freedom
- Formula: DF = n - 1
Where:
DF = Degree of Freedom
n = The number of column ( or row )
- Therefore
DF (Cities) = 3 - 1 = 2
DF (Services) =3-1=2
DF (Error) =2x2=4

Step 4: Calculate the mean square

SS
- Formula: MS =
DF
Where:
SS = Sum of Square
DF = Degree of Freedom
MS = Mean Square
- Therefore
296.22
MS (Cities) = = 148.11
2
MS (Services) = 0
294.22
MS (Error) = = 147.11
2
Step5: Calculate the F ratio
- We have formula:
MS
F-ratio =
M S(Error)
- Therefore
148.11
F (Cities) = = 24.91
5.945
147.11
F (Services) = = 24.74
5.945

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 20/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Step 6: Display the result

Sum of squares Df Mean square F-ratio

1st Factor: Cities 296.22 2 148.11 24.91
2nd Factor: Services 294.22 2 147.11 24.74
Error 23.78 4 5.945
Total 614.22 8 76.79

2.4.5 Solving the problem in R

After import the data
1 > data
2 cities services cost
3 1 I I 61
4 2 I II 52
5 3 I III 69
6 4 II I 58
7 5 II II 51
8 6 II III 61
9 7 III I 68
10 8 III II 64
11 9 III III 79

We set Cities and Services as factors

1 > # as . factor ()
2 > data$cities <- as . factor ( data$cities )
3 > data$services <- as . factor ( data$services )
4 > str ( data )
5 Classes tbl_df , tbl and ' data . frame ': 9 obs . of 3 variables :
6 $ cities : Factor w / 3 levels " I " ," II " ," III ": 1 1 1 2 2 2 3 3 3
7 $ services : Factor w / 3 levels " I " ," II " ," III ": 1 2 3 1 2 3 1 2 3
8 $ cost : num 61 52 69 58 51 61 68 64 79

We use two-way ANOVA to analyze the data

1 > anova1 <- aov ( cost ~ cities + services , data = data )
2 > summary ( anova1 )
3 Df Sum Sq Mean Sq F value Pr ( > F )
4 cities 2 296.22 148.11 24.92 0.00552 **
5 services 2 294.22 147.11 24.75 0.00559 **
6 Residuals 4 23.78 5.94
7 ---
8 Signif . codes : 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

The ANOVA test can not show us the details of the effect between each element so we decided
to use the Tukey test to investigate the data by using the command “TukeyHSD”
1 > TukeyHSD ( anova1 )
2 Tukey multiple comparisons of means
3 95% family - wise confidence level
4
5 Fit : aov ( formula = cost ~ cities + services , data = data )
6
7 $cities
8 diff lwr upr p adj
9 II - I -4.000000 -11.094901 3.094901 0.2253564
10 III - I 9.666667 2.571766 16.761568 0.0180671
11 III - II 13.666667 6.571766 20.761568 0.0052028

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 21/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

12
13 $services
14 diff lwr upr p adj
15 II - I -6.666667 -13.7615676 0.4282342 0.0604773
16 III - I 7.333333 0.2384324 14.4282342 0.0450964
17 III - II 14.000000 6.9050991 21.0949009 0.0047572

We can draw a bar graph to show the relation between 3 variables

1 > anova2 <- aov ( cost ~ services + cities + services : cities , data = data )
2 > summary ( anova2 )
3 Df Sum Sq Mean Sq
4 services 2 294.22 147.11
5 cities 2 296.22 148.11
6 services : cities 4 23.78 5.94
7 ########################################
8 > model . tables ( anova2 )
9 Tables of means
10 Grand mean
11
12 62.55556
13
14 services
15 services
16 I II III
17 62.33 55.67 69.67
18
19 cities
20 cities
21 I II III
22 60.67 56.67 70.33
23
24 services : cities
25 cities
26 services I II III
27 I 61 58 68
28 II 52 51 64
29 III 69 61 79
30 ########################################
31 > Input = " services I II III
32 + I 61 58 68
33 + II 52 51 64
34 + III 69 61 79 "
35
36 > Means = as . matrix ( read . table ( te xtConn ection ( Input ) ) )
37 > Means
38 V1 V2 V3 V4
39 [1 ,] " services " " I " " II " " III "
40 [2 ,] " I " "61" "58" "68"
41 [3 ,] " II " "52" "51" "64"
42 [4 ,] " III " "69" "61" "79"
43 ########################################
44 > Means = as . matrix ( read . table ( te xtConn ection ( Input ) , header = TRUE , row . names =
1) )
45 > barplot ( Means , beside = TRUE , legend . text = TRUE , ylim = c (0 ,80) , xlab = " Cities
" , ylab =" Cost ")

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 22/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.4.6 Conclusion
We received the same result from both calculation in manual and code in R

The F-ratio of Cities is 24.91 > F(2,4) = 6.94.

The F-ratio of Services is 24.74 > F(2,4) = 6.94 (5% significant level).
Furthermore, p-value from two factors < 0.05.

We can reject the both null hypothesis and conclude that services and cities affect
the average cost.

Besides, from the result resulted from tukeyHSD, and the bar graph we can also conclude
that the differences III-I and III-II are significant, while II-I is not significant and base on the
difference mean of each group, it can seem that:

The cost of the I and II services is considered as no difference while the III ser-
vice’ cost is higher compare to others.

The result is similar to the city when

The first and second cities nearly have the same cost, the third one has the
highest compare to two remains.

3 Project 2
3.1 Import data: Diet.csv
The first command (setwd - wd means working directory) tells R that we want to save the nu-
meric data in a folder named “c://diet”)

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 23/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

The second command is used to save the excel file Diet.csv into an object named gh.

The third command (save) tells R that the data in the gh object will be stored in a file named
(“gh.rda”). After typing the above two commands, a file named gh.rda will be present in that
directory.
1
2 > setwd (" c :// diet ")
3 >gh < - read . csv (" Diet . csv " , header = TRUE )
4 > save ( gh , file =" gh . rda ")
5 > gh

The output:
1 Person gender Age Height pre . weight Diet weight6weeks
2 1 25 NA 41 171 60 2 60.0
3 2 26 NA 32 174 103 2 103.0
4 3 1 0 22 159 58 1 54.2
5 4 2 0 46 192 60 1 54.0
6 5 3 0 55 170 64 1 63.3
7 6 4 0 33 171 64 1 61.1
8 7 5 0 50 170 65 1 62.2
9 8 6 0 50 201 66 1 64.0
10 9 7 0 37 174 67 1 65.0
11 10 8 0 28 176 69 1 60.5
12 11 9 0 28 165 70 1 68.1
13 12 10 0 45 165 70 1 66.9
14 13 11 0 60 173 72 1 70.5
15 14 12 0 48 156 72 1 69.0
16 15 13 0 41 163 72 1 68.4
17 16 14 0 37 167 82 1 81.1
18 17 27 0 44 174 58 2 60.1
19 18 28 0 37 172 58 2 56.0
20 19 29 0 41 165 59 2 57.3
21 20 30 0 43 171 61 2 56.7
22 21 31 0 20 169 62 2 55.0
23 22 32 0 51 174 63 2 62.4
24 23 33 0 31 163 63 2 60.3
25 24 34 0 54 173 63 2 59.4
26 25 35 0 50 166 65 2 62.0
27 26 36 0 48 163 66 2 64.0
28 27 37 0 16 165 68 2 63.8
29 28 38 0 37 167 68 2 63.3
30 29 39 0 30 161 76 2 72.7
31 30 40 0 29 169 77 2 77.5
32 31 52 0 51 165 60 3 53.0
33 32 53 0 35 169 62 3 56.4
34 33 54 0 21 159 64 3 60.6
35 34 55 0 22 169 65 3 58.2
36 35 56 0 36 160 66 3 58.2
37 36 57 0 20 169 67 3 61.6
38 37 58 0 35 163 67 3 60.2
39 38 59 0 45 155 69 3 61.8
40 39 60 0 58 141 70 3 63.0
41 40 61 0 37 170 70 3 62.7
42 41 62 0 31 170 72 3 71.1
43 42 63 0 35 171 72 3 64.4
44 43 64 0 56 171 73 3 68.9
45 44 65 0 48 153 75 3 68.7
46 45 66 0 41 157 76 3 71.0
47 46 15 1 39 168 71 1 71.6
48 47 16 1 31 158 72 1 70.9

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 24/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

49 48 17 1 40 173 74 1 69.5
50 49 18 1 50 160 78 1 73.9
51 50 19 1 43 162 80 1 71.0
52 51 20 1 25 165 80 1 77.6
53 52 21 1 52 177 83 1 79.1
54 53 22 1 42 166 85 1 81.5
55 54 23 1 39 166 87 1 81.9
56 55 24 1 40 190 88 1 84.5
57 56 41 1 51 191 71 2 66.8
58 57 42 1 38 199 75 2 72.6
59 58 43 1 54 196 75 2 69.2
60 59 44 1 33 190 76 2 72.5
61 60 45 1 45 160 78 2 72.7
62 61 46 1 37 194 78 2 76.3
63 62 47 1 44 163 79 2 73.6
64 63 48 1 40 171 79 2 72.9
65 64 49 1 37 198 79 2 71.1
66 65 50 1 39 180 80 2 81.4
67 66 51 1 31 182 80 2 75.7
68 67 67 1 36 155 71 3 68.5
69 68 68 1 47 179 73 3 72.1
70 69 69 1 29 166 76 3 72.5
71 70 70 1 37 173 78 3 77.5
72 71 71 1 31 177 78 3 75.2
73 72 72 1 26 179 78 3 69.4
74 73 73 1 40 179 79 3 74.5
75 74 74 1 35 183 83 3 80.2
76 75 75 1 49 177 84 3 79.9
77 76 76 1 28 164 85 3 79.7
78 77 77 1 40 167 87 3 77.8
79 78 78 1 51 175 88 3 81.9

3.2 Cleaning data: NA

We use the command:
1 >gh < - na . omit ( gh )

The output:
1 > gh
2 Person gender Age Height pre . weight Diet weight6weeks
3 3 1 0 22 159 58 1 54.2
4 4 2 0 46 192 60 1 54.0
5 5 3 0 55 170 64 1 63.3
6 6 4 0 33 171 64 1 61.1
7 7 5 0 50 170 65 1 62.2
8 8 6 0 50 201 66 1 64.0
9 9 7 0 37 174 67 1 65.0
10 10 8 0 28 176 69 1 60.5
11 11 9 0 28 165 70 1 68.1
12 12 10 0 45 165 70 1 66.9
13 13 11 0 60 173 72 1 70.5
14 14 12 0 48 156 72 1 69.0
15 15 13 0 41 163 72 1 68.4
16 16 14 0 37 167 82 1 81.1
17 17 27 0 44 174 58 2 60.1
18 18 28 0 37 172 58 2 56.0
19 19 29 0 41 165 59 2 57.3
20 20 30 0 43 171 61 2 56.7
21 21 31 0 20 169 62 2 55.0

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 25/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

22 22 32 0 51 174 63 2 62.4
23 23 33 0 31 163 63 2 60.3
24 24 34 0 54 173 63 2 59.4
25 25 35 0 50 166 65 2 62.0
26 26 36 0 48 163 66 2 64.0
27 27 37 0 16 165 68 2 63.8
28 28 38 0 37 167 68 2 63.3
29 29 39 0 30 161 76 2 72.7
30 30 40 0 29 169 77 2 77.5
31 31 52 0 51 165 60 3 53.0
32 32 53 0 35 169 62 3 56.4
33 33 54 0 21 159 64 3 60.6
34 34 55 0 22 169 65 3 58.2
35 35 56 0 36 160 66 3 58.2
36 36 57 0 20 169 67 3 61.6
37 37 58 0 35 163 67 3 60.2
38 38 59 0 45 155 69 3 61.8
39 39 60 0 58 141 70 3 63.0
40 40 61 0 37 170 70 3 62.7
41 41 62 0 31 170 72 3 71.1
42 42 63 0 35 171 72 3 64.4
43 43 64 0 56 171 73 3 68.9
44 44 65 0 48 153 75 3 68.7
45 45 66 0 41 157 76 3 71.0
46 46 15 1 39 168 71 1 71.6
47 47 16 1 31 158 72 1 70.9
48 48 17 1 40 173 74 1 69.5
49 49 18 1 50 160 78 1 73.9
50 50 19 1 43 162 80 1 71.0
51 51 20 1 25 165 80 1 77.6
52 52 21 1 52 177 83 1 79.1
53 53 22 1 42 166 85 1 81.5
54 54 23 1 39 166 87 1 81.9
55 55 24 1 40 190 88 1 84.5
56 56 41 1 51 191 71 2 66.8
57 57 42 1 38 199 75 2 72.6
58 58 43 1 54 196 75 2 69.2
59 59 44 1 33 190 76 2 72.5
60 60 45 1 45 160 78 2 72.7
61 61 46 1 37 194 78 2 76.3
62 62 47 1 44 163 79 2 73.6
63 63 48 1 40 171 79 2 72.9
64 64 49 1 37 198 79 2 71.1
65 65 50 1 39 180 80 2 81.4
66 66 51 1 31 182 80 2 75.7
67 67 67 1 36 155 71 3 68.5
68 68 68 1 47 179 73 3 72.1
69 69 69 1 29 166 76 3 72.5
70 70 70 1 37 173 78 3 77.5
71 71 71 1 31 177 78 3 75.2
72 72 72 1 26 179 78 3 69.4
73 73 73 1 40 179 79 3 74.5
74 74 74 1 35 183 83 3 80.2
75 75 75 1 49 177 84 3 79.9
76 76 76 1 28 164 85 3 79.7
77 77 77 1 40 167 87 3 77.8
78 78 78 1 51 175 88 3 81.9

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 26/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

3.3 Data visualization

3.3.1 Descriptive statistics for each of the variables
Use the command and we get the values explained in the following figure
1 > summary ( gh )
2 Person gender Age Height
3 Min . : 1.00 Min . :0.0000 Min . :16.00 Min . :141.0
4 1 st Qu .:19.75 1 st Qu .:0.0000 1 st Qu .:32.50 1 st Qu .:163.8
5 Median :40.50 Median :0.0000 Median :39.00 Median :169.0
6 Mean :39.87 Mean :0.4342 Mean :39.22 Mean :170.8
7 3 rd Qu .:59.25 3 rd Qu .:1.0000 3 rd Qu .:47.25 3 rd Qu .:175.2
8 Max . :78.00 Max . :1.0000 Max . :60.00 Max . :201.0
9 pre . weight Diet weight6weeks
10 Min . :58.00 Min . :1.000 Min . :53.00
11 1 st Qu .:66.00 1 st Qu .:1.000 1 st Qu .:61.95
12 Median :72.00 Median :2.000 Median :68.95
13 Mean :72.29 Mean :2.039 Mean :68.34
14 3 rd Qu .:78.00 3 rd Qu .:3.000 3 rd Qu .:73.67
15 Max . :88.00 Max . :3.000 Max . :84.50
16 **1 st Qu . : first quartile ( tuong duong vi tri 25%)
17 **3 rd Qu . : third quartile (75%)
18 ** mean : gia tri trung binh
19 ** min : gia tri nho nhat
20 ** max : gia tri lon nhat
21 ** median : so trung vi

Set the Standard deviation and Standard error as SD and SE respectively.

1 > desc < - function ( x )
2 +{ av < - mean (x , na . rm = TRUE )
3 + sd < - sd (x , na . rm = TRUE )
4 + se < - sd / sqrt ( length ( x ) )
5 + c ( MEAN = av , SD = sd , SE = se ) }

The output:
1 > desc ( gh$Person )
2 MEAN SD SE
3 39.868421 22.843434 2.620322
4 > desc ( gh$gender )
5 MEAN SD SE
6 0.43421053 0.49894626 0.05723306
7 > desc ( gh$Age )
8 MEAN SD SE
9 39.223684 9.908379 1.136569
10 > desc ( gh$Height )
11 MEAN SD SE
12 170.776316 11.419981 1.309962
13 > desc ( gh$pre . weight )
14 MEAN SD SE
15 72.2894737 7.9746528 0.9147554
16 > desc ( gh$Diet )
17 MEAN SD SE
18 2.03947368 0.82366319 0.09448065
19 > desc ( gh $w e ig ht 6 we e ks )
20 MEAN SD SE
21 68.3434211 8.0589385 0.9244236

3.3.2 Boxplot

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 27/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 > boxplot ( gh$Person , main =" PERSON " , ylab =" people ")

The output:

1 > boxplot ( gh$Age , main =" AGE " , ylab =" years ")

The output:

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 28/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 > boxplot ( gh$Height , main =" HEIGHT " , ylab =" cm ")

The output:

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 29/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 > boxplot ( gh$pre . weight , weight6weeks , main =" PRE & AFTER " , ylab =" kg ")

The output:

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 30/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

3.4 t.test: between pre.weight and weight6weeks

Save pre.weight, weight6weeks, Diet is pre,after,diet respectively
1 >pre < - gh$pre . weight
2 > after . g h $w e ig ht 6 we ek s
3 > diet < - gh$Diet

Use the command t.text paired pre.weight and weight6weeks

1 t . test ( pre , after , paired = TRUE )

The output:
1 Paired t - test
2
3 data : pre and after
4 t = 13.728 , df = 75 , p - value < 2.2 e -16
5 alternative hypothesis : true difference in means is not equal to 0
6 95 percent confidence interval :
7 3.373452 4.518653
8 sample estimates :
9 mean of the differences
10 3.946053

CONCLUSION

Because p-value < 0.05 so that we can reject that pre.weight is’nt different from weight6weeks.
And the average of the differences between them is 3.95kg.

3.5 One way ANOVA: What is the best diet for weight loss?
Named weight loss is loss, we got loss=pre-after
1 > loss < - pre - after

To analyze variance, we need to define diet as a factor and then put diet and loss into the
same data frame called data.
1 > diet2 < - as . factor ( diet )
2 > data < - data . frame ( diet , loss )
3 > attach ( data )
4 The following objects are masked _by_ . GlobalEnv :
5
6 diet , loss

compute one-way ANOVA test:

1 > oneway < - aov ( loss ~ diet , data = gh )
2 > anova ( oneway )
3 Analysis of Variance Table
4
5 Response : loss
6 Df Sum Sq Mean Sq F value Pr ( > F )
7 diet 2 60.53 30.2635 5.3831 0.006596 **
8 Residuals 73 410.40 5.6219
9 ---
10 Signif . codes : 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

So that there was a statistically significant interaction the effects of Diet on weight loss(p =
0.006596 < 0.05)
Then we use TukeyHSD(oneway)

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 31/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 > TukeyHSD ( oneway )

2 Tukey multiple comparisons of means
3 95% family - wise confidence level
4
5 Fit : aov ( formula = loss ~ diet , data = gh )
6
7 $diet
8 diff lwr upr p adj
9 2 -1 -0.032000 -1.6530850 1.589085 0.9987711
10 3 -1 1.848148 0.2567422 3.439554 0.0188047
11 3 -2 1.880148 0.3056826 3.454614 0.0152020

CONCLUSION

Diet 3 was significantly different to diet 1 (p = 0.0188) and diet 2 (p = 0.0152). We can see
that diet 3 lost on average 1.85kg more than those on diet 1 and 1.88kg more than those on diet
2. So that the best diet for weight loss is diet 3

3.6 Two way ANOVA: How do Diet and gender affect weightLOST?
To carry out a two way ANOVA with an interaction, use:
aov(dependent as.factor(independent1)*as.factor(indepndent2),data= filename)

give the ANOVA model a name: anova2

1 > anova2 < - aov ( loss ~ as . factor ( gender ) * diet , data = gh )

To view the ANOVA table use the anova() command

1 > anova ( anova2 )
2 Analysis of Variance Table
3
4 Response : loss
5 Df Sum Sq Mean Sq F value Pr ( > F )
6 as . factor ( gender ) 1 0.28 0.2785 0.0518 0.820623
7 diet 2 60.42 30.2086 5.6190 0.005456 **
8 as . factor ( gender ) : diet 2 33.90 16.9520 3.1532 0.048842 *
9 Residuals 70 376.33 5.3761
10 ---
11 Signif . codes : 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

CONCLUSION

There was a statistically significant interaction between the effects of Diet and Gender on
weight loss [F(2, 70)=3.153, p = 0.049]. The ‘Diet’ effect cannot be generalised for both males
and females because the interaction effect is substantial (p =0.049).

The TukeyHSD(anova2) command will produce post hoc tests for the main effects and interac-
tions. Only interpret post hoc tests for the significant factors from the ANOVA. If the interaction
is NOT significant, interpret the post hoc tests for significant main effects but if it is significant,
only interpret the interactions post hoc tests.

Post hoc tests for main effects of diet and gender:

1 > TukeyHSD ( anova2 )
2 Tukey multiple comparisons of means
3 95% family - wise confidence level

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 32/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

4
5 Fit : aov ( formula = loss ~ as . factor ( gender ) * diet , data = gh )
6
7 $ ` as . factor ( gender ) `
8 diff lwr upr p adj
9 1 -0 0.1221283 -0.9480861 1.192343 0.8206233
10
11 $diet
12 diff lwr upr p adj
13 2 -1 -0.03484966 -1.6215073 1.551808 0.9984761
14 3 -1 1.84475570 0.2871469 3.402365 0.0162482
15 3 -2 1.87960536 0.3385771 3.420634 0.0128844
16
17 $ ` as . factor ( gender ) : diet `
18 diff lwr upr p adj
19 1:1 -0:1 0.6000000 -2.2129628 3.4129628 0.9887997
20 0:2 -0:1 -0.4428571 -3.0107291 2.1250148 0.9958151
21 1:2 -0:1 1.0590909 -1.6782698 3.7964516 0.8656520
22 0:3 -0:1 2.8300000 0.3052886 5.3547114 0.0191170
23 1:3 -0:1 1.1833333 -1.4893925 3.8560592 0.7855223
24 0:2 -1:1 -1.0428571 -3.8558199 1.7701056 0.8852416
25 1:2 -1:1 0.4590909 -2.5093998 3.4275816 0.9975014
26 0:3 -1:1 2.2300000 -0.5436187 5.0036187 0.1863470
27 1:3 -1:1 0.5833333 -2.3256625 3.4923292 0.9915569
28 1:2 -0:2 1.5019481 -1.2354126 4.2393087 0.5963201
29 0:3 -0:2 3.2728571 0.7481458 5.7975685 0.0040103
30 1:3 -0:2 1.6261905 -1.0465354 4.2989163 0.4833188
31 0:3 -1:2 1.7709091 -0.9260048 4.4678230 0.3965102
32 1:3 -1:2 0.1242424 -2.7117126 2.9601974 0.9999949
33 1:3 -0:3 -1.6466667 -4.2779524 0.9846191 0.4513580

The following output for post hoc interactions tests has been adjusted in Excel to make it easier
to read.

The interactions post hoc tests compare each pair of combinations. This shows that the only
significant differences are for females and are between diets 1 and 3 (p=0.0191) and diets 2 and
3 (p=0.004). Women on diet 3 lose on average 2.83kg more than those on diet 1 and 3.27kg more
than those on diet 2.

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 33/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

CONCLUSION

A two way ANOVA was carried out on weight lost by diet type and gender. There was a
statistically significant interaction between the effects of Diet and Gender on weight loss [F(2,
70)=3.153, p = 0.049]. Tukey’s HSD post hoc tests were carried out. For females, diet 3 was
significantly different to diet 1 (p = 0.0191) and diet 2 (p = 0.004) but there is no evidence to
suggest that any diets differed for males. Women on diet 3 lost on average 2.83kg more than
those on diet 1 and 3.27kg more than those on diet 2.

4 References
1. Douglas-C.-Montgomery-Applied-Statistics-and-Probability-for-Engineers-2014
2. Scribbr website
3. Investopedia website
4. Khanacademy website
5. File pdf Intro to R Vietnamese

Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 34/34

Malaysian Highway Capacity Manual
100% (3)
Malaysian Highway Capacity Manual
178 pages
Bio Stat Methods
No ratings yet
Bio Stat Methods
474 pages
Stat 331 Course Notes
No ratings yet
Stat 331 Course Notes
79 pages
Keith McNulty - Handbook of Regression Modeling in People Analytics-Routledge (2021)
100% (1)
Keith McNulty - Handbook of Regression Modeling in People Analytics-Routledge (2021)
272 pages
Sae Arp 699e-1997 - 781
No ratings yet
Sae Arp 699e-1997 - 781
3 pages
Field Measurement of Surface Profile of Blast Cleaned Steel: Standard Test Methods For
0% (1)
Field Measurement of Surface Profile of Blast Cleaned Steel: Standard Test Methods For
3 pages
Optics
100% (10)
Optics
487 pages
Report
No ratings yet
Report
30 pages
Data Analysis for the Life Sciences with R - 1st Edition pdf docx
100% (13)
Data Analysis for the Life Sciences with R - 1st Edition pdf docx
16 pages
Advance Stats
No ratings yet
Advance Stats
233 pages
Re Center Psych Stats
No ratings yet
Re Center Psych Stats
560 pages
CC02 Group6 Report
No ratings yet
CC02 Group6 Report
36 pages
Working with Data in Public Health A Practical Pathway with R Fast Download
100% (13)
Working with Data in Public Health A Practical Pathway with R Fast Download
14 pages
Essential R
No ratings yet
Essential R
261 pages
ANOVA3
No ratings yet
ANOVA3
194 pages
7th Report
No ratings yet
7th Report
14 pages
Course Notes18
No ratings yet
Course Notes18
113 pages
Regression Models for Data Science in R
No ratings yet
Regression Models for Data Science in R
137 pages
Applied Statistics
No ratings yet
Applied Statistics
457 pages
Visual Statistics Use R!
50% (2)
Visual Statistics Use R!
388 pages
Visual Statistics Use R PDF
No ratings yet
Visual Statistics Use R PDF
388 pages
Greenwood Intermediate Statistics With R
No ratings yet
Greenwood Intermediate Statistics With R
429 pages
Applied Statistics PDF
No ratings yet
Applied Statistics PDF
417 pages
Reg Book Stat
No ratings yet
Reg Book Stat
79 pages
Imstat
No ratings yet
Imstat
510 pages
Solutions Exercises
100% (1)
Solutions Exercises
109 pages
Applied Statistics
No ratings yet
Applied Statistics
361 pages
Learn R For Applied Statistics
No ratings yet
Learn R For Applied Statistics
457 pages
R For Statistics PDF
90% (10)
R For Statistics PDF
312 pages
R Basics
No ratings yet
R Basics
109 pages
Regression Models For Data Science in R by Brian Caffo
No ratings yet
Regression Models For Data Science in R by Brian Caffo
144 pages
Reg Mods
No ratings yet
Reg Mods
137 pages
Shipunov Visual Statistics
No ratings yet
Shipunov Visual Statistics
429 pages
Visual Statistics Use R
No ratings yet
Visual Statistics Use R
451 pages
STAT1301 Notes
No ratings yet
STAT1301 Notes
215 pages
Ida PDF
No ratings yet
Ida PDF
62 pages
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
Stats Book Sfu
100% (1)
Stats Book Sfu
354 pages
Statistical-Methods-II
No ratings yet
Statistical-Methods-II
284 pages
Imstat
No ratings yet
Imstat
549 pages
CS 2008 3complete PDF
No ratings yet
CS 2008 3complete PDF
53 pages
HJJBV
No ratings yet
HJJBV
17 pages
"Probability and Statistics": Assignment For
No ratings yet
"Probability and Statistics": Assignment For
22 pages
Analysing Data Using Linear Models 5th Ed January 2021
No ratings yet
Analysing Data Using Linear Models 5th Ed January 2021
388 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
Introduction To Statistics WITH SAS
No ratings yet
Introduction To Statistics WITH SAS
238 pages
Preview-9781000427899 A41277316
No ratings yet
Preview-9781000427899 A41277316
28 pages
Ekstrøm, Claus Thorn - Sørensen, Helle - Introduction To Statistical Data Analysis For The Life Sciences-CRC Press (2014)
No ratings yet
Ekstrøm, Claus Thorn - Sørensen, Helle - Introduction To Statistical Data Analysis For The Life Sciences-CRC Press (2014)
521 pages
Lucero R Tutorial 2016
No ratings yet
Lucero R Tutorial 2016
135 pages
Stat2 by Ann R. Cannon
No ratings yet
Stat2 by Ann R. Cannon
639 pages
Statistical Methods in Experimental Chemistry
100% (1)
Statistical Methods in Experimental Chemistry
103 pages
R Manual For Biometry
No ratings yet
R Manual For Biometry
133 pages
Jim Duggan - Exploring Operations Research With R-CRC Pressr (2024)
No ratings yet
Jim Duggan - Exploring Operations Research With R-CRC Pressr (2024)
396 pages
Regression Gl m
No ratings yet
Regression Gl m
315 pages
Bio Stat
No ratings yet
Bio Stat
472 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Osama the Gun
From Everand
Osama the Gun
Norman Spinrad
5/5 (1)
A Discourse Analysis of 1 Peter
From Everand
A Discourse Analysis of 1 Peter
Ervin Ray Starwalt
No ratings yet
Kellory the Warlock
From Everand
Kellory the Warlock
Lin Carter
No ratings yet
Deadline Yemen (The Elizabeth Darcy Series)
From Everand
Deadline Yemen (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
Manual + Caderno (2 Em 1)
From Everand
Manual + Caderno (2 Em 1)
Tiago Ferreira
No ratings yet
PowerPoint Feel Better With Art. 1.10.21
No ratings yet
PowerPoint Feel Better With Art. 1.10.21
20 pages
Chequered Plate Design
No ratings yet
Chequered Plate Design
2 pages
B.SC Zoology
No ratings yet
B.SC Zoology
20 pages
Nutrition For Lactation
No ratings yet
Nutrition For Lactation
5 pages
1st Sem CV Me MT 2nd Sem Cs El Et Engg. Physics SM SR
No ratings yet
1st Sem CV Me MT 2nd Sem Cs El Et Engg. Physics SM SR
81 pages
PHE Complete Notes (Without_Watermark)
No ratings yet
PHE Complete Notes (Without_Watermark)
214 pages
腦功能期中筆記
No ratings yet
腦功能期中筆記
9 pages
Trace Element and Polycyclic Aromatic Hydrocarbon Analyses of Jet Engine Fuels: Jet A, JP5, and JP8
50% (2)
Trace Element and Polycyclic Aromatic Hydrocarbon Analyses of Jet Engine Fuels: Jet A, JP5, and JP8
33 pages
Improvements in Fea of Composite Overwrapped
No ratings yet
Improvements in Fea of Composite Overwrapped
1 page
Logarithmic Amplifiers
No ratings yet
Logarithmic Amplifiers
181 pages
RP Vs City of Kadapawen
No ratings yet
RP Vs City of Kadapawen
2 pages
STOIIP Fundamentals
100% (1)
STOIIP Fundamentals
1 page
Paper 6 Guidelines
No ratings yet
Paper 6 Guidelines
4 pages
9. Integrating Offshore to the grid - PQ
No ratings yet
9. Integrating Offshore to the grid - PQ
9 pages
AITS 2223 CRT I JEEA LD Paper 2
No ratings yet
AITS 2223 CRT I JEEA LD Paper 2
12 pages
Ad 1154485
No ratings yet
Ad 1154485
445 pages
Telling The Time Past 15 Minutes Lesson Plan
70% (10)
Telling The Time Past 15 Minutes Lesson Plan
5 pages
Standby Kernel Log 2024 0729 002059
No ratings yet
Standby Kernel Log 2024 0729 002059
308 pages
EXERCISE 11.1: (Answer Key With Explanation)
No ratings yet
EXERCISE 11.1: (Answer Key With Explanation)
24 pages
6RJ25 H11+R2
No ratings yet
6RJ25 H11+R2
2 pages
DEATH
No ratings yet
DEATH
39 pages
Chapter5-2013-Diffusion
No ratings yet
Chapter5-2013-Diffusion
49 pages
12-English - Offering Help Service
No ratings yet
12-English - Offering Help Service
13 pages
Project 2a Group Ari 2 Compressed Compressed Compressed 1 - Min-Compressed
No ratings yet
Project 2a Group Ari 2 Compressed Compressed Compressed 1 - Min-Compressed
115 pages
20.05.McPhy Portfolio ELY Augmented McLyzer en
No ratings yet
20.05.McPhy Portfolio ELY Augmented McLyzer en
1 page
Mind and Prana (English 2011) - 1
100% (1)
Mind and Prana (English 2011) - 1
165 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.