PS Assignments
PS Assignments
UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
CC04
Assignment
Contents
1 Member list & Workload 2
2 Project 1 2
2.1 Topic 8-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.4 Calculation in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.5 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Topic 8-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4 Calculate in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.5 Display the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.6 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Topic 8-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Calculation in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.5 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Topic 8-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Solving method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Theory base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.4 Calculation in manual . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.5 Solving the problem in R . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Project 2 23
3.1 Import data: Diet.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Cleaning data: NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Data visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Descriptive statistics for each of the variables . . . . . . . . . . . . 27
3.3.2 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 t.test: between pre.weight and weight6weeks . . . . . . . . . . . . . . . . . 31
3.5 One way ANOVA: What is the best diet for weight loss? . . . . . . . . . 31
3.6 Two way ANOVA: How do Diet and gender affect weightLOST? . . . . 32
4 References 34
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 1/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2 Project 1
2.1 Topic 8 - 1
2.1.1 Requirement
The following table shows the diameter X and the height Y of 20 trees respectively
X 2.3 2.5 2.6 3.1 3.4 3.7 7.3 3.9 4 4.1 4.1 4.2 4.4 5.7 5.1 5.5 5.8 6.2 6.9 6.9
Y 7 8 4 4 6 6 14 12 8 5 7 8 7 9 10 13 7 11 11 16
Fit the linear regression model and check all assumptions.
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 2/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
a) Linear regression
Linear regression attempts to model the relationship between two variables by fitting a
linear equation to observed data. One variable is considered to be an explanatory variable, and
the other is considered to be a dependent variable. For example, a modeler might want to relate
the weights of individuals to their heights using a linear regression model.
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory
variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the
value of y when x = 0).
b) Four assumptions
- Linearity: The response variable y should be a linearly related to the explanatory variables
X
- Independence of errors: After fitting the model on the training data set, the residual errors
of the model should be independent and identically distributed random variables. If the residual
errors are not independent, they will likely demonstrate some sort of a pattern (which is not
always obvious to the naked eye). There is information in this pattern that the regression model
was not able to capture during its training on the training set, thereby making the model sub-
optimal
- Normality of errors: If the residual errors of regression are not N (0, σ 2 ), then statistical tests
of significance that depend on the errors having an N (0, σ 2 ) distribution, simply stop working.
- Homoscedasticity (non-constant variance): In the previous section we saw why the residual
errors should be N (0, σ 2 ) distributed, i.e. normally distributed with mean zero and variance σ 2 .
In this section we impose an additional constraint on them: the variance σ 2 should be constant.
Particularly, σ 2 should not be a function of the response variable y, and thereby indirectly the
explanatory variables X.
- Therefore
Sxy
b= = 1.676896175
Sxx
a = ȳ − bx̄ = 1.045275845
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 3/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
- We get:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
1.045 1.677
So the linear regression model is: y = 1.045 + 1.677x
We plot the data:
1 > plot (x ,y , col = " blue " , main = " Diameter and Height regression " , abline ( lm ( y ~ x ) )
,
2 cex = 1.3 , pch = 16 , xlab = " Diameter " , ylab = " Height ")
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 4/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
The plot of residual versus fits below shows that the correlation is approximately 0, thus this
assumption has been met
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 5/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
We get:
lag Autocorrelation D-W Statistic p-value
1 0.009385456 1.802587 0.5
Alternative hypothesis: rho != 0
As the p-value is larger than 0.05, we cannot reject the null hypothesis
- Normality of errors: We check the normality of errors using Q-Q Plot method:
1 > plot ( relation , 2)
The Q-Q plot below shows that the standardized residual is normally distributed
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 6/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Shapiro-Wilk Test:
1 > library ( MASS )
2 > shapiro <- studres ( relation )
3 > shapiro . test ( shapiro )
data: shapiro
W = 0.96613, p-value = 0.6719
p-value is larger than 0.5, so the error is normaly distributed
- Equal variance (Homoscedasticity):
Non-constant variance test:
Let H0 denote the hypothesis that the variances are not constant
H1 denote the hypothesis that the variances are constant
ncvTest(relation)
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 7/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
We get:
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.01590465, Df = 1, p = 0.89964
The p-value is larger than 0.05, so we can reject the null hypothesis, it means that the variance
of errors are constant
2.1.6 Conclusion
After checking all assumptions, we can conclude that the diameter and the height of those
trees are linear, and all assumptions are met.
2.2 Topic 8 - 2
2.2.1 Requirement
A company wants target 3 overseas markets. To compare the potential market share of the
company and that of some competitors in the markets, their products were sold to potential
customers. The following results were obtained.
Markets
A B C
The company 55 38 24
Competitor 1 28 30 21
Competitor 2 20 18 31
Other competitors 47 64 74
ANOVA tests for significance using the F-test for statistical significance. The F-test is a
groupwise comparison test, which means it compares the variance in each group mean to the
overall variance in the dependent variable.
If the variance within groups is smaller than the variance between groups, the F-test will
find a higher F-value, and therefore a higher likelihood that the difference observed is real and
not due to chance.
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 8/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
A two-way ANOVA with interaction tests three null hypotheses at the same time:
• There is no difference in group means at any level of the first independent variable.
• There is no difference in group means at any level of the second independent vari-
able.
• The effect of one independent variable does not depend on the effect of the other
independent variable (a.k.a. no interaction effect).
A two-way ANOVA without interaction (a.k.a. an additive two-way ANOVA) only tests the
first two of these hypotheses.
Adj MS
The calculations for the mean square for the factors, interaction, and error follow:
SS(A) SS(B)
M S(A) = M S(B) =
df (A) df (B)
SS(AB) SS(Error)
M S(AB) = M S(Error) =
df (AB) df (Error)
Notation
The sum of squared distances. SS Total is the total variation in the data. SS (A) and SS (B)
are the amount of variation of the estimated factor level mean around the overall mean. These
statistics are also known as the sum of squares for factor A or factor B. SS Error is the amount
of variation of the observations from their fitted values. The calculations follow:
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 9/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
The degrees of freedom for the F statistic that you use to calculate the p-value depend on
the term that is in the test.
When you test a term, the denominator degrees of freedom are always the degrees of freedom
for error. The degrees of freedom for error depend on whether the interaction term is in the
model or not.
Markets
A B C Average
The company 55 38 24 39
Competitor 1 28 30 21 26.33
Competitor 2 20 18 31 23
Other competitors 47 64 74 61.67
Average 37.5 37.5 37.5
b, Calculate the Sum of SquaresPn
- We have the formula: SS = i=1 (yi − ȳ)2
Where:
SS = Sum of Square
n = number of observations
yi = value in a sample
ȳ = mean value of a sample
- Therefore:
*Companies
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 10/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
The company
Products The company mean Grand mean (The company mean - grand mean)2
55 39 37.5 2.25
38 39 37.5 2.25
24 39 37.5 2.25
Sum of Squares 6.75
Competitor 1
Products Competitor 1 mean Grand mean (Competitor 1 mean - grand mean)2
28 26.33 37.5 124.77
30 26.33 37.5 124.77
21 26.33 37.5 124.77
Sum of Squares 374.31
Competitor 2
Products Competitor 2 mean Grand mean (Competitor 2 mean - grand mean)2
20 23 37.5 210.25
18 23 37.5 210.25
31 23 37.5 210.25
Sum of Squares 630.75
Other competitors
Products Other competitors mean Grand mean (Other competitors mean - grand mean)2
47 61.67 37.5 584.19
64 61.67 37.5 584.19
74 61.67 37.5 584.19
Sum of Squares 1752.57
*Markets
- We can see that the mean of each markets is equal to each other and also equal to the
grand mean
=> The total sum of square of markets is equal to 0
*Total
Pa Pb y2
- Formula: SST = i=1 j=1 yij 2
− t
ab
Where
SST = Total Sum of Square
a,b = The number of row and column
yt = Sum of all observations
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 11/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
- Therefore
4502
SST = (552 + 382 + ... + 742 ) − = 3761
12
*Error
- Formula: SSE = SST − SS(Companies) − SS(M arkets)
- Therefore
SSE = 3761 − 2764.38 − 0 = 996, 62
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 12/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
SS DF MS F
Companies 2764.38 3 921.46 5.55
Markets 0 2 0 0
Error 996.62 6 166.1
Total 3761
a, Import data
1 > companies <- gl (4 ,3 ,12)
2 > markets <- gl (3 ,1 ,12)
3 > id <- 1:12
4 > products <- c (55 ,38 ,24 ,28 ,30 ,21 ,20 ,18 ,31 ,47 ,64 ,74)
5 > data <- data . frame ( companies , markets , id , products )
6 > data
b, Result
=> Result:
Df Sum Sq Mean Sq F value Pr(>F)
Companies 3 2763.7 921.2 5.542 0.0365
Markets 2 0.0 0 0.0 1.0
Residuals 6 997.3 166.2
=> Result: Tukey multiple comparisons of means 95% family-wise confidence level
*companies
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 13/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
*markets
diff lwr upr p adj
2-1 0 -27.97 27.97 1
3-1 0 -27.97 27.97 1
3-2 0 -27.97 27.97 1
2.2.7 Conclusion
- Because F (Companies) = 5.55 > 4.76 = F3;6 => Reject H0a . Therefore, there is an effect
between products to companies. (From ANOVA)
- Because F (Markets) = 0 => Cannot reject H0b . Therefore, there is no effect between
products to markets. (From ANOVA)
- Because P-value < α = 0.05, so there is only significant differences between Competitor 2
and Other competitors. (From Tukey HSD)
- The total products in each market is equal, which is 150 products.
2.3 Topic 8 - 3
2.3.1 Requirement
The river water level was measured at several locations in a province on the same day and
the data follow.
Measurement location
Measurement time
F1 F2 F3 F4
1 5.5 4.9 4.6 4.5
2 4.6 5.1 4.8 6.2
3 5.8 6.5 5.8 4.8
4 5.9 5.4 5.1 4.8
5 6.0 6.1 6.2 6.5
6 6.7 7.1 6.8
7 7.2
At the significance level α = 2%, are the average river water levels each day at these locations
significantly different?
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 14/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Where yij is a random variable denoting the (ij)th observation, µ is a parameter common to
all treatments called the overall mean, τi is a parameter associated with the ith treatment called
the ith treatment effect, and ij is a random error component.
We assume that the errors ij are normally and independently distributed with mean zero
and variance σ 2 . Therefore, each treatment can be thought of as a normal population with mean
µi and varianceσ 2 .
We have to use ni instead of n because the measurement time of each treatments are
not equal
In the fixed-effects model, the treatment effects i are usually defined as deviations from
the overall mean µ, so that
Pa
i=1 τi =0
Let yi· represent the total of the observations under the ith treatment and y i· . represent the
average of the observations under the ith treatment. Similarly, let y·· represent the grand total of
all observations and y ·· represent the grand mean of all observations. Expressed mathematically,
Pni
yi· = j=1 yij
yi·
y i· = with i = 1, 2, ..., a
ni
Pa Pni
y·· = i=1 j=1 yij
y·· Pa
y ·· = with N = i=1 ni
N
We are interested in testing the equality of the a treatment meansµ1 , µ2 , ..., µa . We find that
this is equivalent to testing the hypotheses
H0 : τ1 = τ2 = ... = τa = 0
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 15/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
H1 : At least one τi 6= 0
Thus, if the null hypothesis is true, each observation consists of the overall mean µ plus a
realization of the random error component i j. This is equivalent to saying that all N observa-
tions are taken from a normal distribution with mean µ and variance σ 2 . Therefore, if the null
hypothesis is true, changing the levels of the factor has no effect on the mean response.
The sums of squares computing formulas for the ANOVA with unequal sample sizes ni
in each treatment are:
Pa Pni 2 y··2
SS T = i=1 j=1 yij −
N
Pa yi·2 y2
SS T reatments = i=1 − ··
ni N
SS E = SS T − SS T reatments
136.92
SS T = (5.52 + 4.92 + ... + 7.22 ) − = 16.2896
24
41.72 282 33.62 33.62 136.92
SS T reatments = ( + + + )− = 0.6324
7 5 6 6 24
SS E = 16.2896 − 0.6324 = 15.6572
SS T r 0.6324
M S T reatments = = = 0.2108
df (SS T r ) 4−1
SS E 15.6572
M SE = = = 0.7829
df (SS E ) 24 − 4
M ST r 0.2108
F = = = 0.2693
M SE 0.7829
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 16/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Then apply the function qf() (f-dist critical value), we can get:
1 >f = qf ( p =.02 , df1 =3 , df2 =20 , lower . tail = FALSE )
2
3 >f
4 [1] 4.113404
2.3.6 Conclusion
Using F-dist table critical values: F = 0.2693 < F0.02,3,20 = 4.1134 =⇒ Accept Hypothesis H0
Therefore, we can conclude that there are not significant differences of the average river wa-
ter levels each day at these locations at significant level α = 2%.
2.4 Topic 8 - 4
2.4.1 Requirement
The following table show the average cost, measured as 1000 VND expended for each service at
each city.
Services
I II III
City I 61 52 69
City II 58 51 61
City III 68 64 79
At the significance level = 5%, compare the costs for three types of services in three cities.
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 17/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
According to the data tables, we have 2 factors which are Cities and Services and each factor
has 3 levels without replication and a dependent variable Cost. Therefore, two-way ANOVA is
chosen.
City III
Service I Service II Service III
68 64 79
Services
Service I Service II Service III Average
City I 61 52 69 60.67
City II 58 51 61 56.67
City III 68 64 79 70.33
Average 62.33 55.67 69.67 62.56
There are two null hypotheses: one for the rows and the other for the columns. Let’s look first
at the rows:
H0 : There is no significant difference in cost between the cities
The null hypothesis for the columns is:
H0 : There is no significant difference in cost between the services
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 18/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
yi = value in a sample
ȳ = mean value of a sample
City I
Cost City I Mean Grand mean (City I Mean - Grand mean)2
61 60.67 62.56 3.57
52 60.67 62.56 3.57
69 60.67 62.56 3.57
Sum of Squares 10.70
City II
Cost City II Mean Grand mean (City II Mean - Grand mean)2
58 56.67 62.56 34.68
51 56.67 62.56 34.68
61 56.67 62.56 34.68
Sum of Squares 104.04
City III
Cost City III Mean Grand mean (City III Mean - Grand mean)2
68 70.34 62.56 60.49
64 70.34 62.56 60.49
79 70.34 62.56 60.49
Sum of Squares 181.48
Sum of squares for 1st factor: Cities 296.22
- Sum of squares for 1st factor: Cities
Service I
Cost City I Mean Grand mean (City I Mean - Grand mean)2
61 62.34 62.56 0.049
58 62.34 62.56 0.049
68 62.34 62.56 0.049
Sum of Squares 0.15
Service II
Cost Service II Mean Grand mean (Service II Mean - Grand mean)2
52 55.67 62.34 47.46
51 55.67 62.34 47.46
64 55.67 62.34 47.46
Sum of Squares 142.37
Service III
Cost Service III Mean Grand mean (Service III Mean - Grand mean)2
69 69.67 62.56 50.57
61 69.67 62.56 50.57
79 69.67 62.56 50.57
Sum of Squares 151.7031
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 19/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 20/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
The ANOVA test can not show us the details of the effect between each element so we decided
to use the Tukey test to investigate the data by using the command “TukeyHSD”
1 > TukeyHSD ( anova1 )
2 Tukey multiple comparisons of means
3 95% family - wise confidence level
4
5 Fit : aov ( formula = cost ~ cities + services , data = data )
6
7 $cities
8 diff lwr upr p adj
9 II - I -4.000000 -11.094901 3.094901 0.2253564
10 III - I 9.666667 2.571766 16.761568 0.0180671
11 III - II 13.666667 6.571766 20.761568 0.0052028
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 21/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
12
13 $services
14 diff lwr upr p adj
15 II - I -6.666667 -13.7615676 0.4282342 0.0604773
16 III - I 7.333333 0.2384324 14.4282342 0.0450964
17 III - II 14.000000 6.9050991 21.0949009 0.0047572
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 22/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2.4.6 Conclusion
We received the same result from both calculation in manual and code in R
We can reject the both null hypothesis and conclude that services and cities affect
the average cost.
Besides, from the result resulted from tukeyHSD, and the bar graph we can also conclude
that the differences III-I and III-II are significant, while II-I is not significant and base on the
difference mean of each group, it can seem that:
The cost of the I and II services is considered as no difference while the III ser-
vice’ cost is higher compare to others.
The first and second cities nearly have the same cost, the third one has the
highest compare to two remains.
3 Project 2
3.1 Import data: Diet.csv
The first command (setwd - wd means working directory) tells R that we want to save the nu-
meric data in a folder named “c://diet”)
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 23/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
The second command is used to save the excel file Diet.csv into an object named gh.
The third command (save) tells R that the data in the gh object will be stored in a file named
(“gh.rda”). After typing the above two commands, a file named gh.rda will be present in that
directory.
1
2 > setwd (" c :// diet ")
3 >gh < - read . csv (" Diet . csv " , header = TRUE )
4 > save ( gh , file =" gh . rda ")
5 > gh
The output:
1 Person gender Age Height pre . weight Diet weight6weeks
2 1 25 NA 41 171 60 2 60.0
3 2 26 NA 32 174 103 2 103.0
4 3 1 0 22 159 58 1 54.2
5 4 2 0 46 192 60 1 54.0
6 5 3 0 55 170 64 1 63.3
7 6 4 0 33 171 64 1 61.1
8 7 5 0 50 170 65 1 62.2
9 8 6 0 50 201 66 1 64.0
10 9 7 0 37 174 67 1 65.0
11 10 8 0 28 176 69 1 60.5
12 11 9 0 28 165 70 1 68.1
13 12 10 0 45 165 70 1 66.9
14 13 11 0 60 173 72 1 70.5
15 14 12 0 48 156 72 1 69.0
16 15 13 0 41 163 72 1 68.4
17 16 14 0 37 167 82 1 81.1
18 17 27 0 44 174 58 2 60.1
19 18 28 0 37 172 58 2 56.0
20 19 29 0 41 165 59 2 57.3
21 20 30 0 43 171 61 2 56.7
22 21 31 0 20 169 62 2 55.0
23 22 32 0 51 174 63 2 62.4
24 23 33 0 31 163 63 2 60.3
25 24 34 0 54 173 63 2 59.4
26 25 35 0 50 166 65 2 62.0
27 26 36 0 48 163 66 2 64.0
28 27 37 0 16 165 68 2 63.8
29 28 38 0 37 167 68 2 63.3
30 29 39 0 30 161 76 2 72.7
31 30 40 0 29 169 77 2 77.5
32 31 52 0 51 165 60 3 53.0
33 32 53 0 35 169 62 3 56.4
34 33 54 0 21 159 64 3 60.6
35 34 55 0 22 169 65 3 58.2
36 35 56 0 36 160 66 3 58.2
37 36 57 0 20 169 67 3 61.6
38 37 58 0 35 163 67 3 60.2
39 38 59 0 45 155 69 3 61.8
40 39 60 0 58 141 70 3 63.0
41 40 61 0 37 170 70 3 62.7
42 41 62 0 31 170 72 3 71.1
43 42 63 0 35 171 72 3 64.4
44 43 64 0 56 171 73 3 68.9
45 44 65 0 48 153 75 3 68.7
46 45 66 0 41 157 76 3 71.0
47 46 15 1 39 168 71 1 71.6
48 47 16 1 31 158 72 1 70.9
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 24/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
49 48 17 1 40 173 74 1 69.5
50 49 18 1 50 160 78 1 73.9
51 50 19 1 43 162 80 1 71.0
52 51 20 1 25 165 80 1 77.6
53 52 21 1 52 177 83 1 79.1
54 53 22 1 42 166 85 1 81.5
55 54 23 1 39 166 87 1 81.9
56 55 24 1 40 190 88 1 84.5
57 56 41 1 51 191 71 2 66.8
58 57 42 1 38 199 75 2 72.6
59 58 43 1 54 196 75 2 69.2
60 59 44 1 33 190 76 2 72.5
61 60 45 1 45 160 78 2 72.7
62 61 46 1 37 194 78 2 76.3
63 62 47 1 44 163 79 2 73.6
64 63 48 1 40 171 79 2 72.9
65 64 49 1 37 198 79 2 71.1
66 65 50 1 39 180 80 2 81.4
67 66 51 1 31 182 80 2 75.7
68 67 67 1 36 155 71 3 68.5
69 68 68 1 47 179 73 3 72.1
70 69 69 1 29 166 76 3 72.5
71 70 70 1 37 173 78 3 77.5
72 71 71 1 31 177 78 3 75.2
73 72 72 1 26 179 78 3 69.4
74 73 73 1 40 179 79 3 74.5
75 74 74 1 35 183 83 3 80.2
76 75 75 1 49 177 84 3 79.9
77 76 76 1 28 164 85 3 79.7
78 77 77 1 40 167 87 3 77.8
79 78 78 1 51 175 88 3 81.9
The output:
1 > gh
2 Person gender Age Height pre . weight Diet weight6weeks
3 3 1 0 22 159 58 1 54.2
4 4 2 0 46 192 60 1 54.0
5 5 3 0 55 170 64 1 63.3
6 6 4 0 33 171 64 1 61.1
7 7 5 0 50 170 65 1 62.2
8 8 6 0 50 201 66 1 64.0
9 9 7 0 37 174 67 1 65.0
10 10 8 0 28 176 69 1 60.5
11 11 9 0 28 165 70 1 68.1
12 12 10 0 45 165 70 1 66.9
13 13 11 0 60 173 72 1 70.5
14 14 12 0 48 156 72 1 69.0
15 15 13 0 41 163 72 1 68.4
16 16 14 0 37 167 82 1 81.1
17 17 27 0 44 174 58 2 60.1
18 18 28 0 37 172 58 2 56.0
19 19 29 0 41 165 59 2 57.3
20 20 30 0 43 171 61 2 56.7
21 21 31 0 20 169 62 2 55.0
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 25/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
22 22 32 0 51 174 63 2 62.4
23 23 33 0 31 163 63 2 60.3
24 24 34 0 54 173 63 2 59.4
25 25 35 0 50 166 65 2 62.0
26 26 36 0 48 163 66 2 64.0
27 27 37 0 16 165 68 2 63.8
28 28 38 0 37 167 68 2 63.3
29 29 39 0 30 161 76 2 72.7
30 30 40 0 29 169 77 2 77.5
31 31 52 0 51 165 60 3 53.0
32 32 53 0 35 169 62 3 56.4
33 33 54 0 21 159 64 3 60.6
34 34 55 0 22 169 65 3 58.2
35 35 56 0 36 160 66 3 58.2
36 36 57 0 20 169 67 3 61.6
37 37 58 0 35 163 67 3 60.2
38 38 59 0 45 155 69 3 61.8
39 39 60 0 58 141 70 3 63.0
40 40 61 0 37 170 70 3 62.7
41 41 62 0 31 170 72 3 71.1
42 42 63 0 35 171 72 3 64.4
43 43 64 0 56 171 73 3 68.9
44 44 65 0 48 153 75 3 68.7
45 45 66 0 41 157 76 3 71.0
46 46 15 1 39 168 71 1 71.6
47 47 16 1 31 158 72 1 70.9
48 48 17 1 40 173 74 1 69.5
49 49 18 1 50 160 78 1 73.9
50 50 19 1 43 162 80 1 71.0
51 51 20 1 25 165 80 1 77.6
52 52 21 1 52 177 83 1 79.1
53 53 22 1 42 166 85 1 81.5
54 54 23 1 39 166 87 1 81.9
55 55 24 1 40 190 88 1 84.5
56 56 41 1 51 191 71 2 66.8
57 57 42 1 38 199 75 2 72.6
58 58 43 1 54 196 75 2 69.2
59 59 44 1 33 190 76 2 72.5
60 60 45 1 45 160 78 2 72.7
61 61 46 1 37 194 78 2 76.3
62 62 47 1 44 163 79 2 73.6
63 63 48 1 40 171 79 2 72.9
64 64 49 1 37 198 79 2 71.1
65 65 50 1 39 180 80 2 81.4
66 66 51 1 31 182 80 2 75.7
67 67 67 1 36 155 71 3 68.5
68 68 68 1 47 179 73 3 72.1
69 69 69 1 29 166 76 3 72.5
70 70 70 1 37 173 78 3 77.5
71 71 71 1 31 177 78 3 75.2
72 72 72 1 26 179 78 3 69.4
73 73 73 1 40 179 79 3 74.5
74 74 74 1 35 183 83 3 80.2
75 75 75 1 49 177 84 3 79.9
76 76 76 1 28 164 85 3 79.7
77 77 77 1 40 167 87 3 77.8
78 78 78 1 51 175 88 3 81.9
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 26/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
The output:
1 > desc ( gh$Person )
2 MEAN SD SE
3 39.868421 22.843434 2.620322
4 > desc ( gh$gender )
5 MEAN SD SE
6 0.43421053 0.49894626 0.05723306
7 > desc ( gh$Age )
8 MEAN SD SE
9 39.223684 9.908379 1.136569
10 > desc ( gh$Height )
11 MEAN SD SE
12 170.776316 11.419981 1.309962
13 > desc ( gh$pre . weight )
14 MEAN SD SE
15 72.2894737 7.9746528 0.9147554
16 > desc ( gh$Diet )
17 MEAN SD SE
18 2.03947368 0.82366319 0.09448065
19 > desc ( gh $w e ig ht 6 we e ks )
20 MEAN SD SE
21 68.3434211 8.0589385 0.9244236
3.3.2 Boxplot
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 27/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
1 > boxplot ( gh$Person , main =" PERSON " , ylab =" people ")
The output:
1 > boxplot ( gh$Age , main =" AGE " , ylab =" years ")
The output:
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 28/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
1 > boxplot ( gh$Height , main =" HEIGHT " , ylab =" cm ")
The output:
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 29/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
1 > boxplot ( gh$pre . weight , weight6weeks , main =" PRE & AFTER " , ylab =" kg ")
The output:
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 30/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
The output:
1 Paired t - test
2
3 data : pre and after
4 t = 13.728 , df = 75 , p - value < 2.2 e -16
5 alternative hypothesis : true difference in means is not equal to 0
6 95 percent confidence interval :
7 3.373452 4.518653
8 sample estimates :
9 mean of the differences
10 3.946053
CONCLUSION
Because p-value < 0.05 so that we can reject that pre.weight is’nt different from weight6weeks.
And the average of the differences between them is 3.95kg.
3.5 One way ANOVA: What is the best diet for weight loss?
Named weight loss is loss, we got loss=pre-after
1 > loss < - pre - after
To analyze variance, we need to define diet as a factor and then put diet and loss into the
same data frame called data.
1 > diet2 < - as . factor ( diet )
2 > data < - data . frame ( diet , loss )
3 > attach ( data )
4 The following objects are masked _by_ . GlobalEnv :
5
6 diet , loss
So that there was a statistically significant interaction the effects of Diet on weight loss(p =
0.006596 < 0.05)
Then we use TukeyHSD(oneway)
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 31/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
CONCLUSION
Diet 3 was significantly different to diet 1 (p = 0.0188) and diet 2 (p = 0.0152). We can see
that diet 3 lost on average 1.85kg more than those on diet 1 and 1.88kg more than those on diet
2. So that the best diet for weight loss is diet 3
3.6 Two way ANOVA: How do Diet and gender affect weightLOST?
To carry out a two way ANOVA with an interaction, use:
aov(dependent as.factor(independent1)*as.factor(indepndent2),data= filename)
CONCLUSION
There was a statistically significant interaction between the effects of Diet and Gender on
weight loss [F(2, 70)=3.153, p = 0.049]. The ‘Diet’ effect cannot be generalised for both males
and females because the interaction effect is substantial (p =0.049).
The TukeyHSD(anova2) command will produce post hoc tests for the main effects and interac-
tions. Only interpret post hoc tests for the significant factors from the ANOVA. If the interaction
is NOT significant, interpret the post hoc tests for significant main effects but if it is significant,
only interpret the interactions post hoc tests.
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 32/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4
5 Fit : aov ( formula = loss ~ as . factor ( gender ) * diet , data = gh )
6
7 $ ` as . factor ( gender ) `
8 diff lwr upr p adj
9 1 -0 0.1221283 -0.9480861 1.192343 0.8206233
10
11 $diet
12 diff lwr upr p adj
13 2 -1 -0.03484966 -1.6215073 1.551808 0.9984761
14 3 -1 1.84475570 0.2871469 3.402365 0.0162482
15 3 -2 1.87960536 0.3385771 3.420634 0.0128844
16
17 $ ` as . factor ( gender ) : diet `
18 diff lwr upr p adj
19 1:1 -0:1 0.6000000 -2.2129628 3.4129628 0.9887997
20 0:2 -0:1 -0.4428571 -3.0107291 2.1250148 0.9958151
21 1:2 -0:1 1.0590909 -1.6782698 3.7964516 0.8656520
22 0:3 -0:1 2.8300000 0.3052886 5.3547114 0.0191170
23 1:3 -0:1 1.1833333 -1.4893925 3.8560592 0.7855223
24 0:2 -1:1 -1.0428571 -3.8558199 1.7701056 0.8852416
25 1:2 -1:1 0.4590909 -2.5093998 3.4275816 0.9975014
26 0:3 -1:1 2.2300000 -0.5436187 5.0036187 0.1863470
27 1:3 -1:1 0.5833333 -2.3256625 3.4923292 0.9915569
28 1:2 -0:2 1.5019481 -1.2354126 4.2393087 0.5963201
29 0:3 -0:2 3.2728571 0.7481458 5.7975685 0.0040103
30 1:3 -0:2 1.6261905 -1.0465354 4.2989163 0.4833188
31 0:3 -1:2 1.7709091 -0.9260048 4.4678230 0.3965102
32 1:3 -1:2 0.1242424 -2.7117126 2.9601974 0.9999949
33 1:3 -0:3 -1.6466667 -4.2779524 0.9846191 0.4513580
The following output for post hoc interactions tests has been adjusted in Excel to make it easier
to read.
The interactions post hoc tests compare each pair of combinations. This shows that the only
significant differences are for females and are between diets 1 and 3 (p=0.0191) and diets 2 and
3 (p=0.004). Women on diet 3 lose on average 2.83kg more than those on diet 1 and 3.27kg more
than those on diet 2.
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 33/34
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
CONCLUSION
A two way ANOVA was carried out on weight lost by diet type and gender. There was a
statistically significant interaction between the effects of Diet and Gender on weight loss [F(2,
70)=3.153, p = 0.049]. Tukey’s HSD post hoc tests were carried out. For females, diet 3 was
significantly different to diet 1 (p = 0.0191) and diet 2 (p = 0.004) but there is no evidence to
suggest that any diets differed for males. Women on diet 3 lost on average 2.83kg more than
those on diet 1 and 3.27kg more than those on diet 2.
4 References
1. Douglas-C.-Montgomery-Applied-Statistics-and-Probability-for-Engineers-2014
2. Scribbr website
3. Investopedia website
4. Khanacademy website
5. File pdf Intro to R Vietnamese
Assignment for Probabilities and Statistics - Academic year 2020 - 2021 Page 34/34