Stat 231 A1
Stat 231 A1
20815290
Stat 231 Assignment 1
23/5/2021
Question 1
1 1 7
a) E( X)= (a+ b)= (2+5)= =3.5
2 2 2
1
Var ( X )= ¿
12
b) The sample mean and variance is 3.499 and 0.753 respectively, which is only 1/100th
off the theoretical mean and variance.
c) Since the sample follows a uniform distribution, which means the distribution for the data
set is symmetrical, the skewness would be close to zero. The skewness of the
generated sample is 0.044.
d) The sample kurtosis is less than 3, since it is distributed uniformly and do not have a
peak. The kurtosis of the generated sample is 1.720.
e)
Normal distribution is not a good approximation for uniform distribution. Even though
normal distribution is symmetrical which gives it a zero skewness, the distribution has
the highest probability at the centre and lowest in the ends, giving it a bell shape,
meaning its kurtosis is larger than 3. Uniform distribution has a constant probability for all
x, therefore normal distribution does not model uniform distribution.
Question 2
a) E( X)=0.5
Var ( X )=0.52 =0.25
The mean and variance from the sample 0.478 and 0.254 respectively, which only
slightly differs from the theoretical mean and variance.
b) The distribution is positively skewed since the data are more densely spreaded on the
left tail, leaving the right tail longer than the left. The skewness of the sample is 1.525.
c)
The screenshot above is the cumulative distribution function for the sample and its
normal approximation. The normal distribution is not a good approximation for the
sample. The sample follows an exponential distribution, meaning its data are skewed
positively with higher density on the left tail, while normal distribution is not skewed at all.
This explains the difference in the shape of the CDFs, with the CDF of the exponential
distribution increasing logarithmically due to the positive skewness, while the normal
CDF increases gradually as its peak is symmetrical.
Question 3
a)
Mean = 0.364
Variance = 0.920
Skewness = - 0.059
Kurtosis = 2.311
b)
Mean = 0.008
Variance = 0.889
Skewness = 0.065
Kurtosis = 2.951
c)
Mean = 0.015
Variance = 0.976
Skewness = 0.052
Kurtosis = 2.982
d) As the sample size increases, the sample mean gradually approaches 0, which is the
theoretical mean. Also the variance approaches 1, which is also the theoretical variance,
as the sample size increases. The skewness and kurtosis respectively approaches 0
and 3 with sample size increasing, which is better approximated by the normal
distribution.
Question 4
a) The discount is offered in a continuous variate.
b) My R commands and output:
> min(dataset$Discount_offered)
[1] 1
> max(dataset$Discount_offered)
[1] 64
> median(dataset$Discount_offered)
[1] 7
> quantile(dataset$Discount_offered,0.25)
25%
4
> quantile(dataset$Discount_offered,0.75)
75%
10.25
The sample mean and the standard deviation are close, while the skewness is greater
than zero, meaning the left tail of the sample is densely populated, leaving a longer right
tail. This is similar to those for an exponential distribution.
d)
My R commands and output:
> hist(dataset$Discount_offered,breaks=50, main="Discount
Offered",col="seashell",xlab="Amount",freq=FALSE)
> curve(dexp(x,log=FALSE),col="red", add = TRUE)
e) The sample mean and the sample standard deviation is respectively 12.92 and 15.361.
This is quite close as the difference between the values is 2.441.
f) The general shape of the exponential distribution and the histogram is similar, with the
left tail being the densest, with less samples falling on the right tail, leaving a long right
tail. The calculated skewness for the sample is 1.84, which is close to the skewness of
exponential distribution of 2. Also, as mentioned in the previous part, the sample mean
and standard deviation is quite close, which can be modeled by those of the exponential
distribution.
Question 5
a) The cost of product is expressed in a continuous variate.
b) My R commands and output:
> round(mean(dataset$Cost_of_the_Product),3)
[1] 211.474
> round(median(dataset$Cost_of_the_Product),3)
[1] 218
> round(sd(dataset$Cost_of_the_Product),3)
[1] 47.851
> round(skewness(dataset$Cost_of_the_Product),3)
[1] -0.206
> round(kurtosis(dataset$Cost_of_the_Product),3)
[1] 2.015
c)
My R commands and output:
> hist(dataset$Cost_of_the_Product,breaks=50, main="Cost of the
Product",col="seashell",xlab="Amount",freq=FALSE)
d) From a numerical standpoint, the skewness of the sample data is -0.026, which is very
close to the 0 skewness in the normal distribution. The kurtosis of the sample is 2.015,
showing a peak in the middle, similar to that of a normal distribution. The majority of the
sample data are located in the middle of the histogram, around 150 and 250, forming a
bell shaped curve similar to the normal distribution. Also the tails of the histogram are
roughly the same length and the overall shape of the sample is symmetrical, meaning
normal distribution can be an estimate for the sample data.
Question 6
a)
My R commands and outputs:
plot(dataset$Weight_in_gms, dataset$Cost_of_the_Product,
xlab="Weight (g)",
ylab="Cost of product ($)",
pch=19,
col="darkblue",
cex.axis=1.25,
cex.lab=1.5)
> x<-dataset$Weight_in_gms
> y<-dataset$Cost_of_the_Product
> RegModel <- lm(y~x)
> abline(RegModel)
c) As illustrated in the scattered plot, the sample data are scattered all across the graph. As
suggested by the linear regression model which has an equation of y = -0.002447x
+220.196, the sample data has a weak negative correlation. Also, from part b) the
correlation is calculated as -0.085, which also supports the fact that it has a weak slight
negative correlation.
Question 7
a) The mode of shipment is a discrete variate.
b) My R commands and outputs:
> table(dataset$Mode_of_Shipment)
c)
My R commands and outputs:
> boxplot(formula = Weight_in_gms ~ Mode_of_Shipment,
data = dataset,
outline=TRUE,
frame=T,
col="seashell",
ylab="Weight (g)",
rm.NA=TRUE,
cex.axis=1.15,
cex.lab=1.5
d) The 3 data sets have a lot of similarities. From the box plot, the interquartile range of the
3 modes of shipments are similar, with flight having the highest maximum and minimum
value. The data sets are also skewed upwards, with the median located around the top
of the boxplot for all 3 of them. The range of weight is large for all the modes of
shipment, spanning from 1000 to around 6000kg. Finally, there are no outliers for the
weights for all modes of shipments.
Question 8
In this homework, it allows me to understand basic concepts of statistics, for example
the five number summary as well as interpreting the shape of a distribution and its
variability. This assignment acts as an introduction to the software R for me, as we are
instructed to perform basic tasks for example inserting data sets and plotting histograms.
During the exercise, I learnt how to model data with different distributions, for instance I
am instructed to overlay an exponential distribution over a histogram. After this exercise,
I feel much more confident with using the software R since I have limited prior
knowledge in using it. It also refreshes and jogs my memory for the basics in empirical
studies of statistics.