HWK1 324 SS
HWK1 324 SS
Jonathan Nolden
*Submit your homework to Canvas by the due date and time. Email your lecturer if you
have extenuating circumstances and need to request an extension.
*If an exercise asks you to use R, include a copy of the code and output. Please edit your
code and output to be only the relevant portions.
*If a problem does not specify how to compute the answer, you many use any appropriate
method. I may ask you to use R or use manually calculations on your exams, so practice
accordingly.
*You must include an explanation and/or intermediate calculations for an exercise to be
complete.
*Be sure to submit the HWK1 Autograde Quiz which will give you ~20 of your 40 accuracy
points.
*50 points total: 40 points accuracy, and 10 points completion
## [1] 0.8925
b. Is it possible to determine the direction in which the median changes? Or how much
the median changes? If so, by how much does it change? If not, why not?
It is possible to determine the direction the median changes, since the minimum value is
changed to a different number that will also be a minimum and the sample size (12) is still
the same, the median won’t change. The median is just the number in the middle of the
collected values since the minimum number is still a minimum number, the median
doesn’t change.
c. Is it possible to predict the direction in which the standard deviation changes? If so,
does it get larger or smaller? If not, why not? Describe why it is difficult to predict by
how much the standard deviation will change in this case.
The standard deviation will increase, this is because the standard deviation gives a value
for how much the data points deviate from the mean. Changing the 11.9 to 1.19 causes
that number to be further away from the mean thus increasing the standard deviation. It is
difficult to predict how much the standard deviation will change in this case, the standard
deviation requires all of the data points to calculate and in this example, we are only given
1 point out of 12.
Exercise 3: After manufacture, computer disks are tested for errors. The table below
tabulates the number of errors detected on each of the 100 disks produced in a day.
The difference between the mean and the median is 0.05 with the median being 1 and the
mean being 1.05. Therefore they are almost the same number. The mean and the median
both essentially being 1 is consistent with what we would guess based on it’s shape. This is
because there are the most amount of 0’s and 1’s while there are not many 2’s,3’s, or 4’s in
comparison. The calculations that prove these numbers can be shown below.
#Hand Calculations
mean_hand = sum(Defects)/length(Defects) #105/100 = 1.05
mean_hand
## [1] 1.05
#R Calculations
mean_R = mean(Defects) #mean_R = 1.05
mean_R
## [1] 1.05
## [1] 1
e. Calculate the sample standard deviation ``by hand” and using R. Are the values
consistent between the two methods? How would our calculation differ if instead
we know that these 100 values were the whole population? [hint: use multiplication
instead of repeated addition]
My hand calculations for the standard deviation match the calculations from R, therefore
the 2 values are consistent. There is a different formula knowing that the standard
deviation is the whole population, not just a sample. The key difference is in the
denominator of the variance equation. For a sample, the denominator is (n-1), when the
population denominator is n. This caused the standard deviation to decrease by about .005
when we assumed it was a population not sample.
#Hand Calculation
#-> sd_sample = 1.158
# -> sd_pop = 1.152
#R Calculations
# Standard Deviation of sample
sd <- sd(Defects, na.rm = FALSE)
sd
## [1] 1.157976
## [1] 1.152172
f. Construct a boxplot for the number of errors data using R with helpful labels.
Explain how the shape of the data (identified in (c)) can be seen from the boxplot
using words such as minimum, 1st quartile, median,3rd quartile, and maximum.
The boxplot shown below matches the shape of the data from the histogram in part c. The
histogram is right skewed and this is shown in the box plot since there are no whiskers on
the left side of the box. The minimum value in the boxplot is 0 and this matches the
histograms minimum value. Along with that, the max value in the boxplot right whisker is 4,
which is also the max value in the histogram. The median of the boxplot is roughly 1 which
also matches the median according to the data set. The first quartile of the boxplot is 0 this
represents the strong right skewed cluster of data in the histogram. The third quartile of the
boxplot is also about 2, therefore 75% of the disks have 2 or fewer defects, this visual is
shown in how the histogram is right skewed.
boxplot(Defects,main = "Boxplot of Number of Errors on Disks", ylab = "Number
of Errors", xlab = "Disks", horizontal = TRUE)
g. Explain why the histogram is better able to show the discrete nature of the data than
a boxplot.
Histograms are better at showing descrete data (like this example) than boxplots. This is
because histograms display the frequency of the data. Therefore you can more easily see
how many disks had 0, 1, 2, 3, or 4 defects.The boxplot doesn’t display this information. It
is more focused on the distribution of the data as a whole and not the individual
frequencies. For example, we know about 75% of the disks have 2 or fewer defects, but we
don’t know how many disks had 2 defects based on the boxplot. Also, histograms show the
exact shape of the distribution, whereas a boxplot shows how much the data is skewed left
or right.