0% found this document useful (0 votes)
10 views7 pages

HWK1 324 SS

The document outlines the requirements for Statistics 324 Homework 1, including submission guidelines, the importance of including R code, and the need for explanations in exercises. It contains exercises focused on statistical concepts such as sampling methods, mean and median calculations, standard deviation, and data visualization through histograms and boxplots. The exercises emphasize understanding the implications of sample selection and the characteristics of data distributions.

Uploaded by

jonathanolden9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views7 pages

HWK1 324 SS

The document outlines the requirements for Statistics 324 Homework 1, including submission guidelines, the importance of including R code, and the need for explanations in exercises. It contains exercises focused on statistical concepts such as sampling methods, mean and median calculations, standard deviation, and data visualization through histograms and boxplots. The exercises emphasize understanding the implications of sample selection and the characteristics of data distributions.

Uploaded by

jonathanolden9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Statistics 324 Homework 1

Jonathan Nolden

*Submit your homework to Canvas by the due date and time. Email your lecturer if you
have extenuating circumstances and need to request an extension.
*If an exercise asks you to use R, include a copy of the code and output. Please edit your
code and output to be only the relevant portions.
*If a problem does not specify how to compute the answer, you many use any appropriate
method. I may ask you to use R or use manually calculations on your exams, so practice
accordingly.
*You must include an explanation and/or intermediate calculations for an exercise to be
complete.
*Be sure to submit the HWK1 Autograde Quiz which will give you ~20 of your 40 accuracy
points.
*50 points total: 40 points accuracy, and 10 points completion

Basics of Statistics and Summarizing Data Numerically and Graphically


(I)
Exercise 1. A number of individuals are interested in the proportion of citizens within a
county who will vote to use tax money to upgrade a professional baseball stadium in the
upcoming vote. Consider the following methods:
The Baseball Team Owner surveyed 8,000 people attending one of the baseball games
held in the stadium. Seventy eight percent (78%) of respondents said they supported the
use of tax money to upgrade the stadium.
The Pollster generated 1,000 random numbers between 1-52,661 (number of county
voters in last election) and surveyed the 1,000 citizens who corresponded to those
numbers on the voting roll. Forty three percent (43%) of respondents said they supported
the use of tax money to upgrade the stadium.
a. What is the population of interest? What is the parameter of interest? Will this
parameter ever be calculated?
The population of interest is the collection of Yes/No responses from all county citizens
who will vote in next election (52,661 People). The Parameter of interest is the proportion of
people in the county that would vote for tax money to be used on the professional baseball
stadium. This parameter will be calculated if the county does an official vote to see if the
citizens truly want the baseball stadium to receive tax money. It appears that this vote is
happening in the next vote therefore the parameter will be calculated.
b. What were the sample sizes used and statistics calculated from those samples?
Are these simple random samples from the population of interest?
The sample size used for the baseball team owner’s poll was 8,000 people attending the
baseball games. The poll found that 78% of respondents said they supported the use of tax
money on the stadium. This is not a simple random sample from the population, the owner
is initially narrowing down the sample size to 8000 people that go the baseball games. This
is not random, along with that, there will be a bias because people more interested in
baseball should be more willing to vote yes to upgrade the stadium.
The sample size for the Pollster was 1000 people randomly selected from the population of
interest. The poll came out that 43% of the voters supported the tax money used to
upgrade the stadium. This poll was a simple random sample from the population of
interest. The pollster did not use any methods to narrow down the population of interest
other than randomly selecting 1000 citizens.
c. The baseball team owner claims that the survey done at the baseball stadium will
better predict the voting outcome because the sample size was much larger. What
is your response?
The baseball team owner is incorrect. Even though his sample size is bigger than the
pollster, he didn’t randomly select the sample size from the population of interest. He
selected a predefined group of people, also he selected people that should have a bias
towards allocating the money since they are attending the baseball games. Therefore the
pollster’s data is more accurate even though he surveyed less people because his survey
was a simple random sample unlike the owner.
Exercise 2. There are 12 numbers in a sample, and the mean is 𝑥‾ = 24. The minimum of
the sample is accidentally changed from 11.9 to 1.19.
a. Is it possible to determine the direction in which (increase/decrease) the mean
(𝑥‾)changes? Or how much the mean changes? If so, by how much does it change? If
not, why not?
It is possible to determine the direction the mean will change when 11.9 is changed to
1.19. The mean will decrease because the minimum value is decreasing, thus decreasing
the mean. The mean decreases by .8925 as shown in the calculation below. This
calculation is made possible because we know the sample size and the two numbers that
got switched.
(11.9-1.19)/12

## [1] 0.8925

b. Is it possible to determine the direction in which the median changes? Or how much
the median changes? If so, by how much does it change? If not, why not?
It is possible to determine the direction the median changes, since the minimum value is
changed to a different number that will also be a minimum and the sample size (12) is still
the same, the median won’t change. The median is just the number in the middle of the
collected values since the minimum number is still a minimum number, the median
doesn’t change.
c. Is it possible to predict the direction in which the standard deviation changes? If so,
does it get larger or smaller? If not, why not? Describe why it is difficult to predict by
how much the standard deviation will change in this case.
The standard deviation will increase, this is because the standard deviation gives a value
for how much the data points deviate from the mean. Changing the 11.9 to 1.19 causes
that number to be further away from the mean thus increasing the standard deviation. It is
difficult to predict how much the standard deviation will change in this case, the standard
deviation requires all of the data points to calculate and in this example, we are only given
1 point out of 12.
Exercise 3: After manufacture, computer disks are tested for errors. The table below
tabulates the number of errors detected on each of the 100 disks produced in a day.

Number of Defects Number of Disks


0 41
1 31
2 15
3 8
4 5
a. Describe the type of data that is being recorded about the sample of 100 disks,
being as specific as possible.
The type of date being recorded is Quantitative - Discrete. It is quantitative because the
data is given in number form and are not categories. The data is discrete because there
can’t be 1.5 defects, it’s either 1 or 2, therefore the data is whole and discrete.
b. A frequency histogram showing the number of errors on the 100 disks is given
below. Write the R code to produce this frequency histogram. Be sure to create
useful labels. Hints: use the rep() function to define your defect data. Also use ylim
and breaks to format your graph.
Defects <- c(rep(0,41),rep(1,31),rep(2,15),rep(3,8),rep(4,5))
hist(Defects, breaks = seq(-0.5, 4.5, by = 1), ylim = c(0, 50),labels = TRUE)
Defect Histogram
c. What is the shape of the histogram for the number of defects observed in this
sample? Why does that make sense in the context of the question?
The rough shape of this histogram is Right skewed data.This shape makes sense because
the factory is trying to have no defects, therefore the chance of them having 4 defects are
much lower than the chance of 0 defects. As the amount of defects increase, the number
of samples that have that defect decreases thus proving how it is right skewed data.
d. Calculate the mean and median number of errors detected on the 100 disks by
hand and with R. How do the mean and median values compare and is that
consistent with what we would guess based on the shape? [You can use LaTeX such
𝑣𝑎𝑙𝑢𝑒1
as 𝑥‾ = 𝑣𝑎𝑙𝑢𝑒2 to help you show your work neatly.]

The difference between the mean and the median is 0.05 with the median being 1 and the
mean being 1.05. Therefore they are almost the same number. The mean and the median
both essentially being 1 is consistent with what we would guess based on it’s shape. This is
because there are the most amount of 0’s and 1’s while there are not many 2’s,3’s, or 4’s in
comparison. The calculations that prove these numbers can be shown below.
#Hand Calculations
mean_hand = sum(Defects)/length(Defects) #105/100 = 1.05
mean_hand

## [1] 1.05

#median_hand = 1 since the 50th disk would be a 1

#R Calculations
mean_R = mean(Defects) #mean_R = 1.05
mean_R

## [1] 1.05

Median_R = median(Defects) #Median_R = 1


Median_R

## [1] 1

e. Calculate the sample standard deviation ``by hand” and using R. Are the values
consistent between the two methods? How would our calculation differ if instead
we know that these 100 values were the whole population? [hint: use multiplication
instead of repeated addition]
My hand calculations for the standard deviation match the calculations from R, therefore
the 2 values are consistent. There is a different formula knowing that the standard
deviation is the whole population, not just a sample. The key difference is in the
denominator of the variance equation. For a sample, the denominator is (n-1), when the
population denominator is n. This caused the standard deviation to decrease by about .005
when we assumed it was a population not sample.
#Hand Calculation
#-> sd_sample = 1.158
# -> sd_pop = 1.152

#R Calculations
# Standard Deviation of sample
sd <- sd(Defects, na.rm = FALSE)
sd

## [1] 1.157976

#standard deviation of population


sd_pop <- sqrt(sum((Defects - mean(Defects))^2) / length(Defects))
sd_pop

## [1] 1.152172

f. Construct a boxplot for the number of errors data using R with helpful labels.
Explain how the shape of the data (identified in (c)) can be seen from the boxplot
using words such as minimum, 1st quartile, median,3rd quartile, and maximum.
The boxplot shown below matches the shape of the data from the histogram in part c. The
histogram is right skewed and this is shown in the box plot since there are no whiskers on
the left side of the box. The minimum value in the boxplot is 0 and this matches the
histograms minimum value. Along with that, the max value in the boxplot right whisker is 4,
which is also the max value in the histogram. The median of the boxplot is roughly 1 which
also matches the median according to the data set. The first quartile of the boxplot is 0 this
represents the strong right skewed cluster of data in the histogram. The third quartile of the
boxplot is also about 2, therefore 75% of the disks have 2 or fewer defects, this visual is
shown in how the histogram is right skewed.
boxplot(Defects,main = "Boxplot of Number of Errors on Disks", ylab = "Number
of Errors", xlab = "Disks", horizontal = TRUE)

g. Explain why the histogram is better able to show the discrete nature of the data than
a boxplot.
Histograms are better at showing descrete data (like this example) than boxplots. This is
because histograms display the frequency of the data. Therefore you can more easily see
how many disks had 0, 1, 2, 3, or 4 defects.The boxplot doesn’t display this information. It
is more focused on the distribution of the data as a whole and not the individual
frequencies. For example, we know about 75% of the disks have 2 or fewer defects, but we
don’t know how many disks had 2 defects based on the boxplot. Also, histograms show the
exact shape of the distribution, whereas a boxplot shows how much the data is skewed left
or right.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy