0% found this document useful (0 votes)
28 views17 pages

Stat 1000 Assignment 2

The document outlines the instructions and requirements for completing STAT 1000 Assignment 2, including data analysis using R. It contains specific tasks such as creating scatterplots, calculating correlations, and interpreting regression results based on provided datasets. The assignment emphasizes proper formatting, submission guidelines, and the importance of showing work for written responses.

Uploaded by

tslforever35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views17 pages

Stat 1000 Assignment 2

The document outlines the instructions and requirements for completing STAT 1000 Assignment 2, including data analysis using R. It contains specific tasks such as creating scatterplots, calculating correlations, and interpreting regression results based on provided datasets. The assignment emphasizes proper formatting, submission guidelines, and the importance of showing work for written responses.

Uploaded by

tslforever35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

STAT 1000 - Assignment 2

Tonqun Lado (007999730)

2023-10-18

Instructions
To properly view the assignment questions, knit this file to .PDF and view the output.
To enter the R-based questions, add code as needed into the R code chunks given below the question, and,
where applicable, replace the “Delete me; . . . ” with your own text response. Be sure when adding in
text responses to never copy-paste symbols from outside of the document. Only use the symbols on your
keyboard. Do not delete the question text, or modify any other part of the code except for the “author” in
Line 3. All numerical and graphical answers must be done using R, unless stated otherwise.
You will have a link in your email that takes you to the Crowdmark submission page. Once you have
completed the worksheet, knit it to .PDF and upload your output to Crowdmark. Also, upload your .Rmd
file to Crowdmark where prompted. To see where your .Rmd file is saved, click File > Save As in the top-left
of your screen. Make sure you set your Name and Student Number in the Author section of this document
(Line 3). Do not alter the title or the date.
After you knit your assignment to PDF, check your code chunks. If your code at any point runs off the page,
find the nearest comma, click to the right of it, and press Enter (or Return if you are on a Mac). This will
force a break in the code so that it goes onto the next line. All of your code must be readable in the final
submission.
For the R-based questions, all calculations and output must be visible in the final knit PDF, and all text
responses should be in complete English sentences. Your work should be done using the same formatting,
functions, and packages as in your labs and course notes, unless otherwise specified. You may speak to
your classmates about ideas and what functions/optional arguments you may need to use but you may not
directly show your code/output to your classmates.
Several questions will require you write up your work on paper (or a tablet). Note that illegible work may
receive a mark of zero. Use lined paper if possible, and clean up your work before you submit it. You will
need to either scan your written work, screenshot / save to PDF if your work is done on a tablet, or take a
picture of your work. Ensure that your submissions are rotated correctly in Crowdmark before you submit
(any incorrectly rotated submissions may receive a grade of zero). Crowdmark will let you preview your
submission and rotate any images before you submit.
Each Question and each Part must be clearly separated. To this end, make sure that each Part is separated
by at least four lines, and that each question starts on a new page.
Whether or not it is specified, all written questions require you to show your full work and reasoning. Answers
given without proper reasoning / work may receive a grade of zero. All calculations should be given to two
decimal places, unless otherwise stated.
Your full submission is due by 11:59 p.m., 9 days after the start of your assignment. Crowdmark may allow
you to submit late, but you will be given an automatic grade of zero if you do. If you have an issue that you
can’t resolve without someone looking at your work (e.g., you get an error when knitting your document),
please see the Help Centre in 311 Machray Hall.

1
Questions [50 Marks]

Question 1 [8 Marks]

Part (a) [1 mark] The FHA dataset contains the Heights, Armlengths, and Footlengths of a sample of
Grade 12 students across the United States. All measurements are done in centimeters.
Import it below.

FHA <- read.csv("~/Documents/School/Stats/Assignment 2/FHA.csv")

Leonardo Da Vinci believed that a person’s armspan is roughly equal to their height. To investigate this,
create a scatterplot comparing Height (X) to Armspan (Y), based on the FHA dataset.
To create a scatterplot in R, we use the plot function. In particular, you will use the code plot(x, y),
where x and y are your x- and y-variables. Use the main, xlab, and ylab arguments to give meaningful
titles to the graph, the x-axis, and the y-axis.

plot(FHA$Height, FHA$Armspan, xlab = "Height (cm)", ylab = "Armspan (cm)",


main = "Height vs Armspan")

Height vs Armspan
200
Armspan (cm)

180
160
140

150 160 170 180 190

Height (cm)

Part (b) [1 mark] Create a similar plot to compare the relationship between Footlength (X) and Armspan
(Y). Again, set meaningful titles for the graph, the x-axis, and the y-axis.

2
plot(FHA$Footlength, FHA$Armspan, xlab = "Foot length (cm)",
ylab = "Armspan (cm)", main = "Foot length vs Armspan")

Foot length vs Armspan


200
Armspan (cm)

180
160
140

20 22 24 26 28 30 32

Foot length (cm)

Compare these two scatterplots to each other. Do you see a linear relationship in each? If so, compare these
relationships in terms of direction and strength.
Both scatter plots have a linear relationship. The scatter plot for “Height vs Armspan” and “Foot length vs
Armspan” have a strong positive relationship.

Part (c) [1 mark] To calculate the correlation between two variables in R, we can use the cor function.
In particular, you will use the code cor(x, y), where x and y are your x- and y-variables.
Calculate the correlation between Height and Armspan, as well as the correlation between Footlength and
Armspan.

cor(FHA$Height, FHA$Armspan)

## [1] 0.7899087

cor(FHA$Footlength, FHA$Armspan)

## [1] 0.6132733

Do the values of the correlation you calculated above confirm your suspicion from Part (b)?

3
Yes, the close value of correlations between the two scatterplots shows the similarities between the two graphs
numerically. As well the calculated correclation is positive values close to 1, meaning the graphs have a
strong positive relationship.

Part (d) [1 mark] Below, calculate the r2 between Height and Armspan, as well as the r2 between
Footlength and Armspan.

cor(FHA$Height, FHA$Armspan)ˆ2

## [1] 0.6239557

cor(FHA$Footlength, FHA$Armspan)ˆ2

## [1] 0.3761041

Provide an interpretation of each of these terms below:


62% of the variation in armspan is accounted for by its regression on height. 38% of the variation in armspan
is accounted for by its regression on footlength.

Part (e) [1 mark] To find the least squares regression equation between two variables in R, we use the
function lm (which stands for linear model). In particular, the code used is lm(y ~ x), where x and y are
your x- and y-variables.
Use lm to find the least-squares regression equation for predicting Armspan (Y) from Height (X). Also, find
the least-squares regression equation for predicting Armspan (Y) from Footlength (X).

lm(FHA$Armspan ~ FHA$Height)

##
## Call:
## lm(formula = FHA$Armspan ~ FHA$Height)
##
## Coefficients:
## (Intercept) FHA$Height
## 0.1431 0.9954

lm(FHA$Armspan ~ FHA$Footlength)

##
## Call:
## lm(formula = FHA$Armspan ~ FHA$Footlength)
##
## Coefficients:
## (Intercept) FHA$Footlength
## 92.254 3.109

Part (f) [1 mark] Use the regression lines calculated above to calculate the predicted armspan of a student
with a height of 170cm, as well as the predicted armspan of a student with a foot length of 25cm.

4
0.1431 + 0.9954*170

## [1] 169.3611

92.254 + 3.109*25

## [1] 169.979

Part (g) [1 mark] Give an interpretation of the slopes of each of these regression lines.
The slopes for the two regression lines are very similar, meaning both a student’s height and footlength is
closely related to their armspan.

Part (h) [1 mark] Below, save the linear models created above as objects in R.

lm.HvA <- lm(FHA$Armspan ~ FHA$Height)


lm.FvA <- lm(FHA$Armspan ~ FHA$Footlength)

With the linear models saved as objects, you can use the abline function to print the lines on top of a
scatterplot. For example, after calling plot, you can enter abline(myLM), where myLM is the name of your
linear model object and it will print the regression line on top of the scatterplot.
Recreate the scatterplots you created earlier. Use abline to print the scatterplots on top of them. Add lwd
= 4 as an argument to abline to widen the line, to make it more readable. Use the col argument in abline
to set the colour to tomato2.

plot(FHA$Height, FHA$Armspan, xlab = "Height (cm)", ylab = "Armspan (cm)",


main = "Height vs Armspan")
abline(lm.HvA, col = "tomato2", lwd = 4)

5
Height vs Armspan
200
Armspan (cm)

180
160
140

150 160 170 180 190

Height (cm)

plot(FHA$Footlength, FHA$Armspan, xlab = "Foot Length (cm)",


ylab = "Armspan (cm)", main = "Foot Length vs Armspan")
abline(lm.FvA, col = "tomato2", lwd = 4)

6
Foot Length vs Armspan
200
Armspan (cm)

180
160
140

20 22 24 26 28 30 32

Foot Length (cm)

7
Question 2 [4 marks]

Part (a) [1 mark] Import the AQ2 dataset. This dataset contains samples of 827 concentrations for
various chemicals, measured by sensors in a large city. Two of the variables included are CO, measuring the
concentration of Carbon Monoxide in grams per cubic meter, and Benzene, measuring the concentration of
Benzene in micrograms per cubic meter.

AQ2 <- read.csv("~/Documents/School/Stats/Assignment 2/AQ2.csv")

Make a scatterplot comparing the CO (X) and the Benzene (Y) variables. Use xlab, ylab and main to set
the axis labels and titles of the graph.

plot(AQ2$CO, AQ2$Benzene, xlab = "Carbon (grams per cubic meter)",


ylab = "Benzenze (micrograms per cubic meter)",
main = "Concentration of Carbon vs Benzene")

Concentration of Carbon vs Benzene


Benzenze (micrograms per cubic meter)

40
30
20
10
0

0 2 4 6 8

Carbon (grams per cubic meter)

Part (b) [1 mark] Use the lm function to determine the linear regression equation for comparing the CO
(X) and the Benzene (Y) variables.

lm(AQ2$CO ~ AQ2$Benzene)

##
## Call:
## lm(formula = AQ2$CO ~ AQ2$Benzene)

8
##
## Coefficients:
## (Intercept) AQ2$Benzene
## 0.3629 0.1848

Part (c) [1 mark] Use the regression equation above to calculate the predicted Benzene concentration
when the CO concentration is 11 grams per cubic meter.

0.3629+0.1848*11

## [1] 2.3957

Part (d) [1 mark] Recreate the scatterplot from Part (a), and use abline to overlay the regression line
on top of the scatterplot.

plot(AQ2$CO, AQ2$Benzene, xlab = "Concentration of Carbon (grams)",


ylab = "Concentation of Benzenze (micrograms)",
main = "Concentration of Carbon vs Benzene")
abline(lm(AQ2$Benzene ~ AQ2$CO), col = "tomato2", lwd = 4)

Concentration of Carbon vs Benzene


Concentation of Benzenze (micrograms)

40
30
20
10
0

0 2 4 6 8

Concentration of Carbon (grams)

9
Question 3 [15 Marks]

Part (a) [1 mark] Follow This link to access recent weather statistics for Winnipeg. Enter the daily min
temperatures and the daily max temperatures for July 1st through July 5th into R, as vectors. You may
have to click the “10x” at the top of the page to access the data range.
Make sure that you enter the data values in order, starting from July 1st and ending at July 5th.

DailyMinimum<- c(13.1, 17.8, 14.0, 9.9, 6.4)


DailyMaximum<- c(30.4,31.3,24.6,21.4,21.2)

Part (b) [2 marks] Before making any calculations, do you expect the correlation between daily minimum
temperature and daily maximum temperature to be positive or negative? Why?
The correlation between daily minimum temperature and daily maximum temperature would be positive be-
cause as the daily minimum temperature decreases, the daily maximum temperature decreases as well.

Part (c) [1 mark] Use R to calculate the means and standard deviations of the daily min temperatures and
the daily max temperatures. Also, calculate the correlation between the daily minimum and daily maximum
temperatures.

mean(DailyMinimum)

## [1] 12.24

mean(DailyMaximum)

## [1] 25.78

sd(DailyMinimum)

## [1] 4.3108

sd(DailyMaximum)

## [1] 4.831356

cor(DailyMinimum, DailyMaximum)

## [1] 0.8352615

Part (d) [2 marks] Confirm the R output by calculating the correlation by hand. You can use the R
outputs for the means and standard deviations, but the rest must be done by hand.
Do this work separately on paper. Reference the assignment instructions for details on
how to format your work. Show all of your work for all written questions.

Part (e) [3 marks] Using the calculations from Part (c), calculate the least-squares regression equation
for predicting Daily Max temperature from Daily Min temperature. Show all of your work.
Do this work separately on paper.

10
Part (f) [1 mark] Confirm your results in Part (e) by using lm to calculate the least-squares regression
equation. Make sure that you save your linear model as an object, and then type its name to print it out.

temperatures <- lm(DailyMaximum ~ DailyMinimum)


lm(temperatures)

##
## Call:
## lm(formula = temperatures)
##
## Coefficients:
## (Intercept) DailyMinimum
## 14.3218 0.9361

Part (g) [2 marks] Calculate the residual for the July 3rd data point.
Do this work separately on paper.

Part (h) [1 mark] In R, once you have created a linear model, you can access the residuals of each data
point by entering myLM$residuals, where myLM is the name of your linear model.
Use the technique explained above to confirm your result in Part (g) by using the linear model calculated in
Part (f).

temperatures$residuals

## 1 2 3 4 5
## 3.8149330 0.3151483 -2.8275790 -2.1894689 0.8869666

Part (i) [2 marks] Use the regression line to calculate the predicted maximum temperatures for two
days with minimum temperatures of 11C and -5C. Is one of these predictions more reliable than the other?
Explain.
Do this work separately on paper.

11
Question 4 [11 marks]

Part (a) [1 mark] The WeatherAP4 dataset contains weather data on a sample of 50 days in Winnipeg.
The measured variables are Average Hourly Temperature (measured in degrees Celsius), and Air Pressure
(measured in kPa, i.e., kiloPascals).
The least-squares regression equation for predicting Air Pressure from Temperature is given below:

ŷ = 102.15 − 0.0407x

We also find that 37.54% of the variance in Air Pressure can be explained by the regression on Temperature.
Import this dataset below as WeatherAP.

WeatherAP4 <- read.csv("~/Documents/School/Stats/Assignment 2/WeatherAP4.csv")

Part (b) [3 marks] Provide an interpretation of the slope, as well as an interpretation of the intercept.
As the temperature increases by 1, the air pressure will decrease by 0.0407. The intercept is the air pressure
when temperature equals 0.

Part (c) [1 mark] Calculate the predicted air pressure when Temperature is 15C.
Do this work separately on paper.

Part (d) [2 marks] Using only the information given above, and without using R, what is the correlation
between Air Pressure and Temperature for this sample?
Do this work separately on paper.

Part (e) [1 marks] Suppose that we wish to convert the temperature from Celsius to Fahrenheit (re-
member that F = 32 + 1.8 × C). Without making any calculations, explain whether the correlation would
increase, decrease, or stay the same.
After changing the temperature from Celsius to Fahrenheit, the correlation will stay the same because unit
does not affect the value of correlation.

Part (f) [1 mark] Below, change the Temperature measurements in WeatherAP to Fahrenheit. To change
a variable x of Celsius measurements in a dataset called DATA to Fahrenheit in R, you would enter DATA$x
<- 32 + 1.8*DATA$x

WeatherAP4$Temp <- 32 + 1.8*WeatherAP4$Temp

Part (g) [1 mark] Use the new vector of Fahrenheit measurements to recalculate the correlation, as well
as the slope and intercept. Compare this to your expectations from Part (e).

cor(WeatherAP4$Temp, WeatherAP4$AP)

## [1] -0.6127113

12
lm(WeatherAP4$AP ~ WeatherAP4$Temp)

##
## Call:
## lm(formula = WeatherAP4$AP ~ WeatherAP4$Temp)
##
## Coefficients:
## (Intercept) WeatherAP4$Temp
## 102.8717 -0.0226

Both the intercept and the slope is closely similar to each other, meaning the change of unit does not largely
affect the graph.

Part (h) [1 mark] Previously, we calculated the predicted pressure for a temperature of 15C. Using the
relationship between Fahrenheit and Celsius, we can find that a temperature 15C is equal to 59F. Use this
new regression line in terms of Fahrenheit to find the predicted air pressure for a temperature of 59F. What
do you notice when you compare this answer to your predicted air pressure in Part (b)?
Do this work separately on paper.

13
Question 5 [6 Marks]

Below is a sample of temperature measurements from Winnipeg, Manitoba and Regina, Saskatchewan across
13 days. Measurements are in degrees Celsius. Run the code below to load these data into your R Session
(code can be run by pressing the green arrow in the top-right of the code chunk).

WeatherWPG <- c(0.3, -2.5, -12.4, -1.2, -8.5, 4.5, -2.4, -0.7, 1.2, -12.7, 1.3, -10.1, -6.8)
WeatherREG <- c(-0.5, -1.2, -14.9, -3.6, -7.6, 11, -1.7, -2.7, 3.5, -21.9, -0.4, -15.2, -7.6)

Part (a) [2 marks] Make a scatterplot of these datasets, with Winnipeg on the x-axis and Regina on the
y-axis. Use xlim, ylim, and main to set appropriate titles and axis labels.

plot(WeatherWPG, WeatherREG, xlab = "Temperature of Winnipeg (Celsius)",


ylab = "Temperature of Regina (Celsius)", main = "Winnipeg vs Regina")

Winnipeg vs Regina
10
Temperature of Regina (Celsius)

5
0
−10
−20

−10 −5 0 5

Temperature of Winnipeg (Celsius)

Part (b) [2 marks] Use lm to create a linear model for predicting Regina’s temperatures from Winnipeg’s
temperatures. Save this linear model as an object and use it with abline to add the regression line to the
plot created earlier (note: you will have to copy your code from Part (a) to recreate the initial plot). Add
the lwd = 2 argument to abline, and use col to help distinguish the line from the plot.

temp <- lm(WeatherREG ~ WeatherWPG)


lm(temp)

14
##
## Call:
## lm(formula = temp)
##
## Coefficients:
## (Intercept) WeatherWPG
## 0.802 1.465

plot(WeatherWPG, WeatherREG, xlab = "Temperature of Winnipeg (Celsius)",


ylab = "Temperature of Regina (Celsius)", main = "Winnipeg vs Regina")
abline(lm(temp), col = "tomato2", lwd = 2)

Winnipeg vs Regina
10
Temperature of Regina (Celsius)

5
0
−10
−20

−10 −5 0 5

Temperature of Winnipeg (Celsius)

Part (c) [2 marks] Two new days of observation are to be added to this dataset. One day (January 27,
2019) reads a temperature of -24.9C for Winnipeg and -3.5C for Regina. Another day (January 7, 2017)
reads a temperature of -10.8C for Winnipeg and -15.4C for Regina. Using the plot and line above, do you
expect either of these two observations to be influential? Explain your reasoning.
The new data recorded on January 27,2019 is an influential observation because it’s an outlier in the x-
direction. As for the data recorded on January 7,2017 is not an influential observation because it’s not an
outlier.

15
Question 6 [6 marks]

The manager of an apartment building would like to conduct a survey of tenants. There are eight floors in
the building, each with ten apartments. The apartments are numbered as follows:

Floor 1: 101 102 103 104 105 106 107 108 109 110
Floor 2: 201 202 203 204 205 206 207 208 209 210
Floor 3: 301 302 303 304 305 306 307 308 309 310
Floor 4: 401 402 403 404 405 406 407 408 409 410
Floor 5: 501 502 503 504 505 506 507 508 509 510
Floor 6: 601 602 603 604 605 606 607 608 609 610
Floor 7: 701 702 703 704 705 706 707 708 709 710
Floor 8: 801 802 803 804 805 806 807 808 809 810

The manager would like to take a sample of 16 apartments. What type of sample is obtained using each of
the following sampling methods?

Part (a) [1 mark] The manager randomly selects one of the first five apartments listed above, then sends
the survey to that apartment, and every fifth apartment after that on the list.
This is an example of systematic sampling.

Part (b) [1 mark] The manager stands in front of the apartment building one Saturday afternoon and
gives the survey to the first 16 tenants she sees entering the building.
This is an example of convenience sampling.

Part (c) [1 mark] The manager randomly selects two apartments on each floor of the building and delivers
the survey to the selected tenants.
This is an example of stratified sampling

Part (d) [1 mark] The manager randomly selects four floors of the apartment building, and for each
selected floor, she randomly selects four apartments. The surveys are sent to the selected tenants.
This is an example of multistage sampling.

Part (e) [1 mark] Explain why none of the above procedures produces a simple random sample. (You
do not need to provide a separate explanation for all four procedures; just explain what a simple random
sample guarantees that none of these procedures achieve.)
None of the previous procedures is an example of simple random sampling because all the apartments does
not have an equal chance of being chosen for the sample.

Part (f) [1 mark] In R, we can use the sample function to take a random sample from a set of numbers.
In particular, the syntax is sample(1:N, n), where N is your total number of observations you’re sampling
from, and n is the size of the sample you wish to take.
Assuming that the apartments are numbered from 1 to 80 (e.g., a sample of 5, 31, 56 would mean apartments
101, 301, and 506), use the sample function to take a sample of 16 apartments.

16
sample(1:80, 16)

## [1] 31 80 76 61 14 66 27 52 62 68 72 5 48 53 3 33

17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy