0% found this document useful (0 votes)
25 views54 pages

Stats10_Chapter+4 2

Uploaded by

w5w4fsfwz7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views54 pages

Stats10_Chapter+4 2

Uploaded by

w5w4fsfwz7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Chapter 4

STATS 10 Introduction to Statistical Reasoning


Maria Cha
Review
• In Chapter 2, we talked about how we graphically
summarize numerical / categorical data.

• In Chapter 3, we talked about how to describe


ONE numerical variable with center and spread
when its distribution is symmetric or skewed.

• In Chapter 4, we will talk about the linear


relationship between TWO numerical variables
with 1) a new plot and 2) a new measure and 3) a
linear equation.

2
Scatterplot

3
Scatterplot
• Scatterplots are the best way to start observing the
relationship and the ideal way to picture associations
between two numerical variables.
• In a scatterplot, you can see patterns, trends,
relationships, and even the occasional extraordinary
value sitting apart from the others.

• The variable in the x-axis is called the explanatory


(independent) variable and the variable on the y-axis
is called the response (dependent) variable.

• When looking at scatterplots, we look for direction,


shape, strength, and unusual features.
4
Scatterplot - Direction
• A pattern that runs from lower left to upper right is
said to have a positive direction (as x increases, y
increases as well).
• A trend running from upper left to lower right has a
negative direction (as x increases, y decreases).

5
Scatterplot - Shape
• If the points appear as a cloud or swarm of
points stretched out in a generally consistent,
straight form, the form of the relationship is
linear.

• Otherwise we categorize the relationship as


non-linear.

6
Scatterplot - Shape
• Linear shape?

A. B.

C. D.

7
Scatterplot - Strength
• If there does not appear to be a lot of scatter, there
is a strong relationship between the two variables
• If there appears to be some scatter, there is a weak
relationship between the two variables
• If there appears to be lots of scatter, there is no
relationship between the two variables

8
Exercise
• Interpret the scatterplot in terms of shape,
direction, and strength.

9
Correlation Coefficient
• The correlation coefficient (𝑟) (also called
Pearson’s correlation coefficient) gives us a
numerical measurement of the strength of the
linear relationship between the explanatory and
response variables.

• Formula:

10
Correlation Coefficient : Properties
• The sign of a correlation coefficient gives the
direction of the association.

• Correlation is always between -1 and +1


• Correlation can be exactly equal to -1 or +1, but
these values are unusual in real data because
they mean that all the data fall exactly on a
single straight line (perfect linear relationship)
• A correlation near zero corresponds to a weak
linear association.

11
Correlation Coefficient : Properties

12
Correlation Coefficient : Properties
• Correlation has no units.
• When arm length and height are measured in
centimeters or inches the correlation coefficient is
still same.
• Correlation measures the strength of the “linear”
association between the two numerical variables.

13
Correlation Coefficient : Properties
• Correlation is sensitive to outliers. A single
outlying value can make a small correlation
large or make a large one small association
between the two numerical variables.

14
Correlation vs. Causation
• Mexican lemon imports prevent highway deaths?

15
Correlation vs. Causation
• A high correlation between two variables does
NOT imply causation.

16
Exercise
Suppose that we have 𝑥 and 𝑦: 𝑖 𝑥! 𝑦!
Calculate and interpret their
1 1 1
correlation coefficient. Note that
the mean of 𝑥 (𝑥)̅ is 3 and SD for 𝑥 2 2 3
(𝑠! ) is 1.83, while the mean of 𝑦 (𝑦)
'
3 4 5
is 4 and SD of 𝑦 (𝑠" ) is 2.58.
4 5 7
𝑥̅ = 3 𝑦& = 4
𝑠" 𝑠#
= 1.83 = 2.58

17
Modeling Linear Trends
• Correlation says “there seems to be a linear
association between these two variable,” but it
doesn't tell us what that association is.

• Want to answer the questions like :


• How much more do people tend to weigh for
each additional inch in height?
• Can we predict how much space a book will take
on a bookshelf by knowing how many pages are
in the book?

18
Modeling Linear Trends
• Modeling linear trends with a linear equation :
• If the model fits the data well, the model can
describe the data and apply/predict the real world
situation.
• The line that describes the relationship between 𝑥 and
𝑦 is called the regression line.

19
Regression Line
• It is a tool for making predictions about future
observed values.
• It provides us with a useful way of summarizing
a linear relationship.

• We can have many possible regression lines for


the data, but we want find the best line which
well describes and predicts the data : How to
find the best regression line?
I.e. let’s assume we have three regression lines.
- we want the deviation between the line and the dots to be small

20
Residual
• The linear model won't be perfect, regardless of the
line we draw.
• Some points will be above the line and some will be
below the line.
• The estimate made from a linear model is the
predicted value, denoted as 𝑦.
!
• The difference between the observed value and its
associated predicted value is called the residual.

• To find the residuals, we always subtract the


predicted value from the observed one:
residual = observed - predicted

21
Residual
• (blue dot) : observed(actual) y values
• (red dot) : predicted y values
• Residual : −

22
Residual
• A negative residual
means the
predicted value is
bigger than the
actual value (an
overestimate).
• A positive residual
means the
predicted value is
smaller than the
actual value (an
underestimate).
23
Finding the “Best” Regression Line
• Ultimately, we want to find the best fitted model,
which minimize the residuals.
• To find a good ‘measure’ for amount of residuals
• 1) Sum of residuals? : Some residuals are positive,
others are negative, on average, they cancel each
other out. – Not a good way!
• 2) Sum of squared residuals : Similar to what we
did with deviations, we square the residuals and
add the squares.
• The smaller the sum, the better the fit.

• The line of best fit is the line for which the sum of the
squared residuals is smallest.
24
Finding the “Best” Regression Line
• Among possible regression lines, find the line that
minimizes the sum or squared residuals.
Scatterplot of Mothers' age vs. Fathers' age
60
Fathers' age

40
20
0

0 10 20 30 40 50

Mothers' age
• Which one looks the best?
25
Finding the “Best” Regression Line
Scatterplot of Mothers' age vs. Fathers' age

60
Fathers' age

40
20
0

0 10 20 30 40 50

Mothers' age

Sum of squared
Line Color Linear Equation
residuals
Black 𝑦$ = 11.54 + 0.68𝑥 38905.1
Green 𝑦$ = 1 + 𝑥 54545
Red 𝑦$ = 15 + 0.6𝑥 47887.32
26
Regression Line
• Ultimately, we want to write the regression line
as:
$ = 𝒊𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 + 𝒔𝒍𝒐𝒑𝒆×𝒙
𝒚
by finding the intercept and slope that find the
regression line which provides smallest sum of
squared residuals.

• This linear model says that our predictions from


our model follow a straight line.
• If the model is a good one, the data values will
scatter closely around it.

27
Regression Line : Slope
• The slope can be calculated using the correlation
coefficient, 𝑟, and the standard deviations of the
explanatory (independent) variable (𝒔𝒙 ) and the
response (dependent) variable (𝒔𝒚 ):

𝒔𝒚
𝒔𝒍𝒐𝒑𝒆 = 𝒓×
𝒔𝒙

• Interpretation of the slope: For each unit


increase in 𝑥, we expect 𝑦 to increase/decrease on
average by the value of the slope.

28
Regression Line : Slope
𝒔𝒚
𝒔𝒍𝒐𝒑𝒆 = 𝒓×
𝒔𝒙

• When 𝑟 is positive the slope will be positive and


when 𝑟 is negative the slope will be negative.
• A positive correlation and a positive slope both
indicate that as 𝑥 increases 𝑦 is expected to
increase as well.
• A negative correlation and a negative slope both
indicate that as 𝑥 increase 𝑦 is expected to
decrease.

29
Regression Line : Slope
• Example : The mean fat content of the 30 Burger
King menu items is 23.5g with a standard deviation
of 16.4g, and the mean protein content of these
items is 17.2g with a standard deviation of 14g. If the
correlation between protein and fat contents is
0.83, calculate the slope of the linear model for
predicting fat content from protein content.

• Which of the two variables, between fat and protein,


should be considered as 𝑦?

30
Regression Line : Slope
Fat (𝑦) Protein(𝑥)
Mean 23.5 17.2
Standard
16.4 14
deviation
Correlation
0.83
Coefficient
• Find a slope of the linear model for predicting fat
content from protein content.
𝒔𝒚 𝟏𝟔. 𝟒
𝒔𝒍𝒐𝒑𝒆 = 𝒓× = 𝟎. 𝟖𝟑× = 𝟎. 𝟗𝟕
𝒔𝒙 𝟏𝟒

• Interpret the slope in context. : For one gram increase in


protein content, we would expect the fat content to
increase on average by 0.97 grams.
31
Regression Line : Intercept
• Once we find the slope, we can calculate the
intercept of the equation.

* − 𝒔𝒍𝒐𝒑𝒆×*
𝒊𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 = 𝒚 𝒙

* denotes the mean of 𝑦, and 𝒙


• Note: 𝒚 * denotes
the mean of x.

• Interpretation of the intercept: When 𝑥


equals 0, we expect 𝑦 to equal the intercept.

32
Regression Line : Intercept
• Example (continued) :
Fat (𝑦) Protein(𝑥)
Mean 23.5 17.2

• We calculated the slope as 0.97.

• Calculate the intercept.


𝒊𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 = 𝒚+ − 𝒔𝒍𝒐𝒑𝒆×+ 𝒙
= 𝟐𝟑. 𝟓 − 𝟎. 𝟗𝟕×𝟏𝟕. 𝟐 = 𝟔. 𝟖

• Interpret the intercept in context. : Burger King


menu items with no protein are expected to have a fat
content of 6.8 grams.
33
Regression Line : Intercept
• Sometimes the intercept by itself may not make
sense. In these cases the intercept serves only to
adjust the height of the line and is meaningless by
itself.

• The scatterplot shows


the relationship between
height and weight

• People who are


0 centimeters tall are
expected to weigh -105.0113 kilograms. This is
obviously not possible.
34
Regression Line
• Once we found slope and intercept of the
regression model, we can write as a form of an
equation.

• For the Burger King menu items the slope was


found to be 0.97 and the intercept was found to be
6.8. Which of the below is the correct equation for
the regression model?
! = 6.8 + 0.97×𝑝𝑟𝑜𝑡𝑒𝑖𝑛
(a) 𝑓𝑎𝑡
4 = 6.8 + 0.97×𝑓𝑎𝑡
(b) 𝑝𝑟𝑜𝑡𝑒𝑖𝑛
(c) 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 = 6.8 + 0.97×𝑓𝑎𝑡
! = 0.97 + 6.8×𝑝𝑟𝑜𝑡𝑒𝑖𝑛
(d) 𝑓𝑎𝑡

35
Regression Line

4 = 6.8 + 0.97×𝑝𝑟𝑜𝑡𝑒𝑖𝑛
𝑓𝑎𝑡

• If a new menu comes with 1g of protein, what do


we expect for the amount of fat in the menu?

36
Exercise
• The table shows the heights and weights of some
people. We want to predict weights from height.

Height (inches) Weight (pounds)


60 105
66 140
72 185
70 145
62 120

a) Draw a scatterplot for the given data. Do you


observe any linear association?
37
Exercise
Height (inches) Weight (pounds)
60 105
66 140
72 185
70 145
62 120

b) Find the mean, SD, and correlation coefficient


of the two variables. What does the correlation
coefficient imply?

38
Exercise
Height (inches) Weight (pounds)
60 105
66 140
72 185
70 145
62 120

c) Find the best fitted line and plot the line on the
scatterplot. Interpret the slope and the intercept
in the context of the data.

39
Exercise
Height (inches) Weight (pounds)
60 105
66 140
72 185
70 145
62 120

d) Compare the predicted weights and actual


weights. What are their residual?

40
Exercise
Height (inches) Weight (pounds)
60 105
66 140
72 185
70 145
62 120

e) If we find a person’s height of 68 inches, what


do we expect for the person’s weight?

41
Measure of Goodness of fit
• If the correlation between 𝑥 and 𝑦 is 𝑟 = 1 or 𝑟 = −1,
then the model would predict 𝑦 perfectly, and the
residuals would all be zero. Hence, all of the variability
in the response variable would be explained by the
explanatory variable.

42
R2 : Coefficient of determination
• In the Burger King menu model, 𝑟 = 0.83, which
is not perfect. In other words, protein cannot
explain 100% of fat’s variation. Then how much?

• The squared correlation (R2, r-squared) gives the


percentage of the variance of 𝑦 accounted for by
the regression model, or in other words explained
by 𝑥.

• BK model : 69%(0.832) of the variation in fat


content is accounted for by the model, i.e.,
explained by the protein content of the burger
43
R2 : Coefficient of determination
• A) 𝑅2 = 0% : the
variation in y is not
explained by x at all.

• B) R2=100% : the
variation in y is
perfectly explained by
x.

44
R2 : Coefficient of determination
• C) R2=60.5% : Some
portion (60.5%) of the
variation in y is
explained by x

45
R2 : Properties
• 𝑅2 = 0 means that none of the variance in y is
explained by x.
• 𝑅2 = 1 means that all of the variance in y is
explained by x.
• While the correlation coefficient is between -1 and
1, 𝑅2 is between 0 and 1
0 ≤ 𝑅2 ≤ 1

• We would like 𝑅2 to be as close to 100% as


possible.

46
Exercise
• The correlation between weight and gas mileage of
cars is -0.96. Which of the below is the correct value
and interpretation of the linear model for predicting
gas mileage from weight?

a) 8% or the variation in gas mileage is explained by


the weight of the cars.
b) 92% of the variation in weight is explained by the
gas mileage of cars.
c) 92% of the variation in gas mileage is explained by
the weight of cars.
d) 96% of the variation in gas mileage is explained by
the weight of cars.
47
Exercise
• The following scatterplot shows the relationship
between number of hours students watch TV
and their score on an exam. If 𝑅2 = 50%, what is
the correlation coefficient?

a) -0.71
b) 0.71
c) 0.25
d) 7.07

48
Residual Plot
• The linear model assumes that the relationship
between the two variables is a perfect straight
line. The residuals are the part of the data that
has not been modeled.
Data = Model + Residual
or equivalently
Residual = Data – Model.

• Residuals help us see whether the model makes


sense (fits well).
• When a regression model is appropriate (is a
good fit), nothing interesting should be left
behind. 49
Residual Plot
The double hamburger
with a protein content of
31 grams has a predicted
fat content of
= = 6.8 + 0.97×31
𝑓𝑎𝑡
= 36.87 grams

The actual fat content


for a double hamburger is
26 grams, so the residual
is
26 − 36.87 = −10.87
grams.
50
Residual Plot : Model Assessment
• We can check if our regression model is a good fit by
looking at what is left behind, i.e., the residuals
• Good model : The scatterplot of residuals vs. x should
show a random scatter around 0.
• Some data points will be above the regression line and
some will be below; hence some of the residuals will be
negative and some positive.
• There should be no apparent patterns in the residual plot
as x increases.

51
Residual Plot : Model Assessment
• Bad examples (not fit to the linear model)
• Fan shaped residuals: Variance of the residuals is
large when x becomes large (or vice versa)
• Curved residuals: Residuals are negative for small x,
then positive, and finally slightly negative again for
large x.

52
Extrapolation
• Do not extrapolate beyond the data, i.e., do not make a
prediction for an x value outside the range of the data –
the linear model may no longer hold outside that range
• For example, if the BK Broiler chicken sandwich has a
protein content of 75 grams, what is the predicted fat
content?

We should not use the linear


model to predict the amount
of fat in a burger with 75g of
protein since the data we used
to create the model is for
burgers with approximately 0 to
50g of protein.
53
What can go wrong
• Do not fit a straight line to a nonlinear
relationship.
• Beware of extraordinary points (y-values that
stand off from the linear pattern or extreme x-
values).
• Do not extrapolate beyond the data – the linear
model may no longer hold outside the range of
the data.
• Do not infer that x causes y just because there is a
good linear model for their relationship –
correlation not causation.

54

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy