Stats10_Chapter+4 2
Stats10_Chapter+4 2
2
Scatterplot
3
Scatterplot
• Scatterplots are the best way to start observing the
relationship and the ideal way to picture associations
between two numerical variables.
• In a scatterplot, you can see patterns, trends,
relationships, and even the occasional extraordinary
value sitting apart from the others.
5
Scatterplot - Shape
• If the points appear as a cloud or swarm of
points stretched out in a generally consistent,
straight form, the form of the relationship is
linear.
6
Scatterplot - Shape
• Linear shape?
A. B.
C. D.
7
Scatterplot - Strength
• If there does not appear to be a lot of scatter, there
is a strong relationship between the two variables
• If there appears to be some scatter, there is a weak
relationship between the two variables
• If there appears to be lots of scatter, there is no
relationship between the two variables
8
Exercise
• Interpret the scatterplot in terms of shape,
direction, and strength.
9
Correlation Coefficient
• The correlation coefficient (𝑟) (also called
Pearson’s correlation coefficient) gives us a
numerical measurement of the strength of the
linear relationship between the explanatory and
response variables.
• Formula:
10
Correlation Coefficient : Properties
• The sign of a correlation coefficient gives the
direction of the association.
11
Correlation Coefficient : Properties
12
Correlation Coefficient : Properties
• Correlation has no units.
• When arm length and height are measured in
centimeters or inches the correlation coefficient is
still same.
• Correlation measures the strength of the “linear”
association between the two numerical variables.
13
Correlation Coefficient : Properties
• Correlation is sensitive to outliers. A single
outlying value can make a small correlation
large or make a large one small association
between the two numerical variables.
14
Correlation vs. Causation
• Mexican lemon imports prevent highway deaths?
15
Correlation vs. Causation
• A high correlation between two variables does
NOT imply causation.
16
Exercise
Suppose that we have 𝑥 and 𝑦: 𝑖 𝑥! 𝑦!
Calculate and interpret their
1 1 1
correlation coefficient. Note that
the mean of 𝑥 (𝑥)̅ is 3 and SD for 𝑥 2 2 3
(𝑠! ) is 1.83, while the mean of 𝑦 (𝑦)
'
3 4 5
is 4 and SD of 𝑦 (𝑠" ) is 2.58.
4 5 7
𝑥̅ = 3 𝑦& = 4
𝑠" 𝑠#
= 1.83 = 2.58
17
Modeling Linear Trends
• Correlation says “there seems to be a linear
association between these two variable,” but it
doesn't tell us what that association is.
18
Modeling Linear Trends
• Modeling linear trends with a linear equation :
• If the model fits the data well, the model can
describe the data and apply/predict the real world
situation.
• The line that describes the relationship between 𝑥 and
𝑦 is called the regression line.
19
Regression Line
• It is a tool for making predictions about future
observed values.
• It provides us with a useful way of summarizing
a linear relationship.
20
Residual
• The linear model won't be perfect, regardless of the
line we draw.
• Some points will be above the line and some will be
below the line.
• The estimate made from a linear model is the
predicted value, denoted as 𝑦.
!
• The difference between the observed value and its
associated predicted value is called the residual.
21
Residual
• (blue dot) : observed(actual) y values
• (red dot) : predicted y values
• Residual : −
22
Residual
• A negative residual
means the
predicted value is
bigger than the
actual value (an
overestimate).
• A positive residual
means the
predicted value is
smaller than the
actual value (an
underestimate).
23
Finding the “Best” Regression Line
• Ultimately, we want to find the best fitted model,
which minimize the residuals.
• To find a good ‘measure’ for amount of residuals
• 1) Sum of residuals? : Some residuals are positive,
others are negative, on average, they cancel each
other out. – Not a good way!
• 2) Sum of squared residuals : Similar to what we
did with deviations, we square the residuals and
add the squares.
• The smaller the sum, the better the fit.
• The line of best fit is the line for which the sum of the
squared residuals is smallest.
24
Finding the “Best” Regression Line
• Among possible regression lines, find the line that
minimizes the sum or squared residuals.
Scatterplot of Mothers' age vs. Fathers' age
60
Fathers' age
40
20
0
0 10 20 30 40 50
Mothers' age
• Which one looks the best?
25
Finding the “Best” Regression Line
Scatterplot of Mothers' age vs. Fathers' age
60
Fathers' age
40
20
0
0 10 20 30 40 50
Mothers' age
Sum of squared
Line Color Linear Equation
residuals
Black 𝑦$ = 11.54 + 0.68𝑥 38905.1
Green 𝑦$ = 1 + 𝑥 54545
Red 𝑦$ = 15 + 0.6𝑥 47887.32
26
Regression Line
• Ultimately, we want to write the regression line
as:
$ = 𝒊𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 + 𝒔𝒍𝒐𝒑𝒆×𝒙
𝒚
by finding the intercept and slope that find the
regression line which provides smallest sum of
squared residuals.
27
Regression Line : Slope
• The slope can be calculated using the correlation
coefficient, 𝑟, and the standard deviations of the
explanatory (independent) variable (𝒔𝒙 ) and the
response (dependent) variable (𝒔𝒚 ):
𝒔𝒚
𝒔𝒍𝒐𝒑𝒆 = 𝒓×
𝒔𝒙
28
Regression Line : Slope
𝒔𝒚
𝒔𝒍𝒐𝒑𝒆 = 𝒓×
𝒔𝒙
29
Regression Line : Slope
• Example : The mean fat content of the 30 Burger
King menu items is 23.5g with a standard deviation
of 16.4g, and the mean protein content of these
items is 17.2g with a standard deviation of 14g. If the
correlation between protein and fat contents is
0.83, calculate the slope of the linear model for
predicting fat content from protein content.
30
Regression Line : Slope
Fat (𝑦) Protein(𝑥)
Mean 23.5 17.2
Standard
16.4 14
deviation
Correlation
0.83
Coefficient
• Find a slope of the linear model for predicting fat
content from protein content.
𝒔𝒚 𝟏𝟔. 𝟒
𝒔𝒍𝒐𝒑𝒆 = 𝒓× = 𝟎. 𝟖𝟑× = 𝟎. 𝟗𝟕
𝒔𝒙 𝟏𝟒
* − 𝒔𝒍𝒐𝒑𝒆×*
𝒊𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 = 𝒚 𝒙
32
Regression Line : Intercept
• Example (continued) :
Fat (𝑦) Protein(𝑥)
Mean 23.5 17.2
35
Regression Line
4 = 6.8 + 0.97×𝑝𝑟𝑜𝑡𝑒𝑖𝑛
𝑓𝑎𝑡
36
Exercise
• The table shows the heights and weights of some
people. We want to predict weights from height.
38
Exercise
Height (inches) Weight (pounds)
60 105
66 140
72 185
70 145
62 120
c) Find the best fitted line and plot the line on the
scatterplot. Interpret the slope and the intercept
in the context of the data.
39
Exercise
Height (inches) Weight (pounds)
60 105
66 140
72 185
70 145
62 120
40
Exercise
Height (inches) Weight (pounds)
60 105
66 140
72 185
70 145
62 120
41
Measure of Goodness of fit
• If the correlation between 𝑥 and 𝑦 is 𝑟 = 1 or 𝑟 = −1,
then the model would predict 𝑦 perfectly, and the
residuals would all be zero. Hence, all of the variability
in the response variable would be explained by the
explanatory variable.
42
R2 : Coefficient of determination
• In the Burger King menu model, 𝑟 = 0.83, which
is not perfect. In other words, protein cannot
explain 100% of fat’s variation. Then how much?
• B) R2=100% : the
variation in y is
perfectly explained by
x.
44
R2 : Coefficient of determination
• C) R2=60.5% : Some
portion (60.5%) of the
variation in y is
explained by x
45
R2 : Properties
• 𝑅2 = 0 means that none of the variance in y is
explained by x.
• 𝑅2 = 1 means that all of the variance in y is
explained by x.
• While the correlation coefficient is between -1 and
1, 𝑅2 is between 0 and 1
0 ≤ 𝑅2 ≤ 1
46
Exercise
• The correlation between weight and gas mileage of
cars is -0.96. Which of the below is the correct value
and interpretation of the linear model for predicting
gas mileage from weight?
a) -0.71
b) 0.71
c) 0.25
d) 7.07
48
Residual Plot
• The linear model assumes that the relationship
between the two variables is a perfect straight
line. The residuals are the part of the data that
has not been modeled.
Data = Model + Residual
or equivalently
Residual = Data – Model.
51
Residual Plot : Model Assessment
• Bad examples (not fit to the linear model)
• Fan shaped residuals: Variance of the residuals is
large when x becomes large (or vice versa)
• Curved residuals: Residuals are negative for small x,
then positive, and finally slightly negative again for
large x.
52
Extrapolation
• Do not extrapolate beyond the data, i.e., do not make a
prediction for an x value outside the range of the data –
the linear model may no longer hold outside that range
• For example, if the BK Broiler chicken sandwich has a
protein content of 75 grams, what is the predicted fat
content?
54