L3 Bivariate Worksheet
L3 Bivariate Worksheet
This is adapted from University of Auckland Statistics Department material. The original can be
found at http://www.stat.auckland.ac.nz/~teachers/2003/regression.php
Quantitative Continuous
measureable/countable measureable to any precision
Types of variables
Qualitative Discrete
grouped exact numbers
Trend
straight line
Scatter
Anything unusual
1
Exercise:
What do I see in these scatter plots? Try to say as many (correct, useful) things as you can:
19
18
17
16
15
14
35 40 45
Latitude (°S)
70
60
50
40
30
20
10
0
0 10 20 30 40
GDP per capita (thousands of dollars)
28
26
Age
24
22
20
1930 1940 1950 1960 1970 1980 1990
Year
2
Exercise:
Rank these relationships from weakest (1) to strongest (4):
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
3
Correlation
Correlation measures the strength of the linear association between two quantitative variables
Get the correlation coefficient (r) from your calculator or computer
The correlation coefficient has no units
r has a value between –1 and +1:
r=0 r = 0.9
Causation
Two variables may be strongly associated (as measured by the correlation coefficient for linear
associations) but may not have a cause and effect relationship existing between them. The explanation
maybe that both the variables are related to a third variable not being measured – a “lurking” or
“confounding” variable.
These variables are positively correlated:
Number of fire trucks vs amount of fire damage
Teacher’s salaries vs price of alcohol
Number of storks seen vs population of Oldenburg Germany over a 6 year period
Number of policemen vs number of crimes
Only talk about causation if you have well designed and carefully carried out experiments. That way
confounding variables can be excluded.
If you do suspect that one variable is causing the other, then that variable should go along the x axis and
is called the “explanatory” variable. If you suspect a causal link, but don’t know which way, then the one
that you are controlling in your experiment, the “control” variable is placed along the x and the measured
variable is the y. It other cases it makes no difference.
4
Exercise:
Tick the plots where it would be OK to use a correlation coefficient to describe the strength of the
relationship:
4000
Dominant Hand
3000 0.6
2000 0.4
1000 0.2
0
0
0 1 2 3 4 5 6 7 8 9 0 0.2 0.4 0. 0.8 1
Position Number
Non-dominant Hand
19
1000
18
Male ($)
800
17
16 600
15 400
14 200
35 40 45 0
0 200 400 600 800
Latitude (°S)
Female ($)
5
Exercise:
What do I see in this scatter plot?
190
Height (cm)
180
170
160
150
22 23 24 25 26 27 28 29
Foot size (cm)
What will happen to the correlation coefficient if the tallest Year 10 student is removed? Tick your
answer:
(Remember the correlation coefficient answers the question: “For a linear relationship, how well do the
data fall on a straight line?”)
Exercise:
What do I see in this scatter plot?
30 Elephant
20
10
6
Exercise:
Life Expectancy and Availability of Using the information in the plot, can you
Doctors for a Sample of 40 Countries suggest what needs to be done in a country to
increase the life expectancy? Explain.
80
Life Expectancy
70
60
50
0 10000 20000 30000 40000
People per Doctor
70
60
50
0 100 200 300 400 500 600
People per Television
Can you suggest another variable that is linked to life expectancy and the availability of doctors (and
televisions) which explains the association between the life expectancy and the availability of doctors
(and televisions)?
______________________________________________________________________________
Data Sources
http://www.niwa.cri.nz/edu/resources/climate
http://www.cia.gov/cia/publications/factbook
http://www.stats.govt.nz
http://www.censusatschool.org.nz
http://www.amstat.org/publications/jse/jse_data_archive.html
7
Regression
(The theory on this page is not required for this unit, but helps explain what calculations are done.)
data point
(8, 25)
25
prediction error
21 y = 5 + 2x
∑ (prediction errors)
2
Minimise
• There is one and only one least squares regression line for every linear regression
• The sum of the prediction errors is zero for the least squares line but it is also true for many
other lines
• The line includes the means of x and y i.e. the point , is on the least squares line
• Calculator or computer gives the equation of the least squares line
8
2
R-squared (R )
On a scatter plot Excel has options for displaying the equation of the fitted line and the value of R2.
Four scatter plots with fitted lines are shown below. The equation of the fitted line and the value of R2 are
given for each plot. Compare the values for R2 to the scatter seen.
Energy (calories/100g)
500 500
450 450
400 400
350 350
100 300 500 700 900 1100 1300 0 10 20 30 40
Salt (mg/100g) Total Fat (%)
30
500
Total Fat (%)
25
20
450
15
400 10
5
350 0
0 20 40 60 0 500 1000 1500
Number of crackers per 100g Salt (mg/100g)
• R2 gives the fraction of the variability of the y values accounted for by the linear regression
(considering the variability in the x values).
9
Exercise:
List the plots from greatest R2 to least R2.
50 0.1
0.09
40 0.08
Optical absorbance
0.07
Hardness (units)
30 0.06
0.05
20 0.04
0.03
10 0.02
0.01
0 0
150 200 250 300 350 400 0 0.5 1 1.5 2 2.5 3
12 80
11.5
Pavement condition index
70
Time (seconds)
11
60
10.5
50
10
9.5 40
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 10 11 12 13 14 15 16 17 18 19
Source: Chance Encounters: A First Course in Data Analysis and Inference by Christopher
J. Wild and George A. F. Seber
10
Exercise:
for each scatter plot, use the value of R2 to write a sentence about the variability of the y-
values accounted for by the linear regression.
_____________________________________
500
450 _____________________________________
400
_____________________________________
350
100 300 500 700 900 1100 1300
Salt (mg/100g)
500 _____________________________________
450
_____________________________________
400
350 _____________________________________
0 10 20 30 40
Total Fat (%)
500 _____________________________________
450
_____________________________________
400
350 _____________________________________
0 20 40 60
Number of crackers per 100g
25
_____________________________________
20
15
10 _____________________________________
5
0
0 500 1000 1500 _____________________________________
Salt (mg/100g)
11
Outliers in a regression context
An outlier, in a regression context, is a point that is unusually far from the trend.
The effect can be quite different for outliers in the y dimension compared to the x dimension.
Outliers in y
This is when the data point is a great distance above or below the trend line. An example:
The following table shows the winning distances in the men’s long jump in the Olympic Games for
years after the Second World War.
9
Distance (metres)
8.5
7.5
7
0 10 20 30 40 50 60
Years since 1944
Exercise:
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
12
Excel output for a linear regression on all 15 observations
8.5
7.5
7
0 10 20 30 40 50 60
Years since 1944
We estimate that for every 4-year increase in years (from one Olympic Games to the next) the
Using this linear regression we predict that the winning distance in 2004 will be ___________.
8.5
7.5
7
0 10 20 30 40 50 60
Years since 1944
We estimate that for every 4-year increase in years (from one Olympic Games to the next) the
Using this linear regression we predict that the winning distance in 2004 will be ___________.
13
What effect did the 1968 observation have on the:
a) fitted line?
____________________________________________________________________________
____________________________________________________________________________
c) value of r?
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
We see how an outlier in y can affect the R2 value a lot although not necessarily the trend line itself.
Such an outlier should be checked out to see if it is a mistake or an actual unusual observation.
• If it is a mistake then it should either be corrected or removed.
• If it is an actual unusual observation then try to understand why it is so different from the other
observations.
If it is an actual unusual observation (or we don’t know if it is a mistake or an actual observation) then
carry out two linear regressions; one with the outlier included and one with the outlier excluded.
Investigate the amount of influence the outlier has on the fitted line and discuss the differences.
14
Outliers in x (or x-outliers)
The effect of an outlier that is distant along the x axis can be quite different. An example:
We often talk about a person’s “blood pressure” as though it is an inherent characteristic of that
person. In fact, a person’s blood pressure is different each time you measure it. One thing it reacts
to is stress. The following table gives two systolic blood pressure readings for each of 20 people
sampled from those participating in a large study. The first was taken five minutes after they came
in for the interview, and the second some time later.
Note: The systolic phase of the heartbeat is when the heart contracts and drives the blood out.
Source: Chance Encounters: A First Course in Data Analysis and Inference by Christopher J. Wild and
George A. F. Seber (Exercise for Section 3.1.2., Question 3, p113).
Observation 1 2 3 4 5 6 7 8 9 10
1st reading 116 122 136 132 128 124 110 110 128 126
2nd reading 114 120 134 126 128 118 112 102 126 124
Observation 11 12 13 14 15 16 17 18 19 20
1st reading 130 122 134 132 136 142 134 140 134 160
2nd reading 128 124 122 130 126 130 128 136 134 160
Blood Pressure
160
150
Second reading
140
130
120
110
100
100 110 120 130 140 150 160
First reading
Exercise:
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
15
Excel output for a linear regression on all 20 observations
Blood Pressure
y = 0.9337x + 4.9068
160 R2 = 0.8673
150
Second reading
140
130
120
110
100
100 110 120 130 140 150 160
First reading
We estimate that for every 10-unit increase in the first blood pressure reading the second reading
For a person with a first reading of 140 units we predict that the second reading will be
________________
150
Second reading
140
130
120
110
100
100 110 120 130 140 150 160
First reading
We estimate that for every 10-unit increase in the first blood pressure reading the second reading
For a person with a first reading of 140 units we predict that the second reading will be
________________
16
What effect did observation 20 have on:
________________________________________________________________________________
________________________________________________________________________________
________________________________________________________________________________
________________________________________________________________________________
We see that an extreme x value can have the effect of artificially making the R2 value too high.
The fitted line may say more about the x-outlier than about the overall relationship between the two
variables. Such an outlier is sometimes called a high-leverage point.
• If it is a mistake then it should either be corrected or removed.
• If it is an actual unusual observation then try to understand why it is so different from the other
observations.
If a data set has an x-outlier which does not appear to be in error then carry out two linear regressions;
one with the x-outlier included and one with the outlier excluded. Investigate the amount of influence the
outlier has on the fitted line and discuss the differences.
17
Groupings
In the 1930s Dr. Edgar Anderson collected data on 150 iris specimens. This data set was published
in 1936 by R. A. Fisher, the well-known British statistician.
30
25
Petal width (mm)
20
15
10
0
0 10 20 30 40 50 60 70 80
Petal length (mm)
Exercise:
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
20
15
10
0
0 10 20 30 40 50 60 70 80
Petal length (mm)
18
The data were actually on fifty iris specimens from each of three species; Iris setosa, Iris versicolor and
Iris verginica. The scatter plot below identifies the different species by using different plotting symbols (+
for setosa, • for versicolor, × for verginica).
Fisher's Iris Data (Iris setosa) Fisher's Iris Data (Iris versicolor)
y = 0.2012x - 0.4822 y = 0.2273x + 3.437
7 2 20
2
R = 0.11 R = 0.3797
6
18
Petal width (mm)
5
16
4
14
3
2 12
1 10
0 8
8 10 12 14 16 18 20 25 30 35 40 45 50 55 60
Petal length (mm) Petal length (mm)
20
15
10
40 45 50 55 60 65 70 75
Petal length (mm)
Exercise:
Comment.
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
19
This shows we need to watch for different groupings in our data.
• If there are groupings in your data that behave differently then consider fitting a different linear
regression line for each grouping.
• An x-outlier or data that has groupings can make the value of R2 seem large when the linear
regression is just not appropriate.
• On the other hand, a low value of R2 may be caused by the presence of a single y-outlier and all
other points have a reasonably strong linear relationship.
• Groupings of different kinds of objects may give a value of R2 that does not hold for the individual
types.
20
Prediction
1200
Concentration (units/litre)
1000
800
600
400
200
0
0 2 4 6 8 10 12 14 16 18
Time (hours)
Exercise:
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Suppose that a patient had a heart attack 17 hours ago. Predict the creatine kinase concentration in the
blood for this patient.
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
21
The complete data set is displayed in the scatter plot below.
1200
Concentration (units/litre)
1000
800
600
400
200
0
0 10 20 30 40 50 60
Time (hours)
The removal of an x outlier will mean that the range of observed x values is reduced. This should be
discussed in the comparison between the two linear regressions (x outlier included and x outlier
excluded). It is possible that the supposed “outlier” may actually indicate the start of a change in the
pattern.
22
Non-Linearity
150
145
Time (minutes)
140
135
130
125
120
0 10 20 30 40 50 60
Years since 1 Jan 1946
Exercise:
Concerns:
____________________________________________________________________________
____________________________________________________________________________
Possible solutions:
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
23
Men's Marathon Fastest Times Men's Marathon Fastest Times
y = 0.0078x 2 - 0.769x + 144.68 y = 139.99e-0.0022x
150 150
R2 = 0.9592 R2 = 0.8521
145 145
Time (minutes)
Time (minutes)
140 140
135 135
130 130
125 125
120
120
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Years since 1 Jan 1946 Years since 1 Jan 1946
145
Time (minutes)
140
135
130
125
120
0 10 20 30 40 50 60
Time (minutes)
140 128
138
136 127
134
126
132
130 125
128
126 124
0 5 10 15 20 25 20 30 40 50 60
Years since 1 Jan 1946 Years since 1 Jan 1946
Exercise:
Comments:
____________________________________________________________________________
____________________________________________________________________________
24
The data in the scatter plot below comes from a random sample of 60 models of new cars taken from all
models on the market in New Zealand in May 2000. We want to use the engine size to predict the weight
of a car.
2000
1500
Weight (kg)
1000
500
1000 2000 3000 4000 5000 6000
Engine size (cc)
Exercise:
Concerns:
____________________________________________________________________________
____________________________________________________________________________
Possible solutions:
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Note: The solution need not be to exclude all linear models. It might be to restrict the range of values
which the linear model is applied to.
25