0% found this document useful (0 votes)
6 views25 pages

L3 Bivariate Worksheet

The document discusses bivariate data analysis, focusing on the use of scatter plots to investigate relationships between two quantitative variables. It covers key concepts such as trends, correlation, causation, and regression, emphasizing the importance of understanding the nature of the data and potential outliers. Additionally, it provides exercises to reinforce the concepts and includes data sources for further exploration.

Uploaded by

madnesin24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views25 pages

L3 Bivariate Worksheet

The document discusses bivariate data analysis, focusing on the use of scatter plots to investigate relationships between two quantitative variables. It covers key concepts such as trends, correlation, causation, and regression, emphasizing the importance of understanding the nature of the data and potential outliers. Additionally, it provides exercises to reinforce the concepts and includes data sources for further exploration.

Uploaded by

madnesin24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Bivariate Data Analysis

This is adapted from University of Auckland Statistics Department material. The original can be
found at http://www.stat.auckland.ac.nz/~teachers/2003/regression.php

Quantitative Continuous
measureable/countable measureable to any precision
Types of variables
Qualitative Discrete
grouped exact numbers

The scatter plot is the basic tool used to investigate


relationships between two quantitative variables.

What do I look for in scatter plots?

Trend

a linear trend or a non-linear trend?

straight line

a positive association or a negative association?

as one variable gets bigger as one variable gets bigger


so does the other the other gets smaller

Scatter

a strong relationship or a weak relationship

little scatter lots of scatter

constant scatter or non-constant scatter

roughly the same amount of the scatter is shaped like


scatter across the plot a fan or funnel

Anything unusual

any outliers or any groupings

1
Exercise:
What do I see in these scatter plots? Try to say as many (correct, useful) things as you can:

Mean January Air Temperatures


for 30 New Zealand Locations
20
Temperature (°C)

19
18
17
16
15
14
35 40 45
Latitude (°S)

% of population who are Internet Users vs


GDP per capita for 202 Countries
80
Internet Users (%)

70
60
50
40
30
20
10
0
0 10 20 30 40
GDP per capita (thousands of dollars)

Average Age New Zealanders are First Married


30

28

26
Age

24

22

20
1930 1940 1950 1960 1970 1980 1990

Year

2
Exercise:
Rank these relationships from weakest (1) to strongest (4):

What features do you see?

_______________________________________________________________

_______________________________________________________________

_______________________________________________________________

_______________________________________________________________

3
Correlation
ƒ Correlation measures the strength of the linear association between two quantitative variables
ƒ Get the correlation coefficient (r) from your calculator or computer
ƒ The correlation coefficient has no units
ƒ r has a value between –1 and +1:

r = –1 r = –0.7 r = –0.4 r=0 r = 0.3 r = 0.8 r=1

Points fall No linear Points fall


exactly on a relationship exactly on a
straight line (uncorrelated) straight line

What can go wrong?


ƒ Use correlation only if you have two quantitative variables
There is an association between gender and weight but there isn’t a correlation between gender
and weight!
ƒ The variables should be continuous (or nearly continuous) for an accurate r value.
ƒ Use correlation only if the relationship is linear
ƒ Beware of outliers!
ƒ Always plot the data before looking at the correlation

r=0 r = 0.9

No linear relationship, No linear relationship,


but but
there is a relationship! there is a relationship!

Causation
Two variables may be strongly associated (as measured by the correlation coefficient for linear
associations) but may not have a cause and effect relationship existing between them. The explanation
maybe that both the variables are related to a third variable not being measured – a “lurking” or
“confounding” variable.
These variables are positively correlated:
ƒ Number of fire trucks vs amount of fire damage
ƒ Teacher’s salaries vs price of alcohol
ƒ Number of storks seen vs population of Oldenburg Germany over a 6 year period
ƒ Number of policemen vs number of crimes
Only talk about causation if you have well designed and carefully carried out experiments. That way
confounding variables can be excluded.

If you do suspect that one variable is causing the other, then that variable should go along the x axis and
is called the “explanatory” variable. If you suspect a causal link, but don’t know which way, then the one
that you are controlling in your experiment, the “control” variable is placed along the x and the measured
variable is the y. It other cases it makes no difference.

4
Exercise:
Tick the plots where it would be OK to use a correlation coefficient to describe the strength of the
relationship:

Reaction Times (seconds)


Distances of Planets from the Sun for 30 Year 10 Students
0.8
Distance (million miles)

4000

Dominant Hand
3000 0.6

2000 0.4

1000 0.2

0
0
0 1 2 3 4 5 6 7 8 9 0 0.2 0.4 0. 0.8 1
Position Number
Non-dominant Hand

Mean January Air Temperatures


for 30 New Zealand Locations Average Weekly Income for
20 Employed New Zealanders in 2001
1200
Temperature (°C)

19
1000
18
Male ($)

800
17
16 600

15 400

14 200
35 40 45 0
0 200 400 600 800
Latitude (°S)
Female ($)

5
Exercise:
What do I see in this scatter plot?

Height and Foot Size


for 30 Year 10 Students
200

190
Height (cm)

180

170

160

150
22 23 24 25 26 27 28 29
Foot size (cm)

What will happen to the correlation coefficient if the tallest Year 10 student is removed? Tick your
answer:

(Remember the correlation coefficient answers the question: “For a linear relationship, how well do the
data fall on a straight line?”)

It will get smaller It won’t change It will get bigger

Exercise:
What do I see in this scatter plot?

Life Expectancies and Gestation Period


for a sample of non-human Mammals
40
Life Expectancy (Years)

30 Elephant

20

10

0 100 200 300 400 500 600


Gestation (Days)

What will happen to the correlation coefficient if the elephant is removed?


Tick your answer:

It will get smaller It won’t change It will get bigger

6
Exercise:
Life Expectancy and Availability of Using the information in the plot, can you
Doctors for a Sample of 40 Countries suggest what needs to be done in a country to
increase the life expectancy? Explain.
80
Life Expectancy

70

60

50
0 10000 20000 30000 40000
People per Doctor

Life Expectancy and Availability of


Using the information in this plot, can you
Televisions for a Sample of 40 Countries
make another suggestion as to what needs to
80 be done in a country to increase life
expectancy?
Life Expectancy

70

60

50
0 100 200 300 400 500 600
People per Television

Can you suggest another variable that is linked to life expectancy and the availability of doctors (and
televisions) which explains the association between the life expectancy and the availability of doctors
(and televisions)?

______________________________________________________________________________

Data Sources

http://www.niwa.cri.nz/edu/resources/climate
http://www.cia.gov/cia/publications/factbook
http://www.stats.govt.nz
http://www.censusatschool.org.nz
http://www.amstat.org/publications/jse/jse_data_archive.html

7
Regression

(The theory on this page is not required for this unit, but helps explain what calculations are done.)

Regression relationship = trend + scatter

Observed value = predicted value + prediction error

data point
(8, 25)

25
prediction error
21 y = 5 + 2x

The Least Squares Regression Line Which line?

Choose the line with smallest sum of squared prediction errors.

Minimise the sum of squared prediction errors

∑ (prediction errors)
2
Minimise

• There is one and only one least squares regression line for every linear regression
• The sum of the prediction errors is zero for the least squares line but it is also true for many
other lines

• The line includes the means of x and y i.e. the point , is on the least squares line
• Calculator or computer gives the equation of the least squares line

8
2
R-squared (R )
On a scatter plot Excel has options for displaying the equation of the fitted line and the value of R2.
Four scatter plots with fitted lines are shown below. The equation of the fitted line and the value of R2 are
given for each plot. Compare the values for R2 to the scatter seen.

Common Cracker Brands Common Cracker Brands


y = 0.1112x + 372.37 y = 4.9844x + 380.82
550 R2 = 0.4257 550 R2 = 0.982
Energy (calories/100g)

Energy (calories/100g)
500 500

450 450

400 400

350 350
100 300 500 700 900 1100 1300 0 10 20 30 40
Salt (mg/100g) Total Fat (%)

Common Cracker Brands Common Cracker Brands


y = 0.3717x + 440.06 y = 0.0237x - 2.6556
550 R2 = 0.0166 35 R2 = 0.4892
Energy (calories/100g)

30
500
Total Fat (%)

25
20
450
15
400 10
5
350 0
0 20 40 60 0 500 1000 1500
Number of crackers per 100g Salt (mg/100g)

• R2 gives the fraction of the variability of the y values accounted for by the linear regression
(considering the variability in the x values).

• R2 is often expressed as a percentage.


• If the assumptions (straightness of line) appear to be satisfied then R2 gives an overall measure of
how successful the regression is in linearly relating y to x.

• R2 lies from 0 to 1 (0% to 100%).


• The smaller the scatter about the regression line the larger the value of R2.
• Therefore the larger the value of R2 the greater the faith we have in any estimates using the
equation of the regression line.
• R2 is the square of the sample correlation coefficient, r.
• For the above example, the linear regression accounts for 86.6% of the variability in the y values
from the variability in the x values.

9
Exercise:
List the plots from greatest R2 to least R2.

A Cement hardness B Optical absorbance versus dissolved carbon

50 0.1
0.09
40 0.08

Optical absorbance
0.07
Hardness (units)

30 0.06

0.05
20 0.04
0.03

10 0.02
0.01

0 0
150 200 250 300 350 400 0 0.5 1 1.5 2 2.5 3

Cem ent (gram s) Dissolved organic carbon (m g/L)

C Reaction times, Olympic 100m


D Road conditions in Canada

12 80

11.5
Pavement condition index

70
Time (seconds)

11
60

10.5

50
10

9.5 40
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 10 11 12 13 14 15 16 17 18 19

Reaction tim e (seconds) Age (years)

Greatest to least R2:


_________________________________________________________

Source: Chance Encounters: A First Course in Data Analysis and Inference by Christopher
J. Wild and George A. F. Seber

10
Exercise:
for each scatter plot, use the value of R2 to write a sentence about the variability of the y-
values accounted for by the linear regression.

Common Cracker Brands _____________________________________


y = 0.1112x + 372.37
550 R2 = 0.4257
Energy (calories/100g)

_____________________________________
500

450 _____________________________________

400
_____________________________________
350
100 300 500 700 900 1100 1300
Salt (mg/100g)

Common Cracker Brands


y = 4.9844x + 380.82 _____________________________________
550 R2 = 0.982
Energy (calories/100g)

500 _____________________________________

450
_____________________________________
400

350 _____________________________________
0 10 20 30 40
Total Fat (%)

Common Cracker Brands


y = 0.3717x + 440.06 _____________________________________
550 R2 = 0.0166
Energy (calories/100g)

500 _____________________________________

450
_____________________________________
400

350 _____________________________________
0 20 40 60
Number of crackers per 100g

Common Cracker Brands


y = 0.0237x - 2.6556
35 R2 = 0.4892 _____________________________________
30
Total Fat (%)

25
_____________________________________
20
15
10 _____________________________________
5
0
0 500 1000 1500 _____________________________________
Salt (mg/100g)

11
Outliers in a regression context
An outlier, in a regression context, is a point that is unusually far from the trend.
The effect can be quite different for outliers in the y dimension compared to the x dimension.

Outliers in y
This is when the data point is a great distance above or below the trend line. An example:
The following table shows the winning distances in the men’s long jump in the Olympic Games for
years after the Second World War.

Year Winner Distance Year Winner Distance


1948 Willie Steele (USA) 7.82m 1980 Lutz Dombrowski (GDR) 8.54m
1952 Jerome Biffle (USA) 7.57m 1984 Carl Lewis (USA) 8.54m
1956 Gregory Bell (USA) 7.83m 1988 Carl Lewis (USA) 8.72m
1960 Ralph Boston (USA) 8.12m 1992 Carl Lewis (USA) 8.67m
1964 Lynn Davies (GBR) 8.07m 1996 Carl Lewis (USA) 8.50m
1968 Bob Beamon (USA) 8.90m 2000 Ivan Pedroso (Cuba) 8.55m
1972 Randy Williams (USA) 8.24m 2004 Dwight Phillips (USA) 8.59m
1976 Arnie Robinson (USA) 8.35m
Source: http://www.sporting-heroes/stats_athletics/olympics_trackandfield/trackfield.asp

Men's Long Jump Winning Distances,


Olympic Games, 1948-2004

9
Distance (metres)

8.5

7.5

7
0 10 20 30 40 50 60
Years since 1944

Exercise:

Comment on the scatter plot.

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

12
Excel output for a linear regression on all 15 observations

Men's Long Jump Winning Distances,


Olympic Games, 1948-2004
y = 0.0164x + 7.8097
2
9 R = 0.5914
Distance (metres)

8.5

7.5

7
0 10 20 30 40 50 60
Years since 1944

We estimate that for every 4-year increase in years (from one Olympic Games to the next) the

winning distance increases by _________________, on average.

Using this linear regression we predict that the winning distance in 2004 will be ___________.

Excel output for a linear regression on 14 observations (with the 1968


observation removed)

Men's Long Jump Winning Distances,


Olympic Games, 1948-2004 (excl.1968)
y = 0.0177x + 7.7158
9 R2 = 0.8213
Distance (metres)

8.5

7.5

7
0 10 20 30 40 50 60
Years since 1944

We estimate that for every 4-year increase in years (from one Olympic Games to the next) the

winning distance increases by _________________, on average.

Using this linear regression we predict that the winning distance in 2004 will be ___________.

13
What effect did the 1968 observation have on the:

a) fitted line?

____________________________________________________________________________

b) predicted winning distance in 2004?

____________________________________________________________________________

c) value of r?

____________________________________________________________________________

So what does this mean, altogether?

____________________________________________________________________________

____________________________________________________________________________

We see how an outlier in y can affect the R2 value a lot although not necessarily the trend line itself.

Such an outlier should be checked out to see if it is a mistake or an actual unusual observation.
• If it is a mistake then it should either be corrected or removed.
• If it is an actual unusual observation then try to understand why it is so different from the other
observations.
If it is an actual unusual observation (or we don’t know if it is a mistake or an actual observation) then
carry out two linear regressions; one with the outlier included and one with the outlier excluded.
Investigate the amount of influence the outlier has on the fitted line and discuss the differences.

14
Outliers in x (or x-outliers)
The effect of an outlier that is distant along the x axis can be quite different. An example:
We often talk about a person’s “blood pressure” as though it is an inherent characteristic of that
person. In fact, a person’s blood pressure is different each time you measure it. One thing it reacts
to is stress. The following table gives two systolic blood pressure readings for each of 20 people
sampled from those participating in a large study. The first was taken five minutes after they came
in for the interview, and the second some time later.
Note: The systolic phase of the heartbeat is when the heart contracts and drives the blood out.
Source: Chance Encounters: A First Course in Data Analysis and Inference by Christopher J. Wild and
George A. F. Seber (Exercise for Section 3.1.2., Question 3, p113).

Observation 1 2 3 4 5 6 7 8 9 10
1st reading 116 122 136 132 128 124 110 110 128 126
2nd reading 114 120 134 126 128 118 112 102 126 124
Observation 11 12 13 14 15 16 17 18 19 20
1st reading 130 122 134 132 136 142 134 140 134 160
2nd reading 128 124 122 130 126 130 128 136 134 160

Blood Pressure

160

150
Second reading

140

130

120

110

100
100 110 120 130 140 150 160
First reading

Exercise:

Comment on the scatter plot.

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

15
Excel output for a linear regression on all 20 observations

Blood Pressure
y = 0.9337x + 4.9068
160 R2 = 0.8673

150
Second reading

140

130

120

110

100
100 110 120 130 140 150 160
First reading

We estimate that for every 10-unit increase in the first blood pressure reading the second reading

increases by _________________, on average.

For a person with a first reading of 140 units we predict that the second reading will be

________________

Excel output for a linear regression on 19 observations (#20 removed)

Blood pressure (Obs 20 removed)


y = 0.8124x + 20.152
160 R2 = 0.7844

150
Second reading

140

130

120

110

100
100 110 120 130 140 150 160
First reading

We estimate that for every 10-unit increase in the first blood pressure reading the second reading

increases by _________________, on average.

For a person with a first reading of 140 units we predict that the second reading will be

________________

16
What effect did observation 20 have on:

a) the fitted line?

________________________________________________________________________________

b) the predicted second reading (for a first reading of 140)?

________________________________________________________________________________

c) the value of R2?

________________________________________________________________________________

So what does this mean overall?

________________________________________________________________________________

We see that an extreme x value can have the effect of artificially making the R2 value too high.
The fitted line may say more about the x-outlier than about the overall relationship between the two
variables. Such an outlier is sometimes called a high-leverage point.
• If it is a mistake then it should either be corrected or removed.
• If it is an actual unusual observation then try to understand why it is so different from the other
observations.
If a data set has an x-outlier which does not appear to be in error then carry out two linear regressions;
one with the x-outlier included and one with the outlier excluded. Investigate the amount of influence the
outlier has on the fitted line and discuss the differences.

17
Groupings

In the 1930s Dr. Edgar Anderson collected data on 150 iris specimens. This data set was published
in 1936 by R. A. Fisher, the well-known British statistician.

This is sourced from: http://lib.stat.cmu.edu/DASL/Stories/Fisher’sIrises.html

Fisher's Iris Data

30

25
Petal width (mm)

20

15

10

0
0 10 20 30 40 50 60 70 80
Petal length (mm)

Exercise:

Comment on the scatter plots.

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

Fisher's Iris Data


y = 0.407x - 3.4532
30
R2 = 0.9137
25
Petal width (mm)

20

15

10

0
0 10 20 30 40 50 60 70 80
Petal length (mm)

18
The data were actually on fifty iris specimens from each of three species; Iris setosa, Iris versicolor and
Iris verginica. The scatter plot below identifies the different species by using different plotting symbols (+
for setosa, • for versicolor, × for verginica).

Let’s see what happens when we look at the groups separately.

Fisher's Iris Data (Iris setosa) Fisher's Iris Data (Iris versicolor)
y = 0.2012x - 0.4822 y = 0.2273x + 3.437
7 2 20
2
R = 0.11 R = 0.3797
6
18
Petal width (mm)

Petal width (mm)

5
16
4
14
3

2 12

1 10

0 8
8 10 12 14 16 18 20 25 30 35 40 45 50 55 60
Petal length (mm) Petal length (mm)

Fisher's Iris Data (Iris verginica)


y = 0.1839x + 9.8509
30
2
R = 0.1222
25
Petal width (mm)

20

15

10
40 45 50 55 60 65 70 75
Petal length (mm)

Exercise:

Comment.

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

19
This shows we need to watch for different groupings in our data.
• If there are groupings in your data that behave differently then consider fitting a different linear
regression line for each grouping.

Conclusions about R2 and outliers/groupings.


• A large value of R2 does not mean the linear regression is appropriate.

• An x-outlier or data that has groupings can make the value of R2 seem large when the linear
regression is just not appropriate.

• On the other hand, a low value of R2 may be caused by the presence of a single y-outlier and all
other points have a reasonably strong linear relationship.
• Groupings of different kinds of objects may give a value of R2 that does not hold for the individual
types.

20
Prediction

The purpose of a lot of regression analyses is to make predictions.


The data in the scatter plot below were collected from a set of heart attack patients. The response
variable is the creatine kinase concentration in the blood (units per litre) and the explanatory
variable is the time (in hours) since the heart attack.
Source: Chance Encounters: A First Course in Data Analysis and Inference by Christopher J. Wild and
George A. F. Seber, p514.

Creatine kinase concentration

1200
Concentration (units/litre)

1000

800

600

400

200

0
0 2 4 6 8 10 12 14 16 18
Time (hours)

Exercise:

Comment on the scatter plot.

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

Suppose that a patient had a heart attack 17 hours ago. Predict the creatine kinase concentration in the
blood for this patient.

____________________________________________________________________________

In fact their creatine kinase concentration was 990 units/litre. Comment.

____________________________________________________________________________

____________________________________________________________________________

21
The complete data set is displayed in the scatter plot below.

Creatine kinase concentration

1200
Concentration (units/litre)

1000

800

600

400

200

0
0 10 20 30 40 50 60
Time (hours)

Beware of extrapolating beyond the data.


• A fitted line will often do a good job of summarising a relationship for the range of the observed x
values.
• Predicting y values for x values that lie beyond the observed x values is dangerous. The linear
relationship may not be valid for those x values.

The removal of an x outlier will mean that the range of observed x values is reduced. This should be
discussed in the comparison between the two linear regressions (x outlier included and x outlier
excluded). It is possible that the supposed “outlier” may actually indicate the start of a change in the
pattern.

22
Non-Linearity

Sometimes a non-linear model is more appropriate. An example:


The data in the scatter plot below shows the progression of the fastest times for the men’s
marathon since the Second World War. We may want to use this data to predict the fastest time at
1 January 2010 (i.e. 64 years after 1 January 1946).
Source: http://www.athletix.org/

Men's Marathon Fastest Times

150

145
Time (minutes)

140

135

130

125

120
0 10 20 30 40 50 60
Years since 1 Jan 1946

Exercise:

Concerns:

____________________________________________________________________________

____________________________________________________________________________

Possible solutions:

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

23
Men's Marathon Fastest Times Men's Marathon Fastest Times
y = 0.0078x 2 - 0.769x + 144.68 y = 139.99e-0.0022x
150 150
R2 = 0.9592 R2 = 0.8521

145 145
Time (minutes)

Time (minutes)
140 140

135 135

130 130

125 125

120
120
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Years since 1 Jan 1946 Years since 1 Jan 1946

Men's Marathon Fastest Times


y = 151.08x -0.0452
155
R2 = 0.9401
150

145
Time (minutes)

140

135

130

125

120
0 10 20 30 40 50 60

Years since 1 Jan 1946

Men's Marathon Fastest Times, Men's Marathon Fastest Times,


1946-69 1969-2003
y = -0.6537x + 144.69 y = -0.1089x + 131.66
148 130
R2 = 0.9366 R2 = 0.9024
146
144 129
142
Time (minutes)

Time (minutes)

140 128
138
136 127
134
126
132
130 125
128
126 124
0 5 10 15 20 25 20 30 40 50 60
Years since 1 Jan 1946 Years since 1 Jan 1946

Exercise:

Comments:

____________________________________________________________________________

____________________________________________________________________________

24
The data in the scatter plot below comes from a random sample of 60 models of new cars taken from all
models on the market in New Zealand in May 2000. We want to use the engine size to predict the weight
of a car.

Models of New Zealand Cars

2000

1500
Weight (kg)

1000

500
1000 2000 3000 4000 5000 6000
Engine size (cc)

Exercise:

Concerns:

____________________________________________________________________________

____________________________________________________________________________

Possible solutions:

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

____________________________________________________________________________

Note: The solution need not be to exclude all linear models. It might be to restrict the range of values
which the linear model is applied to.

25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy