0 - MTH 4272 - Notes and Exercises
0 - MTH 4272 - Notes and Exercises
Contents
Analyzing Data – Review ......................................................................................................................................................... 2
Regression Analysis – overview .............................................................................................................................................. 4
Scatter plots ............................................................................................................................................................................ 7
Correlation Coefficient (IXL grade 12 - AA.6) ...................................................................................................................... 7
Outliers in Scatter Plots (IXL grade 12 - AA.5)..................................................................................................................... 8
Finding the Linear Correlation Coefficient using Box Method [IXL grade 12 - AA.7] .......................................................... 8
*Scale ................................................................................................................................................................................ 11
Regression Line ..................................................................................................................................................................... 13
1. Mayer Line method ....................................................................................................................................................... 13
2. Median-Median Line method ....................................................................................................................................... 17
Finding the Regression Line and Correlation Coefficient using a Calculator ........................................................................ 21
Practice.............................................................................................................................................................................. 21
More practice (IXL grade 12 – AA.7 and AA.8).................................................................................................................. 21
Interpolating and Extrapolating values (IXL grade 12 - AA.9) ........................................................................................... 22
Quadratic vs. Linear Regressions - Example ..................................................................................................................... 24
Graphing Linear and Quadratic functions – Review ............................................................................................................. 27
Linear function (IXL grade 10 – L.6) .............................................................................................................................. 27
Quadratic function (IXL grade 10 – R.6) ........................................................................................................................ 27
1
Analyzing Data – Review
When analyzing your data, you look at 4 primary characteristics:
1. Sample size,
2. Shape,
3. Spread, and
4. Location (Central tendency)
1. Sample Size
The larger the sample, the more certain you are that the results will be representative of the population.
Small sample size larger sample size even larger sample size
2. Shape
If the data is clumped around one if the outcomes trail off more in one
If each outcome is equally likely
particular value direction than another
One measure of spread is the Range (The difference between the highest value and smallest value in a data set).
The age of students in a class: 17, 18, 18, 19, 19, 19, 20, 22, 29, 36, 43, 54, 61
2
4. Location or Central Tendency (Mean, Median, Mode)
We will use 3 measures of calculating the location of the bulk of the data (central tendency)
Mean is the average value. Median is the value of the middle Mode is the value that
number repeats the most often
The age of students in a class: 17, 17, 17, 17, 18, 18, 18, 19, 19, 20, 20, 20, 22
Add up all the values and divide First find the location of the middle Find the value with the
by the number of values value using (n + 1)/2 highest frequency.
= 242/13 = (13+1)/2 = 7 17 repeats 4 times.
Example
The Mean is often the best The Median is the best measure when The mode is used when the
measure of central tendency. there are outlier(s) because it gives a data is qualitative (words, not
When to use
However, the mean can be better sense of a “typical value”. numbers). It is also used for
influenced by outliers (a number finding the winner of a vote.
that is very far from the other
values).
For example, if we add a student who was 100 years old to our class.
Original class New class
17, 17, 17, 17, 18, 18, 18, 19, 19, 20, 20, 20, 22 17, 17, 17, 17, 18, 18, 18, 19, 19, 20, 20, 20, 22, 100
Age of Students
Age of Students
4
4
3
3
2
2
1 Mean
1 = 24.4 Median = 185
0 0
37
17
22
27
32
42
47
52
57
62
67
72
77
82
87
92
97
17 18 19 20 21 22
Age Age
The mean changed from 18.6 to 24.4, whereas the median only changed from 18 to 18.5. The median is less affected by
the “outlier”.
3
Regression Analysis – overview
Your community of 70,000 people has only 13 sports facilities and you think there should be more. You researched
various surrounding communities and came up with the following data table. Your plan is to use the data to argue that
your community needs more sports facilities.
66 17
78 23
80 27
92 32
97 36
98 40
Step 1 – Put the data into a Scatter Plot (*label axes and Title)
The scatter plot allows us to visualize the data. We can see how scattered the points (strength) are and what type of
correlation exists between the two variables (positive or negative).
4
Step 2 – Find the Linear Regression Line (𝑦 = 𝐵𝑥 + 𝐴) and the Linear Correlation Coefficient (r)
𝑦 = 0.535𝑥 − 16.1
Using the equation 𝑦 = 0.535𝑥 − 16.1, we can get two points and then we connect the points with a line.
Let x = 40
𝑦 = 0.535(40) − 16.1
𝑦 = 5.3
Let x = 100
𝑦 = 0.535(100) − 16.1
𝑦 = 37.4
5
Step 4 – Analyse the data
In addition, the r-value (correlation coefficient) tells us the strength and type of correlation.
An r greater than 0.95 indicates a very strong correlation. Because the r-value is positive, that means that the correlation
is positive: as the population increases, the number of sports facilities also increases.
Once we have the linear Regression line, if the correlation is strong, we can use the equation to interpolate (find values
within our data set) or extrapolate (find values outside of our data set).
Since the correlation between population and sports facilities is very strong, we can use the equation of the regression
line to find the number of sports facilities there should be for a population of 70 thousand.
𝑦 = 0.535𝑥 − 16.1
By using the Linear Regression line we find that for a
Let x = 70 population of 70 thousand, there should be
approximately 21 sports facilities.
𝑦 = 0.535(70) − 16.1
𝑦 = 21.35
Therefore, we could use this data to argue that your community should have more sports facilities. With a population of
70 thousand, you should have approximately 21, however you only have 13.
6
Scatter plots
1. In each scatter plot, draw freehand the curve best representing each scatter plot. and determine which functional
model (linear, quadratic or greatest integer function) seems to be the most representative of the situation?
Intensity: The more condensed the points are distributed, with a clearly identifiable linear trend, the stronger the
correlation will be.
Sign: If the cluster of points forms an ascending linear trend, the correlation will be qualified as positive. Conversely, for
a descending trend, the correlation will be qualified as negative.
The following scale identifies the strength of the correlation based on the r-value.
8
1. Find the Linear Correlation Coefficients of the following:
3.
9
4.
Answers
1. a) r = 0.48 b) r = 0.53 c) r = 0.70 2. a) negative b) r = -0.74 c) weak 3. r = 0.61 weak and positive
correlation 4. r = - 0.28 zero correlation
10
*Scale
Choose an appropriate scale for the graphs
1.
2.
3.
11
4.
5.
6.
12
Regression Line
a regression line is a line (equation) that best describes the behavior of a set of data (line of best fit). By using the the
regression line an analyst can forecast future behaviors. Regression lines are widely used in the financial sector (stock
prices), in business (sales, inventories), and in science.
13
1.
Using the Mayer line method, determine the equation of the regression line for each of the following situations.
2.
14
3.
4.
15
5. *Warning: check for outliers before calculating
Answers
16
2. Median-Median Line method
17
1.
18
2.
Determine the equation of the regression line using the median-median line method.
19
3.
Answers
1. M1 (328, 155) M2 (365, 150) M3 (394, 143) Pavg (362.33, 149.33) y = - 0.18x + 214.55
2. M1 (413, 42) M2 (309.5, 25) M3 (116, 13) Pavg (279.5, 26.7) y = 0.1x – 1.28
3. M1 (427.5, 741) M2 (472, 810.5) M3 (558.5, 988.5) Pavg (486, 846.67) y = 1.89x – 71.87
20
Finding the Regression Line and Correlation Coefficient using a Calculator
Use your calculator to find Linear Regression Line (𝑦 = 𝐵𝑥 + 𝐴) and Linear Correlation Coefficient (r)
Based on your calculator, the buttons you will need to press will differ. Here are a few examples
Practice
Use your calculator to find the parameters (r, B, and A) based on the following data:
1. 2. 3.
x y x y x y
1 7 1 130 1 0.33
2 15 2 175 2 0.27
3 31 3 223 3 0.06
4 321 4 0.14
21
Interpolating and Extrapolating values (IXL grade 12 - AA.9)
Interpolate: you are seeking to estimate a value located within the interval of the x-coordinates of the scatter plot.
Extrapolate: you are seeking to estimate a value located outside the interval of the x-coordinates of the scatter plot
1.
c) From the trend of the scatter plot reflected by the regression line, estimate the percentage of secondary students
who smoked in Québec in 2010.
d) According to the trend reflected by the regression line, in what year did approximately 8% of secondary students
smoke in Québec?
22
2.
d) Can this mathematical model be used to extrapolate based on the population of Québec in the 19th century or on the
population in a hundred years? Explain your answer.
Answers
2. y = 0.044x – 81.4 a) 6.73 million b) 2002 c) 7.92 million d) No, those years are too far in the past and future
23
Quadratic vs. Linear Regressions - Example
In some cases, a Linear Regression line may not be the best model to represent a correlation between 2 variables.
Let’s say we want to predict the stopping distance for a car travelling at 125 miles/hour (that’s 200 km/h). We do several
tests and we collect the following data:
stopping stopping
Initial speed distance Initial speed distance
(miles/hour) (m) (miles/hour) (m)
10 1 45 27
15 2 50 35
20 5 55 48
25 8 60 55
30 15 65 54
35 18 70 80
40 22 75 94
80 120
Step 1 – Put the data into a Scatter Plot (*label axes and Title)
24
Step 2 – Find the Linear Regression Line (𝑦 = 𝐵𝑥 + 𝐴) and the Linear Correlation Coefficient (r)
Using your calculator you’ll find Linear Regression: 𝑦 = 1.53𝑥 − 29.8 𝑟 = 0.946
*Note: We can use our calculators to find the Quadratic regression curve, however our calculators will not give us the Quadratic
correlation coefficient. For the sake of this course, the Quadratic equation and correlation coefficient will be given to you.
Let x = 80
𝑦 = 0.0247(80)2 − 0.700(80) + 8.80
𝑦 = 110.9
(80 , 110.9)
25
Step 4 – Analyse the data
When we are comparing 2 models (Linear vs. Quadratic), we must compare the correlation coefficients and the scatter
plots/graphs.
Linear Quadratic
Correlation 0.946 0.991
coefficient strong Very strong
r Stronger than the Linear
Scatter plot
By comparing the two models, we can conclude that the Quadratic function best represents the correlation between the
speed of a car and the stopping distance because:
Once we have chose which model best represents the data, we use that model (in this case the quadratic) to find the
stopping distance of a car travelling at 125 miles/h
Let x = 125
26
Graphing Linear and Quadratic functions – Review
or Find two points by choosing values of x or y (ex. Let x=0 and solve for y)
Find extra point(s) by choosing a value of x and finding the point symmetric to it.
27