Statistics Notes
Statistics Notes
MATHEMATICS
Analysis and Approaches (SL and HL)
Lecture Notes
TOPIC 4
STATISTICS
Discrete OR Continuous
{10,20,30} [40,100]
{0,1,2,3,…} R
(finite or numerable set) (interval)
1
TOPIC 4: STATISTICS AND PROBABILITY
Colored
Freq
Balls
Blue 13
Green 8
Red 10
Yellow 3
Age Frequency
[0,10) 7
[10,20) 5
[20,30) 1
[30,40) 3
2
TOPIC 4: STATISTICS AND PROBABILITY
3
TOPIC 4: STATISTICS AND PROBABILITY
10, 20, 20, 20, 30, 30, 40, 50, 70, 70, 80
Here
10 20 20 20 30 30 40 50 70 70 80
mean = = 40
11
median = 30
4
TOPIC 4: STATISTICS AND PROBABILITY
NOTICE
For the data 10, 20, 30
Median = 20
For the data 10, 20, 30, 40
Median = 25
That is, for an even number of data,
median = the mean of the two middle values
The median is not the n -th entry as one would possibly expect.
2
10, 20, 30, 40, 50, 60, 70, 80, 90, 100
μ
x i
5
TOPIC 4: STATISTICS AND PROBABILITY
EXAMPLE 1
Find
a) the integers a b c , given that mean=4, mode=5, median=5.
The median implies that b=5. The mode implies that also c=5.
a5 5
Then 4 a 10 12 a 2
3
MEASURES OF SPREAD
We use the same set of data
10, 20, 20, 20, 30, 30, 40, 50, 70, 70, 80
A) STANDARD DEVIATION
2 In fact,
the Greek letter σ is used for the whole population;
the Latin letter sn is used for a sample of the population
6
TOPIC 4: STATISTICS AND PROBABILITY
As the estimation of the values Q1, Q2, Q3 is quite tricky, let us see
some extra cases in the following example.
b) For n=8 entries: 10, 20, 30, 40, 50, 60, 70, 80
The median is Q2=45 (the 4.5th entry). Hence Q1=25, Q3=65.
c) For n=9 entries: 10, 20, 30, 40, 50, 60, 70, 80, 90
The median is Q2=50 (the 5th entry). Hence Q1=25, Q3=75.
d) For n=10 entries: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100
Then Q2=55 (the 5.5th entry). Hence Q1=30, Q3=80.
7
TOPIC 4: STATISTICS AND PROBABILITY
NOTICE
The square of the standard deviation is called variance. That is
2
variance = σ2 or sn
USE OF GDC
Notice that
The standard deviation in the GDC is denoted by σχ
The variance is not given; it is simply the square of σχ
min Q1 Q2 Q3 max
8
TOPIC 4: STATISTICS AND PROBABILITY
MORE DETAILS
1) Percentiles
The values Q1, Q2, Q3 are also called
Q1 : 25th-percentile
Q2 : 50th-percentile
Q3 : 75th-percentile
2) Outliers
Such a value is viewed as being too far from the central values to
be reasonable. In our example,
Q1 - 1.5×IQR = 20 - 1.5×50 = - 55
9
TOPIC 4: STATISTICS AND PROBABILITY
σ2
x i
2
- μ2
n
For our example, we have x =40 and
x
2
i 10 2 202 20 2 ⋯ 80 2 23400
= = =2127.27
n 11 11
Hence
σ 2 =2127.27- 402 =527.27
σ 2
= (x μ) i
2
= (x i
2
2μxi μ2 )
n n
x x
2 2
=
i
2μ
x μ
i
2
=
i
2μμ
nμ2
n n n n n
x x
2 2
i i
= 2μ2 μ2 = μ2
n n
10
TOPIC 4: STATISTICS AND PROBABILITY
Data Frequency
x f
10 1
20 3
30 2
40 1
50 1
70 2
80 1
n=11
μ
f1 x1 f2 x 2 f3 x3 ⋯
or otherwise μ
f x
i i
n n
11
TOPIC 4: STATISTICS AND PROBABILITY
It helps here to add an extra column in the table above with the
so-called cumulative frequencies:
MEASURES OF SPREAD
A) STANDARD DEVIATION
12
TOPIC 4: STATISTICS AND PROBABILITY
USE OF GDC
13
TOPIC 4: STATISTICS AND PROBABILITY
GROUPED DATA
Suppose that 100 students took an exam and obtained scores from
1 to 60 (full marks), according to the following table:
0 x 10 5 8 8
10 x 20 15 12 20
20 x 30 25 10 30
30 x 40 35 25 55
40 x 50 45 35 90
50 x 60 55 10 100
n=100
14
TOPIC 4: STATISTICS AND PROBABILITY
x: up to 10 20 30 40 50 60
y: c.f 8 20 30 55 90 100
15
TOPIC 4: STATISTICS AND PROBABILITY
Below that graph we can easily draw box and whisker plot:
In the same way we can find any percentile. For example, for the
40th-percentile
Estimate 40% of n: here 40% of 100 students is 40;
Draw a horizontal line at y=40 until you meet the curve;
Then draw a vertical line;
Hence
40th-percentile = 35.
There are no scores lower than -6.5 or greater than 77.5, that is
there are no outliers.
16
TOPIC 4: STATISTICS AND PROBABILITY
Given that fi is the frequency of the entry xi, the formulas now
become:
σ 2
f (x
i i μ)2
n
thus
σ
f (x
i i μ)2
n
In our example,
and then
standard deviation = 527.27 = 22.96
f x
2
2 i i
sn - x2
n
For our example, we have x =40 and
f x
2
i i 1 10 2 3 20 2 2 20 2 ⋯ 23400
= = =2127.27
n 11 11
Hence
σ 2 =2127.27- 402 =527.27
17
TOPIC 4: STATISTICS AND PROBABILITY
4.4 REGRESSION
x 10 12 15 20 23 28 30
y 120 135 174 213 270 301 305
300
200
100
x
-5 5 10 15 20 25 30 35
The closest to the ends ±1, the more our data are linearly related.
(-1 implies a negative slope while +1 implies a positive slope)
The closest to 0, the less our data are linearly related.
There is also a line y=ax+b that best fits our data; it is known as
regression line. We can easily obtain these details by using a GDC.
18
TOPIC 4: STATISTICS AND PROBABILITY
USE OF GDC
300
200
y =9.83x+23.1
100
x
-5 5 10 15 20 25 30 35
Notice that x=18 is within the range of our list while x=40 is not.
f(18)=200 is known as interpolation, f(40)=416 as extrapolation.
In general, interpolations are more reliable than extrapolations.
19
TOPIC 4: STATISTICS AND PROBABILITY
CALC
2VAR: We obtain all the statistics, separately for x’s and y’s
In our example
x = 19.7 y =216.9
Thus the line passes through the point M(19.7, 216.9).
20
TOPIC 4: STATISTICS AND PROBABILITY
x y
y
10
r =1
1 2 8
perfect positive
2 4 6 correlation
3 6 4
Regression line:
4 8 2
x y=2x
5 10 1 2 3 4 5
x y y
r =-1
10
1 10 perfect negative
8
2 8 correlation
6
3 6 4
Regression line:
4 4 2
y=-2x+12
5 2 1 2 3 4 5
x
x y y
10
r =0.98
1 2
8 strong positive
2 3
6 correlation
3 7 4
4 8 Regression line:
2
5 10 x y=2.1x-0.3
1 2 3 4 5
x y y
10
r =-0.98
1 10 strong negative
8
2 8 6 correlation
3 7 4
Regression line:
4 3 2
x y=-2.1x+12.3
5 2 1 2 3 4 5
x y y
10
r =0
1 8
8 no correlation
2 2
6 at all
3 5 4
4 2 Regression line:
2
5 8
x y=5
1 2 3 4 5
21