0% found this document useful (0 votes)
22 views23 pages

Statistics Notes

Uploaded by

kunalg85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views23 pages

Statistics Notes

Uploaded by

kunalg85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

International Baccalaureate

MATHEMATICS
Analysis and Approaches (SL and HL)
Lecture Notes

TOPIC 4
STATISTICS

4.1 BASIC CONCEPTS OF STATISTICS …..……………………………………………………. 1

4.2 MEASURES OF CENTRAL TENDENCY AND SPREAD ………………………… 4

4.3 FREQUENCY TABLES – GROUPED DATA …………………………………………… 11

4.4 REGRESSION ……………………..……………………………………………………………………. 18


TOPIC 4: STATISTICS AND PROBABILITY

4.1 BASIC CONCEPTS OF STATISTICS

In Statistics we deal with data collection, presentation, analysis and


interpretation of results. Data can be from

Population (the entire list of a specified group)


Sample (a subset of the Population)

We usually investigate a small sample of the population to draw


conclusions for the whole population itself.

Numerical data can be

Discrete OR Continuous
{10,20,30} [40,100]
{0,1,2,3,…} R
(finite or numerable set) (interval)

Data can be organized in several ways. We present some examples


below

Frequency table Pie chart

Colored Balls Frequency


Blue 13
Green 8
Red 10
Yellow 3

1
TOPIC 4: STATISTICS AND PROBABILITY

Bar graph (for discrete data)

Colored
Freq
Balls
Blue 13
Green 8
Red 10
Yellow 3

Histogram (for continuous data)

Age Frequency
[0,10) 7
[10,20) 5
[20,30) 1
[30,40) 3

Stem and leaf Diagram

Key: 1|3 represents 13


Stem Leaf
Data
1 2, 4, 6, 6
12, 14, 16, 16, 20, 21
2 0, 1, 1, 1, 5
21, 21, 25, 32, 39, 40
3 2, 9
43, 44, 47, 48, 49, 53
4 0, 3, 4, 7, 8, 9
5 3

2
TOPIC 4: STATISTICS AND PROBABILITY

As far as sampling is concerned, it is very crucial to select a sample


which is not biased. There are several sampling techniques which
face this bias.

Suppose that we have a population of 100,000 people and wish to


select a sample of 1000 people. If we select the first 1000 in a list,
or the youngest 1000 there is certainly a bias in our selection.

Simple random sampling: We select 1000 people out of a hat


Each member has an equal probability

Systematic sampling: Since 100000/1000=100 (=period)


we pick a random starting point (e.g.
the 20th person) and pick every 100th
person (i.e. 20th, 120th, 220th, …)

Stratified sampling: We divide the population in subgroups


(say men and women, or under and
over 40 years old). We pick a sample
from each group

Quota sampling: As in stratified but we pick


proportional samples according to the
proportion of the subgroups in the
population.

There are advantages and disadvantages in each method. Simple


random sampling is fair but it may be very time consuming
compared to the systematic sampling. In systematic sample though,
if there is a periodic pattern in the population there may be a bias.
Suppose that the 100000 are in groups of 100 people. If the first
person of the group is the leader, then the sampling method of
selecting every 100th person may provide a sample of only leaders
or no leaders at all.

3
TOPIC 4: STATISTICS AND PROBABILITY

4.2 MEASURES OF CENTRAL TENDENCY AND SPREAD

Consider the following numerical data1:

10, 20, 20, 20, 30, 30, 40, 50, 70, 70, 80

The total number of entries is n=11.

In order to describe these data we use

 3 measures of central tendency


 3 measures of spread

The first three measures indicate a representative central value


which best describes the data, while the second three measures
indicate if our data are very close or dispersed to each other.

 MEASURES OF CENTRAL TENDENCY (The 3 M’s)

A) MEAN = The sum of all values divided by n.

Here
10  20  20  20  30  30  40  50 70 70 80
mean = = 40
11

B) MODE = the most frequent value


Here
mode = 20

C) MEDIAN = The value in the middle


(provided they have been placed in ascending order).

Here, it is the sixth number in the list

median = 30

1 This set of values is either a population or a sample.

4
TOPIC 4: STATISTICS AND PROBABILITY 

NOTICE
 For the data 10, 20, 30
Median = 20
For the data 10, 20, 30, 40
Median = 25
That is, for an even number of data,
median = the mean of the two middle values

 The median is not the n -th entry as one would possibly expect.
2

the median is the n 1 -th entry.


2
For example,
n 1
if n=11, =6, thus the median is the 6th entry. See the
2
example above;
n 1
if n=10, =5.5, thus the median is the mean of the 5th and
2
6th entries; for the 10 entries

10, 20, 30, 40, 50, 60, 70, 80, 90, 100

the median is the mean of 50 and 60. Hence median = 55

The median is also denoted by Q2 (the index 2 will be clarified soon)

 The mean is denoted by μ (or by x ). In fact, we use

the Greek letter μ for the whole population.


the Latin letter x for a sample of the population.

If our data are denoted by x1 , x 2 ,…, xn , the mean is given by


x1  x 2  x3  ⋯
μ =
n
or otherwise

μ
x i

5
TOPIC 4: STATISTICS AND PROBABILITY

EXAMPLE 1
Find
a) the integers a  b  c , given that mean=4, mode=5, median=5.

The median implies that b=5. The mode implies that also c=5.
a5 5
Then  4  a  10  12  a  2
3

Therefore, the numbers are 2,5,5.

b) the integers a  b  c  d , given that mean=5, mode=7, median=6.

The median implies that either b=c=6 or (b=5 and c=7)

Since the mode is 7 we obtain b=5 and c=d=7.


a  5 7 7
Then  5  a  19 20  a  1
4
Therefore, the numbers are 1,5,7,7.

 MEASURES OF SPREAD
We use the same set of data

10, 20, 20, 20, 30, 30, 40, 50, 70, 70, 80

A) STANDARD DEVIATION

The standard deviation is perhaps the most “reliable” measure


for spread, as it takes all data into consideration. It measures
how far the entries from the mean are. It can be found by using
the GDC (directions will be given later on).

The standard deviation is denoted2 either by σ or by sn .

For our example the GDC gives σ = 22.96.

2 In fact,
the Greek letter σ is used for the whole population;
the Latin letter sn is used for a sample of the population

6
TOPIC 4: STATISTICS AND PROBABILITY 

B) RANGE = (maximum value) - (minimum value)


Here
range = 80-10 = 70

C) INTERQUARTILE RANGE = IQR = Q3 – Q1


where
Q1 = LOWER QUARTILE = the median of the values before Q2
Q3 = UPPER QUARTILE = the median of the values after Q2

Here, before the median Q2=30, we have 5 numbers, hence


Q1=20 (this is the 3rd entry)
Also,
Q3=70 (it is the 3rd entry from the end)
Therefore,
IQR = 70-20 = 50

As the estimation of the values Q1, Q2, Q3 is quite tricky, let us see
some extra cases in the following example.

EXAMPLE 2 Remember that


n1
 for the value of the median Q2 we consider the th entry.
2
 for the values of Q1 and Q3 we consider only the entries before
and the entries after the median respectively.
a) For n=7 entries: 10, 20, 30, 40, 50, 60, 70
The median is Q2=40 (the 4th entry). Hence Q1=20, Q3=60.

b) For n=8 entries: 10, 20, 30, 40, 50, 60, 70, 80
The median is Q2=45 (the 4.5th entry). Hence Q1=25, Q3=65.

c) For n=9 entries: 10, 20, 30, 40, 50, 60, 70, 80, 90
The median is Q2=50 (the 5th entry). Hence Q1=25, Q3=75.

d) For n=10 entries: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100
Then Q2=55 (the 5.5th entry). Hence Q1=30, Q3=80.

7
TOPIC 4: STATISTICS AND PROBABILITY 

NOTICE
The square of the standard deviation is called variance. That is
2
variance = σ2 or sn

For our example, σ2= 22.962 = 527.27

 USE OF GDC

We can use the GDC to easily obtain all these measures.


For Casio CFX we select
 MENU
 STAT
 Complete List 1 with values of x (our data)
 CALC
 (1VAR): We obtain all the statistics.

Notice that
The standard deviation in the GDC is denoted by σχ
The variance is not given; it is simply the square of σχ

 BOX AND WHISKER PLOT


Consider again the initial example
10, 20, 20, 20, 30, 30, 40, 50, 70, 70, 80

In an appropriate horizontal scale we mark 5 figures:

min, Q1, Q2, Q3, max


in the following way:

min Q1 Q2 Q3 max

8
TOPIC 4: STATISTICS AND PROBABILITY

This diagram is helpful, particularly when we have a large number


of entries. It shows the “density” of data within the whole range. In
fact, the box plot splits the whole range of data in 4 intervals.
Generally speaking, each interval contains 25% of the entries. Thus
the following conclusions can be drawn:

The lowest 25% is below Q1 The upper 25% is above Q3


The lowest 50% is below Q2 The upper 50% is above Q2
The middle 50% is between Q1 and Q3

 MORE DETAILS

1) Percentiles
The values Q1, Q2, Q3 are also called
Q1 : 25th-percentile
Q2 : 50th-percentile
Q3 : 75th-percentile

Other percentiles may also be defined in a similar way; we will give


further examples in the next paragraph.

2) Outliers

Very extreme values in a set of data (that is very small or very


large) may give a false impression for out data. They are known as
outliers. We agree that

an outlier is any value


below Q1 – 1.5×IQR
or above Q3 + 1.5×IQR,

Such a value is viewed as being too far from the central values to
be reasonable. In our example,

Q1 - 1.5×IQR = 20 - 1.5×50 = - 55

Q3 + 1.5×IQR = 70 + 1.5×50 = 145

i.e. there are no outliers.

9
TOPIC 4: STATISTICS AND PROBABILITY

 MORE ON VARIANCE - STANDARD DEVIATION (only for HL)

If our data are x1 , x 2 ,…, x n

the variance is given by σ 2



(x
i  μ)2
n

the standard deviation is given by σ


(xi  μ)2
n

For our example,

(10- 40)2  (20- 40)2  (20- 40)2 ⋯ (80- 40)2


variance = = 527.27
11

standard deviation = 527.27 = 22.96

The variance measures the spread of the data as in fact we find


o the distance of each entry from the mean
o the squares of these distances
o the average of all these square distances

An alternative and more practical formula for the variance is given


by

σ2 
x i
2

- μ2
n

For our example, we have x =40 and

x
2
i 10 2  202  20 2  ⋯  80 2 23400
= = =2127.27
n 11 11
Hence
σ 2 =2127.27- 402 =527.27

Proof of the alternative formula

σ 2
= (x  μ) i
2

= (x i
2
 2μxi  μ2 )
n n

x x
2 2

=
i
 2μ
x  μ
i
2

=
i
 2μμ 
nμ2
n n n n n

x x
2 2
i i
=  2μ2  μ2 =  μ2
n n

10
TOPIC 4: STATISTICS AND PROBABILITY 

4.3 FREQUENCY TABLES – GROUPED DATA

Consider again the numerical data:


10, 20, 20, 20, 30, 30, 40, 50, 70, 70, 80
The total number of entries is n=11.

An alternative way of presentation is the frequency table:

Data Frequency
x f
10 1
20 3
30 2
40 1
50 1
70 2
80 1
n=11

Let us study again the basic measures for these data.

 MEASURES OF CENTRAL TENDENCY (The 3 M’s)

A) MEAN = The sum of all values divided by n.


The MEAN is given by
110 3 20 2 30 1 40 150 270 180
mean = = 40
11

In general, given that fi is the frequency of the entry xi, the


formula is

μ
f1 x1  f2 x 2  f3 x3 ⋯
or otherwise μ
f x
i i

n n

11
TOPIC 4: STATISTICS AND PROBABILITY 

B) MODE = the most frequent value


It is very obvious now. The entry x of the highest frequency is
mode = 20

C) MEDIAN = The value in the middle

It is still the entry in position n  1 , that is the 6th entry.


2

We can easily see that this is 30.

It helps here to add an extra column in the table above with the
so-called cumulative frequencies:

Data Frequency Cumulative


x f frequency (c.f.)
10 1 1
20 3 4
30 2 6
40 1 7
50 1 8
70 2 10
80 1 11
n=11

It simply gives the total number of entries up to each row. For


example, the total number of entries up to 20 is 1+3=4.
The MEDIAN, i.e. the 6th entry, is 30.

 MEASURES OF SPREAD

A) STANDARD DEVIATION

Again, it can be directly obtained by the GDC.

For our example the GDC gives σ = 22.96.

Thus the variance is σ2 = 527.27

12
TOPIC 4: STATISTICS AND PROBABILITY 

B) RANGE = (maximum value of x) - (minimum value of x)


It is very obvious here
range = 80-10 = 70

C) INTERQUARTILE RANGE = IQR = Q3 – Q1


The cumulative frequency table helps here as well.

The median Q2=30 is in the 6th position.


n 1
Thus, before the median we have 5 entries. Since  3,
2
Q1=20 (this is the 3rd entry)
and
Q3=70 (this is the 3rd entry from the end)
Therefore,
IQR = 70-20 = 50

 USE OF GDC

We can use the GDC to easily obtain all these measures.


For Casio CFX we select
 MENU
 STAT
 Complete List 1 with values of x (our data)
List 2 with frequencies
 CALC
 SET: we check the first two lines
The first line is OK. (1Var XList :List1)
For the second line (1Var Freq :----), select between
F1: enter 1, if there are no frequencies
F2: enter List 2 to consider frequencies
 Go back (EXIT)
 1VAR: We obtain all the statistics.

Check the value of n first (number of entries), to ensure that


all data have been considered.

13
TOPIC 4: STATISTICS AND PROBABILITY

NOTICE (for the GDC)


 The variance is not given; it is simply the square of σχ
 Since the GDC gives minX,Q1,Med,Q3,maxX remember that

Range = maxX – minx Interquartile Range = Q3 – Q1

The box and whisker plot uses exactly those 5 measures


 Extra information given:
Σx : the sum of all entries, i.e. x1+x2+x3+…
Σx2: the sum of the squares, i.e. x12+x22+x32+…
sχ : it is known as unbiased st. deviation (not in the syllabus!)

 GROUPED DATA

Suppose that 100 students took an exam and obtained scores from
1 to 60 (full marks), according to the following table:

Score Midpoint No of students Cumulative


(x) (for x) (frequency f) frequency (cf)

0  x  10 5 8 8

10  x  20 15 12 20

20  x  30 25 10 30

30  x  40 35 25 55

40  x  50 45 35 90

50  x  60 55 10 100
n=100

i.e. 8 students obtained scores from 1 up to 10, and so on.

 The mean and the standard deviation are still calculated as in a


usual frequency table, but now x1,x2,x3,… are the midpoints of
the intervals.
For example,
8 5  12 15  10  25  25  35  35  45  10  55
μ  34.7
100

14
TOPIC 4: STATISTICS AND PROBABILITY

These measures may also be obtained by the GDC, where the


LIST1 contains the midpoints of x. Here,
μ=34.7 σ =14.31

 Moreover, instead of the mode we have the modal group here.


That is the interval of the highest frequency. In our example, the
modal group is 40  x  50.

 For the median Q2 and the quartiles Q1 and Q3:


we need to draw the so-called cumulative frequency diagram
x-axis: values of x (we consider upper bounds of intervals)
y-axis: cumulative frequencies

x: up to  10  20  30  40  50  60
y: c.f 8 20 30 55 90 100

Q1=25 Q2=38 Q3=46

For the estimation of Q1, Q2, Q3 follow

15
TOPIC 4: STATISTICS AND PROBABILITY

Step 1: Divide y-axis into four equal parts


(Here we divide at y=25, y=50, y=75)
Step 2: Draw three horizontal lines until you meet the curve
Step 3: Draw three vertical lines from the intersection points
Obtain Q1, Q2, Q3 on x-axis (look at above)

Below that graph we can easily draw box and whisker plot:

Min=0 Q1=25 Q2=38 Q3=46 Max=60

 Remember that the values Q1, Q2, Q3 are also called


Q1 : 25th-percentile
Q2 : 50th-percentile
Q3 : 75th-percentile

In the same way we can find any percentile. For example, for the
40th-percentile
Estimate 40% of n: here 40% of 100 students is 40;
Draw a horizontal line at y=40 until you meet the curve;
Then draw a vertical line;
Hence
40th-percentile = 35.

In other words, 40% of the students have scores below 35.

 Let us check if there are outliers:


IQR = 46-25=21

Q1 - 1.5×IQR = 25 - 1.5×21 = - 6.5

Q3 + 1.5×IQR = 46 + 1.5×21 = 77.5

There are no scores lower than -6.5 or greater than 77.5, that is
there are no outliers.

16
TOPIC 4: STATISTICS AND PROBABILITY 

 MORE ON VARIANCE - STANDARD DEVIATION (only for HL)

Given that fi is the frequency of the entry xi, the formulas now
become:

σ 2

f (x
i i  μ)2
n
thus

σ
f (x
i i  μ)2
n

In our example,

1 (10- 40)2  3 (20- 40)2  2 (30- 40)2 ⋯


variance = =527.27
11

and then
standard deviation = 527.27 = 22.96

The alternative formula for the variance takes the form

f x
2
2 i i 
sn  - x2
n

For our example, we have x =40 and

f x
2
i i 1  10 2  3  20 2  2  20 2  ⋯ 23400
= = =2127.27
n 11 11
Hence
σ 2 =2127.27- 402 =527.27

(The proof of the alternative formula is very similar to what we


have seen in 5.1)

17
TOPIC 4: STATISTICS AND PROBABILITY

4.4 REGRESSION

We have a list of paired data. For example

x 10 12 15 20 23 28 30
y 120 135 174 213 270 301 305

We assume that x is the independent variable, y is the dependent


variable. Let us also see these points (x,y) on a scatter diagram.
y

300

200

100

x
-5 5 10 15 20 25 30 35

The main question here is whether there is a linear relationship


between the values of x and the corresponding values of y.

There is a parameter r, called correlation coefficient3 that gives the


extent of this relationship. It takes values
-1 ≤ r ≤ 1

The closest to the ends ±1, the more our data are linearly related.
(-1 implies a negative slope while +1 implies a positive slope)
The closest to 0, the less our data are linearly related.

There is also a line y=ax+b that best fits our data; it is known as
regression line. We can easily obtain these details by using a GDC.

3It is known as Pearson’s product-moment correlation coefficient

18
TOPIC 4: STATISTICS AND PROBABILITY 

 USE OF GDC

For Casio CFX we select


 MENU
 STAT
 Complete List 1 with values of x; List 2 with values of y
 CALC
 REG
 X
 aX+b : look at the values of a,b,r.

For our example,

r =0.99 there is a very strong correlation between x and y


a =9.83
The regression line is y =9.83x+23.1
b =23.1

300

200

y =9.83x+23.1
100

x
-5 5 10 15 20 25 30 35

By using the regression line y=f(x) we may predict values of y


corresponding to values of x that are not in the list. For example

for x=18, we estimate y = 9.8318+23.1  200


for x=40, we estimate y = 9.8340+23.1  416

Notice that x=18 is within the range of our list while x=40 is not.
f(18)=200 is known as interpolation, f(40)=416 as extrapolation.
In general, interpolations are more reliable than extrapolations.

Notice. In order to predict a value of x corresponding to a given y


we do not use the same regression line. We find a new regression
line for x on y. In our example, the GDC gives x =0.0997y-1.92

19
TOPIC 4: STATISTICS AND PROBABILITY 

 CHARACTERISTICS OF THE REGRESSION LINE y=ax+b


The regression line
 passes through the point M( x , y ), where
x = the mean of the values of x
y = the mean of the values of y

 separates the points in (almost) two halves: half of the points


are above and half below the line.

The values of x , y can also be obtained by the GDC (together with


other statistics). In the STAT mode, after inserting the values of x
and y, select

 CALC
 2VAR: We obtain all the statistics, separately for x’s and y’s

In our example
x = 19.7 y =216.9
Thus the line passes through the point M(19.7, 216.9).

 CHARACTERISTICS OF THE CORRELATION COEFFICIENT r

The correlation between x and y is characterised according to the


value of r as follows:

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

strong moderate weak very weak weak moderate strong


negative negative negative or no positive positive positive
correlation correlation correlation correlation correlation correlation correlation

To better understand the correlation coefficient r, let us see some


characteristic cases (find the results below in your GDC for
practice).

20
TOPIC 4: STATISTICS AND PROBABILITY

Data Scatter diagram Results

x y
y

10
r =1
1 2 8
perfect positive
2 4 6 correlation
3 6 4

Regression line:
4 8 2

x y=2x
5 10 1 2 3 4 5

x y y
r =-1
10
1 10 perfect negative
8

2 8 correlation
6

3 6 4
Regression line:
4 4 2

y=-2x+12
5 2 1 2 3 4 5
x

Let us slightly modify our data

x y y

10
r =0.98
1 2
8 strong positive
2 3
6 correlation
3 7 4

4 8 Regression line:
2

5 10 x y=2.1x-0.3
1 2 3 4 5

x y y

10
r =-0.98
1 10 strong negative
8

2 8 6 correlation
3 7 4

Regression line:
4 3 2

x y=-2.1x+12.3
5 2 1 2 3 4 5

and a final extreme case

x y y

10
r =0
1 8
8 no correlation
2 2
6 at all
3 5 4

4 2 Regression line:
2

5 8
x y=5
1 2 3 4 5

21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy