0% found this document useful (0 votes)
20 views86 pages

Descriptive Statistics

Uploaded by

majdahanout
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views86 pages

Descriptive Statistics

Uploaded by

majdahanout
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

23

Describing data
Data matrix

Data collected on students in a statistics class on a variety of variables:

variable(s)

# gender sleep bedtime countries dread
1 male 5 12-2 13 medium
2 female 7.15 10-12 7 low
3 female 5.5 12-2 1 very high ←
4 female 3 10-12 0 low observation(s)
.. .. .. .. .. ..
. . . . . .
86 male 8 12-2 5 very low

24
Types of variables

all variables

quantitative qualitative
(numerical) (categorical)

continuous discrete nominal ordinal

25
Types of variables (cont.)

From previous data matrix:

# gender sleep bedtime countries dread


1 male 5 12-2 13 medium
2 female 7.15 10-12 7 low
.. .. .. .. .. ..
. . . . . .

ˆ gender: categorical, nominal


ˆ sleep: numerical, continuous
ˆ bedtime: categorical, ordinal
ˆ countries: numerical, discrete
ˆ dread: categorical, ordinal

26
From raw data to effective representations

ˆ Once data has been collected (in a data matrix), and the type of variables we are working with
are known, it should be displayed and visualized
ˆ Such presentation must be done clearly and concisely: one has to quickly obtain a “feel” for
the essential characteristics of the data
ˆ Different graphical tools depending on the type of variables
ˆ Visualising data will often show interesting features of the data and can be very informative:
plot your data before analysing it!

27
Case I: one single variable
Frequency tables

ˆ Number of accidents at 2 intersections per day:


I1 0,3,0,5,1,0,4,2,2,1,5,5,1,5,2,3,0,2,5,0
I2 3,2,6,5,4,3,1,4,4,3,3,2,1,2,5,4,3,3,2,0

ˆ Small number of distinct and discrete values → it is convenient to represent it in an (absolute)


frequency table, which presents each distinct value along with its frequency of occurrence
ˆ Step by step:
1. Sort the data
2. Count the occurrences of each distinct value
3. Group in a tabular form

28
Frequency tables (cont.)

1. Sorted number of accidents per day:

I1 0,0,0,0,0,1,1,1,2,2,2,2,3,3,4,5,5,5,5,5
I2 0,1,1,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,6

2. Occurrences:

I1 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3 , 4 , 5, 5, 5, 5, 5
| {z } | {z } | {z } |{z} |{z} | {z }
5 3 4 2 1 5
I2 0 , 1, 1 , 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5 , 6
|{z} |{z} | {z } | {z } | {z } |{z} |{z}
1 2 4 6 4 2 1

29
Frequency tables (cont.)

3. Absolute frequency table:


I1 I2
Value Frequency Value Frequency
0 5 0 1
1 3 1 2
2 4 2 4
3 2 3 6
4 1 4 4
5 5 5 2
6 1

ˆ How often are there no accidents?


5 days out of 20 in I1 and 1 day out 20 in I2
ˆ How many days are there with more than one accident?
4+2+1+5=12 out of 20 in I1
4+6+4+2+1=17 out 20 in I2
30
Visualize the frequency table: the line graph

Intersection 1 Intersection 2
5

6
5
4

4
Number of days

Number of days
3

3
2

2
1

1
0

0
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day


31
Visualize the frequency table: the bar plot

Intersection 1 Intersection 2
5

6
5
4

4
Number of days

Number of days
3

3
2

2
1

1
0

0
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day


32
Visualize the frequency table: the frequency polygon

Intersection 1 Intersection 2
5

6
5
4

4
Number of days

Number of days
3

3
2

2
1

1
0

0
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day


33
Symmetry

ˆ A set of data is said to be symmetric about the value x0 if the frequencies of the values x0 − c
and x0 + c are the same for all c.
ˆ That is, for every constant c, there are just as many data points that are c less than x0 as
there are c greater than x0
ˆ Data that are “close to” being symmetric are said to be approximately symmetric → graphical
representation
ˆ On the number of accidents/day:
ˆ in case of I2 we have symmetry around x0 = 3
ˆ in case of I1 we do not have symmetry

34
Relative frequency graphs

ˆ Let f represents the frequency of occurrence of some data value x


ˆ The relative frequency of x is the ratio f /n, where n represents the total number of observations of
available data
ˆ Relative frequency table for number of accidents/day (n = 20 each intersection):

I1 I2

Value x Frequency f Rel. frequency f /n Value x Frequency f Rel. frequency f /n

0 5 5 = 0.25 0 1 1 = 0.05
20 20
1 3 3 = 0.15 1 2 2 = 0.10
20 20
2 4 4 = 0.20 2 4 4 = 0.20
20 20
3 2 2 = 0.10 3 6 6 = 0.30
20 20
4 1 1 = 0.05 4 4 4 = 0.20
20 20
5 5 5 = 0.25 5 2 2 = 0.10
20 20
6 1 1 = 0.05
20
n = 20 sum = 1 n = 20 sum = 1

35
Visualize the relative frequency table: the line graph

Intersection 1 Intersection 2
0.25

0.30
0.25
0.20

0.20
Relative frequency

Relative frequency
0.15

0.15
0.10

0.10
0.05

0.05
0.00

0.00
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day


36
Visualize the relative frequency table: the bar plot

Intersection 1 Intersection 2
0.25

0.30
0.25
0.20

0.20
0.15
Relative frequency

Relative frequency

0.15
0.10

0.10
0.05

0.05
0.00

0.00
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day


37
Visualize the relative frequency table: the frequency polygon

Intersection 1 Intersection 2
0.25

0.30
0.25
0.20

0.20
Relative frequency

Relative frequency
0.15

0.15
0.10

0.10
0.05

0.05
0.00

0.00
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day


38
Displays for qualitative data

Till now: the variable (number of accidents) will have a value contained in a small set of integer
numbers. It is a discrete numerical variable and it is likely to take values within a small set.

ˆ Suppose now that for I1 you have information on the traffic light present for that intersection,
i.e., you know if that day the traffic light was working or not
ˆ Such information is provided as a label, for example “W” (working), “NW” (not working)
ˆ The sample is given below
W,NW,W,W,NW,W,NW,NW,W,W,NW,NW,W,NW,NW,NW,W,NW,NW,W
ˆ This is categorical data → bar chart (absolute/relative frequency) or pie chart (relative
frequency)

39
Pie chart

Value x Frequency f Rel. frequency f /n


9
W 9 20
= 0.45
11
NW 11 20
= 0.55 Working

n = 20 sum = 1 Not
Working

ˆ Few cons of pie charts:


ˆ Use only with few categories (no more than 5/6 categories)
ˆ Hard to read/process for most people (areas are hard to compare)
40
Frequencies

ˆ Relative frequencies are somehow standardised: they will be useful again when comparing two
samples/populations
ˆ In our example we observed the two intersections for the same number of days, so it is easy to
compare. That’s not always the case.
ˆ Frequency tables are very common but can only be constructed when there are a small number
of possible values: what happens when we measure continous outcomes or variables which can
take many values?

41
Grouped data

ˆ For some quantitative data sets, the number of distinct values is too large to utilize a line
graph/bar plot
ˆ In such cases,
ˆ we divide the values into groupings called class intervals or bins, and then
ˆ we plot the number of data values falling in each interval
ˆ How to chose the number of classes (width of bins)? → trade-off between
ˆ choosing too few classes (bigger bins) at a cost of losing information about the actual data
values in a class
ˆ choosing too many classes (smaller bins) at a cost of having frequencies of each class being too
small for a pattern to be discernible

42
Visualizing grouped data: the histogram

ˆ Step by step:
1. Arrange the data in increasing order
2. Choose class intervals (bins) so that all data points are covered
3. Construct a (relative) frequency table
4. Draw adjacent bars having heights determined by the frequencies in step 3
ˆ Why? → To draw attention to important features of the data:
ˆ How symmetric the data are
ˆ How spread out the data are
ˆ Whether there are intervals having high levels of data concentration (modality and skewness)
ˆ Whether there are gaps in the data
ˆ Whether some data values are far apart from others (outliers)

43
The problem of number of classes

3.0
5
0.015

2.5
4

2.0
0.010

1.5
2

1.0
0.005

0.5
0.000

0.0
0
140 160 180 200 220 140 160 180 200 220 140 160 180 200 220

The choice of classes (and width) shows different patterns in the data.
A first rule of thumb: same size classes are usually to be preferred. Methods to choose the
”optimal” number of classes do exist (but we don’t discuss them).
44
In practice

ˆ Suppose that you know how many pedestrians crossed I1 in those 20 days:
164, 142, 194, 180, 151, 200, 158, 168, 209, 169,
200, 201, 205, 157, 161, 168, 210, 197, 211, 182

ˆ Step by step:
1. Sorted data: 142, 151, 157, 158, 161, 164, 168, 168, 169, 180, 182, 194, 197, 200, 200, 201,
205, 209, 210, 211
2. Class intervals:
142 , 151, 157, 158, 161, 164, 168, 168, 169, 180, 182,
|{z} | {z } | {z } | {z }
[140,150) [150,160) [160,170) [180,190)

194, 197, 200, 200, 201, 205, 209, 210, 211


| {z } | {z } | {z }
[190,200) [200,210) [210,220)

45
In practice (cont.)

3. Frequency table:

Class intervals Frequency f Rel. frequency f /n


1
[140, 150) 1 20
= 0.05
3
[150, 160) 3 20
= 0.15
5
[160, 170) 5 20
= 0.25
0
[170, 180) 0 20
= 0.00
2
[180, 190) 2 20
= 0.10
2
[190, 200) 2 20
= 0.10
5
[200, 210) 5 20
= 0.25
2
[210, 220) 2 20
= 0.10

n = 20 sum = 1 46
In practice (cont.)

4. Histogram plot (absolute and relative frequency):

Histogram of pedestrians (absolute frequency) Histogram of pedestrians (relative frequency)

0.25
5

0.20
4

Relative frequency

0.15
3
Frequency

0.10
2

0.05
1

0.00
0

140 160 180 200 220 140 160 180 200 220

Number of pedestrians crossing intersection 1 Number of pedestrians crossing intersection 1

Symmetric? Spread of the data? Gaps?


47
In practice (cont.)

3. Frequency table with different class intervals:

Class intervals Frequency f Rel. frequency f /n


4
[140, 160) 4 20
= 0.20
5
[160, 180) 5 20
= 0.25
4
[180, 200) 4 20
= 0.20
7
[200, 220) 7 20
= 0.35

n = 20 sum = 1

48
In practice (cont.)

4. Histogram plot with different class intervals:

Histogram of pedestrians (absolute frequency) Histogram of pedestrians (relative frequency)

0.35
7

0.30
6

0.25
5

Relative frequency

0.20
Frequency

0.15
3

0.10
2

0.05
1

0.00
0

140 160 180 200 220 140 160 180 200 220

Number of pedestrians crossing intersection 1 Number of pedestrians crossing intersection 1

Symmetric? Spread of the data? Gaps?


49
Case II: Paired variables
Pairs of variables

ˆ Sometimes a data set consists of pairs of values that have some relationship to each other →
(x, y)
ˆ If x is qualitative and y is quantitative → distinct histograms, one for each category of x.
(Overlayed) Frequency polygons are also an option.
ˆ If both x and y are quantitative → scatter plot

50
Example 1: distinct histograms

ˆ x: traffic light is working or not at I1


ˆ y: number of pedestrians crossing I1
ˆ As a table:
days 1 2 3 4 5 6 7 8 9 10
x W NW W W NW W NW NW W W
y 164 142 194 180 151 200 158 168 209 169
days 11 12 13 14 15 16 17 18 19 20
x NW NW W NW NW NW W NW NW W
y 200 201 205 157 161 168 210 197 211 182

51
Example 1: distinct histograms (cont.)

Traffic light at I1 is working Traffic light at I1 is not working


3.0

3.0
2.5

2.5
2.0

2.0
Frequency

Frequency
1.5

1.5
1.0

1.0
0.5

0.5
0.0

0.0
140 160 180 200 220 140 160 180 200 220

Number of pedestrians crossing I1 Number of pedestrians crossing I1

52
Example 1: distinct frequency polygons

Very similar information to histograms

Traffic light at I1 is working Traffic light at I1 is not working


3.0

3.0
2.5

2.5
2.0

2.0
Frequency

Frequency
1.5

1.5
1.0

1.0
0.5

0.5
0.0

0.0
[140,150) [160,170) [180,190) [200,210) [140,150) [160,170) [180,190) [200,210)

Number of pedestrians crossing I1 Number of pedestrians crossing I1

53
Example 1: overlayed frequency polygons

It is easy to overlay frequency polygons to compare groups:

Overlayed frequency plots (relative frequency)

0.15
working
not working

0.10
Frequency

0.05
0.00

[140,150) [160,170) [180,190) [200,210)

Number of pedestrians crossing I1


54
Example 2: scatter plot

ˆ x: daily energy consumption (in kWh) for production of traffic lights


ˆ y: daily production (in 1000 pieces) of traffic lights
ˆ As a table:
days 1 2 3 4 5 6 7 8 9 10
x 754 814 749 787 759 754 773 732 712 764
y 164 141 194 180 151 200 158 168 209 169
days 11 12 13 14 15 16 17 18 19 20
x 759 769 750 806 751 740 733 728 751 749
y 200 201 205 157 161 168 210 197 211 182

55
Example 2: scatter plot (cont.)

220
200
daily production (in 1000 pieces)

180

(x1,y1) = (754,164)

*
160
140

700 720 740 760 780 800 820

daily energy consumption (in kWh) 56


Example 2: scatter plot (cont.)

220
200
daily production (in 1000 pieces)

180

(x1,y1) = (754,164)

*
160

(x2,y2) = (814,141)
140

700 720 740 760 780 800 820

daily energy consumption (in kWh) 57


Example 2: scatter plot (cont.)

220
200
daily production (in 1000 pieces)

(x3,y3) = (749,194)

*
180

(x1,y1) = (754,164)

*
160

(x2,y2) = (814,141)
140

700 720 740 760 780 800 820

daily energy consumption (in kWh) 58


Example 2: scatter plot (cont.)

220
* * *
*
200 ** *
daily production (in 1000 pieces)

* *
*
180

*
* * *
**
160

* *
*
140

700 720 740 760 780 800 820

daily energy consumption (in kWh) 59


Summary statistics
Sample statistics

ˆ Suppose that we have in our possession sample data from some underlying population
ˆ Examples: number of accidents/day at I1 and I2, conditions of traffic lights, number of
pedestrians crossing I1, . . .
ˆ Up to now we showed how to describe and portray sample data in their entirety
ˆ Now we want to determine summary measures about the sample → enter the sample statistics
ˆ Sample statistics are numerical quantities computed from sample data
ˆ Why? → To provide a simple way to characterise the sample
ˆ Why? → Sample statistic as a point estimate of a specific population property, called
parameter, for a specific variable (statistical inference)
ˆ The type of summary statistic depends on the type of data and on what we are trying to
characterise

60
Types of statistics

ˆ Central tendency of data:


ˆ sample mean
ˆ sample median
ˆ sample mode
ˆ Amount of variation/spread inside data:
ˆ sample variance
ˆ sample standard deviation
ˆ Relations betweens pairs of quantitative data:
ˆ sample covariance
ˆ sample correlation coefficient

61
Central tendency measures
Sample mean

ˆ The population mean, denoted with µ, is a parameter defined by


PN
x1 + . . . + xN i=1 xi
µ= = ,
N N
but typically it is impossible to observe the population units x1 , . . . , xN (except in very rare cases)
ˆ Let (x1 , . . . , xn ) be a sample of dimension n
ˆ The sample mean, denoted as x̄, is a sample statistic defined by
Pn
x1 + . . . + xn i=1 xi
x̄ = =
n n

ˆ Example: energy consumption in 5 days (EG, in kWh)

x = (x1 , x2 , x3 , x4 , x5 ) = (164, 141, 194, 180, 151)


164+141+194+180+151 830
x̄ = 5
= 5
= 166 kWh (average EG)

62
Properties of sample mean

ˆ Property I. Define yi = xi + c, i = 1, . . . , n. Then

ȳ = x̄ + c

ˆ Property II. Define yi = axi , i = 1, . . . , n. Then

ȳ = ax̄

ˆ What happens if yi = axi + c?

63
Weighted sample mean

ˆ Is it possible to compute x̄ with (relative) frequency tables? Yes! Remember we could build
frequency tables from sequences of measurements.
ˆ What we need:
ˆ xi – value i in sample, i = 1, . . . , k (k distinct values)
ˆ fi – absolute frequency of xi
ˆ n – sample size (n = f1 + . . . + fk = ki=1 fi )
P

ˆ wi – relative frequency of xi (wi = fi /n and ki=1 wi = 1)


P

64
Weighted sample mean (cont.)

Values xi Frequency fi Rel. frequency wi = fi /n


x1 f1 w1
x2 f2 w2
.. .. ..
. . .
xk fk wk
Total n 1

ˆ Then Pk k
i=1 fi xi X
x̄ = = wi xi
n i=1

65
Example: number of accidents/day at I1

ˆ x = (0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5, 5)
ˆ k = 6 different values: 0, 1, 2, 3, 4, 5

Values xi Frequency fi Rel. frequency wi = fi /n

0 5 0.25

1 3 0.15

2 4 0.20

3 2 0.10

4 1 0.05

5 5 0.25

Total 20 1
Pk
i=1 fi xi 0×5+1×3+2×4+3×2+4×1+5×5 46
x̄ = = = = 2.3
n 20 20
or
= 0 × 0.25 + 1 × 0.15 + 2 × 0.20 + 3 × 0.10 + 4 × 0.05 + 5 × 0.25 = 2.3 66
Deviation from the sample mean

ˆ Let (x1 , . . . , xn ) be a sample of size n


ˆ Let x̄ be the sample mean
ˆ The deviations are the differences between the data values and the sample mean: xi − x̄
ˆ Key property:
n
X
(xi − x̄) = 0
i=1

67
In practice

ˆ x = (164, 141, 194, 180, 151) – energy consumption in 5 days (EG, in kWh)
ˆ x̄ = 166 kWh

141 151 164 166 180 194

ˆ Deviations:
x1 − x̄ = 164 − 166 = −2, x2 − x̄ = 141 − 166 = −25,
x3 − x̄ = 194 − 166 = 28, x4 − x̄ = 180 − 166 = 14,
x5 − x̄ = 151 − 166 = −15

ˆ Is the sum 0?
5
X
(xi − x̄) = −2 − 25 + 28 + 14 − 15 = 0
i=1
68
Sample median

ˆ Sample mean is deeply affected by extreme values:

x = (1, 1, 1, 1, 1, 50) → x̄ = 9.17

ˆ We need an index to indicate the center of a sample, but not affected by extreme values
ˆ Such measure is the sample median, denoted with m, which is the middle value in ranked data
from the smallest to largest value:

# (xi > m) = #(xi < m)

69
Sample median (cont.)

ˆ Step by step:
ˆ Order the sample values from smallest to largest, for a sample of size n
ˆ If n is odd, m is the element of the ordered sample in position n−1
2
+1
ˆ If n is even, m is the average of elements of the ordered sample in positions n
2
and n
2
+ 1,
respectively

70
Sample median: examples

Example 1:

ˆ Energy consumption in 5 days (EG, in kWh): x = (164, 141, 194, 180, 151)
ˆ Sorted sample: x̃ = (141, 151, 164, 180, 194)
ˆ n = 5 → value in position 3 → m = x̃3 = 164

Example 2:

ˆ Extreme values: x = (1, 1, 1, 1, 1, 50) (already sorted)


ˆ n = 6 → average between values in position 3 and 4 → m = x3 +x4
2 = 1+1
2 =1

71
Other properties: symmetry and skewness

ˆ If sample has a symmetric behaviour, mean and median are the same (e.g., number of
accidents/day at I2)
ˆ When mean is greater than the median, i.e., x̄ > m, the sample is right-skewed (bulk of data
on the left ↔ long “tail” on the right)
ˆ When mean is smaller than the median, i.e., x̄ < m, the sample is left-skewed (bulk of data on
the right ↔ long “tail” on the left)

72
Other properties: symmetry and skewness (cont.)

Right−skewed Left−skewed Symmetric

50
25

20
x x

40
x
20

m m m

15
Frequency

Frequency

Frequency
30
15

10
20
10

5
10
5
0

0
0 2 4 6 8 10 −2 0 2 4 6 8 10 12 0 2 4 6 8 10

73
Sample percentiles

The median is a special case of the a general concept: sample percentiles.

ˆ Let p be any number between 0 and 1


ˆ Let (x1 , . . . , xn ) be a sample of size n
ˆ Let x̃ = (x(1) , . . . , x(n) ) be the ordered sample
ˆ The sample 100p percentile, denoted as x̃(100p) , is that value such that
ˆ at least 100p% of the n data values (= np) are less than or equal to it:

# (xi ≤ x̃(100p) ) = np

ˆ AND at least 100(1 − p)% of the n data values (= n (1 − p)) are greater than or equal to it:

# (xi ≥ x̃(100p) ) = n (1 − p) = n − np

74
Sample percentiles (cont.)

ˆ Step by step:
ˆ Arrange the sample in increasing order
ˆ If np is not an integer, determine the smallest integer greater than np → the value in that
position is x̃(100p)
x̃ +x̃
ˆ If np is an integer, then x̃(100p) = (np) 2 (np+1)
ˆ If two data values satisfy this condition, then x̃(100p) is the arithmetic average of these values
ˆ Special percentiles called quartiles: 25th (Q1 ), 50th (sample median) and 75th (Q3 )

75
Sample percentiles (cont.) - Examples

Let’s consider a classroom of 20 students and their scores on a recent math test in USA. The
scores are as follows:
x = (75, 82, 90, 65, 88, 72, 94, 78, 60, 85, 70, 92, 68, 89, 76, 80, 98, 84, 79, 87)
Calculate the 70th percentile of these scores. The percentile calculation involves sorting the scores
in ascending order and then finding the score below which 70% of the data falls.
- Step 1: Sort the scores in ascending order:
x̃= (60, 65, 68, 70, 72, 75, 76, 78, 79, 80, 82, 84, 85, 87, 88, 89, 90, 92, 94, 98)
- Step 2: Calculate the index for the 70th percentile:
n*p= 0.70 * 20 = 14. Since np is integer the 70th percentile is the average between x̃14 and x̃15
(x̃14 +x̃15 )
70th percentile = 2 =(87+88)/2= 87.5
And what is the 35 percentile of the following sample?
x= (15, 22, 18, 30, 25, 21, 28, 20, 24)

76
Sample mode

ˆ Let (x1 , . . . , xn ) be a sample of size n


ˆ The data value that occurs most frequently in the data set is called mode
ˆ At a graphical level, search for peaks:
ˆ if only one, we talk about unimodal data – case of number of accidents/day at I2 (modes is 3)
ˆ if several peaks, we talk about bimodal / multimodal data – case of production of traffic lights
(modes are 168 and 200)
ˆ if no apparent peaks, we talk about uniform data
Intersection 1 Intersection 2
0.25

0.30
0.25
0.20

0.20
0.15
Relative frequency

Relative frequency

0.15
0.10

0.10
0.05

0.05
0.00

0.00
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day


77
Some caution for measures of centrality

Not all centrality measures are appropriate for all types of variables:

ˆ Sample means can only be computed (in a sensible way) for numerical variables (not for
qualitative/categorical variables)
ˆ Sample medians can be computed for numerical and ordinal variables, but not for nominal ones
ˆ For nominal variables only the mode can be computed. On the other hand, the mode for
continuous variables is often derived for grouped data, as there are infinitely many values
continuous variables can take

This is often something that students get wrong in exams!

78
Variability measures
Variability

ˆ Variability refers to the extent to which values differ from one another
ˆ Example: same mean, different dispersion

100
80

80
60

Frequency

60
40

40
20

20
0

0
0 2 4 6 8 10 0 2 4 6 8 10

79
Sample variance

ˆ The population variance, denoted with σ 2 , is a parameter defined as


PN
i=1 (xi − x̄)2
σ2 = ,
N
but typically it is impossible to observe the population units x1 , . . . , xN (except in very rare
cases)
ˆ Let (x1 , . . . , xn ) be a sample of size n. From this we can compute x̄ the sample mean
ˆ The sample variance, denoted as s2 , is a sample statistic defined as
Pn
2 (xi − x̄)2
s = i=1
n−1
ˆ Sample variance (and in general variability measures) can only be computed for numerical
variables

80
In practice

Example: energy consumption in 5 days (EG, in kWh)

ˆ Sample x = (x1 , x2 , x3 , x4 , x5 ) = (164, 141, 194, 180, 151)


ˆ Sample mean x̄ = 166 kWh
ˆ Squared deviations:
(xi − x̄)2 for i = 1, 2, 3, 4, 5: (164 − 166)2 = (−2)2 = 4
(141 − 166)2 = (−25)2 = 625
(194 − 166)2 = (28)2 = 784
(180 − 166)2 = (14)2 = 196
(151 − 166)2 = (−15)2 = 225
ˆ Sample variance:
Pn
− x̄)2
i=1 (xi 4 + 625 + 784 + 196 + 225 1834
s2 = = = = 458.5
n−1 4 4

81
Properties of sample variance

ˆ Property I Alternative formula (computational lifesaver):


Pn 2 Pn 2 2
i=1 (xi − x̄) i=1 xi − nx̄
s2 = =
n−1 n−1

ˆ Property II Define yi = xi + c, i = 1, . . . , n, and denote with s2x the sample variance of x. Then
Pn 2 Pn 2
i=1 (yi − ȳ) i=1 (xi − x̄)
s2y = = = s2x
n−1 n−1

ˆ Property III Define yi = axi , i = 1, . . . , n, and denote with s2x the sample variance of x. Then
Pn 2 Pn 2
2 i=1 (yi − ȳ) 2 i=1 (xi − x̄)
sy = =a = a2 s2x
n−1 n−1

ˆ What is s2y if we define yi = axi + c, i = 1, . . . , n, and denote with s2x the sample variance of x?

82
Sample standard deviation

ˆ The square root of s2 , denoted as s, is called sample standard deviation:


sP
√ n 2
i=1 (yi − ȳ)
s = s2 =
n−1

ˆ Why? → s is measured in the same units as the original data, whilst s2 is in the squared units
of the original data
ˆ Example: energy consumption in 5 days (EG, in kWh)

s2 = 458.5 kWh2 → s = 458.5 = 21.41 kWh

ˆ Properties:
ˆ if yi = xi + c, i = 1, . . . , n, then sy = sy
ˆ if yi = cxi , i = 1, . . . , n, then sy = |c|sy

83
Interquartile range

ˆ Recap the 25th (Q1 - 1st quartile), the 50th (median - 2nd quartile) and the 75th (Q3 - 3rd
quartile) percentiles
ˆ The difference between Q3 and Q1 is the so called interquartile range, denoted with IQR
ˆ Roughly speaking, the IQR is the length of the interval in which the central half of the
sample lies: it contains 50% of the data

84
Example: Q1 , Q3 and IQR

ˆ Number of pedestrians crossing I1 in 20 days:


ˆ Sorted data x̃: 142, 151, 157, 158, 161, 164, 168, 168, 169, 180, 182, 194, 197, 200, 200, 201,
205, 209, 210, 211
ˆ Q1 as 25th percentile → p = 0.25 → np = 20 × 0.25 = 5. Since it is an integer,
x̃ +x̃
Q1 = (5) 2 (6) = 161+164
2
= 162.5
ˆ Q3 as 75th percentile → p = 0.75 → np = 20 × 0.75 = 15. Since it is an integer,
x̃ +x̃
Q3 = (15) 2 (16) = 200+201
2
= 200.5
ˆ IQR = Q3 − Q1 = 200.5 − 162.5 = 38

85
Visualizing percentiles: the boxplot

70

60

sorted sample values 50


suspected outliers

40
max whisker reach
& upper whisker
30

20 Q3 (third quartile)
median
10 Q1 (first quartile)

0 original data lower whisker

86
Whiskers and outliers

ˆ Whiskers – extend up to 1.5 × IQR away from Q1 and Q3 :


ˆ max upper whisker reach = Q3 + 1.5 × IQR
ˆ max lower whisker reach = Q1 − 1.5 × IQR
ˆ we will discuss where the 1.5 come from later
ˆ Outlier (potential) – an observed sample value beyond the maximum reach of the whiskers. It
is an observation that appears extreme relative to the rest of the data
ˆ As median is preferred to sample mean in case of outliers/skewness, IQR is preferred to
sample standard deviation in the same cases

87
“Normal” data

ˆ A data set is said to be normal / approximately normal if a histogram describing it has the
following properties:
ˆ It is highest at the middle interval (mode = sample mean = median)
ˆ Bell-shaped when moving from the middle interval in either direction
ˆ Symmetry with respect its middle interval

Normal data Approximately normal data

88
Empirical rule

ˆ If a data set is approximately normal, with sample mean (and median and mode) x̄ and
sample standard deviation s, then

(x̄ − s, x̄ + s) contains ≈ 68% of data


(x̄ − 2s, x̄ + 2s) contains ≈ 95% of data
(x̄ − 3s, x̄ + 3s) contains ≈ 99.7% of data

The idea is that normal data is data that has been sampled from a population which follows a
normal distribution
Why the normal distribution? Remember it is a special distribution which is the limiting
distribution for sums/means of random variables
If the population follows a normal distribution with standard deviation σ we have that
Q3 − Q1 = 1.35 ∗ σ. Since 1.35 ∗ 1.5 ≈ 2 we expect the whiskers of the boxplots to contain 95% of
the data.
89
Sample covariance and correlation coefficients
Remember the scatter plot?

ˆ Consider the data set of paired values (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )


ˆ Example: daily energy consumption (in kWh) for production of traffic lights (x) and daily
production (in 1000 pieces) of traffic lights (y)

220
* * *
*
200
** *
daily production (in 1000 pieces)

* *
*
180

*
* * *
**
160

* *
*
140

700 720 740 760 780 800 820

daily energy consumption (in kWh)

90
Type of association

ˆ We say that between x and y there is positive association when they tend to grow together
ˆ We say that between x and y there is negative association when x grows whilst y decreases, or
vice versa
ˆ The most common way to measure association between numerical variables is the sample
covariance and correlation. Other measures do exist, especially other measures should be
employed for variables which are not numerical.

91
Sample covariance

ˆ The population covariance, denoted as σxy , is the parameter


PN
i=1 (xi − µx )(yi − µy )
σxy =
N
but typically it is impossible to observe the population units (x1 , y1 ) . . . , (xN , yN ) (except in
very rare cases)
ˆ We typically have a sample (x1 , y1 ) . . . , (xn , yn )
ˆ The sample covariance, denoted as sxy , is the data derived quantity (sample statistic)
Pn Pn
i=1 (xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
sxy = = i=1
n−1 n−1
ˆ Roughly, sxy is the mean of the products of the deviations from the means of x and y

92
Interpretation of sample covariance

ˆ Consider the generic term (xi − x̄)(yi − ȳ) in sxy


ˆ This quantity is
ˆ positive when xi and yi are both either larger or smaller then x̄ and ȳ
ˆ negative when xi is larger than x̄ and yi smaller than ȳ, or the other way around

93
Interpretation of sample covariance (cont.)


1


y
⇒ y
5 5

4 4 − +

3 3

2 2

+ −
1 1

x x

0 1 2 3 4 5 0 1 2 3 4 5

x̄ 94
Interpretation of sample covariance (cont.)


2


y
⇒ y
5 5

4 4 − +

3 3

2 2

+ −
1 1

x x

0 1 2 3 4 5 0 1 2 3 4 5

x̄ 95
Interpretation of sample covariance (cont.)


3


y
⇒ y
5 5

4 4 − +

3 3

2 2

+ −
1 1

x x

0 1 2 3 4 5 0 1 2 3 4 5

x̄ 96
Sample correlation coefficient

ˆ Problem: we can detect the type of association using sxy , but how can we measure its
strength?
ˆ The sample correlation coefficient, denoted with r, of the data pairs (xi , yi ), i = 1, . . . , n, is a
sample statistic defined by
Pn
sxy (xi − x̄)(yi − ȳ)
r= = pPn i=1 Pn ,
sx sy i=1 (xi − x̄)
2
i=1 (yi − ȳ)
2

where sx and sy are the sample standard deviations of x and y, respectively


ˆ The population correlation coefficient, denoted with ρ, is a parameter defined by
σxy
ρ= ,
σx σy

and it is an unknown measure

97
Properties of r

1. −1 ≤ r ≤ 1
2. r = 1 ↔ yi = a + bxi (all the points are placed on a positive line) (b > 0)
3. r = −1 ↔ yi = a + bxi (all the points are placed on a negative line) (b < 0)
4. r = 0 ↔ yi is not in any linear relation with xi , for all i = 1, . . . , n
5. Computation lifesaver formula:
Pn
i=1xi yi − nx̄ȳ
r= q P
n 2
 Pn 2

i=1 xi − nx̄2 i=1 yi − nȳ
2

98
Example: energy consumptions and production in 5 days

ˆ Sample: (x1 , y1 ) = (754, 164), (x2 , y2 ) = (814, 141), (x3 , y3 ) = (749, 194),
(x4 , y4 ) = (787, 180), (x5 , y5 ) = (759, 151)
ˆ Sample means: x̄ = 772.6, ȳ = 166
ˆ Sample variances: s2x = 752.3, s2y = 458.5
ˆ Sample std. deviations: sx = 27.43, sy = 21.41
P5
ˆ i=1 xi yi = 640005
ˆ nx̄ȳ = 5x̄ȳ = 641258
ˆ Then sxy = 640005−641258
4 = −1253
4 = −313.25
sxy
ˆ And r = −313.25
sx sy = 27.43×21.41 = −0.53

99
Table of content
Table of content

Describing data
Case I: one single variable
Case II: Paired variables

Summary statistics
Central tendency
Variability measures
Boxplot
Covariance and correlation

100

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy