Descriptive Statistics
Descriptive Statistics
Describing data
Data matrix
variable(s)
↓
# gender sleep bedtime countries dread
1 male 5 12-2 13 medium
2 female 7.15 10-12 7 low
3 female 5.5 12-2 1 very high ←
4 female 3 10-12 0 low observation(s)
.. .. .. .. .. ..
. . . . . .
86 male 8 12-2 5 very low
24
Types of variables
all variables
quantitative qualitative
(numerical) (categorical)
25
Types of variables (cont.)
26
From raw data to effective representations
Once data has been collected (in a data matrix), and the type of variables we are working with
are known, it should be displayed and visualized
Such presentation must be done clearly and concisely: one has to quickly obtain a “feel” for
the essential characteristics of the data
Different graphical tools depending on the type of variables
Visualising data will often show interesting features of the data and can be very informative:
plot your data before analysing it!
27
Case I: one single variable
Frequency tables
28
Frequency tables (cont.)
I1 0,0,0,0,0,1,1,1,2,2,2,2,3,3,4,5,5,5,5,5
I2 0,1,1,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,6
2. Occurrences:
I1 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3 , 4 , 5, 5, 5, 5, 5
| {z } | {z } | {z } |{z} |{z} | {z }
5 3 4 2 1 5
I2 0 , 1, 1 , 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5 , 6
|{z} |{z} | {z } | {z } | {z } |{z} |{z}
1 2 4 6 4 2 1
29
Frequency tables (cont.)
Intersection 1 Intersection 2
5
6
5
4
4
Number of days
Number of days
3
3
2
2
1
1
0
0
0 1 2 3 4 5 0 1 2 3 4 5 6
Intersection 1 Intersection 2
5
6
5
4
4
Number of days
Number of days
3
3
2
2
1
1
0
0
0 1 2 3 4 5 0 1 2 3 4 5 6
Intersection 1 Intersection 2
5
6
5
4
4
Number of days
Number of days
3
3
2
2
1
1
0
0
0 1 2 3 4 5 0 1 2 3 4 5 6
A set of data is said to be symmetric about the value x0 if the frequencies of the values x0 − c
and x0 + c are the same for all c.
That is, for every constant c, there are just as many data points that are c less than x0 as
there are c greater than x0
Data that are “close to” being symmetric are said to be approximately symmetric → graphical
representation
On the number of accidents/day:
in case of I2 we have symmetry around x0 = 3
in case of I1 we do not have symmetry
34
Relative frequency graphs
I1 I2
0 5 5 = 0.25 0 1 1 = 0.05
20 20
1 3 3 = 0.15 1 2 2 = 0.10
20 20
2 4 4 = 0.20 2 4 4 = 0.20
20 20
3 2 2 = 0.10 3 6 6 = 0.30
20 20
4 1 1 = 0.05 4 4 4 = 0.20
20 20
5 5 5 = 0.25 5 2 2 = 0.10
20 20
6 1 1 = 0.05
20
n = 20 sum = 1 n = 20 sum = 1
35
Visualize the relative frequency table: the line graph
Intersection 1 Intersection 2
0.25
0.30
0.25
0.20
0.20
Relative frequency
Relative frequency
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0 1 2 3 4 5 0 1 2 3 4 5 6
Intersection 1 Intersection 2
0.25
0.30
0.25
0.20
0.20
0.15
Relative frequency
Relative frequency
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0 1 2 3 4 5 0 1 2 3 4 5 6
Intersection 1 Intersection 2
0.25
0.30
0.25
0.20
0.20
Relative frequency
Relative frequency
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0 1 2 3 4 5 0 1 2 3 4 5 6
Till now: the variable (number of accidents) will have a value contained in a small set of integer
numbers. It is a discrete numerical variable and it is likely to take values within a small set.
Suppose now that for I1 you have information on the traffic light present for that intersection,
i.e., you know if that day the traffic light was working or not
Such information is provided as a label, for example “W” (working), “NW” (not working)
The sample is given below
W,NW,W,W,NW,W,NW,NW,W,W,NW,NW,W,NW,NW,NW,W,NW,NW,W
This is categorical data → bar chart (absolute/relative frequency) or pie chart (relative
frequency)
39
Pie chart
n = 20 sum = 1 Not
Working
Relative frequencies are somehow standardised: they will be useful again when comparing two
samples/populations
In our example we observed the two intersections for the same number of days, so it is easy to
compare. That’s not always the case.
Frequency tables are very common but can only be constructed when there are a small number
of possible values: what happens when we measure continous outcomes or variables which can
take many values?
41
Grouped data
For some quantitative data sets, the number of distinct values is too large to utilize a line
graph/bar plot
In such cases,
we divide the values into groupings called class intervals or bins, and then
we plot the number of data values falling in each interval
How to chose the number of classes (width of bins)? → trade-off between
choosing too few classes (bigger bins) at a cost of losing information about the actual data
values in a class
choosing too many classes (smaller bins) at a cost of having frequencies of each class being too
small for a pattern to be discernible
42
Visualizing grouped data: the histogram
Step by step:
1. Arrange the data in increasing order
2. Choose class intervals (bins) so that all data points are covered
3. Construct a (relative) frequency table
4. Draw adjacent bars having heights determined by the frequencies in step 3
Why? → To draw attention to important features of the data:
How symmetric the data are
How spread out the data are
Whether there are intervals having high levels of data concentration (modality and skewness)
Whether there are gaps in the data
Whether some data values are far apart from others (outliers)
43
The problem of number of classes
3.0
5
0.015
2.5
4
2.0
0.010
1.5
2
1.0
0.005
0.5
0.000
0.0
0
140 160 180 200 220 140 160 180 200 220 140 160 180 200 220
The choice of classes (and width) shows different patterns in the data.
A first rule of thumb: same size classes are usually to be preferred. Methods to choose the
”optimal” number of classes do exist (but we don’t discuss them).
44
In practice
Suppose that you know how many pedestrians crossed I1 in those 20 days:
164, 142, 194, 180, 151, 200, 158, 168, 209, 169,
200, 201, 205, 157, 161, 168, 210, 197, 211, 182
Step by step:
1. Sorted data: 142, 151, 157, 158, 161, 164, 168, 168, 169, 180, 182, 194, 197, 200, 200, 201,
205, 209, 210, 211
2. Class intervals:
142 , 151, 157, 158, 161, 164, 168, 168, 169, 180, 182,
|{z} | {z } | {z } | {z }
[140,150) [150,160) [160,170) [180,190)
45
In practice (cont.)
3. Frequency table:
n = 20 sum = 1 46
In practice (cont.)
0.25
5
0.20
4
Relative frequency
0.15
3
Frequency
0.10
2
0.05
1
0.00
0
140 160 180 200 220 140 160 180 200 220
n = 20 sum = 1
48
In practice (cont.)
0.35
7
0.30
6
0.25
5
Relative frequency
0.20
Frequency
0.15
3
0.10
2
0.05
1
0.00
0
140 160 180 200 220 140 160 180 200 220
Sometimes a data set consists of pairs of values that have some relationship to each other →
(x, y)
If x is qualitative and y is quantitative → distinct histograms, one for each category of x.
(Overlayed) Frequency polygons are also an option.
If both x and y are quantitative → scatter plot
50
Example 1: distinct histograms
51
Example 1: distinct histograms (cont.)
3.0
2.5
2.5
2.0
2.0
Frequency
Frequency
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
140 160 180 200 220 140 160 180 200 220
52
Example 1: distinct frequency polygons
3.0
2.5
2.5
2.0
2.0
Frequency
Frequency
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
[140,150) [160,170) [180,190) [200,210) [140,150) [160,170) [180,190) [200,210)
53
Example 1: overlayed frequency polygons
0.15
working
not working
0.10
Frequency
0.05
0.00
55
Example 2: scatter plot (cont.)
220
200
daily production (in 1000 pieces)
180
(x1,y1) = (754,164)
*
160
140
220
200
daily production (in 1000 pieces)
180
(x1,y1) = (754,164)
*
160
(x2,y2) = (814,141)
140
220
200
daily production (in 1000 pieces)
(x3,y3) = (749,194)
*
180
(x1,y1) = (754,164)
*
160
(x2,y2) = (814,141)
140
220
* * *
*
200 ** *
daily production (in 1000 pieces)
* *
*
180
*
* * *
**
160
* *
*
140
Suppose that we have in our possession sample data from some underlying population
Examples: number of accidents/day at I1 and I2, conditions of traffic lights, number of
pedestrians crossing I1, . . .
Up to now we showed how to describe and portray sample data in their entirety
Now we want to determine summary measures about the sample → enter the sample statistics
Sample statistics are numerical quantities computed from sample data
Why? → To provide a simple way to characterise the sample
Why? → Sample statistic as a point estimate of a specific population property, called
parameter, for a specific variable (statistical inference)
The type of summary statistic depends on the type of data and on what we are trying to
characterise
60
Types of statistics
61
Central tendency measures
Sample mean
62
Properties of sample mean
ȳ = x̄ + c
ȳ = ax̄
63
Weighted sample mean
Is it possible to compute x̄ with (relative) frequency tables? Yes! Remember we could build
frequency tables from sequences of measurements.
What we need:
xi – value i in sample, i = 1, . . . , k (k distinct values)
fi – absolute frequency of xi
n – sample size (n = f1 + . . . + fk = ki=1 fi )
P
64
Weighted sample mean (cont.)
Then Pk k
i=1 fi xi X
x̄ = = wi xi
n i=1
65
Example: number of accidents/day at I1
x = (0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5, 5)
k = 6 different values: 0, 1, 2, 3, 4, 5
0 5 0.25
1 3 0.15
2 4 0.20
3 2 0.10
4 1 0.05
5 5 0.25
Total 20 1
Pk
i=1 fi xi 0×5+1×3+2×4+3×2+4×1+5×5 46
x̄ = = = = 2.3
n 20 20
or
= 0 × 0.25 + 1 × 0.15 + 2 × 0.20 + 3 × 0.10 + 4 × 0.05 + 5 × 0.25 = 2.3 66
Deviation from the sample mean
67
In practice
x = (164, 141, 194, 180, 151) – energy consumption in 5 days (EG, in kWh)
x̄ = 166 kWh
Deviations:
x1 − x̄ = 164 − 166 = −2, x2 − x̄ = 141 − 166 = −25,
x3 − x̄ = 194 − 166 = 28, x4 − x̄ = 180 − 166 = 14,
x5 − x̄ = 151 − 166 = −15
Is the sum 0?
5
X
(xi − x̄) = −2 − 25 + 28 + 14 − 15 = 0
i=1
68
Sample median
We need an index to indicate the center of a sample, but not affected by extreme values
Such measure is the sample median, denoted with m, which is the middle value in ranked data
from the smallest to largest value:
69
Sample median (cont.)
Step by step:
Order the sample values from smallest to largest, for a sample of size n
If n is odd, m is the element of the ordered sample in position n−1
2
+1
If n is even, m is the average of elements of the ordered sample in positions n
2
and n
2
+ 1,
respectively
70
Sample median: examples
Example 1:
Energy consumption in 5 days (EG, in kWh): x = (164, 141, 194, 180, 151)
Sorted sample: x̃ = (141, 151, 164, 180, 194)
n = 5 → value in position 3 → m = x̃3 = 164
Example 2:
71
Other properties: symmetry and skewness
If sample has a symmetric behaviour, mean and median are the same (e.g., number of
accidents/day at I2)
When mean is greater than the median, i.e., x̄ > m, the sample is right-skewed (bulk of data
on the left ↔ long “tail” on the right)
When mean is smaller than the median, i.e., x̄ < m, the sample is left-skewed (bulk of data on
the right ↔ long “tail” on the left)
72
Other properties: symmetry and skewness (cont.)
50
25
20
x x
40
x
20
m m m
15
Frequency
Frequency
Frequency
30
15
10
20
10
5
10
5
0
0
0 2 4 6 8 10 −2 0 2 4 6 8 10 12 0 2 4 6 8 10
73
Sample percentiles
# (xi ≤ x̃(100p) ) = np
AND at least 100(1 − p)% of the n data values (= n (1 − p)) are greater than or equal to it:
# (xi ≥ x̃(100p) ) = n (1 − p) = n − np
74
Sample percentiles (cont.)
Step by step:
Arrange the sample in increasing order
If np is not an integer, determine the smallest integer greater than np → the value in that
position is x̃(100p)
x̃ +x̃
If np is an integer, then x̃(100p) = (np) 2 (np+1)
If two data values satisfy this condition, then x̃(100p) is the arithmetic average of these values
Special percentiles called quartiles: 25th (Q1 ), 50th (sample median) and 75th (Q3 )
75
Sample percentiles (cont.) - Examples
Let’s consider a classroom of 20 students and their scores on a recent math test in USA. The
scores are as follows:
x = (75, 82, 90, 65, 88, 72, 94, 78, 60, 85, 70, 92, 68, 89, 76, 80, 98, 84, 79, 87)
Calculate the 70th percentile of these scores. The percentile calculation involves sorting the scores
in ascending order and then finding the score below which 70% of the data falls.
- Step 1: Sort the scores in ascending order:
x̃= (60, 65, 68, 70, 72, 75, 76, 78, 79, 80, 82, 84, 85, 87, 88, 89, 90, 92, 94, 98)
- Step 2: Calculate the index for the 70th percentile:
n*p= 0.70 * 20 = 14. Since np is integer the 70th percentile is the average between x̃14 and x̃15
(x̃14 +x̃15 )
70th percentile = 2 =(87+88)/2= 87.5
And what is the 35 percentile of the following sample?
x= (15, 22, 18, 30, 25, 21, 28, 20, 24)
76
Sample mode
0.30
0.25
0.20
0.20
0.15
Relative frequency
Relative frequency
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0 1 2 3 4 5 0 1 2 3 4 5 6
Not all centrality measures are appropriate for all types of variables:
Sample means can only be computed (in a sensible way) for numerical variables (not for
qualitative/categorical variables)
Sample medians can be computed for numerical and ordinal variables, but not for nominal ones
For nominal variables only the mode can be computed. On the other hand, the mode for
continuous variables is often derived for grouped data, as there are infinitely many values
continuous variables can take
78
Variability measures
Variability
Variability refers to the extent to which values differ from one another
Example: same mean, different dispersion
100
80
80
60
Frequency
60
40
40
20
20
0
0
0 2 4 6 8 10 0 2 4 6 8 10
79
Sample variance
80
In practice
81
Properties of sample variance
Property II Define yi = xi + c, i = 1, . . . , n, and denote with s2x the sample variance of x. Then
Pn 2 Pn 2
i=1 (yi − ȳ) i=1 (xi − x̄)
s2y = = = s2x
n−1 n−1
Property III Define yi = axi , i = 1, . . . , n, and denote with s2x the sample variance of x. Then
Pn 2 Pn 2
2 i=1 (yi − ȳ) 2 i=1 (xi − x̄)
sy = =a = a2 s2x
n−1 n−1
What is s2y if we define yi = axi + c, i = 1, . . . , n, and denote with s2x the sample variance of x?
82
Sample standard deviation
Why? → s is measured in the same units as the original data, whilst s2 is in the squared units
of the original data
Example: energy consumption in 5 days (EG, in kWh)
√
s2 = 458.5 kWh2 → s = 458.5 = 21.41 kWh
Properties:
if yi = xi + c, i = 1, . . . , n, then sy = sy
if yi = cxi , i = 1, . . . , n, then sy = |c|sy
83
Interquartile range
Recap the 25th (Q1 - 1st quartile), the 50th (median - 2nd quartile) and the 75th (Q3 - 3rd
quartile) percentiles
The difference between Q3 and Q1 is the so called interquartile range, denoted with IQR
Roughly speaking, the IQR is the length of the interval in which the central half of the
sample lies: it contains 50% of the data
84
Example: Q1 , Q3 and IQR
85
Visualizing percentiles: the boxplot
70
60
40
max whisker reach
& upper whisker
30
20 Q3 (third quartile)
median
10 Q1 (first quartile)
86
Whiskers and outliers
87
“Normal” data
A data set is said to be normal / approximately normal if a histogram describing it has the
following properties:
It is highest at the middle interval (mode = sample mean = median)
Bell-shaped when moving from the middle interval in either direction
Symmetry with respect its middle interval
88
Empirical rule
If a data set is approximately normal, with sample mean (and median and mode) x̄ and
sample standard deviation s, then
The idea is that normal data is data that has been sampled from a population which follows a
normal distribution
Why the normal distribution? Remember it is a special distribution which is the limiting
distribution for sums/means of random variables
If the population follows a normal distribution with standard deviation σ we have that
Q3 − Q1 = 1.35 ∗ σ. Since 1.35 ∗ 1.5 ≈ 2 we expect the whiskers of the boxplots to contain 95% of
the data.
89
Sample covariance and correlation coefficients
Remember the scatter plot?
220
* * *
*
200
** *
daily production (in 1000 pieces)
* *
*
180
*
* * *
**
160
* *
*
140
90
Type of association
We say that between x and y there is positive association when they tend to grow together
We say that between x and y there is negative association when x grows whilst y decreases, or
vice versa
The most common way to measure association between numerical variables is the sample
covariance and correlation. Other measures do exist, especially other measures should be
employed for variables which are not numerical.
91
Sample covariance
92
Interpretation of sample covariance
93
Interpretation of sample covariance (cont.)
1
y
⇒ y
5 5
4 4 − +
3 3
ȳ
2 2
+ −
1 1
x x
0 1 2 3 4 5 0 1 2 3 4 5
x̄ 94
Interpretation of sample covariance (cont.)
2
y
⇒ y
5 5
4 4 − +
3 3
ȳ
2 2
+ −
1 1
x x
0 1 2 3 4 5 0 1 2 3 4 5
x̄ 95
Interpretation of sample covariance (cont.)
3
y
⇒ y
5 5
4 4 − +
3 3
ȳ
2 2
+ −
1 1
x x
0 1 2 3 4 5 0 1 2 3 4 5
x̄ 96
Sample correlation coefficient
Problem: we can detect the type of association using sxy , but how can we measure its
strength?
The sample correlation coefficient, denoted with r, of the data pairs (xi , yi ), i = 1, . . . , n, is a
sample statistic defined by
Pn
sxy (xi − x̄)(yi − ȳ)
r= = pPn i=1 Pn ,
sx sy i=1 (xi − x̄)
2
i=1 (yi − ȳ)
2
97
Properties of r
1. −1 ≤ r ≤ 1
2. r = 1 ↔ yi = a + bxi (all the points are placed on a positive line) (b > 0)
3. r = −1 ↔ yi = a + bxi (all the points are placed on a negative line) (b < 0)
4. r = 0 ↔ yi is not in any linear relation with xi , for all i = 1, . . . , n
5. Computation lifesaver formula:
Pn
i=1xi yi − nx̄ȳ
r= q P
n 2
Pn 2
i=1 xi − nx̄2 i=1 yi − nȳ
2
98
Example: energy consumptions and production in 5 days
Sample: (x1 , y1 ) = (754, 164), (x2 , y2 ) = (814, 141), (x3 , y3 ) = (749, 194),
(x4 , y4 ) = (787, 180), (x5 , y5 ) = (759, 151)
Sample means: x̄ = 772.6, ȳ = 166
Sample variances: s2x = 752.3, s2y = 458.5
Sample std. deviations: sx = 27.43, sy = 21.41
P5
i=1 xi yi = 640005
nx̄ȳ = 5x̄ȳ = 641258
Then sxy = 640005−641258
4 = −1253
4 = −313.25
sxy
And r = −313.25
sx sy = 27.43×21.41 = −0.53
99
Table of content
Table of content
Describing data
Case I: one single variable
Case II: Paired variables
Summary statistics
Central tendency
Variability measures
Boxplot
Covariance and correlation
100