0% found this document useful (0 votes)

20 views86 pages

Descriptive Statistics

Uploaded by

majdahanout

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views86 pages

Descriptive Statistics

Uploaded by

majdahanout

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

23

Describing data
Data matrix

Data collected on students in a statistics class on a variety of variables:

variable(s)
↓
# gender sleep bedtime countries dread
1 male 5 12-2 13 medium
2 female 7.15 10-12 7 low
3 female 5.5 12-2 1 very high ←
4 female 3 10-12 0 low observation(s)
.. .. .. .. .. ..
. . . . . .
86 male 8 12-2 5 very low

24
Types of variables

all variables

quantitative qualitative
(numerical) (categorical)

continuous discrete nominal ordinal

25
Types of variables (cont.)

From previous data matrix:

# gender sleep bedtime countries dread

1 male 5 12-2 13 medium
2 female 7.15 10-12 7 low
.. .. .. .. .. ..
. . . . . .

gender: categorical, nominal

sleep: numerical, continuous
bedtime: categorical, ordinal
countries: numerical, discrete
dread: categorical, ordinal

26
From raw data to effective representations

Once data has been collected (in a data matrix), and the type of variables we are working with
are known, it should be displayed and visualized
Such presentation must be done clearly and concisely: one has to quickly obtain a “feel” for
the essential characteristics of the data
Different graphical tools depending on the type of variables
Visualising data will often show interesting features of the data and can be very informative:
plot your data before analysing it!

27
Case I: one single variable
Frequency tables

Number of accidents at 2 intersections per day:

I1 0,3,0,5,1,0,4,2,2,1,5,5,1,5,2,3,0,2,5,0
I2 3,2,6,5,4,3,1,4,4,3,3,2,1,2,5,4,3,3,2,0

Small number of distinct and discrete values → it is convenient to represent it in an (absolute)

frequency table, which presents each distinct value along with its frequency of occurrence
Step by step:
1. Sort the data
2. Count the occurrences of each distinct value
3. Group in a tabular form

28
Frequency tables (cont.)

1. Sorted number of accidents per day:

I1 0,0,0,0,0,1,1,1,2,2,2,2,3,3,4,5,5,5,5,5
I2 0,1,1,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,6

2. Occurrences:

I1 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3 , 4 , 5, 5, 5, 5, 5
| {z } | {z } | {z } |{z} |{z} | {z }
5 3 4 2 1 5
I2 0 , 1, 1 , 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5 , 6
|{z} |{z} | {z } | {z } | {z } |{z} |{z}
1 2 4 6 4 2 1

29
Frequency tables (cont.)

3. Absolute frequency table:

I1 I2
Value Frequency Value Frequency
0 5 0 1
1 3 1 2
2 4 2 4
3 2 3 6
4 1 4 4
5 5 5 2
6 1

How often are there no accidents?

5 days out of 20 in I1 and 1 day out 20 in I2
How many days are there with more than one accident?
4+2+1+5=12 out of 20 in I1
4+6+4+2+1=17 out 20 in I2
30
Visualize the frequency table: the line graph

Intersection 1 Intersection 2
5

6
5
4

4
Number of days

Number of days
3

3
2

2
1

1
0

0
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day

31
Visualize the frequency table: the bar plot

Intersection 1 Intersection 2
5

6
5
4

4
Number of days

Number of days
3

3
2

2
1

1
0

0
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day

32
Visualize the frequency table: the frequency polygon

Intersection 1 Intersection 2
5

6
5
4

4
Number of days

Number of days
3

3
2

2
1

1
0

0
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day

33
Symmetry

A set of data is said to be symmetric about the value x0 if the frequencies of the values x0 − c
and x0 + c are the same for all c.
That is, for every constant c, there are just as many data points that are c less than x0 as
there are c greater than x0
Data that are “close to” being symmetric are said to be approximately symmetric → graphical
representation
On the number of accidents/day:
in case of I2 we have symmetry around x0 = 3
in case of I1 we do not have symmetry

34
Relative frequency graphs

Let f represents the frequency of occurrence of some data value x

The relative frequency of x is the ratio f /n, where n represents the total number of observations of
available data
Relative frequency table for number of accidents/day (n = 20 each intersection):

I1 I2

Value x Frequency f Rel. frequency f /n Value x Frequency f Rel. frequency f /n

0 5 5 = 0.25 0 1 1 = 0.05
20 20
1 3 3 = 0.15 1 2 2 = 0.10
20 20
2 4 4 = 0.20 2 4 4 = 0.20
20 20
3 2 2 = 0.10 3 6 6 = 0.30
20 20
4 1 1 = 0.05 4 4 4 = 0.20
20 20
5 5 5 = 0.25 5 2 2 = 0.10
20 20
6 1 1 = 0.05
20
n = 20 sum = 1 n = 20 sum = 1

35
Visualize the relative frequency table: the line graph

Intersection 1 Intersection 2
0.25

0.30
0.25
0.20

0.20
Relative frequency

Relative frequency
0.15

0.15
0.10

0.10
0.05

0.05
0.00

0.00
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day

36
Visualize the relative frequency table: the bar plot

Intersection 1 Intersection 2
0.25

0.30
0.25
0.20

0.20
0.15
Relative frequency

Relative frequency

0.15
0.10

0.10
0.05

0.05
0.00

0.00
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day

37
Visualize the relative frequency table: the frequency polygon

Intersection 1 Intersection 2
0.25

0.30
0.25
0.20

0.20
Relative frequency

Relative frequency
0.15

0.15
0.10

0.10
0.05

0.05
0.00

0.00
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day

38
Displays for qualitative data

Till now: the variable (number of accidents) will have a value contained in a small set of integer
numbers. It is a discrete numerical variable and it is likely to take values within a small set.

Suppose now that for I1 you have information on the traffic light present for that intersection,
i.e., you know if that day the traffic light was working or not
Such information is provided as a label, for example “W” (working), “NW” (not working)
The sample is given below
W,NW,W,W,NW,W,NW,NW,W,W,NW,NW,W,NW,NW,NW,W,NW,NW,W
This is categorical data → bar chart (absolute/relative frequency) or pie chart (relative
frequency)

39
Pie chart

Value x Frequency f Rel. frequency f /n

9
W 9 20
= 0.45
11
NW 11 20
= 0.55 Working

n = 20 sum = 1 Not
Working

Few cons of pie charts:

Use only with few categories (no more than 5/6 categories)
Hard to read/process for most people (areas are hard to compare)
40
Frequencies

Relative frequencies are somehow standardised: they will be useful again when comparing two
samples/populations
In our example we observed the two intersections for the same number of days, so it is easy to
compare. That’s not always the case.
Frequency tables are very common but can only be constructed when there are a small number
of possible values: what happens when we measure continous outcomes or variables which can
take many values?

41
Grouped data

For some quantitative data sets, the number of distinct values is too large to utilize a line
graph/bar plot
In such cases,
we divide the values into groupings called class intervals or bins, and then
we plot the number of data values falling in each interval
How to chose the number of classes (width of bins)? → trade-off between
choosing too few classes (bigger bins) at a cost of losing information about the actual data
values in a class
choosing too many classes (smaller bins) at a cost of having frequencies of each class being too
small for a pattern to be discernible

42
Visualizing grouped data: the histogram

Step by step:
1. Arrange the data in increasing order
2. Choose class intervals (bins) so that all data points are covered
3. Construct a (relative) frequency table
4. Draw adjacent bars having heights determined by the frequencies in step 3
Why? → To draw attention to important features of the data:
How symmetric the data are
How spread out the data are
Whether there are intervals having high levels of data concentration (modality and skewness)
Whether there are gaps in the data
Whether some data values are far apart from others (outliers)

43
The problem of number of classes

3.0
5
0.015

2.5
4

2.0
0.010

1.5
2

1.0
0.005

0.5
0.000

0.0
0
140 160 180 200 220 140 160 180 200 220 140 160 180 200 220

The choice of classes (and width) shows different patterns in the data.
A first rule of thumb: same size classes are usually to be preferred. Methods to choose the
”optimal” number of classes do exist (but we don’t discuss them).
44
In practice

Suppose that you know how many pedestrians crossed I1 in those 20 days:
164, 142, 194, 180, 151, 200, 158, 168, 209, 169,
200, 201, 205, 157, 161, 168, 210, 197, 211, 182

Step by step:
1. Sorted data: 142, 151, 157, 158, 161, 164, 168, 168, 169, 180, 182, 194, 197, 200, 200, 201,
205, 209, 210, 211
2. Class intervals:
142 , 151, 157, 158, 161, 164, 168, 168, 169, 180, 182,
|{z} | {z } | {z } | {z }
[140,150) [150,160) [160,170) [180,190)

194, 197, 200, 200, 201, 205, 209, 210, 211

| {z } | {z } | {z }
[190,200) [200,210) [210,220)

45
In practice (cont.)

3. Frequency table:

Class intervals Frequency f Rel. frequency f /n

1
[140, 150) 1 20
= 0.05
3
[150, 160) 3 20
= 0.15
5
[160, 170) 5 20
= 0.25
0
[170, 180) 0 20
= 0.00
2
[180, 190) 2 20
= 0.10
2
[190, 200) 2 20
= 0.10
5
[200, 210) 5 20
= 0.25
2
[210, 220) 2 20
= 0.10

n = 20 sum = 1 46
In practice (cont.)

4. Histogram plot (absolute and relative frequency):

Histogram of pedestrians (absolute frequency) Histogram of pedestrians (relative frequency)

0.25
5

0.20
4

Relative frequency

0.15
3
Frequency

0.10
2

0.05
1

0.00
0

140 160 180 200 220 140 160 180 200 220

Number of pedestrians crossing intersection 1 Number of pedestrians crossing intersection 1

Symmetric? Spread of the data? Gaps?

47
In practice (cont.)

3. Frequency table with different class intervals:

Class intervals Frequency f Rel. frequency f /n

4
[140, 160) 4 20
= 0.20
5
[160, 180) 5 20
= 0.25
4
[180, 200) 4 20
= 0.20
7
[200, 220) 7 20
= 0.35

n = 20 sum = 1

48
In practice (cont.)

4. Histogram plot with different class intervals:

Histogram of pedestrians (absolute frequency) Histogram of pedestrians (relative frequency)

0.35
7

0.30
6

0.25
5

Relative frequency

0.20
Frequency

0.15
3

0.10
2

0.05
1

0.00
0

140 160 180 200 220 140 160 180 200 220

Number of pedestrians crossing intersection 1 Number of pedestrians crossing intersection 1

Symmetric? Spread of the data? Gaps?

49
Case II: Paired variables
Pairs of variables

Sometimes a data set consists of pairs of values that have some relationship to each other →
(x, y)
If x is qualitative and y is quantitative → distinct histograms, one for each category of x.
(Overlayed) Frequency polygons are also an option.
If both x and y are quantitative → scatter plot

50
Example 1: distinct histograms

x: traffic light is working or not at I1

y: number of pedestrians crossing I1
As a table:
days 1 2 3 4 5 6 7 8 9 10
x W NW W W NW W NW NW W W
y 164 142 194 180 151 200 158 168 209 169
days 11 12 13 14 15 16 17 18 19 20
x NW NW W NW NW NW W NW NW W
y 200 201 205 157 161 168 210 197 211 182

51
Example 1: distinct histograms (cont.)

Traffic light at I1 is working Traffic light at I1 is not working

3.0

3.0
2.5

2.5
2.0

2.0
Frequency

Frequency
1.5

1.5
1.0

1.0
0.5

0.5
0.0

0.0
140 160 180 200 220 140 160 180 200 220

Number of pedestrians crossing I1 Number of pedestrians crossing I1

52
Example 1: distinct frequency polygons

Very similar information to histograms

Traffic light at I1 is working Traffic light at I1 is not working

3.0

3.0
2.5

2.5
2.0

2.0
Frequency

Frequency
1.5

1.5
1.0

1.0
0.5

0.5
0.0

0.0
[140,150) [160,170) [180,190) [200,210) [140,150) [160,170) [180,190) [200,210)

Number of pedestrians crossing I1 Number of pedestrians crossing I1

53
Example 1: overlayed frequency polygons

It is easy to overlay frequency polygons to compare groups:

Overlayed frequency plots (relative frequency)

0.15
working
not working

0.10
Frequency

0.05
0.00

[140,150) [160,170) [180,190) [200,210)

Number of pedestrians crossing I1

54
Example 2: scatter plot

x: daily energy consumption (in kWh) for production of traffic lights

y: daily production (in 1000 pieces) of traffic lights
As a table:
days 1 2 3 4 5 6 7 8 9 10
x 754 814 749 787 759 754 773 732 712 764
y 164 141 194 180 151 200 158 168 209 169
days 11 12 13 14 15 16 17 18 19 20
x 759 769 750 806 751 740 733 728 751 749
y 200 201 205 157 161 168 210 197 211 182

55
Example 2: scatter plot (cont.)

220
200
daily production (in 1000 pieces)

180

(x1,y1) = (754,164)

*
160
140

700 720 740 760 780 800 820

daily energy consumption (in kWh) 56

Example 2: scatter plot (cont.)

220
200
daily production (in 1000 pieces)

180

(x1,y1) = (754,164)

*
160

(x2,y2) = (814,141)
140

700 720 740 760 780 800 820

daily energy consumption (in kWh) 57

Example 2: scatter plot (cont.)

220
200
daily production (in 1000 pieces)

(x3,y3) = (749,194)

*
180

(x1,y1) = (754,164)

*
160

(x2,y2) = (814,141)
140

700 720 740 760 780 800 820

daily energy consumption (in kWh) 58

Example 2: scatter plot (cont.)

220
* * *
*
200 ** *
daily production (in 1000 pieces)

* *
*
180

*
* * *
**
160

* *
*
140

700 720 740 760 780 800 820

daily energy consumption (in kWh) 59

Summary statistics
Sample statistics

Suppose that we have in our possession sample data from some underlying population
Examples: number of accidents/day at I1 and I2, conditions of traffic lights, number of
pedestrians crossing I1, . . .
Up to now we showed how to describe and portray sample data in their entirety
Now we want to determine summary measures about the sample → enter the sample statistics
Sample statistics are numerical quantities computed from sample data
Why? → To provide a simple way to characterise the sample
Why? → Sample statistic as a point estimate of a specific population property, called
parameter, for a specific variable (statistical inference)
The type of summary statistic depends on the type of data and on what we are trying to
characterise

60
Types of statistics

Central tendency of data:

sample mean
sample median
sample mode
Amount of variation/spread inside data:
sample variance
sample standard deviation
Relations betweens pairs of quantitative data:
sample covariance
sample correlation coefficient

61
Central tendency measures
Sample mean

The population mean, denoted with µ, is a parameter defined by

PN
x1 + . . . + xN i=1 xi
µ= = ,
N N
but typically it is impossible to observe the population units x1 , . . . , xN (except in very rare cases)
Let (x1 , . . . , xn ) be a sample of dimension n
The sample mean, denoted as x̄, is a sample statistic defined by
Pn
x1 + . . . + xn i=1 xi
x̄ = =
n n

Example: energy consumption in 5 days (EG, in kWh)

x = (x1 , x2 , x3 , x4 , x5 ) = (164, 141, 194, 180, 151)

164+141+194+180+151 830
x̄ = 5
= 5
= 166 kWh (average EG)

62
Properties of sample mean

Property I. Define yi = xi + c, i = 1, . . . , n. Then

ȳ = x̄ + c

Property II. Define yi = axi , i = 1, . . . , n. Then

ȳ = ax̄

What happens if yi = axi + c?

63
Weighted sample mean

Is it possible to compute x̄ with (relative) frequency tables? Yes! Remember we could build
frequency tables from sequences of measurements.
What we need:
xi – value i in sample, i = 1, . . . , k (k distinct values)
fi – absolute frequency of xi
n – sample size (n = f1 + . . . + fk = ki=1 fi )
P

wi – relative frequency of xi (wi = fi /n and ki=1 wi = 1)

64
Weighted sample mean (cont.)

Values xi Frequency fi Rel. frequency wi = fi /n

x1 f1 w1
x2 f2 w2
.. .. ..
. . .
xk fk wk
Total n 1

Then Pk k
i=1 fi xi X
x̄ = = wi xi
n i=1

65
Example: number of accidents/day at I1

x = (0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5, 5)
k = 6 different values: 0, 1, 2, 3, 4, 5

Values xi Frequency fi Rel. frequency wi = fi /n

0 5 0.25

1 3 0.15

2 4 0.20

3 2 0.10

4 1 0.05

5 5 0.25

Total 20 1
Pk
i=1 fi xi 0×5+1×3+2×4+3×2+4×1+5×5 46
x̄ = = = = 2.3
n 20 20
or
= 0 × 0.25 + 1 × 0.15 + 2 × 0.20 + 3 × 0.10 + 4 × 0.05 + 5 × 0.25 = 2.3 66
Deviation from the sample mean

Let (x1 , . . . , xn ) be a sample of size n

Let x̄ be the sample mean
The deviations are the differences between the data values and the sample mean: xi − x̄
Key property:
n
X
(xi − x̄) = 0
i=1

67
In practice

x = (164, 141, 194, 180, 151) – energy consumption in 5 days (EG, in kWh)
x̄ = 166 kWh

141 151 164 166 180 194

Deviations:
x1 − x̄ = 164 − 166 = −2, x2 − x̄ = 141 − 166 = −25,
x3 − x̄ = 194 − 166 = 28, x4 − x̄ = 180 − 166 = 14,
x5 − x̄ = 151 − 166 = −15

Is the sum 0?
5
X
(xi − x̄) = −2 − 25 + 28 + 14 − 15 = 0
i=1
68
Sample median

Sample mean is deeply affected by extreme values:

x = (1, 1, 1, 1, 1, 50) → x̄ = 9.17

We need an index to indicate the center of a sample, but not affected by extreme values
Such measure is the sample median, denoted with m, which is the middle value in ranked data
from the smallest to largest value:

# (xi > m) = #(xi < m)

69
Sample median (cont.)

Step by step:
Order the sample values from smallest to largest, for a sample of size n
If n is odd, m is the element of the ordered sample in position n−1
2
+1
If n is even, m is the average of elements of the ordered sample in positions n
2
and n
2
+ 1,
respectively

70
Sample median: examples

Example 1:

Energy consumption in 5 days (EG, in kWh): x = (164, 141, 194, 180, 151)
Sorted sample: x̃ = (141, 151, 164, 180, 194)
n = 5 → value in position 3 → m = x̃3 = 164

Example 2:

Extreme values: x = (1, 1, 1, 1, 1, 50) (already sorted)

n = 6 → average between values in position 3 and 4 → m = x3 +x4
2 = 1+1
2 =1

71
Other properties: symmetry and skewness

If sample has a symmetric behaviour, mean and median are the same (e.g., number of
accidents/day at I2)
When mean is greater than the median, i.e., x̄ > m, the sample is right-skewed (bulk of data
on the left ↔ long “tail” on the right)
When mean is smaller than the median, i.e., x̄ < m, the sample is left-skewed (bulk of data on
the right ↔ long “tail” on the left)

72
Other properties: symmetry and skewness (cont.)

Right−skewed Left−skewed Symmetric

50
25

20
x x

40
x
20

m m m

15
Frequency

Frequency

Frequency
30
15

10
20
10

5
10
5
0

0
0 2 4 6 8 10 −2 0 2 4 6 8 10 12 0 2 4 6 8 10

73
Sample percentiles

The median is a special case of the a general concept: sample percentiles.

Let p be any number between 0 and 1

Let (x1 , . . . , xn ) be a sample of size n
Let x̃ = (x(1) , . . . , x(n) ) be the ordered sample
The sample 100p percentile, denoted as x̃(100p) , is that value such that
at least 100p% of the n data values (= np) are less than or equal to it:

# (xi ≤ x̃(100p) ) = np

AND at least 100(1 − p)% of the n data values (= n (1 − p)) are greater than or equal to it:

# (xi ≥ x̃(100p) ) = n (1 − p) = n − np

74
Sample percentiles (cont.)

Step by step:
Arrange the sample in increasing order
If np is not an integer, determine the smallest integer greater than np → the value in that
position is x̃(100p)
x̃ +x̃
If np is an integer, then x̃(100p) = (np) 2 (np+1)
If two data values satisfy this condition, then x̃(100p) is the arithmetic average of these values
Special percentiles called quartiles: 25th (Q1 ), 50th (sample median) and 75th (Q3 )

75
Sample percentiles (cont.) - Examples

Let’s consider a classroom of 20 students and their scores on a recent math test in USA. The
scores are as follows:
x = (75, 82, 90, 65, 88, 72, 94, 78, 60, 85, 70, 92, 68, 89, 76, 80, 98, 84, 79, 87)
Calculate the 70th percentile of these scores. The percentile calculation involves sorting the scores
in ascending order and then finding the score below which 70% of the data falls.
- Step 1: Sort the scores in ascending order:
x̃= (60, 65, 68, 70, 72, 75, 76, 78, 79, 80, 82, 84, 85, 87, 88, 89, 90, 92, 94, 98)
- Step 2: Calculate the index for the 70th percentile:
n*p= 0.70 * 20 = 14. Since np is integer the 70th percentile is the average between x̃14 and x̃15
(x̃14 +x̃15 )
70th percentile = 2 =(87+88)/2= 87.5
And what is the 35 percentile of the following sample?
x= (15, 22, 18, 30, 25, 21, 28, 20, 24)

76
Sample mode

Let (x1 , . . . , xn ) be a sample of size n

The data value that occurs most frequently in the data set is called mode
At a graphical level, search for peaks:
if only one, we talk about unimodal data – case of number of accidents/day at I2 (modes is 3)
if several peaks, we talk about bimodal / multimodal data – case of production of traffic lights
(modes are 168 and 200)
if no apparent peaks, we talk about uniform data
Intersection 1 Intersection 2
0.25

0.30
0.25
0.20

0.20
0.15
Relative frequency

Relative frequency

0.15
0.10

0.10
0.05

0.05
0.00

0.00
0 1 2 3 4 5 0 1 2 3 4 5 6

Number of accident/day Number of accident/day

77
Some caution for measures of centrality

Not all centrality measures are appropriate for all types of variables:

Sample means can only be computed (in a sensible way) for numerical variables (not for
qualitative/categorical variables)
Sample medians can be computed for numerical and ordinal variables, but not for nominal ones
For nominal variables only the mode can be computed. On the other hand, the mode for
continuous variables is often derived for grouped data, as there are infinitely many values
continuous variables can take

This is often something that students get wrong in exams!

78
Variability measures
Variability

Variability refers to the extent to which values differ from one another
Example: same mean, different dispersion

100
80

80
60

Frequency

60
40

40
20

20
0

0
0 2 4 6 8 10 0 2 4 6 8 10

79
Sample variance

The population variance, denoted with σ 2 , is a parameter defined as

PN
i=1 (xi − x̄)2
σ2 = ,
N
but typically it is impossible to observe the population units x1 , . . . , xN (except in very rare
cases)
Let (x1 , . . . , xn ) be a sample of size n. From this we can compute x̄ the sample mean
The sample variance, denoted as s2 , is a sample statistic defined as
Pn
2 (xi − x̄)2
s = i=1
n−1
Sample variance (and in general variability measures) can only be computed for numerical
variables

80
In practice

Example: energy consumption in 5 days (EG, in kWh)

Sample x = (x1 , x2 , x3 , x4 , x5 ) = (164, 141, 194, 180, 151)

Sample mean x̄ = 166 kWh
Squared deviations:
(xi − x̄)2 for i = 1, 2, 3, 4, 5: (164 − 166)2 = (−2)2 = 4
(141 − 166)2 = (−25)2 = 625
(194 − 166)2 = (28)2 = 784
(180 − 166)2 = (14)2 = 196
(151 − 166)2 = (−15)2 = 225
Sample variance:
Pn
− x̄)2
i=1 (xi 4 + 625 + 784 + 196 + 225 1834
s2 = = = = 458.5
n−1 4 4

81
Properties of sample variance

Property I Alternative formula (computational lifesaver):

Pn 2 Pn 2 2
i=1 (xi − x̄) i=1 xi − nx̄
s2 = =
n−1 n−1

Property II Define yi = xi + c, i = 1, . . . , n, and denote with s2x the sample variance of x. Then
Pn 2 Pn 2
i=1 (yi − ȳ) i=1 (xi − x̄)
s2y = = = s2x
n−1 n−1

Property III Define yi = axi , i = 1, . . . , n, and denote with s2x the sample variance of x. Then
Pn 2 Pn 2
2 i=1 (yi − ȳ) 2 i=1 (xi − x̄)
sy = =a = a2 s2x
n−1 n−1

What is s2y if we define yi = axi + c, i = 1, . . . , n, and denote with s2x the sample variance of x?

82
Sample standard deviation

The square root of s2 , denoted as s, is called sample standard deviation:

sP
√ n 2
i=1 (yi − ȳ)
s = s2 =
n−1

Why? → s is measured in the same units as the original data, whilst s2 is in the squared units
of the original data
Example: energy consumption in 5 days (EG, in kWh)
√
s2 = 458.5 kWh2 → s = 458.5 = 21.41 kWh

Properties:
if yi = xi + c, i = 1, . . . , n, then sy = sy
if yi = cxi , i = 1, . . . , n, then sy = |c|sy

83
Interquartile range

Recap the 25th (Q1 - 1st quartile), the 50th (median - 2nd quartile) and the 75th (Q3 - 3rd
quartile) percentiles
The difference between Q3 and Q1 is the so called interquartile range, denoted with IQR
Roughly speaking, the IQR is the length of the interval in which the central half of the
sample lies: it contains 50% of the data

84
Example: Q1 , Q3 and IQR

Number of pedestrians crossing I1 in 20 days:

Sorted data x̃: 142, 151, 157, 158, 161, 164, 168, 168, 169, 180, 182, 194, 197, 200, 200, 201,
205, 209, 210, 211
Q1 as 25th percentile → p = 0.25 → np = 20 × 0.25 = 5. Since it is an integer,
x̃ +x̃
Q1 = (5) 2 (6) = 161+164
2
= 162.5
Q3 as 75th percentile → p = 0.75 → np = 20 × 0.75 = 15. Since it is an integer,
x̃ +x̃
Q3 = (15) 2 (16) = 200+201
2
= 200.5
IQR = Q3 − Q1 = 200.5 − 162.5 = 38

85
Visualizing percentiles: the boxplot

sorted sample values 50

suspected outliers

40
max whisker reach
& upper whisker
30

20 Q3 (third quartile)
median
10 Q1 (first quartile)

0 original data lower whisker

86
Whiskers and outliers

Whiskers – extend up to 1.5 × IQR away from Q1 and Q3 :

max upper whisker reach = Q3 + 1.5 × IQR
max lower whisker reach = Q1 − 1.5 × IQR
we will discuss where the 1.5 come from later
Outlier (potential) – an observed sample value beyond the maximum reach of the whiskers. It
is an observation that appears extreme relative to the rest of the data
As median is preferred to sample mean in case of outliers/skewness, IQR is preferred to
sample standard deviation in the same cases

87
“Normal” data

A data set is said to be normal / approximately normal if a histogram describing it has the
following properties:
It is highest at the middle interval (mode = sample mean = median)
Bell-shaped when moving from the middle interval in either direction
Symmetry with respect its middle interval

Normal data Approximately normal data

88
Empirical rule

If a data set is approximately normal, with sample mean (and median and mode) x̄ and
sample standard deviation s, then

(x̄ − s, x̄ + s) contains ≈ 68% of data

(x̄ − 2s, x̄ + 2s) contains ≈ 95% of data
(x̄ − 3s, x̄ + 3s) contains ≈ 99.7% of data

The idea is that normal data is data that has been sampled from a population which follows a
normal distribution
Why the normal distribution? Remember it is a special distribution which is the limiting
distribution for sums/means of random variables
If the population follows a normal distribution with standard deviation σ we have that
Q3 − Q1 = 1.35 ∗ σ. Since 1.35 ∗ 1.5 ≈ 2 we expect the whiskers of the boxplots to contain 95% of
the data.
89
Sample covariance and correlation coefficients
Remember the scatter plot?

Consider the data set of paired values (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )

Example: daily energy consumption (in kWh) for production of traffic lights (x) and daily
production (in 1000 pieces) of traffic lights (y)

220
* * *
*
200
** *
daily production (in 1000 pieces)

* *
*
180

*
* * *
**
160

* *
*
140

700 720 740 760 780 800 820

daily energy consumption (in kWh)

90
Type of association

We say that between x and y there is positive association when they tend to grow together
We say that between x and y there is negative association when x grows whilst y decreases, or
vice versa
The most common way to measure association between numerical variables is the sample
covariance and correlation. Other measures do exist, especially other measures should be
employed for variables which are not numerical.

91
Sample covariance

The population covariance, denoted as σxy , is the parameter

PN
i=1 (xi − µx )(yi − µy )
σxy =
N
but typically it is impossible to observe the population units (x1 , y1 ) . . . , (xN , yN ) (except in
very rare cases)
We typically have a sample (x1 , y1 ) . . . , (xn , yn )
The sample covariance, denoted as sxy , is the data derived quantity (sample statistic)
Pn Pn
i=1 (xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
sxy = = i=1
n−1 n−1
Roughly, sxy is the mean of the products of the deviations from the means of x and y

92
Interpretation of sample covariance

Consider the generic term (xi − x̄)(yi − ȳ) in sxy

This quantity is
positive when xi and yi are both either larger or smaller then x̄ and ȳ
negative when xi is larger than x̄ and yi smaller than ȳ, or the other way around

93
Interpretation of sample covariance (cont.)

y
⇒ y
5 5

4 4 − +

3 3
ȳ

2 2

+ −
1 1

x x

0 1 2 3 4 5 0 1 2 3 4 5

x̄ 94
Interpretation of sample covariance (cont.)

y
⇒ y
5 5

4 4 − +

3 3
ȳ

2 2

+ −
1 1

x x

0 1 2 3 4 5 0 1 2 3 4 5

x̄ 95
Interpretation of sample covariance (cont.)

y
⇒ y
5 5

4 4 − +

3 3
ȳ

2 2

+ −
1 1

x x

0 1 2 3 4 5 0 1 2 3 4 5

x̄ 96
Sample correlation coefficient

Problem: we can detect the type of association using sxy , but how can we measure its
strength?
The sample correlation coefficient, denoted with r, of the data pairs (xi , yi ), i = 1, . . . , n, is a
sample statistic defined by
Pn
sxy (xi − x̄)(yi − ȳ)
r= = pPn i=1 Pn ,
sx sy i=1 (xi − x̄)
2
i=1 (yi − ȳ)
2

where sx and sy are the sample standard deviations of x and y, respectively

The population correlation coefficient, denoted with ρ, is a parameter defined by
σxy
ρ= ,
σx σy

and it is an unknown measure

97
Properties of r

1. −1 ≤ r ≤ 1
2. r = 1 ↔ yi = a + bxi (all the points are placed on a positive line) (b > 0)
3. r = −1 ↔ yi = a + bxi (all the points are placed on a negative line) (b < 0)
4. r = 0 ↔ yi is not in any linear relation with xi , for all i = 1, . . . , n
5. Computation lifesaver formula:
Pn
i=1xi yi − nx̄ȳ
r= q P
n 2
Pn 2

i=1 xi − nx̄2 i=1 yi − nȳ
2

98
Example: energy consumptions and production in 5 days

Sample: (x1 , y1 ) = (754, 164), (x2 , y2 ) = (814, 141), (x3 , y3 ) = (749, 194),
(x4 , y4 ) = (787, 180), (x5 , y5 ) = (759, 151)
Sample means: x̄ = 772.6, ȳ = 166
Sample variances: s2x = 752.3, s2y = 458.5
Sample std. deviations: sx = 27.43, sy = 21.41
P5
i=1 xi yi = 640005
nx̄ȳ = 5x̄ȳ = 641258
Then sxy = 640005−641258
4 = −1253
4 = −313.25
sxy
And r = −313.25
sx sy = 27.43×21.41 = −0.53

99
Table of content
Table of content

Describing data
Case I: one single variable
Case II: Paired variables

Summary statistics
Central tendency
Variability measures
Boxplot
Covariance and correlation

100

Statistics and Probability For VERSION 3
70% (20)
Statistics and Probability For VERSION 3
71 pages
Formula Sheet - Study Version. - Portfolio Management PDF
No ratings yet
Formula Sheet - Study Version. - Portfolio Management PDF
2 pages
Techlog: Automatic Depth Shifting
100% (1)
Techlog: Automatic Depth Shifting
12 pages
1.a. Descriptive Statistics (Part A)
No ratings yet
1.a. Descriptive Statistics (Part A)
86 pages
Chapter 2 PDF
No ratings yet
Chapter 2 PDF
63 pages
QMM 2 6 2017
No ratings yet
QMM 2 6 2017
87 pages
Fundamentals of Ststisitics
0% (1)
Fundamentals of Ststisitics
102 pages
QMM 2
No ratings yet
QMM 2
68 pages
Data Organization
No ratings yet
Data Organization
69 pages
Business Statistics For R: Name PRN
No ratings yet
Business Statistics For R: Name PRN
30 pages
Organizing and Graphing Data
No ratings yet
Organizing and Graphing Data
83 pages
Frequency Distributions: Essentials of Statistics For The Behavioral Sciences
No ratings yet
Frequency Distributions: Essentials of Statistics For The Behavioral Sciences
45 pages
Lecture-2,3 - Chapter 2 - Organizing and Graphing Data
No ratings yet
Lecture-2,3 - Chapter 2 - Organizing and Graphing Data
46 pages
Ed242 Lec2a Review Data
No ratings yet
Ed242 Lec2a Review Data
21 pages
Methods of Data Collection and Presentation
No ratings yet
Methods of Data Collection and Presentation
33 pages
Biostatistics and Epidemiology LAB
No ratings yet
Biostatistics and Epidemiology LAB
13 pages
M 301 - Ch1 - Introduction To Statistics
No ratings yet
M 301 - Ch1 - Introduction To Statistics
96 pages
2 Organizing and Visualizing Variables
No ratings yet
2 Organizing and Visualizing Variables
36 pages
Tsa Solutions
No ratings yet
Tsa Solutions
49 pages
MCQ Testing of Hypothesis PDF
No ratings yet
MCQ Testing of Hypothesis PDF
10 pages
A Quick Guide To Understanding The Impact of Test Time On Estimation of Mean Time Between Failure (MTBF)
No ratings yet
A Quick Guide To Understanding The Impact of Test Time On Estimation of Mean Time Between Failure (MTBF)
7 pages
Chapter 3 Data Presentation
No ratings yet
Chapter 3 Data Presentation
37 pages
Course: Biostatistics: Haramaya University, Chms
100% (1)
Course: Biostatistics: Haramaya University, Chms
49 pages
Business Statistics: A Decision-Making Approach: Graphs, Charts, and Tables - Describing Your Data
No ratings yet
Business Statistics: A Decision-Making Approach: Graphs, Charts, and Tables - Describing Your Data
47 pages
Describing Data:: Frequency Tables, Frequency Distributions, and Graphic Presentation
No ratings yet
Describing Data:: Frequency Tables, Frequency Distributions, and Graphic Presentation
23 pages
Chapter 2-190810 074149
No ratings yet
Chapter 2-190810 074149
19 pages
Organizing and Graphing Data - Francheska G. Alviz
No ratings yet
Organizing and Graphing Data - Francheska G. Alviz
13 pages
Week 2 Data Presentation
No ratings yet
Week 2 Data Presentation
37 pages
UNIT II Probability Problems
No ratings yet
UNIT II Probability Problems
42 pages
Chapter2: Summarizing and Graphing Data 2-2 Frequency Distributions
No ratings yet
Chapter2: Summarizing and Graphing Data 2-2 Frequency Distributions
4 pages
Sample of Final Exam PDF
No ratings yet
Sample of Final Exam PDF
5 pages
Calculating Standard Deviation Step by Step
No ratings yet
Calculating Standard Deviation Step by Step
21 pages
Probability and Statistical Inference 9ed (2015) Answer
No ratings yet
Probability and Statistical Inference 9ed (2015) Answer
12 pages
2-Organizing and Displaying Data
No ratings yet
2-Organizing and Displaying Data
65 pages
12 Housing Prices
No ratings yet
12 Housing Prices
12 pages
2.fundamentals of Ststisitics
No ratings yet
2.fundamentals of Ststisitics
126 pages
Frequency Distribution & Graghs
No ratings yet
Frequency Distribution & Graghs
28 pages
Biostatistics 3
No ratings yet
Biostatistics 3
108 pages
1st Mid
No ratings yet
1st Mid
19 pages
2 Lecture 2 Organizing and Displaying of Data
No ratings yet
2 Lecture 2 Organizing and Displaying of Data
37 pages
WORKSHEET wk7 1
No ratings yet
WORKSHEET wk7 1
4 pages
How Much Does Industry Matter Really
No ratings yet
How Much Does Industry Matter Really
16 pages
Kuesioner Pengaruh Etika Komunikasi Antar Mahasiswa
No ratings yet
Kuesioner Pengaruh Etika Komunikasi Antar Mahasiswa
20 pages
Introductory Statistics (Chapter 2)
No ratings yet
Introductory Statistics (Chapter 2)
3 pages
CH02 - Data Description 2
No ratings yet
CH02 - Data Description 2
85 pages
Lecture 2 - Table and Chart
No ratings yet
Lecture 2 - Table and Chart
9 pages
BIOL 2163 Lecture 2 - Summarizing and Graphing Data
No ratings yet
BIOL 2163 Lecture 2 - Summarizing and Graphing Data
59 pages
International Journal of Pure and Applied Mathematics No. 3 2013, 583-592
No ratings yet
International Journal of Pure and Applied Mathematics No. 3 2013, 583-592
10 pages
Computational Statistics and Data Analysis: Tonglin Zhang, Ge Lin
No ratings yet
Computational Statistics and Data Analysis: Tonglin Zhang, Ge Lin
12 pages
Presentation of Data
No ratings yet
Presentation of Data
9 pages
Introductory Statistics (Chapter 2)
No ratings yet
Introductory Statistics (Chapter 2)
3 pages
Statistics Report New 1
No ratings yet
Statistics Report New 1
11 pages
Topic 3
No ratings yet
Topic 3
22 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
15 pages
Psychological Statistics Exam
No ratings yet
Psychological Statistics Exam
3 pages
Week 2.1 Data Presentation
No ratings yet
Week 2.1 Data Presentation
40 pages
Lecture-02 Data Organization and Presentation
No ratings yet
Lecture-02 Data Organization and Presentation
36 pages
Topic1.4-Functions of Random Variables
No ratings yet
Topic1.4-Functions of Random Variables
41 pages
IEP102 Group 3
No ratings yet
IEP102 Group 3
75 pages
Mcom 3 Sem Statistical Analysis Cgs S 2019
No ratings yet
Mcom 3 Sem Statistical Analysis Cgs S 2019
4 pages
Lecture 1 - Descriptive Statistics
No ratings yet
Lecture 1 - Descriptive Statistics
43 pages
STA112 Week 2 Class Note
No ratings yet
STA112 Week 2 Class Note
102 pages
Lind 2024 Release Chap002 PPT Accessible
No ratings yet
Lind 2024 Release Chap002 PPT Accessible
30 pages
Chap 3. Data Presentation
No ratings yet
Chap 3. Data Presentation
72 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
65 pages
Panel101 With Cover Page v2
No ratings yet
Panel101 With Cover Page v2
41 pages
ML Notes (Module-3)
No ratings yet
ML Notes (Module-3)
21 pages
Example 12 Polynomial Fit Example Using EXCEL: X y (Data) y (Fit) (Y-Mean Y) (Y-Y (Fit) )
No ratings yet
Example 12 Polynomial Fit Example Using EXCEL: X y (Data) y (Fit) (Y-Mean Y) (Y-Y (Fit) )
2 pages
Chap 2. Data Presentation
No ratings yet
Chap 2. Data Presentation
72 pages
CH - 2 (Organizing and Graphing Data)
No ratings yet
CH - 2 (Organizing and Graphing Data)
83 pages
2. presenting of data - ١١١٠٥٩
No ratings yet
2. presenting of data - ١١١٠٥٩
39 pages
Session 2
No ratings yet
Session 2
38 pages
Module3 DSV Notes
No ratings yet
Module3 DSV Notes
29 pages
BLUP Breeding Values - Quantitative Genetics - WSU Lecture
No ratings yet
BLUP Breeding Values - Quantitative Genetics - WSU Lecture
12 pages
Lecture 1, 2 and 3
No ratings yet
Lecture 1, 2 and 3
45 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
23 pages
Unit 2 Organizing and Displaying Data
No ratings yet
Unit 2 Organizing and Displaying Data
38 pages
Module 3 - Assignment Rakesh Thakor
No ratings yet
Module 3 - Assignment Rakesh Thakor
16 pages
MLC2
No ratings yet
MLC2
9 pages
10 - Matching - Causal Inference For The Brave and True
No ratings yet
10 - Matching - Causal Inference For The Brave and True
9 pages
Lecture 7 Quantitative Reasoning
No ratings yet
Lecture 7 Quantitative Reasoning
7 pages
Organizing-Data 250120 180858
No ratings yet
Organizing-Data 250120 180858
32 pages
Data Visualization
No ratings yet
Data Visualization
5 pages
Evaluation of Slaughterhouse Infrastructure Accord
No ratings yet
Evaluation of Slaughterhouse Infrastructure Accord
15 pages
Real World Project For R Programing
No ratings yet
Real World Project For R Programing
3 pages
Describing Data With Tables
No ratings yet
Describing Data With Tables
9 pages
Chapter 2 Math
No ratings yet
Chapter 2 Math
19 pages
Math Starters: 5- to 10-Minute Activities Aligned with the Common Core Math Standards, Grades 6-12
From Everand
Math Starters: 5- to 10-Minute Activities Aligned with the Common Core Math Standards, Grades 6-12
Gary R. Muschla
No ratings yet
Polyiamonds Folding: Folding Polyiamonds into Deltaheda with 12 Faces or Less: Book 2
From Everand
Polyiamonds Folding: Folding Polyiamonds into Deltaheda with 12 Faces or Less: Book 2
Dr. Keh-Ming Lu
No ratings yet
Gre Formula Book
From Everand
Gre Formula Book
Saifuddin Kamran
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Descriptive Statistics

Uploaded by

Descriptive Statistics

Uploaded by

23

Data collected on students in a statistics class on a variety of variables:

continuous discrete nominal ordinal

From previous data matrix:

# gender sleep bedtime countries dread

 gender: categorical, nominal

 Number of accidents at 2 intersections per day:

 Small number of distinct and discrete values → it is convenient to represent it in an (absolute)

1. Sorted number of accidents per day:

3. Absolute frequency table:

 How often are there no accidents?

Number of accident/day Number of accident/day

Number of accident/day Number of accident/day

Number of accident/day Number of accident/day

 Let f represents the frequency of occurrence of some data value x

Value x Frequency f Rel. frequency f /n Value x Frequency f Rel. frequency f /n

Number of accident/day Number of accident/day

Number of accident/day Number of accident/day

Number of accident/day Number of accident/day

Value x Frequency f Rel. frequency f /n

 Few cons of pie charts:

194, 197, 200, 200, 201, 205, 209, 210, 211

Class intervals Frequency f Rel. frequency f /n

4. Histogram plot (absolute and relative frequency):

Histogram of pedestrians (absolute frequency) Histogram of pedestrians (relative frequency)

Number of pedestrians crossing intersection 1 Number of pedestrians crossing intersection 1

Symmetric? Spread of the data? Gaps?

3. Frequency table with different class intervals:

Class intervals Frequency f Rel. frequency f /n

4. Histogram plot with different class intervals:

Histogram of pedestrians (absolute frequency) Histogram of pedestrians (relative frequency)

Number of pedestrians crossing intersection 1 Number of pedestrians crossing intersection 1

Symmetric? Spread of the data? Gaps?

 x: traffic light is working or not at I1

Traffic light at I1 is working Traffic light at I1 is not working

Number of pedestrians crossing I1 Number of pedestrians crossing I1

Very similar information to histograms

Traffic light at I1 is working Traffic light at I1 is not working

Number of pedestrians crossing I1 Number of pedestrians crossing I1

It is easy to overlay frequency polygons to compare groups:

Overlayed frequency plots (relative frequency)

[140,150) [160,170) [180,190) [200,210)

Number of pedestrians crossing I1

 x: daily energy consumption (in kWh) for production of traffic lights

700 720 740 760 780 800 820

daily energy consumption (in kWh) 56

700 720 740 760 780 800 820

daily energy consumption (in kWh) 57

700 720 740 760 780 800 820

daily energy consumption (in kWh) 58

700 720 740 760 780 800 820

daily energy consumption (in kWh) 59

 Central tendency of data:

 The population mean, denoted with µ, is a parameter defined by

 Example: energy consumption in 5 days (EG, in kWh)

x = (x1 , x2 , x3 , x4 , x5 ) = (164, 141, 194, 180, 151)

 Property I. Define yi = xi + c, i = 1, . . . , n. Then

 Property II. Define yi = axi , i = 1, . . . , n. Then

 What happens if yi = axi + c?

 wi – relative frequency of xi (wi = fi /n and ki=1 wi = 1)

Values xi Frequency fi Rel. frequency wi = fi /n

Values xi Frequency fi Rel. frequency wi = fi /n

 Let (x1 , . . . , xn ) be a sample of size n

141 151 164 166 180 194

 Sample mean is deeply affected by extreme values:

x = (1, 1, 1, 1, 1, 50) → x̄ = 9.17

# (xi > m) = #(xi < m)

 Extreme values: x = (1, 1, 1, 1, 1, 50) (already sorted)

Right−skewed Left−skewed Symmetric

The median is a special case of the a general concept: sample percentiles.

 Let p be any number between 0 and 1

 Let (x1 , . . . , xn ) be a sample of size n

Number of accident/day Number of accident/day

This is often something that students get wrong in exams!

 The population variance, denoted with σ 2 , is a parameter defined as

Example: energy consumption in 5 days (EG, in kWh)

gender: categorical, nominal

Number of accidents at 2 intersections per day:

Small number of distinct and discrete values → it is convenient to represent it in an (absolute)

How often are there no accidents?

Let f represents the frequency of occurrence of some data value x

Few cons of pie charts:

x: traffic light is working or not at I1

x: daily energy consumption (in kWh) for production of traffic lights

Central tendency of data:

The population mean, denoted with µ, is a parameter defined by

Example: energy consumption in 5 days (EG, in kWh)

Property I. Define yi = xi + c, i = 1, . . . , n. Then

Property II. Define yi = axi , i = 1, . . . , n. Then

What happens if yi = axi + c?

wi – relative frequency of xi (wi = fi /n and ki=1 wi = 1)

Let (x1 , . . . , xn ) be a sample of size n

Sample mean is deeply affected by extreme values:

Extreme values: x = (1, 1, 1, 1, 1, 50) (already sorted)

Let p be any number between 0 and 1

Let (x1 , . . . , xn ) be a sample of size n

The population variance, denoted with σ 2 , is a parameter defined as

Sample x = (x1 , x2 , x3 , x4 , x5 ) = (164, 141, 194, 180, 151)

Property I Alternative formula (computational lifesaver):

The square root of s2 , denoted as s, is called sample standard deviation:

Number of pedestrians crossing I1 in 20 days:

Whiskers – extend up to 1.5 × IQR away from Q1 and Q3 :

Consider the data set of paired values (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )

The population covariance, denoted as σxy , is the parameter

Consider the generic term (xi − x̄)(yi − ȳ) in sxy