T1. DescriptiveStatistics
T1. DescriptiveStatistics
Descriptive Statistics
January 8, 2024
1 Preliminaries
2 Frequency Distribution
5 z-scores
6 Box-Plots
Introduction
Statistics is the science of data. It involves collecting, classifying,
summarizing, organizing, and analyzing, and interpreting data. It also
involves model building.
Population: A population is the collection or set of all objects or
measurements that are of interest to the collector.
Sample: A sample is a subset of data selected from a population.
Example: In a clinical study, the population will be all the patients with
the same disease, whereas the sample will be the subset of patients
participants in the study.
Introduction
Types of variables
Variable: It is a feature of interest of the sample.
Types of variables:
Types of variables
√
Number of class intervals k: If N small, the nearest integer to N (it must
be between 5 and 20). Otherwise, the nearest integer to 1 + 3.22 log(N).
lk+1 −l0
Class interval width: round up k .
Mean
Add up all the numbers and divide by how many numbers there are.
Pn
Ungrouped data: x̄ = n1 i=1 xi .
Pn
Grouped data: x̄ = n1 i=1 ni xi .
It can be only computed for NUMERICAL data.
It is very sensitive to outliers.
It is not a good indicator when data distribution is not symmetric.
Median
It is the middle number of the ordered data set. It is found by putting
the numbers in order and taking the actual middle number if there is
one, or the average of the two middle number if not.
Ungrouped data: Me = order data and locate the central one.
n/2−Ni−1
Grouped data: Me = li−1 + ni (li − li−1 ).
It can be only computed for NUMERICAL and ORDINAL data.
It is more robust than mean.
Mode
It is the most commonly occurring number.
Ungrouped data: Mo = is just select the most occurring category (or
categories if there are several with the same frequency) or number(s).
Grouped data: Mo = li−1 + δ1δ+δ 1
2
(li − li−1 ), where δ1 = ni − ni−1 and
δ2 = ni − ni+1 .
Theoretically it can be computed for ALL type of data, but it is only
representative in nominal, ordinal and discrete (with few different
values) data.
Skewness
Skewness denotes the degree of asymmetry in the data. It is the tendency of
a distribution to depart from symmetry.
x̄−Mo
Pearson’s 1st coefficient of skewness: AP1 = S .
Kurtosis
z-scores
The z−score indicates how many standard deviations an individual
observation x is from the center of the data set, its mean.
x − x̄
The z −score of an observation x is calculated: z = .
S
It is quite useful to do fair comparisons of observations that were
measured by using different scales or units. It is dimensionless.
Box-Plot
Box-Plot
5 Draw an open circle (or asterisks *) to identify each observation that falls
between 1.5(IQR) and 3(IQR) from the edge to which it is closest; these are
called mild outliers.
6 Draw a solid circle to identify each observation that falls more than 3(IQR)
from the closest edge; these are called extreme outliers.
Box-Plot
Exercise 3. The following data identify the time in months from hire to
promotion to chief pharmacist for a random sample of 25 employees from
a certain group of employees in a large corporation of drugstores.
Recap
January 8, 2024