7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
1
I would appreciate if you point out any typos you spot out to me: (javier.rubio alvarez@kcl.ac.uk).
Disclaimer: These notes should not be distributed or used for commercial purposes.
Week 1: Exploratory Data Analysis
• the extraction of knowledge from data. It employs techniques and theories drawn from many fields
within the broad areas of mathematics, statistics, and information technology.
• the use of scientific methods to obtain useful information from computer data, especially large
amounts of data.
• is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to
extract knowledge and insights from many structural and unstructured data.
For more details on the definition of Data Science, see the following articles:
Definition 2. In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to
summarize their main characteristics, often with visual methods.
As early as 1961, John Tukey identified the importance of EDA, which he defined as:
“Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of
planning the gathering of data to make its analysis easier, more precise or more accurate, and all the
machinery and results of (mathematical) statistics which apply to analyzing data”
There are book-length references on EDA. In this course, we will focus on some common and useful
tools to produce basic descriptive summaries of a data set or the results from a model.
Definition 3. According to the Oxford Dictionary of Statistics, Statistical Inference is defined as:
“The process of drawing conclusions about the nature of some system on the basis of data subject
to random variation. There are several distinguishable and apparently irreconcilable approaches to the
process of inference; comfortingly, there are rarely any gross differences in the inferences that result.
Approaches include Bayesian inference and fiducial inference; the approach first met by a student of
Statistics is usually that based on the Neyman-Pearson lemma.”
The word “inference” refers to drawing conclusions on the basis of some evidence. Thus, Statistical
Inference refers to drawing conclusions on the basis of evidence obtained from the data.
1
1.2 Lecture 2: Exploratory Data Analysis II
• object or process
• unambiguously defined
• natural units (people, animals, plants), socioeconomic units (families, households, companies)
Definition 5. A variable is any characteristics, number, or quantity that can be measured or counted.
Definition 6. Values are simply the values of a variable that a statistical unit can take.
Notation 1. Variables will be denoted with upper case letters while values will be denoted with lower
case letters:
Variable Values
X x1 ,x2 ,x3 . . .,xn
Y y1 ,y2 ,y3 . . .,yn
Definition 7. Population: In statistics this term is used for any finite or infinite collection of “units”,
which are often people but may be, for example, institutions, events, etc.
Definition 8. Sample: A selected subset of a population chosen by some process usually with the
objective of investigating particular properties of the parent population.
Definition 9. Outlier. “An observation that appears to deviate markedly from the other members of the
sample in which it occurs. In the set of systolic blood pressures, {125; 128; 130; 131; 198}, for example,
198 might be considered an outlier. More formally the term refers to an observation which appears to be
inconsistent with the rest of the data, relative to an assumed model. Such extreme observations may be
reflecting some abnormality in the measured characteristic of a subject, or they may result from an error
in the measurement or recording.”
2
1.2.2 Scaling
Definition 10. The scale of a variable is the metric on which a variable is recorded on a set of units.
The scale of the variable measured drastically affects the type of analytical techniques that can be used
on the data, and what conclusions can be drawn from the data.
There are different scales (or types of data):
• nominal scale
• ordinal scale
• numerical scale
– discrete
– continuous
Definition 11. A categorical variable is a variable that can take on one of a limited, and usually fixed,
number of possible values, assigning each individual or other unit of observation to a particular group or
nominal category on the basis of some qualitative property. This is, a variable that gives the appropriate
label of an observation after allocation to one of several possible categories. For example, respiratory
status: terrible, poor, fair, good, excellent, or blood group: A, B, AB or O. Respiratory status is an
example of an ordered categorical variable or ordinal variable whereas blood type is an example of an
unordered categorical variable.
Categorical variables can be:
The values the categorical variable can assume are called levels.
Definition 12. Dichotomous or binary variables. A binary variable can only take two mutually exclu-
sive (disjoint) values. For example:
Definition 13. Nominal variables are variables that have two or more categories, but which do not have
an intrinsic order. Nominal scales assign numbers as labels to identify objects or classes of objects. A
nominal variable is an unordered categorical variable.
Definition 14. Ordinal variable: A measurement that allows a sample of individuals to be ranked with
respect to some characteristic but where differences at different points of the scale are not necessarily
equivalent. Ordinal data is a categorical, statistical data type where the variables have natural, ordered
categories and the distances between the categories is not known. For example, anxiety might be rated on
a scale ‘none’, ‘mild’, ‘moderate’ and ‘severe’, with the values 0,1,2,3, being used to label the categories.
Definition 15. A numerical variable is a variable where the measurement or value has a numerical
meaning. There are two types of numerical variables: discrete and continuous.
3
Figure 1.2.1: Scales.
4
Definition 16. Discrete variables: Variables having only integer values, for example, number of births,
number of pregnancies, number of teeth extracted, etc. Discrete variables are variables that can only
take certain values.
Definition 17. Continuous variable: A measurement not restricted to particular values except in so far
as this is constrained by the accuracy of the measuring instrument. Common examples include weight,
height, temperature, and blood pressure. For such a variable equal sized differences on different parts of
the scale are equivalent. Continuous variables are variables that can take any value (within a range).
Practically speaking, variables with many “countable” units (e.g. income) are treated as continuous
and sometimes called “quasi-continuous”.
5
1.3 Lecture 3: Exploratory Data Analysis III
1.3.1 Binning
Definition 18. Binning: A term most frequently used in imaging studies to denote that several pixels
are grouped together to reduce the impact of read noise on the signal to noise ratio. This is, binning is a
partition of the values of a continuous variable into several classes (usually intervals)
Properties
• xuj = xlj+1 , j = 1, . . . , k − 1.
Class size
4xj = xuj − xlj .
Example 1.3.1. Income distribution. According to the salary and income tax statistics
1.3.2 Distribution
Notation:
• variable: X
• observed values: xi (i = 1, . . . , n)
6
• distinct values: xj (j = 1, . . . , k)
Example 1.3.2. tossing a coin ten times:
• Number of observations: 10
• Observed values: H, T, H, T, T, H, T, H, H, T
Definition 19. The frequency is the number of times a value of a variable is observed.
h(X = xj ) = h(xj ) = hj .
Pk
• properties: 0 ≤ h(xj ) ≤ n, j=1 h(xj ) = n.
Definition 21. Relative frequency
A method for summarising a data set is the construction of a frequency table or frequency distribution.
Definition 22. An empirical frequency distribution (EFD) of a variable is a listing of the values or
ranges of values of the variable together with the frequencies with which these values or ranges of values
occur.
The frequency distribution of a variable is determined by
• the values
The frequency distribution states how the statistical units are distributed with regard to the observed
values.
In many cases, we are also interested in learning how these frequency values cumulate on a subset of
possible values. This leads to the definition of cumulative difference:
Definition 23. Cumulative frequency is the sum of absolute or relative frequencies of all observed
values up to a particular value.
7
• relative cumulative frequency
j
H(xj ) X
F (xj ) = = f (xi ).
n i=1
These definitions now allow us to construct the building blocks of the concept of “distribution”. In
particular, a useful concept in descriptive analyses is that of the Empirical Distribution Function (EFD)
or Empirical Cumulative Distribution Function (ECDF). This definition requires ordinal or numerical
variables.
Definition 24. Empirical distribution function : A probability distribution function estimated directly
from sample data without assuming an underlying algebraic form.
More specifically, the ECDF F is defined as:
x < x1
0 for
j
P
F (x) = f (xi ) for xj ≤ x < xj+1
i=1
xk ≤ x
1 for
l−1
X l−1
X i
X
f (xj ) = f (xj ) − f (xj )
j=i+1 j=1 j=1
= F (xl−1 ) − F (xi ).
Binned variables
Suppose now that we have binned variables, and that we want to summarise these values using the EDF.
This is, we have observed values of a continuous variable
• x1 , x2 , . . . , xn
1. Histogram
8
• area proportional to frequency
– x–axis: class limits xlj , xuj
h(xj ) f (xj )
– y–axis: frequency density h(x
b j) =
xu l or fb(xj ) = xu l
j −xj j −xj
...,...
0
0
~
0
0
,........
0
0
,...,
57 93 92 29 J
I
0 10 20 30 40 50
Punkte
9
2. Empirical distribution function
for x ≤ xl1
0
j−1
x−xl
f (xi ) + xu −xjl f (xj ) for xlj < x ≤ xuj
P
F (x) =
i=1 j j
for xuk < x
1
F(x)
1.0
0.8
0.6
0.4
0.2
10
Which one do you prefer, and why?
Another way of producing EDF is using the built-in R command ecdf(). See the following
example:
# loading the data (variable mpg from mtcars data set)
mydata <- mtcars$mpg
# EDF plot
r_ecdf <- ecdf(mydata)
plot(r_ecdf)
11
1.4 Lecture 4: Exploratory Data Analysis IV
In order to analyse, understand, and communicate the information contained in the sample x, we need
tools to summarise it. The process of summarising the data is known as data reduction or data summary.
There exist many quantities that are used as summaries. These quantities are functions of the sample,
and they are called “statistics”. In mathematical terms, a statistic is any function of the sample T (x),
with T : Rn → Rm , 1 ≤ m ≤ n. The statistic summarises the data in that, instead of reporting the
entire sample, only the value of the statistic T (x) = t is reported. An example of a statistic (also known
x1 + · · · + xn
as “summary statistic”) is the sample mean (or average) T (x) = x̄ = . For instance, if
n
the sample x consists of the age of individuals in the UK population, a way of summarising this sample
is to report only the average age of the population, in which case T (x) is the sample mean.
In many cases, a statistical data analysis consists only on summarising a data set, using a choice of
different summary statistics. This kind of analysis is known as “Descriptive Statistics” or “Descriptive
Analysis”. In fact, a descriptive analysis is usually the first step in statistical data analysis as it helps the
statistician gain understanding about the features of the data. Other summaries that are used in practice
are: the median (0.5 quantile) as well as other quantiles, the minimum of the sample, the maximum of
the sample, and etcetera. In fact, the set of summary statistics given by the minimum of the sample, first
quartile (0.25 quantile), the median, the third quartile (0.75 quantile), and the maximum is known as the
“Five Number Summary”. Visual tools are also used in applied statistics to understand other features
of the data. These include boxplots, violin plots, histograms, smooth density plots, scatter plots, and
etcetera. These tools are not covered in this course, but you may want to have a look at them if you are
planning to pursue a career involving data analysis.
Arithmetic mean
1 Pn
The arithmetic mean of a set of values is simply the average x̄ = xi .
n i=1
In some case, we may instead want to report the mean of the empirical distribution.
Definition 27. The Arithmetic mean x̄ of an empirical distribution function is the sum of all observed
values after being split up evenly to all statistical units:
n k k
1X 1X X
x̄ = xi = xj h(xj ) = xj f (xj )
n i=1 n j=1 j=1
12
For Binned data we can only approximate the mean using the classes and number of observations in
each class:
• xj class mean
• Zero property
n
X
(xi − x̄) = 0
i=1
k
X
(xj − x̄)h(xj ) = 0
j=1
• Summation property
zi = xi + yi z̄ = x̄ + ȳ
Mode
Definition 28. Mode: The most frequently occurring value in a set of observations. Occasionally used
as a measure of location.
The Mode xD is the most frequent value.
xD = xj | hj = max hk or fj = max f (xk ) .
xk xk
This measure is useful for nominal, ordinal, discrete or classified data but ⇒ not continuous data!
xlj < X ≤ xuj h(xj ) f (xj ) fˆ(xj )
0 – 100 1 0.01 0.0001
100 – 500 24 0.24 0.0006
Example 1.4.2.
500 – 1000 45 0.45 0.0009
1000 – 2000 30 0.30 0.0003
sum 100 1.00
13
• modal class: 500 – 1000 hours
Definition 29. Median: The value in a set of ranked observations that divides the data into two parts
of equal size. When there is an odd number of observations the median is the middle value. When
there is an even number of observations the measure is calculated as the average of the two central
values. Provides a measure of location of a sample that is suitable for asymmetric distributions and is
also relatively insensitive to the presence of outliers.
In order to calculate the median, we need to distinguish the cases when the sample size is odd or even:
• if n is odd:
x0.5 = x( n+1 ) ,
2
• if n is even:
1n o
x0.5 = x( n2 ) + x( n2 +1) ,
2
where x(i) is the ith entry of the ordered (sorted in increasing order) sample.
The median of classified (binned) data is not usually reported, it is preferred to use the raw data.
However, if you only have access to the binned data, you may still want to calculate (or approximate)
the median. In such case, the Median of classified variables is calculated using the interpolation of the
EDF (see also Figures ):
0.5 − F (xlj )
F (x0.5 ) = 0.5 ⇐⇒ x0.5 = xlj + · (xuj − xlj )
f (xj )
14
Exercise 3. Run the following R code, which illustrates the calculation of the median for classified data,
in a particular data set. Try to reflect on what each line of the code is doing.
# Median of classified data
plot(x,y,type="l", xlab = "x", ylab = "F(x)", lwd = 2, cex.axis = 1.5)
abline(h=0.5,col="red",lwd=2)
# The median should be between class 3 and 4
# This defines xˆu = x[4] and xˆl = x[3]
points(c(x[3],x[3]),c(0,y[3]), col="gray", type = "l", lty = 2, lwd = 2)
points(c(0,x[3]),c(y[3],y[3]), col="gray", type = "l", lty = 2, lwd = 2)
points(c(x[4],x[4]),c(0,y[4]), col="gray", type = "l", lty = 2, lwd = 2)
points(c(0,x[4]),c(y[4],y[4]), col="gray", type = "l", lty = 2, lwd = 2)
# Note that F(xˆl) = y[3] and f(x_j) = y[4] - y(3)
# The median then is
med05 <- x[3] + (0.5 - y[3] )*(x[4]-x[3])/(y[4]-y[3])
points(c(med05,med05),c(0,0.5), col="gray", type = "l", lty = 2, lwd = 2)
points(c(0,med05),c(0.5,0.5), col="gray", type = "l", lty = 2, lwd = 2)
Definition 30. Quantiles: Divisions of a probability distribution or frequency distribution into equal,
ordered subgroups, for example, quartiles or percentiles.
Thus, we can interpret quantiles as a generalisation of the median, where we are interested in finding
values of the EDF ( p ∈ (0, 1) ) different from 0.5. Special quantiles
• Deciles p = s/10, s = 1, . . . , 9
• Quartiles p = q/4, q = 1, 2, 3
• Quintiles p = r/5, r = 1, . . . , 4
• 50%-quantile = Median
Calculating quantiles requires distinguishing the cases of binned and non-binned data, as follows.
xp = x(k) ,
– if n · p is an integer and k = n · p, then each value between x(k) and x(k+1) can be defined
as quantile
1n o
xp = x(k) + x(k+1) .
2
• Quantiles of binned variables
p − F (xlj )
F (xp ) = p ⇐⇒ xp = xlj + · (xuj − xlj ).
f (xj )
What would be the quantiles p = 0.4 and p = 0.6 using the data in Exercise 3?
Further Reading 1. Read the following R Mardowns which illustrate the robustness of the median.
[Robust Estimation of Location]
15
Bibliography
[1] G. Casella and R.L. Berger. Statistical Inference, volume 2. Duxbury Pacific Grove, CA, 2002.
[2] A.C. Davison. Statistical Models, volume 11. Cambridge University Press, 2003.
[3] J.G. Kalbfleisch. Probability and Statistical Inference, Volume 2: Statistical Inference. Springer
Science & Business Media, 2012.
[4] E.L. Lehmann and G. Casella. Theory of Point Estimation. Springer Science & Business Media,
2006.
16