0% found this document useful (0 votes)

147 views19 pages

7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics

The document discusses concepts related to exploratory data analysis and statistics. It defines key terms like data science, exploratory data analysis, statistical inference, units, variables, populations, samples, outliers, scales of measurement for variables, categorical variables, binning data, and discrete vs. continuous variables. The document is a lecture outline that introduces fundamental statistical concepts for exploratory data analysis.

Uploaded by

Viorel Adirva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

147 views19 pages

7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics

Uploaded by

Viorel Adirva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

7CCMMS61 Statistics for Data Analysis

Francisco Javier Rubio

Department of Mathematics
Contents

1 Week 1: Exploratory Data Analysis 1

1.1 Lecture 1: Exploratory Data Analysis I . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Lecture 2: Exploratory Data Analysis II . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Lecture 3: Exploratory Data Analysis III . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Lecture 4: Exploratory Data Analysis IV . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1
I would appreciate if you point out any typos you spot out to me: (javier.rubio alvarez@kcl.ac.uk).

Disclaimer: These notes should not be distributed or used for commercial purposes.
Week 1: Exploratory Data Analysis

1.1 Lecture 1: Exploratory Data Analysis I

There are several definitions of the concept of Data Science

Definition 1. Data science is:

• the extraction of knowledge from data. It employs techniques and theories drawn from many fields
within the broad areas of mathematics, statistics, and information technology.

• the scientific analysis of large amounts of information held on computers.

• the use of scientific methods to obtain useful information from computer data, especially large
amounts of data.

• is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to
extract knowledge and insights from many structural and unstructured data.

For more details on the definition of Data Science, see the following articles:

[“A Very Short History Of Data Science”]

[“Statistics: a data science for the 21st century”]

Definition 2. In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to
summarize their main characteristics, often with visual methods.
As early as 1961, John Tukey identified the importance of EDA, which he defined as:
“Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of
planning the gathering of data to make its analysis easier, more precise or more accurate, and all the
machinery and results of (mathematical) statistics which apply to analyzing data”

There are book-length references on EDA. In this course, we will focus on some common and useful
tools to produce basic descriptive summaries of a data set or the results from a model.

Definition 3. According to the Oxford Dictionary of Statistics, Statistical Inference is defined as:
“The process of drawing conclusions about the nature of some system on the basis of data subject
to random variation. There are several distinguishable and apparently irreconcilable approaches to the
process of inference; comfortingly, there are rarely any gross differences in the inferences that result.
Approaches include Bayesian inference and fiducial inference; the approach first met by a student of
Statistics is usually that based on the Neyman-Pearson lemma.”
The word “inference” refers to drawing conclusions on the basis of some evidence. Thus, Statistical
Inference refers to drawing conclusions on the basis of evidence obtained from the data.

1
1.2 Lecture 2: Exploratory Data Analysis II

1.2.1 Basic concepts

Definition 4. In statistics, a unit, or statistical unit, is one member of a set of entities being studied.

• object or process

• unambiguously defined

• unit of information for the statistical examination

• natural units (people, animals, plants), socioeconomic units (families, households, companies)

Definition 5. A variable is any characteristics, number, or quantity that can be measured or counted.

Definition 6. Values are simply the values of a variable that a statistical unit can take.

Notation 1. Variables will be denoted with upper case letters while values will be denoted with lower
case letters:
Variable Values
X x1 ,x2 ,x3 . . .,xn
Y y1 ,y2 ,y3 . . .,yn
Definition 7. Population: In statistics this term is used for any finite or infinite collection of “units”,
which are often people but may be, for example, institutions, events, etc.

Definition 8. Sample: A selected subset of a population chosen by some process usually with the
objective of investigating particular properties of the parent population.

Definition 9. Outlier. “An observation that appears to deviate markedly from the other members of the
sample in which it occurs. In the set of systolic blood pressures, {125; 128; 130; 131; 198}, for example,
198 might be considered an outlier. More formally the term refers to an observation which appears to be
inconsistent with the rest of the data, relative to an assumed model. Such extreme observations may be
reflecting some abnormality in the measured characteristic of a subject, or they may result from an error
in the measurement or recording.”

2
1.2.2 Scaling
Definition 10. The scale of a variable is the metric on which a variable is recorded on a set of units.
The scale of the variable measured drastically affects the type of analytical techniques that can be used
on the data, and what conclusions can be drawn from the data.
There are different scales (or types of data):

• nominal scale

• ordinal scale

• numerical scale

– discrete
– continuous

Definition 11. A categorical variable is a variable that can take on one of a limited, and usually fixed,
number of possible values, assigning each individual or other unit of observation to a particular group or
nominal category on the basis of some qualitative property. This is, a variable that gives the appropriate
label of an observation after allocation to one of several possible categories. For example, respiratory
status: terrible, poor, fair, good, excellent, or blood group: A, B, AB or O. Respiratory status is an
example of an ordered categorical variable or ordinal variable whereas blood type is an example of an
unordered categorical variable.
Categorical variables can be:

• binary (dichotomous): only two levels.

• polytomous : many levels.

The values the categorical variable can assume are called levels.

Definition 12. Dichotomous or binary variables. A binary variable can only take two mutually exclu-
sive (disjoint) values. For example:

• a treatment is successful or not successful

• a household owns a car or not

• a bank classifies customers as credit worthy or not

• a coin flip returns head or tail.

Definition 13. Nominal variables are variables that have two or more categories, but which do not have
an intrinsic order. Nominal scales assign numbers as labels to identify objects or classes of objects. A
nominal variable is an unordered categorical variable.

Definition 14. Ordinal variable: A measurement that allows a sample of individuals to be ranked with
respect to some characteristic but where differences at different points of the scale are not necessarily
equivalent. Ordinal data is a categorical, statistical data type where the variables have natural, ordered
categories and the distances between the categories is not known. For example, anxiety might be rated on
a scale ‘none’, ‘mild’, ‘moderate’ and ‘severe’, with the values 0,1,2,3, being used to label the categories.

Definition 15. A numerical variable is a variable where the measurement or value has a numerical
meaning. There are two types of numerical variables: discrete and continuous.

3
Figure 1.2.1: Scales.

4
Definition 16. Discrete variables: Variables having only integer values, for example, number of births,
number of pregnancies, number of teeth extracted, etc. Discrete variables are variables that can only
take certain values.

Definition 17. Continuous variable: A measurement not restricted to particular values except in so far
as this is constrained by the accuracy of the measuring instrument. Common examples include weight,
height, temperature, and blood pressure. For such a variable equal sized differences on different parts of
the scale are equivalent. Continuous variables are variables that can take any value (within a range).
Practically speaking, variables with many “countable” units (e.g. income) are treated as continuous
and sometimes called “quasi-continuous”.

5
1.3 Lecture 3: Exploratory Data Analysis III

1.3.1 Binning
Definition 18. Binning: A term most frequently used in imaging studies to denote that several pixels
are grouped together to reduce the impact of read noise on the signal to noise ratio. This is, binning is a
partition of the values of a continuous variable into several classes (usually intervals)

There are several aspects to consider when binning data.

Class limit. This is the value of a variable which limits a class downwards or upwards:

• lower class limit xlj j = 1, . . . , k.

• upper class limit xuj j = 1, . . . , k.

Properties

• xuj = xlj+1 , j = 1, . . . , k − 1.

• xlj < x ≤ xuj (or xlj ≤ x < xuj ), j = 1, . . . , k.

Class size
4xj = xuj − xlj .
Example 1.3.1. Income distribution. According to the salary and income tax statistics

• statistical unit: taxpayer

• variable: taxable income

Exercise 1. Identify the Class limit and Class size in the following table. Which rule would you prefer
to apply on the lower class limit and upper class limit?
P
Total income Taxpayers Income
e (1000) bn. e
1 – 4 000 1445.2 2611.3
4 000 – 8 000 1455.5 8889.2
8 000 – 12 000 1240.5 12310.9
12 000 – 16 000 1110.7 15492.7
16 000 – 25 000 2762.9 57218.5
25 000 – 30 000 1915.1 52755.4
30 000 – 50 000 6923.7 270182.7
50 000 – 75 000 3876.9 234493.1
75 000 – 100 000 1239.7 105452.9
100 000 – 250 000 791.6 108065.7
250 000 – 500 000 93.7 31433.8
500 000 – 1 Mill. 26.6 17893.3
1 Mill. – 2 Mill. 8.6 11769.9
2 Mill. – 5 Mill. 3.7 10950.8
5 Mill. and more 1.4 16791.6

1.3.2 Distribution
Notation:

• variable: X

• total number of observations: n

• observed values: xi (i = 1, . . . , n)

6
• distinct values: xj (j = 1, . . . , k)
Example 1.3.2. tossing a coin ten times:

• Variable: “visible side of the coin”

• Number of observations: 10

• Distinct values: “heads (H)”, “tails (T)”

• Observed values: H, T, H, T, T, H, T, H, H, T
Definition 19. The frequency is the number of times a value of a variable is observed.

Question: for which type of data this is a good description?

Two types of frequencies are typically reported:
Definition 20. Absolute frequency

• number of statistical units with a certain characteristic value xj (j = 1, . . . , k)

h(X = xj ) = h(xj ) = hj .
Pk
• properties: 0 ≤ h(xj ) ≤ n, j=1 h(xj ) = n.
Definition 21. Relative frequency

• proportion of statistical units with a certain characteristic value xj (j = 1, . . . , k)

h(xj )
f (xj ) = .
n
Pk
• properties: 0 ≤ f (xj ) ≤ 1, j=1 f (xj ) = 1.

A method for summarising a data set is the construction of a frequency table or frequency distribution.
Definition 22. An empirical frequency distribution (EFD) of a variable is a listing of the values or
ranges of values of the variable together with the frequencies with which these values or ranges of values
occur.
The frequency distribution of a variable is determined by

• the values

• and the absolute or relative frequencies

The frequency distribution states how the statistical units are distributed with regard to the observed
values.

In many cases, we are also interested in learning how these frequency values cumulate on a subset of
possible values. This leads to the definition of cumulative difference:
Definition 23. Cumulative frequency is the sum of absolute or relative frequencies of all observed
values up to a particular value.

• absolute cumulative frequency

j
X
H(xj ) = h(xi ), j = 1, ..., k.
i=1

7
• relative cumulative frequency
j
H(xj ) X
F (xj ) = = f (xi ).
n i=1

These definitions now allow us to construct the building blocks of the concept of “distribution”. In
particular, a useful concept in descriptive analyses is that of the Empirical Distribution Function (EFD)
or Empirical Cumulative Distribution Function (ECDF). This definition requires ordinal or numerical
variables.
Definition 24. Empirical distribution function : A probability distribution function estimated directly
from sample data without assuming an underlying algebraic form.
More specifically, the ECDF F is defined as:

x < x1


 0 for


j

 P
F (x) = f (xi ) for xj ≤ x < xj+1
 i=1




xk ≤ x

1 for

Calculations with the distribution function

f (xj ) = F (xj ) − F (xj−1 ) for j = 1, ..., k with F (x0 ) = 0

l−1
X l−1
X i
X
f (xj ) = f (xj ) − f (xj )
j=i+1 j=1 j=1
= F (xl−1 ) − F (xi ).

Binned variables

Suppose now that we have binned variables, and that we want to summarise these values using the EDF.
This is, we have observed values of a continuous variable

• x1 , x2 , . . . , xn

• binned into k classes

Frequency table for binned data

Classes absolute class frequency relative class frequency

xlj <X≤ xuj h(xj ) = h(xlj <X≤ xuj ) f (xj ) = f (xlj < X ≤ xuj )
xl1 − −xu1 h(x1 ) f (x1 )
.. .. ..
. . .
xlj − −xuj h(xj ) f (xj )
.. .. ..
. . .
xlk − −xuk h(xk ) f (xk )
Sum n 1

Graphical representation of binned data

1. Histogram

8
• area proportional to frequency
– x–axis: class limits xlj , xuj
h(xj ) f (xj )
– y–axis: frequency density h(x
b j) =
xu l or fb(xj ) = xu l
j −xj j −xj

• class frequency = area of the rectangle over the respective class.

• total area of the histogram = 1
k k
fˆ(xj )(xuj − xlj ) =
X X
f (xj ) = 1
i=1 i=1

Example 1.3.3. Binned representation of the age of users of an online platform.

xlj ≤ X < xuj h(xj ) f (xj ) fˆ(xj )

0 – 20 57 0.208 0.010
20 – 30 93 0.339 0.034
30 – 37 92 0.336 0.048
37 – 46 29 0.106 0.012
46 – 51 3 0.011 0.002
Sum 274 1.000

...,...
0
0

~
0
0

,........
0
0

,...,
57 93 92 29 J
I
0 10 20 30 40 50
Punkte

Figure 1.3.1: age data: histogram

9
2. Empirical distribution function
for x ≤ xl1


 0




 j−1

x−xl
f (xi ) + xu −xjl f (xj ) for xlj < x ≤ xuj
P
F (x) =

 i=1 j j




for xuk < x

1


Graphical representation: piecewise linear curve (frequency polygon)

• xlj lower class limit
• xuj upper class limit
Example 1.3.4. Examination of the durability (in hours) of 100 light bulbs
statistical unit: light bulb
variable: durability
numerical, continuous

xlj < X ≤ xuj h(xj ) f (xj ) H(xj ) F (xj )

0 – 100 1 0.01 1 0.01
100 – 500 24 0.24 25 0.25
500 – 1000 45 0.45 70 0.70
1000 – 2000 30 0.30 100 1.00
Sum 100 1.0

Table 1.1: Distribution function of the durability of light bulbs

F(x)

1.0

0.8

0.6

0.4

0.2

100 500 1000 2000 x

Assumptions: uniform distribution of the observations within the class

⇒ linear linking of the points in the graphical representation
Exercise 2. Run the following R code to produce the two types of EDFs:
# Defining the x and y values
x <- c(0,100,500,1000,2000)
y <- c(0,0.01,0.25,0.7,1)

# Plot using steps

plot(x,y,type="s", xlab = "x", ylab = "F(x)", lwd = 2)

# Plot using linear interpolation

plot(x,y,type="l", xlab = "x", ylab = "F(x)", lwd = 2)

10
Which one do you prefer, and why?
Another way of producing EDF is using the built-in R command ecdf(). See the following
example:
# loading the data (variable mpg from mtcars data set)
mydata <- mtcars$mpg

# read the description of the data

help(mtcars)

# EDF plot
r_ecdf <- ecdf(mydata)
plot(r_ecdf)

# EDF evaluation (at 20 in this case)

r_ecdf(20)

Think of other ways of summarising this data set.

11
1.4 Lecture 4: Exploratory Data Analysis IV
In order to analyse, understand, and communicate the information contained in the sample x, we need
tools to summarise it. The process of summarising the data is known as data reduction or data summary.
There exist many quantities that are used as summaries. These quantities are functions of the sample,
and they are called “statistics”. In mathematical terms, a statistic is any function of the sample T (x),
with T : Rn → Rm , 1 ≤ m ≤ n. The statistic summarises the data in that, instead of reporting the
entire sample, only the value of the statistic T (x) = t is reported. An example of a statistic (also known
x1 + · · · + xn
as “summary statistic”) is the sample mean (or average) T (x) = x̄ = . For instance, if
n
the sample x consists of the age of individuals in the UK population, a way of summarising this sample
is to report only the average age of the population, in which case T (x) is the sample mean.
In many cases, a statistical data analysis consists only on summarising a data set, using a choice of
different summary statistics. This kind of analysis is known as “Descriptive Statistics” or “Descriptive
Analysis”. In fact, a descriptive analysis is usually the first step in statistical data analysis as it helps the
statistician gain understanding about the features of the data. Other summaries that are used in practice
are: the median (0.5 quantile) as well as other quantiles, the minimum of the sample, the maximum of
the sample, and etcetera. In fact, the set of summary statistics given by the minimum of the sample, first
quartile (0.25 quantile), the median, the third quartile (0.75 quantile), and the maximum is known as the
“Five Number Summary”. Visual tools are also used in applied statistics to understand other features
of the data. These include boxplots, violin plots, histograms, smooth density plots, scatter plots, and
etcetera. These tools are not covered in this course, but you may want to have a look at them if you are
planning to pursue a career involving data analysis.

1.4.1 Summary Statistics

Definition 25. In descriptive statistics, summary statistics are used to summarize a set of observations,
in order to communicate the largest amount of information as simply as possible.

The main types of summary statistics are:

• Measures of location or central tendency.

• Measures of dispersion or spread.

• Measures of shape, such as skewness and kurtosis.

• Measures of dependence, such as correlation.

An important property of summary statistics is that of Robustness.

Definition 26. A statistic is called robust if it is insensitive to outliers.

Arithmetic mean

1 Pn
The arithmetic mean of a set of values is simply the average x̄ = xi .
n i=1
In some case, we may instead want to report the mean of the empirical distribution.

Definition 27. The Arithmetic mean x̄ of an empirical distribution function is the sum of all observed
values after being split up evenly to all statistical units:
n k k
1X 1X X
x̄ = xi = xj h(xj ) = xj f (xj )
n i=1 n j=1 j=1

12
For Binned data we can only approximate the mean using the classes and number of observations in
each class:

• xj class mean

• nj the number observations in class j

k k
1X X
x̄ = xj nj , n= nj
n j=1 j=1

Example 1.4.1. monthly household net income (up to 25 000 Euro)

MHNI Class mean Share of HH

(Euro) xj f (x) F (x)
1 – 800 400 0.044 0.044
800 – 1 400 1100 0.166 0.210
1 400 – 3 000 2200 0.471 0.681
3 000 – 5 000 4000 0.243 0.924
5 000 – 25 000 15000 0.076 1.000

x̄ = 400 · 0.044 + 1100 · 0.166 + 2200 · 0.471 +

4000 · 0.243 + 15000 · 0.076
= 17.6 + 182.6 + 1036.2 + 972 + 1140 = 3348.4EUR

The arithmetic mean has some useful properties:

• Zero property
n
X
(xi − x̄) = 0
i=1
k
X
(xj − x̄)h(xj ) = 0
j=1

• Summation property
zi = xi + yi z̄ = x̄ + ȳ

Mode
Definition 28. Mode: The most frequently occurring value in a set of observations. Occasionally used
as a measure of location.
The Mode xD is the most frequent value.

xD = xj | hj = max hk or fj = max f (xk ) .
xk xk

This measure is useful for nominal, ordinal, discrete or classified data but ⇒ not continuous data!
xlj < X ≤ xuj h(xj ) f (xj ) fˆ(xj )
0 – 100 1 0.01 0.0001
100 – 500 24 0.24 0.0006
Example 1.4.2.
500 – 1000 45 0.45 0.0009
1000 – 2000 30 0.30 0.0003
sum 100 1.00

13
• modal class: 500 – 1000 hours

• mode: 750 hours

Definition 29. Median: The value in a set of ranked observations that divides the data into two parts
of equal size. When there is an odd number of observations the median is the middle value. When
there is an even number of observations the measure is calculated as the average of the two central
values. Provides a measure of location of a sample that is suitable for asymmetric distributions and is
also relatively insensitive to the presence of outliers.

In order to calculate the median, we need to distinguish the cases when the sample size is odd or even:

• if n is odd:
x0.5 = x( n+1 ) ,
2

• if n is even:
1n o
x0.5 = x( n2 ) + x( n2 +1) ,
2
where x(i) is the ith entry of the ordered (sorted in increasing order) sample.
The median of classified (binned) data is not usually reported, it is preferred to use the raw data.
However, if you only have access to the binned data, you may still want to calculate (or approximate)
the median. In such case, the Median of classified variables is calculated using the interpolation of the
EDF (see also Figures ):

0.5 − F (xlj )
F (x0.5 ) = 0.5 ⇐⇒ x0.5 = xlj + · (xuj − xlj )
f (xj )

The following figure illustrates this calculation.

14
Exercise 3. Run the following R code, which illustrates the calculation of the median for classified data,
in a particular data set. Try to reflect on what each line of the code is doing.
# Median of classified data
plot(x,y,type="l", xlab = "x", ylab = "F(x)", lwd = 2, cex.axis = 1.5)
abline(h=0.5,col="red",lwd=2)
# The median should be between class 3 and 4
# This defines xˆu = x[4] and xˆl = x[3]
points(c(x[3],x[3]),c(0,y[3]), col="gray", type = "l", lty = 2, lwd = 2)
points(c(0,x[3]),c(y[3],y[3]), col="gray", type = "l", lty = 2, lwd = 2)
points(c(x[4],x[4]),c(0,y[4]), col="gray", type = "l", lty = 2, lwd = 2)
points(c(0,x[4]),c(y[4],y[4]), col="gray", type = "l", lty = 2, lwd = 2)
# Note that F(xˆl) = y[3] and f(x_j) = y[4] - y(3)
# The median then is
med05 <- x[3] + (0.5 - y[3] )*(x[4]-x[3])/(y[4]-y[3])
points(c(med05,med05),c(0,0.5), col="gray", type = "l", lty = 2, lwd = 2)
points(c(0,med05),c(0.5,0.5), col="gray", type = "l", lty = 2, lwd = 2)

Definition 30. Quantiles: Divisions of a probability distribution or frequency distribution into equal,
ordered subgroups, for example, quartiles or percentiles.

Thus, we can interpret quantiles as a generalisation of the median, where we are interested in finding
values of the EDF ( p ∈ (0, 1) ) different from 0.5. Special quantiles

• Deciles p = s/10, s = 1, . . . , 9

• Quartiles p = q/4, q = 1, 2, 3

• Quintiles p = r/5, r = 1, . . . , 4

• 50%-quantile = Median

Calculating quantiles requires distinguishing the cases of binned and non-binned data, as follows.

• Quantiles of non-binned data

– if n · p is not an integer and k is the nearest integer to n · p, then the quantile

xp = x(k) ,

– if n · p is an integer and k = n · p, then each value between x(k) and x(k+1) can be defined
as quantile
1n o
xp = x(k) + x(k+1) .
2
• Quantiles of binned variables

p − F (xlj )
F (xp ) = p ⇐⇒ xp = xlj + · (xuj − xlj ).
f (xj )

What would be the quantiles p = 0.4 and p = 0.6 using the data in Exercise 3?

Further Reading 1. Read the following R Mardowns which illustrate the robustness of the median.
[Robust Estimation of Location]

15
Bibliography

[1] G. Casella and R.L. Berger. Statistical Inference, volume 2. Duxbury Pacific Grove, CA, 2002.

[2] A.C. Davison. Statistical Models, volume 11. Cambridge University Press, 2003.

[3] J.G. Kalbfleisch. Probability and Statistical Inference, Volume 2: Statistical Inference. Springer
Science & Business Media, 2012.

[4] E.L. Lehmann and G. Casella. Theory of Point Estimation. Springer Science & Business Media,
2006.

A Student Handbook For Writing
100% (1)
A Student Handbook For Writing
442 pages
Percentile PB Cummins Engine
100% (1)
Percentile PB Cummins Engine
7 pages
Distribution Tables
No ratings yet
Distribution Tables
18 pages
IandF CT1 201709 Exam
No ratings yet
IandF CT1 201709 Exam
6 pages
(Buiness Statistics) Chapter 1 2
No ratings yet
(Buiness Statistics) Chapter 1 2
33 pages
Intro
No ratings yet
Intro
67 pages
Std121-121e - Business Statistics Course Booklet 2023
No ratings yet
Std121-121e - Business Statistics Course Booklet 2023
82 pages
Data Science Lecture No 03
No ratings yet
Data Science Lecture No 03
23 pages
Lecture 1 - Introduction To Statistics
No ratings yet
Lecture 1 - Introduction To Statistics
3 pages
01 Introduction
No ratings yet
01 Introduction
50 pages
22UCS303 DS-Unit III-N
No ratings yet
22UCS303 DS-Unit III-N
85 pages
Lecture 1 Statistics and Lecture2 (1)
No ratings yet
Lecture 1 Statistics and Lecture2 (1)
44 pages
Section 6 Data - Statistics For Quantitative Study
No ratings yet
Section 6 Data - Statistics For Quantitative Study
142 pages
Chapter 1. Biostatistics
No ratings yet
Chapter 1. Biostatistics
34 pages
dataanalysiswithspssppt-221110071954-6ebd3b41
No ratings yet
dataanalysiswithspssppt-221110071954-6ebd3b41
189 pages
EECM3724_Unit_1_Ch1_slides_2022
No ratings yet
EECM3724_Unit_1_Ch1_slides_2022
24 pages
Chapter 1
No ratings yet
Chapter 1
20 pages
RES1N Prefinal Module 4
No ratings yet
RES1N Prefinal Module 4
3 pages
Note for Int to Statistics
No ratings yet
Note for Int to Statistics
24 pages
2348314_BioStats_CIA1
No ratings yet
2348314_BioStats_CIA1
10 pages
Basic Statistics PDF
No ratings yet
Basic Statistics PDF
43 pages
FDS Unit II Notes
No ratings yet
FDS Unit II Notes
48 pages
Report Stat
No ratings yet
Report Stat
21 pages
Dr. Nguyen Thi Van Anh Department of Biotechnology-Pharmacology
No ratings yet
Dr. Nguyen Thi Van Anh Department of Biotechnology-Pharmacology
48 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Biostatistics
No ratings yet
Biostatistics
234 pages
Chapter-1 (Introduction To Biostatistics)
No ratings yet
Chapter-1 (Introduction To Biostatistics)
30 pages
MTPDF1 - Introduction To Statistics
No ratings yet
MTPDF1 - Introduction To Statistics
106 pages
Prof. Hilaria M. Barsabal Cagayan State University
100% (1)
Prof. Hilaria M. Barsabal Cagayan State University
35 pages
STPDF1 - Recalling Basic Concepts
No ratings yet
STPDF1 - Recalling Basic Concepts
31 pages
Introduction To Data Analtsis
No ratings yet
Introduction To Data Analtsis
33 pages
Chapter 1 Correct
No ratings yet
Chapter 1 Correct
31 pages
Introduction
No ratings yet
Introduction
43 pages
Notes (Chapter 1 - 3)
No ratings yet
Notes (Chapter 1 - 3)
15 pages
Statistics Introduction
No ratings yet
Statistics Introduction
26 pages
What Is Statistics?: "Statistics Is A Way To Get Information From Data"
No ratings yet
What Is Statistics?: "Statistics Is A Way To Get Information From Data"
220 pages
Week 01, PT 1
No ratings yet
Week 01, PT 1
16 pages
Statistics 1A Lecture Notes Article
No ratings yet
Statistics 1A Lecture Notes Article
123 pages
Stats Bio Supp. 1
No ratings yet
Stats Bio Supp. 1
11 pages
Week 1 - Data & Statistics
No ratings yet
Week 1 - Data & Statistics
75 pages
Lecture 1
No ratings yet
Lecture 1
33 pages
Introduction To Statistics
100% (3)
Introduction To Statistics
43 pages
EBA2123 1.Data and Statistics
No ratings yet
EBA2123 1.Data and Statistics
36 pages
Statistics Notes
No ratings yet
Statistics Notes
7 pages
Basic Ideas of Data Management
No ratings yet
Basic Ideas of Data Management
32 pages
Introduction to Statistics
No ratings yet
Introduction to Statistics
12 pages
Data Analysis Notes 1
No ratings yet
Data Analysis Notes 1
3 pages
Introduction To Statistics: Lecturer: LE HONG VAN Foreign Trade University - HCM Campus Email: Lehongvan - Cs2@ftu - Edu.vn
No ratings yet
Introduction To Statistics: Lecturer: LE HONG VAN Foreign Trade University - HCM Campus Email: Lehongvan - Cs2@ftu - Edu.vn
62 pages
Measurement Scale: Dr. Myint Moe Moe Khin Professor / Head Department of Statistics Monywa University of Economics
No ratings yet
Measurement Scale: Dr. Myint Moe Moe Khin Professor / Head Department of Statistics Monywa University of Economics
27 pages
Week 01, PT 1
No ratings yet
Week 01, PT 1
16 pages
University of Gondar: Prepared By: Bisrat Misganaw Department of Statistics
100% (1)
University of Gondar: Prepared By: Bisrat Misganaw Department of Statistics
20 pages
2 Types of Data
No ratings yet
2 Types of Data
44 pages
Business Statitics New
No ratings yet
Business Statitics New
72 pages
MAT 361 Lecture 15 16
No ratings yet
MAT 361 Lecture 15 16
40 pages
Data Science
No ratings yet
Data Science
47 pages
Bio Statistics
No ratings yet
Bio Statistics
234 pages
ST1009 - Week 1
No ratings yet
ST1009 - Week 1
26 pages
Introduction Data
No ratings yet
Introduction Data
32 pages
Handout-A-Preliminaries (Advance Statistics)
No ratings yet
Handout-A-Preliminaries (Advance Statistics)
29 pages
Basic Concepts in Statistics
No ratings yet
Basic Concepts in Statistics
42 pages
Week 1
No ratings yet
Week 1
76 pages
MODULE 1 Introduction, Levels of Measurement, Frequency Distribution
No ratings yet
MODULE 1 Introduction, Levels of Measurement, Frequency Distribution
25 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Data Analysis for Engineers and Statisticians: A Modern Guide to Statistical Methods and Techniques
From Everand
Data Analysis for Engineers and Statisticians: A Modern Guide to Statistical Methods and Techniques
Pasquale De Marco
No ratings yet
The Straight Line in 3D The Plane in 3D
No ratings yet
The Straight Line in 3D The Plane in 3D
15 pages
E X FR R FX X: Find An Estimator of Parameter
No ratings yet
E X FR R FX X: Find An Estimator of Parameter
1 page
Formulae and Distributions Tables
No ratings yet
Formulae and Distributions Tables
19 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
13 pages
Lubrizol Corp: Performance Score (P)
No ratings yet
Lubrizol Corp: Performance Score (P)
18 pages
Investment and The Weighted Average Cost of Capital: Murray Z. Frank and Tao Shen
No ratings yet
Investment and The Weighted Average Cost of Capital: Murray Z. Frank and Tao Shen
51 pages
Testing Seasonal Unit Roots in Data at Any Frequency
No ratings yet
Testing Seasonal Unit Roots in Data at Any Frequency
35 pages
Package Digest R Software
No ratings yet
Package Digest R Software
15 pages
Seas2010 Pres Goudarzi
No ratings yet
Seas2010 Pres Goudarzi
22 pages
Temi Di Discussione: Del Servizio Studi
No ratings yet
Temi Di Discussione: Del Servizio Studi
55 pages
Moments PDF
No ratings yet
Moments PDF
15 pages
FEAR Manual
No ratings yet
FEAR Manual
53 pages
Entral Bank Behavior in Times of Financial Crisis
No ratings yet
Entral Bank Behavior in Times of Financial Crisis
22 pages
Quiz#1 Stats
No ratings yet
Quiz#1 Stats
7 pages
Syllabus Ba Sciology
No ratings yet
Syllabus Ba Sciology
14 pages
Abebe Zemelak.
No ratings yet
Abebe Zemelak.
65 pages
Difficulties Faced by Teachers in Using ICT in Teaching-Learning at Technical and Higher Educational Institutions of Uganda
No ratings yet
Difficulties Faced by Teachers in Using ICT in Teaching-Learning at Technical and Higher Educational Institutions of Uganda
11 pages
Thesis Crop Science
No ratings yet
Thesis Crop Science
28 pages
Handbook of Advanced Multilevel Analysis European Association of Methodology Series 1st Edition Joop Hox All Chapter Instant Download
100% (7)
Handbook of Advanced Multilevel Analysis European Association of Methodology Series 1st Edition Joop Hox All Chapter Instant Download
84 pages
Statistics BSc
No ratings yet
Statistics BSc
43 pages
Internal Quality Controll Handbook For Chemical Laboratories
No ratings yet
Internal Quality Controll Handbook For Chemical Laboratories
52 pages
Human Culture and Science Fiction: A Review of The Literature, 1980-2016
No ratings yet
Human Culture and Science Fiction: A Review of The Literature, 1980-2016
15 pages
Sample - Size - Calculation - LeyeADEOMI
No ratings yet
Sample - Size - Calculation - LeyeADEOMI
42 pages
Hyperbolic Secant Distribution
No ratings yet
Hyperbolic Secant Distribution
4 pages
Crop Disease Detection Using Deep Learning Models
No ratings yet
Crop Disease Detection Using Deep Learning Models
9 pages
Transformers in Time Series A Survey 2202.07125
No ratings yet
Transformers in Time Series A Survey 2202.07125
8 pages
Uk Population Estimates 18512014
No ratings yet
Uk Population Estimates 18512014
55 pages
Introductury Econometrics: A Modern Approach 7th Edition Jeffrey M. Wooldridge - Quickly download the ebook to explore the full content
100% (2)
Introductury Econometrics: A Modern Approach 7th Edition Jeffrey M. Wooldridge - Quickly download the ebook to explore the full content
57 pages
This Study Resource Was: MC Qu. 9-54 The Construction Manager For ABC..
No ratings yet
This Study Resource Was: MC Qu. 9-54 The Construction Manager For ABC..
3 pages
unit 4 ai
No ratings yet
unit 4 ai
29 pages
Farhan Dhuha Alharis (220201089)
No ratings yet
Farhan Dhuha Alharis (220201089)
28 pages
Artificial Intelligence-Based Lead Propensity Prediction
No ratings yet
Artificial Intelligence-Based Lead Propensity Prediction
10 pages
H2 MYE Revision Package Hypothesis Testing Solutions
No ratings yet
H2 MYE Revision Package Hypothesis Testing Solutions
9 pages
Grlweap LRFD
No ratings yet
Grlweap LRFD
128 pages
Bayesian Answers
No ratings yet
Bayesian Answers
13 pages
Data Processing and Statistical Treatment: Mark C. Maravillas, Maem
No ratings yet
Data Processing and Statistical Treatment: Mark C. Maravillas, Maem
37 pages
QC PDF
No ratings yet
QC PDF
18 pages
Statistics for Lawyers and Law for Statistics
No ratings yet
Statistics for Lawyers and Law for Statistics
26 pages
Dilla University: Page 1 of 6
100% (2)
Dilla University: Page 1 of 6
6 pages
Assignment Question
No ratings yet
Assignment Question
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics

Uploaded by

7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics

Uploaded by

7CCMMS61 Statistics for Data Analysis

Francisco Javier Rubio

1 Week 1: Exploratory Data Analysis 1

1.1 Lecture 1: Exploratory Data Analysis I

Definition 1. Data science is:

• the scientific analysis of large amounts of information held on computers.

[“A Very Short History Of Data Science”]

[“Statistics: a data science for the 21st century”]

1.2.1 Basic concepts

• unit of information for the statistical examination

• binary (dichotomous): only two levels.

• polytomous : many levels.

• a treatment is successful or not successful

• a household owns a car or not

• a bank classifies customers as credit worthy or not

• a coin flip returns head or tail.

There are several aspects to consider when binning data.

• lower class limit xlj j = 1, . . . , k.

• upper class limit xuj j = 1, . . . , k.

• xlj < x ≤ xuj (or xlj ≤ x < xuj ), j = 1, . . . , k.

• statistical unit: taxpayer

• variable: taxable income

• total number of observations: n

• Variable: “visible side of the coin”

• Distinct values: “heads (H)”, “tails (T)”

Question: for which type of data this is a good description?

• number of statistical units with a certain characteristic value xj (j = 1, . . . , k)

• proportion of statistical units with a certain characteristic value xj (j = 1, . . . , k)

• and the absolute or relative frequencies

• absolute cumulative frequency

Calculations with the distribution function

f (xj ) = F (xj ) − F (xj−1 ) for j = 1, ..., k with F (x0 ) = 0

• binned into k classes

Frequency table for binned data

Classes absolute class frequency relative class frequency

Graphical representation of binned data

• class frequency = area of the rectangle over the respective class.

Example 1.3.3. Binned representation of the age of users of an online platform.

xlj ≤ X < xuj h(xj ) f (xj ) fˆ(xj )

Figure 1.3.1: age data: histogram

Graphical representation: piecewise linear curve (frequency polygon)

xlj < X ≤ xuj h(xj ) f (xj ) H(xj ) F (xj )

Table 1.1: Distribution function of the durability of light bulbs

100 500 1000 2000 x

Assumptions: uniform distribution of the observations within the class

# Plot using steps

# Plot using linear interpolation

# read the description of the data

# EDF evaluation (at 20 in this case)

Think of other ways of summarising this data set.

1.4.1 Summary Statistics

The main types of summary statistics are:

• Measures of location or central tendency.

• Measures of dispersion or spread.

• Measures of shape, such as skewness and kurtosis.

• Measures of dependence, such as correlation.

An important property of summary statistics is that of Robustness.

Definition 26. A statistic is called robust if it is insensitive to outliers.

• nj the number observations in class j

Example 1.4.1. monthly household net income (up to 25 000 Euro)

MHNI Class mean Share of HH

x̄ = 400 · 0.044 + 1100 · 0.166 + 2200 · 0.471 +

The arithmetic mean has some useful properties:

• mode: 750 hours

The following figure illustrates this calculation.

• Quantiles of non-binned data

– if n · p is not an integer and k is the nearest integer to n · p, then the quantile

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.