2nd Unit
2nd Unit
Descriptive Statistics
Descriptive statistics applies the concepts, measures, and terms that are used
to describe the basic features of the samples in a study. These procedures are
essential to provide summaries about the samples as an approximation of the
population. Together with simple graphics, they form the basis of every
quantitative analysis of data. In order to describe the sample data and to be able
to infer any conclusion, weshould go through several steps:
Data Preparation
One of the first tasks when analyzing data is to collect and prepare the data in a
format appropriate for analysis of the samples. The most common steps for data
preparationinvolve the following operations.
1. Obtaining the data: Data can be read directly from a file or they might be obtained
by scraping the web.
2. Parsing the data: The right parsing procedure depends on what format the
dataare in: plain text, fixed columns, CSV, XML, HTML, etc.
3. Cleaning the data: Survey responses and other data files are almost always in-
complete. Sometimes, there are multiple codes for things such as, not asked,
did not know, and declined to answer. And there are almost always errors. A
simplestrategy is to remove or ignore incomplete records.
4. Building data structures: Once you read the data, it is necessary to store them
in a data structure that lends itself to the analysis we are interested in. If the
data fit into the memory, building a data structure is usually the way to go. If
not, usually a database is built, which is an out-of-memory data structure.
Most databases provide a mapping from keys to values, so they serve as
dictionaries.
Let us consider a public database called the “Adult” dataset, hosted on the UCI’s
Machine Learning Repository.1 It contains approximately 32,000 observations con-
cerning different financial parameters related to the US population: age, sex,
marital (marital status of the individual), country, income (Boolean variable: whether
the per- son makes more than $50,000 per annum), education (the highest level of
educationachieved by the individual), occupation, capital gain, etc.
We will show that we can explore the data by asking questions like: “Are men
more likely to become high-income professionals than women, i.e., to receive an
income of over $50,000 per annum?”
Data Preparation
data = []
for line in file :
data1 = line . split ( ’, ’)
if len ( data1 ) == 15 :
data . append ([ chr_int ( data1 [0]) , data1 [1] ,
chr_int ( data1 [2]) , data1 [3] ,
chr_int ( data1 [4]) , data1 [5] ,
data1 [6] , data1 [7] , data1 [8] ,
data1 [9] , chr_int ( data1 [10]) ,
chr_int ( data1 [11]) ,
chr_int ( data1 [12]) ,
data1 [13] , data1 [14]
])
The command shapegives exactly the number of data samples (in rows, in this
case) and features (in columns):
In[4]: df . shape
Thus, we can see that our dataset contains 32,561 data records with 15
featureseach. Let us count the number of items per country:
In[5]:
counts = df . groupby ( ’ country ’). size ()
print counts . head ()
Out[5]: country
? 583
Cambodia 19
Vietnam 67
Yugoslavia 16
The first row shows the number of samples with unknown country, followed
bythe number of samples corresponding to the first countries in the dataset.
Let us split people according to their gender into two groups: men and women.
In [6]:
ml = df [( df . sex == ’ Male ’)]
The data that come from performing a particular measurement on all the
subjects in a sample represent our observations for a single characteristic like
country, age, education, etc. These measurements and categories represent a
sample distribution of the variable, which in turn approximately represents the
population distribution of the variable. One of the main goals of exploratory
data analysis is to visualize and summarize the sample distribution, thereby
allowing us to make tentative assumptions about the population distribution.
Summarizing the Data
In [8]:
df 1 = df [( df . income == ’ >50 K\ n ’)]
print ’ The rate of people with high income is : ’,
int ( len ( df1 )/ float ( len ( df )) *100) , ’%. ’
print ’ The rate of men with high income is : ’,
int ( len ( ml1 )/ float ( len ( ml )) *100) , ’%. ’
print ’ The rate of women with high income is : ’,
int ( len ( fm1 )/ float ( len ( fm )) *100) , ’%. ’
Mean
One of the first measurements we use to have a look at the data is to obtain
samplestatistics from the data, such as the sample mean [1]. Given a sample of
n values,
{ x}i , i = 1 , . . . , n, the mean, μ, is the sum of the values divided by the number of
values,2 in other words:
n
μ 1= ix . (3.1)
n
i =1
The terms mean and average are often used interchangeably. In fact, the
maindistinction between them is that the mean of a sample is the summary
statistic com-puted by Eq. (3.1), while an average is not strictly defined and could
be one of manysummary statistics that can be chosen to describe the central
tendency of a sample.
In our case, we can consider what the average age of men and women samples
inour dataset would be in terms of their mean:
Descriptive Statistics
In [9]:
print ’ The average age of men is : ’,
ml [ ’ age ’]. mean ()
print ’ The average age of women is : ’,
fm [ ’ age ’]. mean ()
Out[9]: The average age of men is: 39.4335474989 The average age of
women is: 36.8582304336
The average age of high-income men is: 44.6257880516
The average age of high-income women is: 42.1255301103
This difference in the sample means can be considered initial evidence that
thereare differences between men and women with high income!
Comment: Later, we will work with both concepts: the population mean and
the sample mean. We should not confuse them! The first is the mean of samples
takenfrom the population; the second, the mean of the whole population.
Sample Variance
The mean is not usually a sufficient descriptor of the data. We can go further by
knowing two numbers: mean and variance. The variance σ2 describes the spread
ofthe data and it is defined as follows:
1
σ2 = (xi − μ)2. (3.2)
n i
The term (xi − μ) is called the deviation from the mean, so the variance is the mean
squared deviation. The square root of the variance, σ, is called the standard
deviation. We consider the standard deviation, because the variance is hard to
interpret (e.g., ifthe units are grams, the variance is in grams squared).
Let us compute the mean and the variance of hours per week men and women
inour dataset work:
In[10]: ml_mu = ml [ ’ age ’]. mean ()
fm_mu = fm [ ’ age ’]. mean ()
ml_var = ml [ ’ age ’]. var ()
fm_var = fm [ ’ age ’]. var ()
ml_std = ml [ ’ age ’]. std ()
fm_std = fm [ ’ age ’]. std ()
print ’ Statistics of age for men : mu : ’,
3.3 Exploratory Data Analysis 35
Out[10]: Statistics of age for men: mu: 39.4335474989 var: 178.773751745std: 13.3706301925
Statistics of age for women: mu: 36.8582304336 var:196.383706395 std:
14.0136970994
We can see that the mean number of hours worked per week by women is signif-
icantly lesser than that worked by men, but with much higher variance and
standarddeviation.
Sample Median
The mean of the samples is a good descriptor, but it has an important drawback:
what will happen if in the sample set there is an error with a value very different
from the rest? For example, considering hours worked per week, it would
normally be in a range between 20 and 80; but what would happen if by mistake
there was a value of 1000? An item of data that is significantly different from the
rest of the data is called an outlier. In this case, the mean, μ, will be drastically
changed towards the outlier. One solution to this drawback is offered by the
statistical median, μ12, which is an order statistic giving the middle value of a
sample. In this case, all the values are ordered by their magnitude and the
median is defined as the value that is in themiddle of the ordered list. Hence, it is
a value that is much more robust in the face of outliers.
Let us see, the median age of working men and women in our dataset and the
median age of high-income men and women:
Fig. 3.1 Histogram of the age of working men (left) and women (right)
Data Distributions
Summarizing data by just looking at their mean, median, and variance can be danger-
ous: very different data can be described by the same statistics. The best thing to
do is to validate the data by inspecting them. We can have a look at the data
distribution, which describes how often each value appears (i.e., what is its
frequency).
The most common representation of a distribution is a histogram, which is a graph
that shows the frequency of each value. Let us show the age of working men and
women separately.
In[12]:
ml_age = ml [ ’ age ’]
ml_age . hist ( normed = 0 , histtype = ’ stepfilled ’,
bins = 20 )
In [13]:
fm_age = fm [ ’ age ’]
fm_age . hist ( normed = 0 , histtype = ’ stepfilled ’,
bins = 10 )
The output can be seen in Fig. 3.1. If we want to compare the histograms, we
canplot them overlapping in the same graphic as follows:
3.3 Exploratory Data Analysis 37
Fig. 3.2 Histogram of the age of working men (in ochre) and women (in violet) (left). Histogram of the
age of working men (in ochre), women (in blue), and their intersection (in violet) after samples
normalization (right)
In [14]:
import seaborn as sns
fm_age . hist ( normed = 0 , histtype = ’ stepfilled ’,
alpha = .5 , bins = 20 )
ml_age . hist ( normed = 0 , histtype = ’ stepfilled ’,
alpha = .5 ,
color = sns . desaturate (" india nred " ,
.75) ,
bins = 10 )
The output can be seen in Fig. 3.2 (left). Note that we are visualizing the absolute
values of the number of people in our dataset according to their age (the abscissa
of the histogram). As a side effect, we can see that there are many more men in
these conditions than women.
We can normalize the frequencies of the histogram by dividing/normalizing by
n, the number of samples. The normalized histogram is called the Probability
MassFunction (PMF).
This outputs Fig. 3.2 (right), where we can observe a comparable range of indi-
viduals (men and women).
The Cumulative Distribution Function (CDF), or just distribution function,
describes the probability that a real-valued random variable X with a given proba-
bility distribution will be found to have a value less than or equal to x . Let us show
the CDF of age distribution for both men and women.
38 3 Descriptive Statistics
In[16]:
ml_age . hist ( normed = 1 , histtype = ’ step ’,
cumulative = True , linewidth = 3.5 ,
bins = 20 )
fm_age . hist ( normed = 1 , histtype = ’ step ’,
cumulative = True , linewidth = 3.5 ,
bins = 20 ,
color = sns . desaturate (" india nred " ,
.75) )
The output can be seen in Fig. 3.3, which illustrates the CDF of the age distributions
for both men and women.
Outlier Treatment
As mentioned before, outliers are data samples with a value that is far from the
centraltendency. Different rules can be defined to detect outliers, as follows:
For example, in our case, we are interested in the age statistics of men versus
women with high incomes and we can see that in our dataset, the minimum age is
17 years and the maximum is 90 years. We can consider that some of these samples
are due to errors or are not representable. Applying the domain knowledge, we
focus on the median age (37, in our case) up to 72 and down to 22 years old, and
we considerthe rest as outliers.
3.3 Exploratory Data Analysis 39
In [17]:
df 2= df . drop ( df . index [
( df . income == ’ >50 K\ n ’) &
(df[ ’ age ’] > df [ ’ age ’]. median () + 35 ) &
(df[ ’ age ’] > df [ ’ age ’]. median () -15)
])
ml1_age = ml 1 [ ’ age ’]
fm1_age = fm 1 [ ’ age ’]
We can check how the mean and the median changed once the data were cleaned:
In [18]: mu2ml = ml2_age . mean ()
std2ml =
ml2_age . std ()
md2ml = ml2_age . median ()
mu2fm = fm2_age . mean ()
std2fm = fm2_age . std ()
md2fm = fm2_age . median ()
Fig. 3.4 The red shows the cleaned data without the considered outliers (in blue)
Figure 3.4 shows the outliers in blue and the rest of the data in red. Visually,
wecan confirm that we removed mainly outliers from the dataset.
Next we can see that by removing the outliers, the difference between the
popula-tions (men and women) actually decreased. In our case, there were more
outliers inmen than women. If the difference in the mean values before removing
the outliersis 2.5, after removing them it slightly decreased to 2.44:
In[20]: print ’ The mean differenc e with outliers is : %4.2 f.
’
% ( ml_age . mean () - fm_age . mean () )
print ’ The mean differen ce without outliers is :
%4.2 f. ’
% ( ml2_age . mean () - fm2_age . mean () )
The results are shown in Fig. 3.5. One can see that the differences between
male and female values are slightly negative before age 42 and positive after it.
Hence, women tend to be promoted (receive more than 50 K) earlier than men.
3.3 Exploratory Data Analysis 41
Fig. 3.5 Differences in high-income earner men versus women as a function of age
For univariate data, the formula for skewness is a statistic that measures the
asym-metry of the set of n data samples, xi :
.
1 i (x i − μ3 )
g1 = , (3.3)
n σ3
where μ is the mean, σ is the standard deviation, and n is the number of data points.
Negative deviation indicates that the distribution “skews left” (it extends
further to the left than to the right). One can easily see that the skewness for a
normal distribution is zero, and any symmetric data must have a skewness of
zero. Note that skewness can be affected by outliers! A simpler alternative is to
look at the relationship between the mean μ and the median μ12.
In[22]:
def skewness ( x):
res = 0
m = x. mean ()
s = x. std ()
for i in x:
res += ( i - m) * ( i - m) * ( i - m)
res /= ( len ( x) * s * s * s)
return res
Out[23]: Pearson’s coefficient of the male population = 9.55830402221 Pearson’s coefficient of the
female population = 26.4067269073
Continuous Distribution
Fig. 3.6 Exponential CDF (left) and PDF (right) with λ = 3.00
¸x
is defined as FX (x) where this satisfies: FX (x) = f X (t)δt for all x . There are
∞
many continuous distributions; here, we will consider the most common ones: the
exponential and the normal distributions.
The normal CDF has no closed-form expression and its most common
represen-tation is the PDF:
1 (x −μ)2
PDF (x) = √ e − 2σ2 .
2πσ2
The parameter σ defines the shape of the distribution. An example of the PDF
ofa normal distribution with μ = 6 and σ = 2 is given in Fig. 3.7.
Kernel Density
y Data Analysis 45
Fig. 3.8 Summed kernel functions around a random set of points (left) and the kernel density
estimate with the optimal bandwidth (right) for our dataset. Random data shown in blue, kernel
shown in black and summed function shown in red
a continuous function that when normalized would approximate the density of the
distribution:
In[24]: x1 = np . random . normal ( -1 , 0.5 , 15 )
x2 = np . random . normal (6 , 1 , 10 )
y = np .r_[ x1 , x2] # r_ t ranslate s slice objects to
conc ate nat ion along the first axis .
x = np . linspace ( min (y) , max ( y) , 100)
Figure 3.8 (left) shows the result of the construction of the continuous
functionfrom the kernel summarization.
In fact, the library SciPy3 implements a Gaussian kernel density estimation that
automatically chooses the appropriate bandwidth parameter for the kernel. Thus,
thefinal construction of the density estimate will be obtained by:
.
46 3 Descriptive Statistics
In [25]:
from scipy . stats import kde
density = kde . gaus sia n_k de ( y)
xgrid = np . linspace ( x. min () , x. max () , 200)
plt . hist (y , bins = 28 , normed = True )
plt . plot ( xgrid , density ( xgrid ) , ’r-’ )
Figure 3.8 (right) shows the result of the kernel density estimate for our example.
Estimation
An important aspect when working with statistical data is being able to use
estimates to approximate the values of unknown parameters of the dataset. In this
section, we will review different kinds of estimators (estimated mean, variance,
standard score,etc.).
In continuation, we will deal with point estimators that are single numerical estimates
of parameters of a population.
Mean
Let us assume that we know that our data are coming from a normal distribution
andthe random samples drawn are as follows:
{0.33, −1.76, 2.34, 0.56, 0.89}.
The question is can we guess the mean μ of the distribution? One approximation
isgiven by the sample mean,¯x . This process is called estimation and the statistic (e.g.,
the sample mean) is called an estimator. In our case, the sample mean is 0.472, and
it seems a logical choice to represent the mean of the distribution. It is not so
evident if we add a sample with —a value of 465. In this case, the sample mean− will be
77.11, which does not look like the mean of the distribution. The reason is due to
the fact that the last value seems to be an outlier compared to the rest of the
sample. In order to avoid this effect, we can try first to remove outliers and then to
estimate the mean; or we can use the sample median as an estimator of the
mean of the distribution. If there are no¯ outliers, the sample mean x minimizes
the following mean squared error:
1
MSE = ¯ − (x μ)2,
n
where n is the number of times we estimate the
mean.Let us compute the MSE of a set of random
data:
3.4 Estimation 47
In [26]:
NTs = 200
mu = 0.0
var = 1.0
err = 0.0
NPs = 1000
for i in range ( NTs ):
x = np . random . normal ( mu , var , NPs )
err += ( x. mean () - mu ) ** 2
print ’ MSE : ’, err / NTests
Variance
If we ask ourselves what is the variance, σ2, of the distribution of X , analogously
we can use the sample variance as an estimator. Let us denote¯ by σ2 the sample
varianceestimator:
1
¯σ = −¯i(x
2
x )2 .
n
For large samples, this estimator works well, but for a small number of
samplesit is biased. In those cases, a better estimator is given by:
1
σ̄2 = (xi − x̄)2 .
n−1
Standard Score
In many real problems, when we want to compare data, or estimate their
correlations or some other kind of relations, we must avoid data that come in
different units. For example, weight can come in kilograms or grams. Even data
that come in the same units can still belong to different distributions. We need to
normalize them to standard scores. Given a dataset as{ a} series of values, xi , we
convert the data to standard scores by subtracting the mean and dividing them by
the standard deviation:
(xi − μ)
zi= .
σ
Note that this measure is dimensionless and its distribution has a mean of 0
and variance of 1. It inherits the “shape” of the dataset: if X is normally
distributed, so is Z ; if X is skewed, so is Z .
Variables of data can express relations. For example, countries that tend to invest
in research also tend to invest more in education and health. This kind of
relationshipis captured by the covariance.
48 3 Descriptive Statistics
Fig. 3.9 Positive correlation between economic growth and stock market returns worldwide (left).
Negative correlation between the world oil production and gasoline prices worldwide (right)
Covariance
When two variables share the same tendency, we speak about covariance. Let us
consider two series,{xi and
} y{i . }Let us center the data with respect to their mean:
dxi =xi μ−X and d yi y= i μ− Y . It is easy to show that when x{i }and {yi } vary
together, their deviations tend to have the same sign. The covariance is defined
as the mean of the following products:
n
1
Cov(X, Y) = n dx
i
dy ,
i
i =1
where n is the length of both sets. Still, the covariance itself is hard to interpret.
= does not necessarily mean that the variables are not correlated! Pear-
having ρ 0,
son’s correlation captures correlations of first order, but not nonlinear
correlations.Moreover, it does not work well in the presence of outliers.
between the sets. However, the Spearman’s rank coefficient, capturing the
correlation between the ranks, gives as a final value of 0.80, confirming the
correlation between the sets. As an exercise, you can compute the Pearson’s and
the Spearman’s rank correlations for the different Anscombe configurations given in
Fig. 3.10. Observe if linear and nonlinear correlations can be captured by the
Pearson’s and the Spearman’s rank correlations.
Statistical Inference
Introduction
There is not only one way to address the problem of statistical inference. In fact,
there are two main approaches to statistical inference: the frequentist and
Bayesianapproaches. Their differences are subtle but fundamental:
• In the case of the frequentist approach, the main assumption is that there is a
population, which can be represented by several parameters, from which we
can obtain numerous random samples. Population parameters are fixed but
they are not accessible to the observer. The only way to derive information
about these parameters is to take a sample of the population, to compute the
parameters of the sample, and to use statistical inference techniques to make
probable propositionsregarding population parameters.
• The Bayesian approach is based on a consideration that data are fixed, not the
result of a repeatable sampling process, but parameters describing data can be
described probabilistically. To this end, Bayesian inference methods focus on
producing parameter distributions that represent all the knowledge we can
extract from the sample and from prior information about the problem.
In [1]:
Sampling Distribution of Point Estimates
Let us suppose that we are interested in describing the daily number of traffic
acci- dents in the streets of Barcelona in 2013. If we have access to the
population, the computation of this parameter is a simple operation: the total
number of accidents divided by 365.
data = pd . re ad_csv (" files / ch04 / A C CI D EN T S _G U _ BC N _ 20 1 3 . csv ")
data [ ’ Date ’] = data [ u ’ Dia de mes ’]. apply ( lambda x: str (x))
+ ’-’ +
data [ u ’ Mes de any ’]. apply ( lambda x: str (x))
data [ ’ Date ’] = pd . t o_da tetim e ( data [ ’ Date ’])
suppose that we only have access to a limited part of the data (the
Out[1]: sample): the number of accidents during some days of 2013. Can we
Mean: still give an approximation of the population mean?
25.9095 The most intuitive way to go about providing such a mean is simply
B to take the sample mean. The sample mean is a point estimate of the
u population mean. If we can only choose one value to estimate the
t population mean, then this is our best guess.
The problem we face is that estimates generally vary from one
n sample to another, and this sampling variation suggests our estimate
o may be close, but it will not be exactly equal to our parameter of
w interest. How can we measure this variability?
, In our example, because we have access to the population, we can
empirically buildthe sampling distribution of the sample mean2 for a
f given number of observations.Then, we can use the sampling
o = we can
distribution to compute a measure of the variability.In Fig. 4.1,
r = empirical sample distribution of the mean for s
see the 10.000 samp
200 observations from our dataset. This empirical distribution has
i been built in the following way: Statistical Inference
l
l
u
s
t
r
a
t
i
v
e
Fig. 4.1 Empirical distribution of the sample mean. In red, the mean value of this distribution
p
u
1. Draw s (a large number) independent samples {x 1 , . . ., xs} from the
r
populationwhere each element x j is composed of {x j }i=1,...,n.
p i
.
2. Evaluate the sample mean μ̂j = 1 n x j of each sample.
o n i =1 i
3. Estimate the sampling distribution
ˆ of μ by the empirical distribution of the
s sample
e replications.
s
,
In [2]:
# popu latio n
df = acc ide nts . to_f ram e ()
l N_test = 10000
e elements = 200
# mean array of samples
t means = [ 0 ] * N_test
# sample g enera tion
for i in range ( N_test ):
u rows = np . random . choice ( df . index . values , elements )
sampled _df = df . ix [ rows ]
s means [ i] = s ample d_df . mean ()
te from a sample of size n, we define its sampling distribution as the
distribution of the point estimate based on samples of size n from its
I
population. This definition is valid for point estimates of other
n
population parameters, such as the population median or population
standard deviation, but we will focus on the analysis of the sample
g
mean.
e
The sampling distribution of an estimate plays an important role in
n
understanding the real meaning of propositions concerning point
e
estimates. It is very useful to think of a particular point estimate as
r
being drawn from such a distribution.
a
The Traditional Approach
l
In real problems, we do not have access to the real population and
,
so estimation of the sampling distribution of the estimate from the
empirical distribution of the sample replications is not an option. But
g
this problem can be solved by making use of some theoretical results
i
from traditional statistics.
v
e
4.3 Measuring the Variability in Estimates 55
n
It can be mathematically shown that given n independent observations { } xi i=1,..,n
a a population with a standard deviation σx , the standard deviation of the
of
samplemean σx¯, or standard error, can be approximated by this formula:
p σx
SE = √
o n
i The demonstration of this result is based on the Central Limit Theorem: an
noldtheorem with a history that starts in 1810 when Laplace released his first paper
t on it.This formula uses the standard deviation of the population σx , which is not
known, but it can be shown that if it is substituted by its empiricalˆ estimate σx , the
e estimationis sufficiently good if n > 30 and the population distribution is not
sskewed. Thisallows us to estimate the standard error of the sample mean even if
t we do not have
iaccess to the population.
m So, how can we give a measure of the variability of the sample mean? The
a
answer is simple: by giving the empirical standard error of the mean distribution.
rows = np . random . choice ( df . index . values , 200)
sampled _df = df . ix [ rows ]
est_sig ma_me an = sam pled _df . std () / math . sqrt (200)
Out[3]: Direct estimation of SE from one sample of 200 elements: 0.6536Estimation of the SE by
simulating 10000 samples of 200
elements: 0.6362
Unlike the case of the sample mean, there is no formula for the standard error
ofother interesting sample estimates, such as the median.
The Computationally Intensive Approach
Let us consider from now that our full dataset is a sample from a hypothetical
population (this is the most common situation when analyzing real data!).
A modern alternative to the traditional approach to statistical inference is the
bootstrapping method [2]. In the bootstrap, we draw n observations with
replacement from the original data to create a bootstrap sample or resample. Then,
we can calculate the mean for this resample. By repeating this process a large
number of times, we can build a good approximation of the mean sampling
distribution (see Fig. 4.2).
56 4 Statistical Inference
Fig. 4.2 Mean sampling distribution by bootstrapping. In red, the mean value of this distribution
Confidence Intervals
A point estimate Θ, such as the sample mean, provides a single plausible value
for a parameter. However, as we have seen, a point estimate is rarely perfect;
usually there is some error in the estimate. That is why we have suggested using the
standard error as a measure of its variability.
Instead of that, a next logical step would be to provide a plausible range of
valuesfor the parameter. A plausible range of values for the sample parameter is
called a confidence interval.
We will base the definition of confidence interval on two ideas:
1. Our point estimate is the most plausible value of the parameter, so it makes
senseto build the confidence interval around the point estimate.
2. The plausibility of a range of values can be defined from the sampling
distributionof the estimate.
For the case of the mean, the Central Limit Theorem states that its
samplingdistribution is normal:
Theorem 4.1 Given a population with a finite mean μ and a finite non-zero variance σ
2
, the sampling distribution of the mean approaches a normal distribution with a
mean of μ and a variance of σ2/n as n, the sample size, increases.
In this case, and in order to define an interval, we can make use of a well-
known result from probability that applies to normal distributions: roughly 95% of
the time our estimate will be within 1.96 standard errors of the true mean of the
distribution. If the interval spreads out 1.96 standard errors from a normally
distributed point estimate, intuitively we can say that we are roughly 95%
confident that we have captured the true parameter.
CI = [Θ − 1.96 × SE , Θ + 1.96 × SE ]
This is how we would compute a 95% confidence interval of the sample mean
using bootstrapping:
1. Repeat the following steps for a large number, s, of times:
2. Calculate the mean of your s values of the sample statistic. This process
givesyou a “bootstrapped” estimate of the sample statistic.
3. Calculate the standard deviation of your s values of the sample statistic.
Thisprocess gives you a “bootstrapped” estimate of the SE of the sample
statistic.
4. Obtain the 2.5th and 97.5th percentiles of your s values of the sample statistic.
In 95% of the cases, when I compute the 95% confidence interval from this sample, the
true mean of the population will fall within the interval defined by these bounds: ±1.96 ×
SE.
We cannot say either that our specific sample contains the true parameter or
that the interval has a 95% chance of containing the true parameter. That
interpretation would not be correct under the assumptions of traditional
statistics.
Hypothesis Testing
• H0: The mean number of daily traffic accidents is the same in 2010 and 2013
(there is only one population, one true mean, and 2010 and 2013 are just
differentsamples from the same population).
• HA: The mean number of daily traffic accidents in 2010 and 2013 is different
(2010 and 2013 are two samples from two different populations).
Fig. 4.3 This graph shows 100 sample means (green points) and its corresponding confidence
intervals, computed from 100 different samples of 100 elements from our dataset. It can be
observed that a few of them (those in red) do not contain the mean of the population (black
horizontal line)
60 4 Statistical Inference
We call H0 the null hypothesis and it represents a skeptical point of view: the
effect we have observed is due to chance (due to the specific sample bias). HA is
the alternative hypothesis and it represents the other point of view: the effect is
real.
The general rule of frequentist hypothesis testing: we will not discard H0 (and
hence we will not consider HA) unless the observed effect is implausible under
H0.
This estimate suggests that in 2013 the mean rate of traffic accidents in
Barcelonawas higher than it was in 2010. But is this effect statistically significant?
Based on our sample, the 95% confidence interval for the mean rate of
trafficaccidents in Barcelona during 2013 can be calculated as follows: