Statistics 101
Statistics 101
Sample space:
Is the set of all possible values that a variable can take, represented by the letter S.
Measurement scales:
- nominal scales
o non-ordered scales that only state properties of the values, ex. Nationality
- ordinal scales
o ordered scales that allow to establish a hierarchy among values, ex. If A>B, B>C
then A>C
- interval scales
o allow an arbitrary 0, it’s possible to numerically quantify intervals between
individuals, ex. Temperature
- ratio scales
o allow establishing a proportionality, allow measuring a fixed zero that signifies the
absence, ex. Tree height
X fi fi/n ∑i fi/n ∑i fi
In the case of a continuous variable, it’s necessary to group the possible values into intervals or
classes of values: the frequencies are therefore intervals that need to be defined. To do so a
reasonable subjective choice for the number of non-overlapping intervals must be made.
Bar charts:
Used to visualize the distribution of discrete variables, it’s the graphical representation of a
frequency table. Consist of vertical bars representing fi or fi/n
Special case, bar chart of a qualitative variable:
The classes are called categories, and the bar chart is called an organ pipe chart, the height of the
bars represents fi or fi/n
Histograms:
Allows to visualize the distribution of continuous variables, graphical representation of a frequency
table.
Consists of vertical bars one for each class or range of values. The area of each bar is the relative
frequency of the corresponding class.
- Base of the bar: width of the class
- Height of the bar: relative frequency/width of the class = density
- Area of the bar: width of the class * density = relative frequency
Dot plots:
Suitable for continuous variables with a small sample size, each dot on the line is a value.
Pie charts:
Suitable for qualitative variables, we represent fi/n of each category on a disk. The relative
frequencies are proportional to the angle of the disk’s slice.
1° trim. 2° trim.
3° trim. 4° trim.
Quantiles:
The pth quantile Q(p) of a dataset is a value such that a proportion p of the data have values below
this value.
Special cases:
- p = 0.25 : lower quartile Q1
- p = 0.5 : median M
- p = 0.75 : upper quartile Q3
note: quantiles can be located between the values of two data points
linear interpolation from raw data, discrete variables:
method:
- arrange the data in ascending order
- find the rank: p(n-1)+1
- use the rank to interpolate the value of the quantile.
d. Divided by n-1:
4. Standard deviation:
Boxplots:
Graphically summarize information about the center and spread of the data
1. Box:
a. Lower base lower quartile Q1
b. Upper base upper quartile Q3
2. Lines:
a. In the box median M
3. Points: outliers
Constructing a boxplot:
1. Median M
2. Lower quartile Q1
3. Upper quartile Q3
4. Lower bound LB=Q1-1.5(Q3-Q1)
a. Adjusted to the nearest observed value towards the median.
5. Upper bound UB=Q3+1.5(Q3-Q1)
a. Adjusted to the nearest observed value towards the median.
BASIC PROBABILITY CALCULATIONS – MODULE 3
Formalization:
The probability of an event in the context of an experiment is a measure of likelihood of that event.
Ex: the probability of getting heads (event) when a coin is tossed (experiment)
Definitions:
- Experiment: an action that results in one outcome from a set of possible outcomes, ex rolling
a die.
- Sample space S: the set of all possible outcomes (results) of an experiment
o N: number of possible outcomes
o e1, e2,..,en : list of all possible outcomes
o S = { e1, e2,..,en }
- Event: a set of one or more outcomes
Probability:
- 0≤P(A)≤1
- Impossible event: P(A)=0
- Certain event: P(A)=1
Counting probabilities:
- Equally likely cases, all possible outcomes of an experiment are equally likely, have the
same probability to happen:
n number of favorable cases
o P(A) = =
N number of possible cases
Example: how many different orders of finish are there in a race with 10 participants?
10!
to count the number of different orders among n objects to be ordered over m positions: mn
Example: how many possibilities are there to create a 7-digit phone number?
107
Arrangements:
r n!
An =
( n−r ) !
The order in which elements are selected matters an ordered list without repetition of r elements
out of n.
Example: how many outcomes are there for the first 3 horses (orders) in a race with 10 horses?
Combinations:
r n!
C n=
r ! ( n−r ) !
The order in which elements are selected doesn’t matter an unordered list without repetition (set)
or r elements out of n.
Example: how many possible results are there in a lottery with 6 numbers drawn out of 49?
It’s a combination because we do not count orders, for example in the lottery 12,1,4,5,46,31 is equal
to 5,1,4,12,31,46.
- De Morgan’s laws:
o A ∪ B= A ∩ B
o A ∩ B= A ∪ B
- Distributive law:
o A ∩ ( B ∪ C )= ( A ∩ B ) ∪ ( A ∩C )
Conditional probabilities:
P( A ∩ B)
The conditional probability of an event A given that B has occurred: P ( A|B )=
P( B)
Special cases:
P ( B )=P ( B| A ) P ( A ) + P ( B| A ) P( A)
Bayes’ theorem:
Independence:
P ( A|B )=P( A)
Or
P ( A ∩B )=P ( A ) P (B)
Remark: two disjoint events are not independent unless P(A)=0 or P(B)=0
Contingency tables:
Tree diagrams:
is a graphical representation of the sample space when the outcomes themselves consist of a
sequence of outcomes from sub-experiments.
Independent sub-experiments
We distinguish the random variable RV (representing the phenomenon) from the realization that is
observed after the experiment. The random variable is represented by an upper case letter and the
realization by a lower case letter.
Example: Let X be the annual demand for electronic components of a company. We might be
interested in P(X ≤ x) for any possible realization x of X.
We take into consideration numerical RVs, random variables who can only take numbers as possible
outcomes:
- Discrete RV:
o The set of possible outcomes S is countable, its elements can be counted even if they
are not necessarily finite in number.
- Continuous RV:
o The set of possible outcomes S is uncountable, its elements cannot be counted.
The probability function of the RV R is a function that associates p r with r for all possible
realizations: r pr
Properties:
Properties:
Expectation:
is denoted as E(R).
the expectation of a RV is its average calculated using the probabilities from its probability
function.
It’s a statistic indicating the center of the distribution (central tendency) of a RV, summarizes where
the realizations of the random variable are, on average.
Properties of Expectation:
The standard deviation is also a measure of dispersion that unlike the variance is in the same unit of
measurement as the variable itself.
Properties:
DISCRETE PROBABILITY DISTRIBUTIONS – MODULE 4
Discrete distributions:
We use distribution models that correspond to situations frequently encountered and whose
properties are well known.
Assigns equal probability to all possible outcomes in a finite set of possible results, for example
rolling a die.
Is the probability distribution of an experiment with two possible outcomes, such as tossing a coin.
The Bernoulli Distribution is defined on the sample space S = {0,1}. We determine to which
outcomes 0 and 1 refer. In general 1 is success and 0 failure.
Binomial Distribution:
Is the distribution of the number of successes in a sequence of n independent Bernoulli trials, for
example the number of heads when a coin is tossed n times.
Characteristics:
Pascal’s Triangle:
Mathematical and graphical
construction for calculating
r
Cn
When we want to calculate a probability when n is large, from table we can read:
Describes the probabilities of the number of i.i.d. (independent and identically distributed)
Bernoulli trials needed until the first success occurs, included. For example the number of attempts
until we roll the first 6.
Hypergeometric distribution:
Let X be the random variable counting the number of red balls obtained in n draws without
replacement from an urn of N balls, containing r red balls, we denote:
Probability mass function:
If the urn contains r successes and we make n draws, then we cannot obtain more than r or n
successes, thus the maximum number of possible successes is
min{n,r}
similarly after drawing ballz we will have drawn all N-r failures if n is sufficiently large, we
therefore will have at least one success if n > N-r. the number of minimum number of possible
successes is then
max{0, n-(N-r)}
Models the probabilities of the number of events occurring randomly in a given unit of continuous
measurement, such as time or distance… could be used for example to model the number of
machine breakdowns per unit time.
Characteristics:
Let R be the number of events in an interval I distributed as Pois(λ), then the distribution of the
number of events in an interval aI is
The main problem when moving from discrete to continuous lies in understanding the concept of
continuous distribution and using the probability density function instead of the probability mass
function.
A continuous RV can take on values in a continuum, typically within an interval (a, b). for example:
Mathematically it’s impossible to assign a probability to each possible outcome: there are too
many possibilities and we wouldn’t obtain a sum of probabilities equal to 1.
So, the solution is to define the probability of an interval of values, which is calculable,
P(x1<X<x2) and a probability density f(x).
From relative frequency to probability density:
let X be a continuous RV with a probability density function f. then for any possible realization x
and a small interval δ, we have approximately:
The probability that X equals x is approximately f(x) multiplied by the width δ of an infinitesimal
interval around x so in practice we define the distribution of X by specifying its probability density
function.
Probability interval:
Properties:
Uniform distribution:
Exponential Distribution:
Memoryless Property:
- The number of customers R arriving at the counter during a given time period t follows a
discrete Poisson Distribution with intensity λ, parameter.
- The waiting time T between the arrival of two successive customers during this time period t
follows an exponential distribution with mean 1/ λ (rate λ)
Most widely used distribution in statistics and probability. Its universality is probably due to its
probability density function, the Gaussian Curve being symmetric, and to the Central Limit
Theorem.
Characteristics:
- Bell-shaped density
- Symmetric around mean µ
- Extends from - ∞ to + ∞
- Probability decreases exponentially
- Rate of decrease depends on the standard deviation σ
Properties of a Normal Variable:
Examples:
The Quantile:
Symmetric intervals:
! Special case: for p=0.95, is the smallest interval around the mean containing 95% of the
probability, given by: [µ−1.96σ; µ+1.96σ] in other words: [µ±1.96σ].
A binomial variable can be approximated by a Normal Random Variable when the number of
experiments n is large, the following conditions are generally considered:
Continuity correction:
Remind that in Continuous Random Variables, strict and non strict preferences are equal:
States that the sum of i.i.d. RVs tends to a Normal Random Variable, the assumptions for its validity
are not restrictive, therefore it establishes that many natural accumulation phenomena can be
modeled by a normal distribution.
So, we define:
The strength of CLT is its generality, it applies to all distributions of Xi and again its conditions of
validity are not restrictive.
The CLT shows that the normal distribution is the distribution of natural accumulations, either sum
or mean.
E=0
Var=1
Structure: let Z1,…, Zn be a sequence of n random variables which are 1. i.i.d. and 2. Following
the centered and reduced Normal Distribution N(0,1)
Then:
Reminder: Chi-Square Distribution is additive, for the sum of two independent RVs with n and
m degrees of freedom, then we can just n+m.
Used to construct confidence intervals and statistical testing such as mean comparisons…
Fisher’s Distribution:
Used for comparing variances of RVs where the observations are from two different populations,
also particularly useful for analysis of variance such as ANOVA and regression analysis.
Definitions:
- Statistical inference: the use of data from a sample to estimate or test hypothesis about the
characteristics of a population
- Population: also called parent population, the set of all elements of interest in a study
- Sample: a subset of the population
- Selecting n units at random from the population: each unit has an equal chance of being
selected
- Sampling is done without replacement: each unit can be selected only once
Representativeness and Selection Bias:
Point Estimation:
Estimation:
Empirical estimators provide characteristics of the sample, an alternative approach would be to set a
probabilistic model on the observations and estimate its parameters from the sample data.
The most common methods for constructing these estimators are the method of movements and the
maximum likelihood method.
Maximum Likelihood:
To find the parameters that maximize probability of observing what has been observed.
Is the value of the parameter θ that maximizes L(θ) or l(θ). This makes the model assign a high
probability to observing what has actually been observed, ensuring that the model is tuned with
reality
This value is denoted as θ and is called the Maximum Likelihood Estimator MLE
- to determine whether the use of an estimator is justified or not, we study the sampling
distribution and the resulting estimator’s distribution
- sampling distribution = probability distribution for the estimator:
o since the selection of samples follows a random process, the estimator itself is a RV
and therefore has its own probability distribution
The standard error of an estimator is very useful in statistics for assessing the precision of the
estimator:
In practice, we estimate the standard error itself. So, the estimator provides an estimate of the
parameter of interest θ and we also have an estimate of the standard error of this estimation.
Interpretation: for a 95% CI, if we were to take multiple samples and calculate a CI in each case,
then in 95% of the cases, the interval would contain the mean.
Otherwise, the interval is approximate and thus depends on the distribution of the population as well
as the sample size, in practice a sample size of n ≥ 30 is sufficient for this approximate interval to be
satisfactory.
Often in practice σ2 is not known, in this case, the sample variance s2 is used as an estimator for the
population’s true variance.
When this is the case, confidence intervals are constructed not from a normal probability
distribution, but from a Student’s probability distribution (degrees of freedom).
Confidence Interval for the True Proportion π:
Minimum Sample Size:
HYPOTHESIS TESTING – MODULE 9
Hypotheses Formulation:
So now:
Conclusion of a test:
- Either H0 is true or H1, never both: ideally the test should lead to
o Failure to reject H0 when H0 is true
o Rejection of H0 when H1 is true
- Since hypothesis tests are based on information from a random sample, there is possibility
of errors:
o Type I error: false positive, the test rejects H0 while H0 is true
Probability of committing this error is called Significance Level and is
denoted by α, commonly α= 1%, 5%, 10%
o Type II error: false negative, the test fails to reject H0 while H1 is true
Probability of committing this one is called Type II Error Rate and is denoted
by β
- The power of a test is denoted by γ and is the probability of correctly rejecting H0 or
correctly failing to reject H1
o γ=1−β, maximizing the power of a test is equivalent to minimizing Type II error rate
Test conclusion: Critical Value Method