Seminar Week 4 - With Solutions - Fullpage
Seminar Week 4 - With Solutions - Fullpage
Week 4
Understanding Statistical Uncertainty
Charanjit Kaur
Week 4: Samples and Sampling Distributions
Learning Outcomes:
• Revisiting the Normal Distribution
• Understanding the process and purpose of random sampling
• Identifying possible biases from different methods of sampling
• Understanding the law of large numbers & the central limit theorem (CLT)
• Using sampling distribution of a statistic to express uncertainties
Probability Distribution: Normal Distribution
• The most common distribution in statistics is the normal distribution
• It is a symmetric (bell-shaped) distribution
• The normal distribution has two features: Mean and Stdev
Normal Distribution
• Notation: 𝑿 ~ 𝑵 𝑴𝒆𝒂𝒏, 𝑺𝒕𝒅𝒆𝒗
𝑿 ~ 𝑵 𝝁, 𝝈
• Skewness = 0
• Mean = Median = Mode
Calculating Normal Distribution Probabilities using Excel
=NORM.INV(probability, mean, standard deviation)
This calculate what “X value” is such that the probability of getting a number LOWER than
this is equal to the entered probability.
→ Percentile at the desired probability
Population:
All members of a group about which you want to draw a conclusion.
Parameter: A measurable characteristic of a population
• population mean (𝜇)
• population standard deviation (𝜎)
Characteristics of Random Sample
Representative
• sample is randomly chosen from the whole population.
• characteristics of people who respond to a survey do not differ significantly from those
in the same sample who do not respond.
• members of the sample exhibit features consistent with the general population.
• all members of the population are equally likely to be chosen for the sample.
• sampling is not done based on voluntary participation.
Representative Sample
Representative sample is determined by:
1) Data collection process (sampling design)
2) Survey design → wording design of the questions/form.
3) Sample size → a sufficiently large sample means the sample statistic gets closer to the population
parameter
Biased sample:
• Non-representative statistics
• Invalid inference → invalid conclusions. It could end with catastrophic outcomes if used in business
decisions
Potential biases:
• Selection bias – each identity in the population has an uneven chance of being chosen
• Non-responsive bias – data collection process leading to systematic non-response from certain
groups
Identifying potential sampling bias: Examples
A marketing study aimed at analysing metro train users’ satisfaction is being
conducted for the Melbourne metropolitan zone.
• Sample 1: Data is collected by verbal surveys across all train stations across the
metropolitan zone. The surveyor conducts the survey by randomly choosing
passengers across all operational times over one week period.
→Minimal selection bias
Key takeaways:
• Large samples → estimate gets closer and closer to truth
• New sample gives you different path → statistical variability
• When sample size is small, there is larger variability of estimate
• ALL of this is subject to truly random sample!
Sampling Distribution
Up to this point, you might think….. just collect more high-quality data!
But this is not always possible:
• Limited access to individuals/experiments
• Limited monetary resources to do so
• Some business problems → small data problem
The sampling distribution is a statistical tool that helps quantify the
uncertainty of the estimate for a given data set.
The basics of Sampling Distribution
• Sample statistic is only an estimate of the truth
• Since the sample may vary, any sample statistic is not exact and has
variation/error around them.
• The smaller the error, the greater the accuracy.
• We need to take into account such variability in the statistic if we want to
analyse the statistic.
• Assume we take data samples repeatedly, and compute sample means as
the statistic for each set of sample. Then we would have the sampling
distribution of the sample mean to portray its variability.
Sampling Distribution of the Sample Mean
𝑺𝒕𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏
ഥ ∼ 𝑵 𝑴𝒆𝒂𝒏,
𝒙
𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆
• This is true regardless of the shape of the population distribution
• Magic? NO! Just the beauty of statistics.
https://ebsmonash.shinyapps.io/cltdemo/
The following are based on all possible samples of size n.
21
Sampling Distribution of the Sample Mean
Key takeaways:
• The sample mean 𝒙
ഥ is centred around the true mean
• Its uncertainty is measured by the standard error
𝑠
𝑆𝐸 𝑥ҧ =
𝑛
• The standard error is always smaller than the standard deviation
• Large sample size → 𝒙
ഥ is more precise estimate of the true population mean
Estimation of the true population mean
There are two types of estimates:
1) Point Estimate
A single value that estimates a population parameter
2) Interval Estimate
A range of values within which the population parameter probably lies. This range is
known as a confidence interval estimate
Confidence interval = plausible range of the unknown population mean given some level of
probability
23
Confidence Interval: Basic Format
𝑠 𝑠 OR 𝒔
𝑋−𝑍 < 𝜇 < 𝑋+𝑍 ഥ±𝒁
𝒙
𝑛 𝑛 𝒏
ഥ
𝑿 Margin of error
Point Estimate A value that embodies the Standard error - A measure of the error
Estimate by 𝑋ത desired level of confidence associated with the point estimate
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
( )
𝑛
24
Width of the Confidence Interval
1–a
𝑠 𝑠
𝑥lj − 𝑍 𝑥lj + 𝑍
𝑛 𝑛
Note:
• (1-𝛼) is referred to as the level of confidence
• 𝛼 is referred to as the level of significance. It is the probability left in the “tail ends” of the confidence intervals
E.g. for a 95% confidence interval, 𝛼 = 1 − 0.95 = 0.05
26
Factors that affect the width of a Confidence Interval Estimate
If the standard deviation (𝜎) ↑, the spread of the distribution is larger
𝑠𝑡𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
standard error ↑, width ↑, estimate is less precise
𝑛
𝑠𝑡𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
If the sample size (n) ↑,standard error ↓, width ↓, estimate is more precise
𝑛
The bigger the sample, the more information we have to increase the precision of the interval estimate of the
sample mean, the narrower the interval.
If the level of confidence (1-α) ↑, critical value changes, width ↑ , the estimate is less precise
The more confident we are, the more values we need to include in our confidence interval, the wider the
interval.
𝒔
ഥ±𝒁
𝒙
𝒏
Confidence Interval in Repeated Sampling Context
(Source: Lind, Marchal and Wathen, Statistical Techniques in Business Economics, 2021, 18th edition)
Confidence Interval – Average house prices
Calculate the 95% confidence interval of the population mean house prices
𝒔
ഥ =
𝑺𝑬 𝒙 already given here
𝒏
[1170.195, 1260.729]
We are 95% confident that the true average house price in this Melbourne
suburb is between $1,170,195 and $1,260,729.
Note: Remember that price is recorded in thousands of dollars.
Confidence Interval – Average house prices
Houses in the school zone are more expensive, on average compared to
houses outside the school zone.
𝑝 1−𝑝
• If the sample size 𝑛 is large, 𝑝 ∼ 𝑁 𝜋, 𝑆𝐸 𝑝 where 𝑆𝐸 𝑝 =
𝑛
• Lower & upper bounds of a 1 − 𝛼 confidence interval for the sample proportion
(1 − 𝛼)% C.I. = 𝒑 ± 𝑧 𝛼
1− 2
𝑆𝐸 𝒑
Confidence Interval – Marketing Survey
More than half of our potential market have tried our frozen food product
• Here, 𝒕 𝜶
𝟏− 𝟐 ,𝒅𝒇
is calculated by “=T.INV(1-alpha/2,df)”
Samples and Sampling Distributions
❖ Representative Sample
▪ Sample design
▪ Survey design
▪ Sample size
❖ Sampling distribution of the sample statistic
▪ General concepts
▪ Relationship with population distribution and sample size
▪ Confidence interval calculation and interpretations