0% found this document useful (0 votes)
34 views35 pages

Seminar Week 4 - With Solutions - Fullpage

Uploaded by

Anika Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views35 pages

Seminar Week 4 - With Solutions - Fullpage

Uploaded by

Anika Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

ETF1100 Business Statistics

Week 4
Understanding Statistical Uncertainty
Charanjit Kaur
Week 4: Samples and Sampling Distributions
Learning Outcomes:
• Revisiting the Normal Distribution
• Understanding the process and purpose of random sampling
• Identifying possible biases from different methods of sampling
• Understanding the law of large numbers & the central limit theorem (CLT)
• Using sampling distribution of a statistic to express uncertainties
Probability Distribution: Normal Distribution
• The most common distribution in statistics is the normal distribution
• It is a symmetric (bell-shaped) distribution
• The normal distribution has two features: Mean and Stdev

Normal Distribution
• Notation: 𝑿 ~ 𝑵 𝑴𝒆𝒂𝒏, 𝑺𝒕𝒅𝒆𝒗
𝑿 ~ 𝑵 𝝁, 𝝈
• Skewness = 0
• Mean = Median = Mode
Calculating Normal Distribution Probabilities using Excel
=NORM.INV(probability, mean, standard deviation)
This calculate what “X value” is such that the probability of getting a number LOWER than
this is equal to the entered probability.
→ Percentile at the desired probability

Calculate the 10th percentile of this price distribution


P 𝑃𝑟𝑖𝑐𝑒 < 𝑃𝑟𝑖𝑐𝑒 ∗ = 10%
=NORM.INV(probability, mean, standard deviation)
𝑃𝑟𝑖𝑐𝑒 ∗ =NORM.INV(0.1,1215,419)
=678.03
1215 Price
10% of houses are priced lower than $678.03k (000s)
STANDARD Normal Distribution
• A special case: STANDARD normal distribution Z=
𝑋−𝜇
𝜎

• Mean = 0 and Stdev = 1


• Used in statistics to assess statistical uncertainty (this week… and next week)

Standard Normal Distribution


• Notation: 𝑍 ~ 𝑁(0,1)
• Skewness = 0
• Mean = Median = Mode
Normal Distribution & Standard normal distribution

𝑋 ~ 𝑁(𝑀𝑒𝑎𝑛, 𝑆𝑡𝑑𝑒𝑣) 𝑍 ~ 𝑁(0,1)


House Price Distribution
𝑃𝑟𝑖𝑐𝑒 000𝑠 ∼ 𝑁(𝑀𝑒𝑎𝑛 = 1215, 𝑆𝑡𝑑𝑒𝑣 = 419)

1215 Price ($000s)

Is the normal distribution always a good approximation for numerical data?


• It depends on the nature of the observed data distribution
• Not appropriate for skewed distribution
• It is most used in the context of statistical analysis/hypothesis testing → STANDARD
NORMAL distribution
Purpose of Random Sampling
Let’s first go back to “statistics”

Statistics: The study of the collection, organisation,


analysis, and interpretation of data.

How is statistics used in decision making?


→ drawing general conclusions about the population from
a sample set of data
Basic Concepts of Statistics: Population and Sample
Sample:
A subset of the population selected for analysis.
• Often chosen randomly
• Preferably representative of the population
Statistic (Estimate): Computable summaries of the sample
• Sample mean (𝑥)ҧ and
• Sample standard deviation (𝑠)

Population:
All members of a group about which you want to draw a conclusion.
Parameter: A measurable characteristic of a population
• population mean (𝜇)
• population standard deviation (𝜎)
Characteristics of Random Sample
Representative
• sample is randomly chosen from the whole population.
• characteristics of people who respond to a survey do not differ significantly from those
in the same sample who do not respond.
• members of the sample exhibit features consistent with the general population.
• all members of the population are equally likely to be chosen for the sample.
• sampling is not done based on voluntary participation.
Representative Sample
Representative sample is determined by:
1) Data collection process (sampling design)
2) Survey design → wording design of the questions/form.
3) Sample size → a sufficiently large sample means the sample statistic gets closer to the population
parameter
Biased sample:
• Non-representative statistics
• Invalid inference → invalid conclusions. It could end with catastrophic outcomes if used in business
decisions

Potential biases:
• Selection bias – each identity in the population has an uneven chance of being chosen
• Non-responsive bias – data collection process leading to systematic non-response from certain
groups
Identifying potential sampling bias: Examples
A marketing study aimed at analysing metro train users’ satisfaction is being
conducted for the Melbourne metropolitan zone.
• Sample 1: Data is collected by verbal surveys across all train stations across the
metropolitan zone. The surveyor conducts the survey by randomly choosing
passengers across all operational times over one week period.
→Minimal selection bias

• Sample 2: Data is collected by a mandatory survey of all train passengers


arriving and exiting at Caulfield station between 8-9 am on Friday morning.
→ Selection bias, both by geographical location and respondent demographics
Identifying potential sampling bias: Example
Political party X is conducting a poll of voter’s opinion and voting tendencies for the
upcoming election
Data Sample: obtained by phone survey, with the calls made to randomly chosen
registered landline numbers between 9am-5pm between Monday and Friday.

Is this data sample a random sample?


Statistics is UNCERTAIN

Statistical analysis depends on random sampling of data

→Different sets of data can generate a slightly different estimate

Statistics is about quantifying the uncertainty of the sample estimate

If data is to influence decisions, decision-makers need to be able to


understand the extent of this uncertainty
Expectation
ഥ is an estimate of 𝑬 𝑿 = 𝝁
𝒙

• Expectation of the random variable X


• This represents the “theoretical” population mean
• In real world problem this is our “unknown”
• In simulation, this can be calculated from probability theory
Expectation
𝑬 𝑿 =𝝁
• E.g. A simple trial of rolling a six-sided fair dice, and recording face value
• Six possible values: {1,2,3,4,5,6}
• Each outcome has equal probability of 1/6
• Random variable 𝑋 is the face value of each roll
• What is the expectation of 𝑋? From probability theory:
1 1 1 1 1 1
𝐸 𝑋 = ×1 + ×2 + ×3 + ×4 + ×5 + × 6 = 3.5
6 6 6 6 6 6

We can estimate this expectation by conducting experiments and collecting data.


Does sample size matter?

Key takeaways:
• Large samples → estimate gets closer and closer to truth
• New sample gives you different path → statistical variability
• When sample size is small, there is larger variability of estimate
• ALL of this is subject to truly random sample!
Sampling Distribution

Up to this point, you might think….. just collect more high-quality data!
But this is not always possible:
• Limited access to individuals/experiments
• Limited monetary resources to do so
• Some business problems → small data problem
The sampling distribution is a statistical tool that helps quantify the
uncertainty of the estimate for a given data set.
The basics of Sampling Distribution
• Sample statistic is only an estimate of the truth
• Since the sample may vary, any sample statistic is not exact and has
variation/error around them.
• The smaller the error, the greater the accuracy.
• We need to take into account such variability in the statistic if we want to
analyse the statistic.
• Assume we take data samples repeatedly, and compute sample means as
the statistic for each set of sample. Then we would have the sampling
distribution of the sample mean to portray its variability.
Sampling Distribution of the Sample Mean

Statistical theory gives us a result (Central Limit Theorem):


• If the sample size 𝒏 is large, 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟
𝑠
𝑆𝐸 𝑥ҧ =
𝑛

𝑺𝒕𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏
ഥ ∼ 𝑵 𝑴𝒆𝒂𝒏,
𝒙
𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆
• This is true regardless of the shape of the population distribution
• Magic? NO! Just the beauty of statistics.

https://ebsmonash.shinyapps.io/cltdemo/
The following are based on all possible samples of size n.

21
Sampling Distribution of the Sample Mean

Key takeaways:
• The sample mean 𝒙
ഥ is centred around the true mean
• Its uncertainty is measured by the standard error

𝑠
𝑆𝐸 𝑥ҧ =
𝑛
• The standard error is always smaller than the standard deviation
• Large sample size → 𝒙
ഥ is more precise estimate of the true population mean
Estimation of the true population mean
There are two types of estimates:
1) Point Estimate
A single value that estimates a population parameter

2) Interval Estimate
A range of values within which the population parameter probably lies. This range is
known as a confidence interval estimate

Point estimates do not indicate uncertainty (sampling error).


Better approach: give a range of values within which the unknown population parameter is
thought likely to lie. We refer to this range of plausible values as a confidence interval

Confidence interval = plausible range of the unknown population mean given some level of
probability
23
Confidence Interval: Basic Format
𝑠 𝑠 OR 𝒔
𝑋−𝑍 < 𝜇 < 𝑋+𝑍 ഥ±𝒁
𝒙
𝑛 𝑛 𝒏


𝑿  Margin of error

Point Estimate A value that embodies the Standard error - A measure of the error
Estimate  by 𝑋ത desired level of confidence associated with the point estimate
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
( )
𝑛

24
Width of the Confidence Interval

1–a

𝑠 𝑠
𝑥lj − 𝑍 𝑥lj + 𝑍
𝑛 𝑛

Lower Confidence Limit Width of the Upper Confidence Limit


/ Lower boundary confidence interval / Upper boundary

The width of a confidence interval indicates the precision of the estimate.

Note:
• (1-𝛼) is referred to as the level of confidence
• 𝛼 is referred to as the level of significance. It is the probability left in the “tail ends” of the confidence intervals
E.g. for a 95% confidence interval, 𝛼 = 1 − 0.95 = 0.05
26
Factors that affect the width of a Confidence Interval Estimate
If the standard deviation (𝜎) ↑, the spread of the distribution is larger
𝑠𝑡𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
standard error ↑, width ↑, estimate is less precise
𝑛

𝑠𝑡𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
If the sample size (n) ↑,standard error ↓, width ↓, estimate is more precise
𝑛
The bigger the sample, the more information we have to increase the precision of the interval estimate of the
sample mean, the narrower the interval.

If the level of confidence (1-α) ↑, critical value changes, width ↑ , the estimate is less precise
The more confident we are, the more values we need to include in our confidence interval, the wider the
interval.

𝒔
ഥ±𝒁
𝒙
𝒏
Confidence Interval in Repeated Sampling Context

We select a sample of 𝑛 observations


repeatedly and and for each sample we
construct a 95% confidence interval for
the population mean.

We could expect 95% of intervals to


contain the population mean.
While 5% of the intervals would not
contain the population mean.

(Source: Lind, Marchal and Wathen, Statistical Techniques in Business Economics, 2021, 18th edition)
Confidence Interval – Average house prices
Calculate the 95% confidence interval of the population mean house prices
𝒔
ഥ =
𝑺𝑬 𝒙 already given here
𝒏

Standard deviation estimate, 𝒔


Sample size, 𝒏 𝒔
ഥ±𝒛
95% C.I.: 𝒙 𝜶
𝟏− 𝟐 𝒏
0.05
𝑧 𝛼
1− 2
= N𝑂𝑅𝑀. 𝑆. 𝐼𝑁𝑉 1 −
2

[1170.195, 1260.729]
We are 95% confident that the true average house price in this Melbourne
suburb is between $1,170,195 and $1,260,729.
Note: Remember that price is recorded in thousands of dollars.
Confidence Interval – Average house prices
Houses in the school zone are more expensive, on average compared to
houses outside the school zone.

Let’s use the concepts of confidence interval to validate/invalidate this claim


Confidence Interval – Average house prices
Houses in the school zone are more expensive, on average compared to
houses outside the school zone.

Let’s use the concepts of confidence interval to validate/invalidate this claim


Calculation of the 95% confidence interval
School Zone Outside School Zone
alpha 0.05 0.05
• TRUE
xbar 1258.7863 1046.0448
SE(xbar) 26.0798 44.1962
• FALSE
z_(1-alpha/2) 1.959963985 1.959963985

Lower bound 1207.670856 959.4218151


Upper bound 1309.901663 1132.667737
The intervals show that the lower bound and upper bound for houses within the school Zone are both higher.
Furthermore, the two intervals do not overlap. Hence there is a clear difference in prices.
Sampling Distribution of Proportion
Statistical theory ALSO gives us a result (Central Limit Theorem):
Here, 𝝅 =unknown population proportion
𝑋 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑡𝑒𝑚𝑠 𝑤𝑖𝑡ℎ 𝑡ℎ𝑒 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡
• Proportions estimated by 𝑝 = =
𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒

𝑝 1−𝑝
• If the sample size 𝑛 is large, 𝑝 ∼ 𝑁 𝜋, 𝑆𝐸 𝑝 where 𝑆𝐸 𝑝 =
𝑛

• Lower & upper bounds of a 1 − 𝛼 confidence interval for the sample proportion
(1 − 𝛼)% C.I. = 𝒑 ± 𝑧 𝛼
1− 2
𝑆𝐸 𝒑
Confidence Interval – Marketing Survey
More than half of our potential market have tried our frozen food product

Let’s use the concepts of confidence interval to validate/invalidate this claim

Contingency table from Week 3 Tutorial


Count of Person ID Gender
Have tried Female Male Grand Total
No 0.21262 0.20911 0.42173 𝑝
Yes 0.25234 0.32593 0.57827
Grand Total 0.46495 0.53505 1.00000
N= 856
Confidence Interval – Marketing Survey
More than half of our potential market have tried our frozen food product

Let’s use the concepts of confidence interval to validate/invalidate this claim

95% confidence interval


We can be 95% confident that the
alpha 0.05 population proportion of the potential
p_hat 0.57827 market who have tried our product is
SE(p_hat) 0.016878955 between 54.52% and 61.14%.
z_(1-alpha/2) 1.959963985

Lower bound 0.545188884


This range is well above half.
Upper bound 0.611353172
Sampling Distribution – SPECIAL CASE
SPECIAL CASE: what happens when sample size 𝒏 is small?
• If you are confident that your data comes from a population
distribution that is normally distributed:
𝑥ҧ − 𝜇
∼ 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 − 𝑡 𝑑𝑓 = 𝑛 − 1
𝑠/ 𝑛
• Confidence intervals can be constructed by
𝑠
95% C.I.: 𝑥ҧ ± 𝒕 𝜶
𝟏− 𝟐 ,𝒅𝒇 𝑛

• Here, 𝒕 𝜶
𝟏− 𝟐 ,𝒅𝒇
is calculated by “=T.INV(1-alpha/2,df)”
Samples and Sampling Distributions

❖ Representative Sample
▪ Sample design
▪ Survey design
▪ Sample size
❖ Sampling distribution of the sample statistic
▪ General concepts
▪ Relationship with population distribution and sample size
▪ Confidence interval calculation and interpretations

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy