0% found this document useful (0 votes)

2 views23 pages

L4 Sampling

Uploaded by

rishitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views23 pages

L4 Sampling

Uploaded by

rishitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Sampling and Simulation

INFO5501: Fundamentals of Data Analytics

Ref: Fundamentals of Data Science by Stephen Elson, at: https://github.com/StephenElston/CSCI-E-83

Introduction
• Sampling is a fundamental process in the collection and analysis of data
• Sampling is important because we almost never have data on a whole
population
• Sampling needs to be randomized to preclude biases
• As sample size increases, the standard error of a statistic computed from
the sample decreases by the law of large numbers
• Key points to keep in mind:
• Understanding sampling is essential to ensure data is representative of the entire
population
• Use inferences on the sample to say something about the population
• The sample must be randomly drawn from the population
• Sampling from distribution is the building block of simulation
Sampling Examples

USE CASE Sample Population

A/B Testing The users shown in group A or B All users
World cup soccer 32 teams which qualify in a season All national teams
Average height of data science students Students in a data science class All students in data science major
Tolerance of a manufactured part Samples taken from production lines All parts manufactured
Numbers of a species in a habitat Counts from sampled habitats All habitats

• It is not only impractical, but impossible to collect

data from the entire population in some cases
Importance of Random Sampling
• All statistical methods rely on the use of randomized unbiased samples
• Failure to randomized samples violates many key assumptions of statistical
models
• An understanding of proper use of sampling methods is essential to statistical
inference
• Most commonly used machine learning algorithms assume that training data
are unbiased and independent and identically distributed (iid)
• These conditions are only met if training data sample is randomized

• Otherwise, the training data will be biased and not represent the underlying process
distribution
Sampling Distributions
Sampling of a population is done from an unknown population distribution,

• Any statistic we compute for the generating process is based on a sample

• The statistic is an approximation, s() of a population parameter

• For example, the mean of the population is μ
• But, sample estimate is
• If we continue to take random samples from the population and compute estimates of a statistic,
we generate a sampling distribution.
• Hypothetical concept of the sampling distribution is a foundation of frequentist statistics
• Example, if we continue generating samples and computing the sample means, for the ith sample
• Frequentist statistics built on the idea of randomly resampling the population
distribution and recomputing a statistic
• In the frequentist world, statistical inferences are performed on the sampling distribution
• Sampling process must not bias the estimates of the statistic
Sampling Distributions
Sampling of a population is done from an unknown population distribution,
• Any statistic we compute for the generating process is an approximation for the
population, s().
Sampling and the Law of Large Numbers

The law of large numbers is a theorem that states that statistics of independent
random samples converge to the population values as more samples are used.
• Example, for a population distribution, (μ,σ), the sample mean is:

Then by the law of Large Numbers:

This result is reassuring that the larger the sample the more the statistic
coverages to the population parameter.
The law of large numbers is foundational to statistics

• We rely on the law of large numbers whenever we work with samples

• Assume that larger samples are more representatives of the population we are
sampling
• Is foundation of sampling theory, plus modern computational methods; simulation,
bootstrap resampling, and Monte Carlo methods
• If the real world did not follow this theorem, then much of statistics (to say nothing of
science and technology as a whole) would have to be rethought.

The law of large numbers has a long history

• Jacob Bernoulli posthumously published the first proof for the Binomial distribution in
1713
• Law of large numbers is sometimes referred to as Bernoulli’s theorem
• A more general proof was published by Poisson in 1837.
Central Limit Theorem (CLT)
CLT is a sort of guarantee
• Sampling distribution of mean estimates do not depend on the population the
sample was drawn from
• Standard Deviation s of the sampling distribution of converges as
• Only depends on the population’s man and variance, and on the sample size
• CLT is the basis for hypothesis testing.

Example: a mixture of Normal distributions

Standard Error and Convergence for a Normal Distribution

As we sampled from a Normal distribution, the mean and standard deviation of

the sample converged to the population mean
• What can we say about the expected error of the mean estimate as the number
of samples increases?
• This measure is known as the standard error of the sample mean.
• As corollary of the law of large numbers the standard error is defined:

• Standard error decreases as the square root of n

• Example, if you wish to halve the error, you will need to sample four times as many values.
• For the mean estimate , define the uncertainty in terms of confidence intervals.
• For example, for 95% confidence interval
Convergence and Standard Errors for a Normal Distribution

Mean estimates for realizations of

standard Normal distribution with
standard errors
Sampling Strategies

There are a great number of possible sampling methods.

• Some of the most commonly used methods
• Bernoulli sampling, a foundation of random sampling
• Stratified sampling, when groups with different characteristics must be sampled
• Cluster sampling, to reduce cost of sampling
• Systematic sampling and convenience sampling, a slippery slope
Bernoulli Sampling
• Bernoulli sampling is a widely used foundational random sampling strategy
• Bernoulli sampling has the following properties:
• A single random sample of the population is created
• A particular value in the population is selected based on the outcome of a
Bernoulli trial with fixed probability of success, p

• Example, a company sells a product by weight

• To ensure the quality of a packaging process so few packages are underweight
• Impractical to empty and weight the contents of every package
• Bernoulli randomly sampled packages from the production line and weigh contents
• Statistical inferences are made from sample
Sampling Grouped Data
Group data is quite common in application

• A few examples include:

• Pooling opinion by county and income group, where income groups and
counties have significant differences in population
• Testing a drug which may have different effectiveness by sex and ethnic group
• Spectral characteristics of stars by type
Stratified Sampling
What is a sampling strategy for grouped or stratified data?
• Stratified sampling strategies are used when data are organized in strata
• Simple Idea: independently sample an equal numbers of cases from each strata
• The simplest version of stratified sampling creates an equal-size Bernoulli
sample from each strata
• In many cases, nested samples are required
• For example, a top level sample can be grouped by zip code, a geographic strata

• Within each zip code, people are then sampled by income bracket strata

• Equal sized Bernoulli samples are collected at the lowest level

Cluster Sampling
When sampling is expensive, a strategy is required to reduce the cost

• Examples of expensive to collect data:

• Surveys of customers at a chain of stores

• Door to door survey of homeowners

• Sampling wildlife populations in a dispersed habitat

• Population can be divided into randomly selected clusters:

- Define the clusters for the population
- Randomly select the required number of clusters
- Sample from selected clusters
- Optionally, stratify the sample within each cluster
Systematic Sampling

Convenience and systematic sampling are a slippery slope toward biased

inferences

• Systematic methods lack randomization

• Convenience sampling selects the cases that are easiest to obtain
• Commonly cited example known as database sampling

• Example, the first N rows resulting from a database query

• Example, every k-th case of the population

Introduction to Simulation

Simulation enables data scientists to study the behavior of stochastic processes with
complex probability distributions

• Most real-world processes have complex behavior, resulting in complex distributions of

output values
• Simulation is a practical approach to understanding these complex processes
• Two main purposes of simulation can be summarized as:
• Testing models: If data simulated from the model do not resemble the original data, something is
likely wrong

• Understand processes with complex probability distributions: In these cases, simulation provides a
powerful and flexible computational technique to understand behavior
• As cheap computational power has become ubiquitous, simulation has become
a widely used technique
• Simulations compute a large number of cases, or realizations
• The computing cost of each realization must be low in any practical simulation

• Realizations are drawn from complex probability distributions of the process model
• In many cases, realizations are computed using conditional probability
distributions
• The final or posterior distribution of the process is comprised of these realizations
Representation as a Directed Acyclic Graphical Model

When creating a simulation with multiple conditionally dependent variables it is useful to draw a directed graph;
a directed acyclic graphical model or DAG

• The graph is a communications device showing which variables are independent and which are conditionally
dependent on others with the shapes used representing the type of nodes

• Probability distributions of the variables are shown as ellipses

• Distributions have parameters which must be estimated

• Decision variables are deterministic and are shown as rectangles

• Decisions are determined by variables

• Setting decision variables can be performed either manually or automatically

• Utility nodes, profit in this case, are shown as diamonds

• Nodes represent a utility function given the dependencies in the graph

• Utility calculations are deterministic given the input values

• Directed edges show the dependency structure of the distributions

Sandwich Shop Simulation

The sandwich shop

simulation can be
represented by a
DAG
Interpreting the DAG

• The DAG is a shorthand description of the simulation model

• Leaves of the DAG are independent distributions
• Parameters must be known or estimated

• Can be useful to vary the parameters

• Child distributions are conditional on their parents
• Parameters must be known or estimated

• Resulting distribution can be quite complex

• Decision variables deterministicly change the model parameters
• Utility node uses a fixed deterministic formula to compute the value for each
realization of the simulation.
Summary
Sampling is a fundamental process in the collection and analysis of data

• Sampling is important because we almost never have data on a whole population

• Sampling must be randomized to preclude biases
• As sample size increases the standard error of a statistic computed from the
sample decreases by the law of large numbers
• Key points to keep in mind:
• Understanding sampling is essential to ensure data is representative of the entire population

• Use inferences on the sample to say something about the population

• The sample must be randomly drawn from the population

• Sampling from distribution is the building block of simulation

EECM3724 Unit 4 Ch7 Slides 2022
No ratings yet
EECM3724 Unit 4 Ch7 Slides 2022
24 pages
09.1 Sampling Distributions Lecture
No ratings yet
09.1 Sampling Distributions Lecture
28 pages
Implications For Sampling Distributions and Population Inferences PPT Rommel
No ratings yet
Implications For Sampling Distributions and Population Inferences PPT Rommel
12 pages
Chapter07 Sampling
No ratings yet
Chapter07 Sampling
37 pages
Msb14e PPT ch05
No ratings yet
Msb14e PPT ch05
38 pages
Chapter 4567 - KT 110H
No ratings yet
Chapter 4567 - KT 110H
32 pages
Engineering Mathematics - IV (15MAT41) Module-V: SAMPLING THEORY and Stochastic Process
100% (1)
Engineering Mathematics - IV (15MAT41) Module-V: SAMPLING THEORY and Stochastic Process
28 pages
SAMPLING AND ESTIMATION Notes and Examples
No ratings yet
SAMPLING AND ESTIMATION Notes and Examples
20 pages
Week 10 - Statistics, Random Sampling, Point Estimation
No ratings yet
Week 10 - Statistics, Random Sampling, Point Estimation
14 pages
Seminar Week 4 - With Solutions - Fullpage
No ratings yet
Seminar Week 4 - With Solutions - Fullpage
35 pages
Accomplishment Report-Catch-Up-Fridays
No ratings yet
Accomplishment Report-Catch-Up-Fridays
10 pages
Chapter7 Sampling Distribution
No ratings yet
Chapter7 Sampling Distribution
37 pages
Module 2
No ratings yet
Module 2
148 pages
3 Introduction To Probablities
No ratings yet
3 Introduction To Probablities
25 pages
Statisticsppt Copy 170221201132
No ratings yet
Statisticsppt Copy 170221201132
30 pages
Manhwa
No ratings yet
Manhwa
7 pages
Sleep Hygiene
No ratings yet
Sleep Hygiene
1 page
Eba3e PPT ch06
No ratings yet
Eba3e PPT ch06
41 pages
Thinking Statistically
From Everand
Thinking Statistically
Anthony Banfield
5/5 (1)
Sampling Inference
No ratings yet
Sampling Inference
83 pages
Sampling Distribution
No ratings yet
Sampling Distribution
41 pages
Data and Monte Carlo Simulations
No ratings yet
Data and Monte Carlo Simulations
66 pages
7 Inference L8 Unlocked
No ratings yet
7 Inference L8 Unlocked
29 pages
Ba1 7
No ratings yet
Ba1 7
37 pages
Unit 5 Overview of Probability
No ratings yet
Unit 5 Overview of Probability
21 pages
Lecture7 - Sampling Distribution - 0930
No ratings yet
Lecture7 - Sampling Distribution - 0930
37 pages
P&S - Lec 6 - Sampling Distribution
No ratings yet
P&S - Lec 6 - Sampling Distribution
32 pages
Cash Flow Statement
80% (5)
Cash Flow Statement
6 pages
Prefer vs. Would Rather
No ratings yet
Prefer vs. Would Rather
6 pages
Chapter7 Samplingw
No ratings yet
Chapter7 Samplingw
16 pages
Inferential Statistics 1 (G4)
No ratings yet
Inferential Statistics 1 (G4)
43 pages
Sampling and Estimation
No ratings yet
Sampling and Estimation
36 pages
Sampling Method and Estimation: Statistics For Economics 1
No ratings yet
Sampling Method and Estimation: Statistics For Economics 1
62 pages
3 - Sampling, Sampling Bias The Central Limit Theorem
No ratings yet
3 - Sampling, Sampling Bias The Central Limit Theorem
38 pages
ECON4150 - Introductory Econometrics Lecture 2: Review of Statistics
No ratings yet
ECON4150 - Introductory Econometrics Lecture 2: Review of Statistics
41 pages
Bizstat ssn2
No ratings yet
Bizstat ssn2
55 pages
Elementary Statistics
From Everand
Elementary Statistics
jay prakash Maheshwari
5/5 (1)
Lecture 5 Statistics
0% (1)
Lecture 5 Statistics
52 pages
State Bank of India: Balance Sheet As On 31 March, 2018
No ratings yet
State Bank of India: Balance Sheet As On 31 March, 2018
103 pages
BRCSU
No ratings yet
BRCSU
536 pages
Sampling Methods and The Central Limit Theorem
No ratings yet
Sampling Methods and The Central Limit Theorem
30 pages
RM2
No ratings yet
RM2
85 pages
Sampling & Sampling Distributions
No ratings yet
Sampling & Sampling Distributions
26 pages
Lecture 5-Chapt 4-Normal Distribution & Sampling
No ratings yet
Lecture 5-Chapt 4-Normal Distribution & Sampling
24 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
Lectorial Slides 6a
No ratings yet
Lectorial Slides 6a
30 pages
Probability and Statistical Course.: Instructor: DR - Ing. (C) Sergio A. Abreo C
No ratings yet
Probability and Statistical Course.: Instructor: DR - Ing. (C) Sergio A. Abreo C
74 pages
Stats-And-Prob-Reviewer (Grade 11 Stem)
100% (1)
Stats-And-Prob-Reviewer (Grade 11 Stem)
5 pages
Sampling Distribution
No ratings yet
Sampling Distribution
19 pages
Sampling Methods and The Central Limit Theorem
No ratings yet
Sampling Methods and The Central Limit Theorem
20 pages
Sampling and Sampling Distributions
No ratings yet
Sampling and Sampling Distributions
35 pages
Unit - 4
No ratings yet
Unit - 4
10 pages
I P S F E Sampling Distributions: Ntroduction To Robability AND Tatistics Ourteenth Dition
No ratings yet
I P S F E Sampling Distributions: Ntroduction To Robability AND Tatistics Ourteenth Dition
37 pages
Chapter Seven
No ratings yet
Chapter Seven
13 pages
Introduction: Royal Kingdom of Maharlikan
100% (1)
Introduction: Royal Kingdom of Maharlikan
2 pages
Chapter 6-8 Sampling and Estimation
No ratings yet
Chapter 6-8 Sampling and Estimation
48 pages
Letter To Judge Alvin Hellerstein
100% (1)
Letter To Judge Alvin Hellerstein
5 pages
Data Science Q&A
No ratings yet
Data Science Q&A
4 pages
Hospital Management System Database Design
100% (1)
Hospital Management System Database Design
8 pages
Sampling, Sampling Distributions and Estimation
No ratings yet
Sampling, Sampling Distributions and Estimation
8 pages
Fundamental Sampling Distributions and Data Descriptions
No ratings yet
Fundamental Sampling Distributions and Data Descriptions
22 pages
Statistics II Essentials
From Everand
Statistics II Essentials
Emil Milewski
2.5/5 (1)
Sampling Distribution
No ratings yet
Sampling Distribution
7 pages
נאום הפרידה של הנשיא אייזנהאואר
100% (2)
נאום הפרידה של הנשיא אייזנהאואר
4 pages
Why "Sample" The Population? Why Not Study The Whole Population?
No ratings yet
Why "Sample" The Population? Why Not Study The Whole Population?
9 pages
Neuromarketing
No ratings yet
Neuromarketing
11 pages
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet
Curriculum Models Report
No ratings yet
Curriculum Models Report
4 pages
Brief Lecture Notes
No ratings yet
Brief Lecture Notes
13 pages
Physiotherapy An Active Transformational and Authentic Career Choice
No ratings yet
Physiotherapy An Active Transformational and Authentic Career Choice
15 pages
2 & 16 Mark Questions and Answers: Enterprise Resource Planning
82% (11)
2 & 16 Mark Questions and Answers: Enterprise Resource Planning
33 pages
WDR2016 Concept Note
No ratings yet
WDR2016 Concept Note
21 pages
Instructions: Read The Following Article and Answer The Questions. Six Sigma in Industry: Some Observations After Twenty-Five Years T. N. Goh
No ratings yet
Instructions: Read The Following Article and Answer The Questions. Six Sigma in Industry: Some Observations After Twenty-Five Years T. N. Goh
7 pages
Sampling Concepts, Sampling Distributions & Estimation
No ratings yet
Sampling Concepts, Sampling Distributions & Estimation
21 pages
8888 Uprising - Wikipedia, The Free Encyclopedia
No ratings yet
8888 Uprising - Wikipedia, The Free Encyclopedia
12 pages
Assessment of Spinal Anaesthetic Block: References
No ratings yet
Assessment of Spinal Anaesthetic Block: References
3 pages
Sample Integrated Marketing Communications Plan - WWE LIVE - Global Spectrum (Internship Work)
No ratings yet
Sample Integrated Marketing Communications Plan - WWE LIVE - Global Spectrum (Internship Work)
14 pages
My Kundalini Awakening: by Murli Menon
No ratings yet
My Kundalini Awakening: by Murli Menon
3 pages
Sampling Distributions: The Basic Practice of Statistics
No ratings yet
Sampling Distributions: The Basic Practice of Statistics
14 pages
Gate Scholorship Work - October: Sampling Fundamentals
No ratings yet
Gate Scholorship Work - October: Sampling Fundamentals
13 pages
Math
No ratings yet
Math
2 pages
Delta Module 1 June 2010 Paper 1 PDF
No ratings yet
Delta Module 1 June 2010 Paper 1 PDF
8 pages
Potential of Galvanic Cell
No ratings yet
Potential of Galvanic Cell
4 pages
UNIT 1 Adjudication of Dispute and Claims
100% (1)
UNIT 1 Adjudication of Dispute and Claims
13 pages
Siemens Power Engineering Guide 7E 354
No ratings yet
Siemens Power Engineering Guide 7E 354
1 page
MPD RGH
100% (1)
MPD RGH
12 pages
Relationship Principles To Win Hearts at Work
No ratings yet
Relationship Principles To Win Hearts at Work
104 pages
PLC Questions Liii
No ratings yet
PLC Questions Liii
209 pages
Detailedlessonplan 131030230707 Phpapp01
No ratings yet
Detailedlessonplan 131030230707 Phpapp01
8 pages
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

L4 Sampling

Uploaded by

L4 Sampling

Uploaded by

Sampling and Simulation

INFO5501: Fundamentals of Data Analytics

Ref: Fundamentals of Data Science by Stephen Elson, at: https://github.com/StephenElston/CSCI-E-83

USE CASE Sample Population

• It is not only impractical, but impossible to collect

• Any statistic we compute for the generating process is based on a sample

• The statistic is an approximation, s() of a population parameter

Then by the law of Large Numbers:

• We rely on the law of large numbers whenever we work with samples

The law of large numbers has a long history

Example: a mixture of Normal distributions

As we sampled from a Normal distribution, the mean and standard deviation of

• Standard error decreases as the square root of n

Mean estimates for realizations of

There are a great number of possible sampling methods.

• Example, a company sells a product by weight

• A few examples include:

• Equal sized Bernoulli samples are collected at the lowest level

• Examples of expensive to collect data:

• Door to door survey of homeowners

• Sampling wildlife populations in a dispersed habitat

• Population can be divided into randomly selected clusters:

Convenience and systematic sampling are a slippery slope toward biased

• Systematic methods lack randomization

• Example, the first N rows resulting from a database query

• Example, every k-th case of the population

• Most real-world processes have complex behavior, resulting in complex distributions of

• Probability distributions of the variables are shown as ellipses

• Decision variables are deterministic and are shown as rectangles

• Setting decision variables can be performed either manually or automatically

• Utility nodes, profit in this case, are shown as diamonds

• Utility calculations are deterministic given the input values

• Directed edges show the dependency structure of the distributions

The sandwich shop

• The DAG is a shorthand description of the simulation model

• Can be useful to vary the parameters

• Resulting distribution can be quite complex

• Sampling is important because we almost never have data on a whole population

• Use inferences on the sample to say something about the population

• The sample must be randomly drawn from the population

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.