0% found this document useful (0 votes)
2 views23 pages

L4 Sampling

Uploaded by

rishitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views23 pages

L4 Sampling

Uploaded by

rishitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Sampling and Simulation

INFO5501: Fundamentals of Data Analytics

Ref: Fundamentals of Data Science by Stephen Elson, at: https://github.com/StephenElston/CSCI-E-83


Introduction
• Sampling is a fundamental process in the collection and analysis of data
• Sampling is important because we almost never have data on a whole
population
• Sampling needs to be randomized to preclude biases
• As sample size increases, the standard error of a statistic computed from
the sample decreases by the law of large numbers
• Key points to keep in mind:
• Understanding sampling is essential to ensure data is representative of the entire
population
• Use inferences on the sample to say something about the population
• The sample must be randomly drawn from the population
• Sampling from distribution is the building block of simulation
Sampling Examples

USE CASE Sample Population


A/B Testing The users shown in group A or B All users
World cup soccer 32 teams which qualify in a season All national teams
Average height of data science students Students in a data science class All students in data science major
Tolerance of a manufactured part Samples taken from production lines All parts manufactured
Numbers of a species in a habitat Counts from sampled habitats All habitats

• It is not only impractical, but impossible to collect


data from the entire population in some cases
Importance of Random Sampling
• All statistical methods rely on the use of randomized unbiased samples
• Failure to randomized samples violates many key assumptions of statistical
models
• An understanding of proper use of sampling methods is essential to statistical
inference
• Most commonly used machine learning algorithms assume that training data
are unbiased and independent and identically distributed (iid)
• These conditions are only met if training data sample is randomized

• Otherwise, the training data will be biased and not represent the underlying process
distribution
Sampling Distributions
Sampling of a population is done from an unknown population distribution,

• Any statistic we compute for the generating process is based on a sample

• The statistic is an approximation, s() of a population parameter


• For example, the mean of the population is μ
• But, sample estimate is
• If we continue to take random samples from the population and compute estimates of a statistic,
we generate a sampling distribution.
• Hypothetical concept of the sampling distribution is a foundation of frequentist statistics
• Example, if we continue generating samples and computing the sample means, for the ith sample
• Frequentist statistics built on the idea of randomly resampling the population
distribution and recomputing a statistic
• In the frequentist world, statistical inferences are performed on the sampling distribution
• Sampling process must not bias the estimates of the statistic
Sampling Distributions
Sampling of a population is done from an unknown population distribution,
• Any statistic we compute for the generating process is an approximation for the
population, s().
Sampling and the Law of Large Numbers

The law of large numbers is a theorem that states that statistics of independent
random samples converge to the population values as more samples are used.
• Example, for a population distribution, (μ,σ), the sample mean is:

Then by the law of Large Numbers:

This result is reassuring that the larger the sample the more the statistic
coverages to the population parameter.
The law of large numbers is foundational to statistics

• We rely on the law of large numbers whenever we work with samples


• Assume that larger samples are more representatives of the population we are
sampling
• Is foundation of sampling theory, plus modern computational methods; simulation,
bootstrap resampling, and Monte Carlo methods
• If the real world did not follow this theorem, then much of statistics (to say nothing of
science and technology as a whole) would have to be rethought.

The law of large numbers has a long history


• Jacob Bernoulli posthumously published the first proof for the Binomial distribution in
1713
• Law of large numbers is sometimes referred to as Bernoulli’s theorem
• A more general proof was published by Poisson in 1837.
Central Limit Theorem (CLT)
CLT is a sort of guarantee
• Sampling distribution of mean estimates do not depend on the population the
sample was drawn from
• Standard Deviation s of the sampling distribution of converges as
• Only depends on the population’s man and variance, and on the sample size
• CLT is the basis for hypothesis testing.

Example: a mixture of Normal distributions


Standard Error and Convergence for a Normal Distribution

As we sampled from a Normal distribution, the mean and standard deviation of


the sample converged to the population mean
• What can we say about the expected error of the mean estimate as the number
of samples increases?
• This measure is known as the standard error of the sample mean.
• As corollary of the law of large numbers the standard error is defined:

• Standard error decreases as the square root of n


• Example, if you wish to halve the error, you will need to sample four times as many values.
• For the mean estimate , define the uncertainty in terms of confidence intervals.
• For example, for 95% confidence interval
Convergence and Standard Errors for a Normal Distribution

Mean estimates for realizations of


standard Normal distribution with
standard errors
Sampling Strategies

There are a great number of possible sampling methods.


• Some of the most commonly used methods
• Bernoulli sampling, a foundation of random sampling
• Stratified sampling, when groups with different characteristics must be sampled
• Cluster sampling, to reduce cost of sampling
• Systematic sampling and convenience sampling, a slippery slope
Bernoulli Sampling
• Bernoulli sampling is a widely used foundational random sampling strategy
• Bernoulli sampling has the following properties:
• A single random sample of the population is created
• A particular value in the population is selected based on the outcome of a
Bernoulli trial with fixed probability of success, p

• Example, a company sells a product by weight


• To ensure the quality of a packaging process so few packages are underweight
• Impractical to empty and weight the contents of every package
• Bernoulli randomly sampled packages from the production line and weigh contents
• Statistical inferences are made from sample
Sampling Grouped Data
Group data is quite common in application

• A few examples include:


• Pooling opinion by county and income group, where income groups and
counties have significant differences in population
• Testing a drug which may have different effectiveness by sex and ethnic group
• Spectral characteristics of stars by type
Stratified Sampling
What is a sampling strategy for grouped or stratified data?
• Stratified sampling strategies are used when data are organized in strata
• Simple Idea: independently sample an equal numbers of cases from each strata
• The simplest version of stratified sampling creates an equal-size Bernoulli
sample from each strata
• In many cases, nested samples are required
• For example, a top level sample can be grouped by zip code, a geographic strata

• Within each zip code, people are then sampled by income bracket strata

• Equal sized Bernoulli samples are collected at the lowest level


Cluster Sampling
When sampling is expensive, a strategy is required to reduce the cost

• Examples of expensive to collect data:


• Surveys of customers at a chain of stores

• Door to door survey of homeowners

• Sampling wildlife populations in a dispersed habitat

• Population can be divided into randomly selected clusters:


- Define the clusters for the population
- Randomly select the required number of clusters
- Sample from selected clusters
- Optionally, stratify the sample within each cluster
Systematic Sampling

Convenience and systematic sampling are a slippery slope toward biased


inferences

• Systematic methods lack randomization


• Convenience sampling selects the cases that are easiest to obtain
• Commonly cited example known as database sampling

• Example, the first N rows resulting from a database query

• Example, every k-th case of the population


Introduction to Simulation

Simulation enables data scientists to study the behavior of stochastic processes with
complex probability distributions

• Most real-world processes have complex behavior, resulting in complex distributions of


output values
• Simulation is a practical approach to understanding these complex processes
• Two main purposes of simulation can be summarized as:
• Testing models: If data simulated from the model do not resemble the original data, something is
likely wrong

• Understand processes with complex probability distributions: In these cases, simulation provides a
powerful and flexible computational technique to understand behavior
• As cheap computational power has become ubiquitous, simulation has become
a widely used technique
• Simulations compute a large number of cases, or realizations
• The computing cost of each realization must be low in any practical simulation

• Realizations are drawn from complex probability distributions of the process model
• In many cases, realizations are computed using conditional probability
distributions
• The final or posterior distribution of the process is comprised of these realizations
Representation as a Directed Acyclic Graphical Model

When creating a simulation with multiple conditionally dependent variables it is useful to draw a directed graph;
a directed acyclic graphical model or DAG

• The graph is a communications device showing which variables are independent and which are conditionally
dependent on others with the shapes used representing the type of nodes

• Probability distributions of the variables are shown as ellipses


• Distributions have parameters which must be estimated

• Decision variables are deterministic and are shown as rectangles


• Decisions are determined by variables

• Setting decision variables can be performed either manually or automatically

• Utility nodes, profit in this case, are shown as diamonds


• Nodes represent a utility function given the dependencies in the graph

• Utility calculations are deterministic given the input values

• Directed edges show the dependency structure of the distributions


Sandwich Shop Simulation

The sandwich shop


simulation can be
represented by a
DAG
Interpreting the DAG

• The DAG is a shorthand description of the simulation model


• Leaves of the DAG are independent distributions
• Parameters must be known or estimated

• Can be useful to vary the parameters


• Child distributions are conditional on their parents
• Parameters must be known or estimated

• Resulting distribution can be quite complex


• Decision variables deterministicly change the model parameters
• Utility node uses a fixed deterministic formula to compute the value for each
realization of the simulation.
Summary
Sampling is a fundamental process in the collection and analysis of data

• Sampling is important because we almost never have data on a whole population


• Sampling must be randomized to preclude biases
• As sample size increases the standard error of a statistic computed from the
sample decreases by the law of large numbers
• Key points to keep in mind:
• Understanding sampling is essential to ensure data is representative of the entire population

• Use inferences on the sample to say something about the population

• The sample must be randomly drawn from the population


• Sampling from distribution is the building block of simulation

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy