0% found this document useful (0 votes)
355 views64 pages

STAB22 Lecture's Notes

The document provides an overview of describing and summarizing categorical and quantitative data. It discusses topics like frequency tables, histograms, measures of center and spread. Conditional distributions and Simpson's paradox are also covered. Different ways to visually represent data distributions are described.

Uploaded by

isabella branton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
355 views64 pages

STAB22 Lecture's Notes

The document provides an overview of describing and summarizing categorical and quantitative data. It discusses topics like frequency tables, histograms, measures of center and spread. Conditional distributions and Simpson's paradox are also covered. Different ways to visually represent data distributions are described.

Uploaded by

isabella branton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

STAB22 Lecture Notes

Lecture 1: Describing Categorical Data


What is Statistics?
❖ Way of reasoning, along with a collection of tools and methods
designed to help us understand the world using data
o Statistics can be used to
▪ Described and present a particular situation
▪ Test competing claims/theories
▪ Make future predictions
Data
❖ Individual Cases (Individuals)
o Individuals are the objects described by a set of data
o Cases can be people, animals, things etc.
❖ A variable is any characteristic of an individual case
o Variables can take different values on different cases
❖ Data is usually organized in tables, where
o Rows represent cases
o Columns represent variables
❖ Know your data
o When planning a statistical study or exploring data from someone
else’s work, ask yourself ‘Who, What, Where, When, Why’
Types of Variables
❖ Categorical Variables
o Place an individual into one of seven groups or categories
▪ Ex: Gender, college major etc.
o Two types of categorical variables
▪ Nominal- unordered categories
• Major: Mathematics, Statistics etc.
• Criminal offence convictions: murder, robbery,
assault
▪ Ordinal- ordered categories
• Patient condition: excellent, good, fair, poor
❖ Quantitative Variables
o Take numerical values for which arithmetic operations are
defined
▪ Ex: Height, GPA

Describing Categorical Data


❖ Categorical Data: individual cases fall into one of several groups or
categories
o Ex: Student grades range from A-F
❖ Several methods for describing categorical data
o Tables
▪ Frequency Table
• List ALL categories and their frequencies
o Ex: student course grades
• Describes distribution of a categorical variable
o Ex: value & how often it occurs

▪ Relative Frequency
• List ALL categorical data and their relative
frequencies
o Ex: Proportion/percentage # of each category

2
▪ Contingency Tables
• Two-way table lists frequencies for each combination
of the two categorical variables
o Ex: Student grade by year

• Can also report relative frequencies as % of total


number of cases (divide each category by total
number)
• Margins of contingency table gives distribution of
each categorical variable separately
o Also known as marginal distribution

o Charts
▪ Pie Chart
• Categories represented by slices, where size of each
slice is proportional to relative frequency
▪ Bar Chart

3
• Categories represented by bars, where the height of
each bar is relative frequency

▪ Side-by-side
Describing Two Categorical Variables
❖ Consider two categorical variables for each individual case
o Ex: Student course grades (variable 1) & year the course was
taken (variable 2)
❖ Want to describe joint behavior of both categorical variables
Conditional Distributions
❖ Often need distribution of one variable for a particular value of the
other
o Ex: What percentage of males get accepted/rejected from
university?
❖ Fix value of one variable and look at distribution of the other for that
value only
o Ex: Conditional distribution of decision, conditional on gender =
male
❖ Can condition on either variable (I.e. rows or columns) of contingency
tables
o Ex: conditional distributions of decision for gender =
male/female

4
❖ If conditional distribution of one variable are the same for every value
of the other, we say the two variables are independent
o Ex: If conditional distribution of student course grades does not
change with the year then the grade is independent of the year
❖ Compare conditional distributions visually using side-by-side plot
(double bar-graph)
Simpson’s Paradox
❖ How data can be skewed when looking at an overall basis of it versus
when we look at it in a more disciplined light
Lecture 2: Displaying and Summarizing Quantitative Data
Describing Quantitative Data
❖ Variables measure numerical quantities
o Arithmetic operations make sense
▪ Ex: Breakfast cereal data (calories per serving)
❖ Described using
o Plots: histograms, boxplot
o Numerical summaries: mean, median, range etc.
❖ For any quantitative variable we want to describe the overall pattern
of its distribution
o Shape- peaks (modes), symmetry, outliers
o Centre- mean, median
o Spread- range, standard deviation
Histogram
❖ Create artificial categories called bins or classes for each value of
quantitative variables
o Ex: calorie classes; 40-60, 60-80 etc.
▪ Classes span ALL data and DON’T overlap
❖ Pretend variable is categorical and create bar plot
Shape of Distribution
❖ Modality
o Check for modes/peaks
▪ Also known as the most frequently occurring values

5
❖ Symmetry
o Distribution is called symmetric if, when we draw a vertical line
down its center the two sides are similar shape and size

❖ Skewedness
o Unimodal distributions with one tail longer than the other are
called skewed

❖ Outliers
o Any data that lie far off the main body of the distribution are
called outliers

6
Centre of Distribution
❖ Mean
o Sum of all values of quantitative variable divided by the number
of values
▪ All the actual values added up divided by how many values
there actually are
o Generally, for #n sample values x1, x2, xn, sample mean is given
by

o Mean is the center of gravity


▪ Good representative value BUT sensitive to outliers
❖ Median
o Midpoint of all values, after they are ordered from the smallest to
the largest
▪ For even # of data median is the mean of the two middle
numbers
• Add it up and divide it by two
o Median is robust to extreme values
▪ Prefer median when data has outliers
❖ Mean, Median & Shape
o For symmetric distributions

7
▪ MEAN = MEDIAN
o For skewed distributions
▪ MEAN >< MEDIAN
Spread of Distribution
❖ Spread describes how much data very about their center
(dispersion/variability)
o Spread is an important aspect of a distribution
Range
❖ Range= Maximum – Minimum value
o Ex: Data (4, 6, 8, 8, 10, 15)
▪ Range = 15 – 4 = 11
❖ Simple to calculate but not always helpful
o Data set 1: 4, 4, 4, 4, 4, 10
o Data set 2: 4, 5, 6, 7, 8, 9
▪ Same range but not the same spread
• No variability in the first set vs the second and
outliers exist too
❖ Range is very sensitive to outliers
o Depends only on extreme values
o Better alternative is interquartile range
Quartiles
❖ 3 values that divide distribution into 4 parts, each containing ¼ of data

❖ To find Q1 & Q3, split data in two halves & calculate the mean of each
half

8
o When the data is odd include median value in each half
Interquartile Range
❖ IQR = Q3 – Q1
o Distance between Q1 and Q3
o Resistant (not sensitive) to outliers

Variance
❖ Measures average squared deviation of individual data from their mean
❖ For values x1, x2 etc. variance is given by

o Units of variance are squared


▪ Easy fix: take square root of s2 to correct units
Standard Deviation
❖ Both variance (s2) and standard deviation (s) measure spread around
mean
o They are sensitive to outliers; they work best for symmetric
distribution

9
Five Number Summary
❖ Set of five measure giving quick summary of distribution
o Measures consist of: Minimum, Q1, Median, Q3 and Maximum
❖ Gives an idea for both center and spread
Boxplot
❖ Visual display of 5-number summary

❖ Boxplot helps identify outliers using ‘fences’

❖ Draw lines from box to farthest value within the fences


❖ Plot values outside of fences individually

10
o This are suspected outliers that require examination

Choosing a Summary
❖ Mean and standard deviation work well for symmetric distributions
without outliers
❖ Five-number summary is better for describing skewed distributions
with outliers
❖ In addition to numerical summaries always try to include the plot of the
distribution

Lecture 3: Understanding/Comparing Distributions & SD as a Ruler/Model


Comparing Distributions
• Want to compare distributions of the same quantitative variable for
different groups
o Ex: Cereal carbohydrates VS display self-group
• Histograms are not always helpful
o Difficult to compare for many groups

11
• Better alternative: side-by-side boxplots
o Easy to compare
▪ All boxes have same left axis
o Gives lots of information
▪ Centre, spread, shape and outliers
o Easy to read even for many groups

Percentiles
• Can look what percentile student stands at within each distribution
• Pth percentile: value such that
o P% of data are below
o 100-p% of the data above that value
o Need to know all data values to find percentile

Standardized Values (z-scores)

12
• Assume we don’t know all other scores in exam but only their means &
standard deviations
• If x is value from distribution with mean (_X) and standard deviation
(s) the standardized value of x is

• Z-score syas have many standard deviations does original observation


fall away from the mean
• Z-scores measure the distance of a value from the mean in standard
deviation
o Ex: A z-score of 2 would indicate a value that is 2 SD’s above the
mean
Standardization

• Standardizing into z-scores does not change the shape of the


distribution of a variable
• Standardizing into z-scores changes the center by making the mean 0
• Standardizing into z-scores changes the spread by making the SD 1
Shifting Data
• Adding same number ‘a’ to each value in data adds ‘a’ to measure of
center & percentiles but does not change the spread
o Does not change the standard deviation or IQR (spread)
• When we shift the data by adding or subtracting a constant to each
value all measures will increase or decrease by that constant
Rescaling Data
• Multiplying each value in data by same number ‘b’ multiplies measure
of center & percentiles by ‘b’ and measures of spread by ‘b’

13
o Multiplying or dividing each value by a constant, data will then
change in measurement units
▪ This effects all measures
Distribution Models
• Z-scores measure distances from mean in terms of units of standard
deviations
o Describe how big/small a value is but doesn’t tell what
proportion falls below/above value
• To find proportions from mean & SD need theoretical model of
distribution
o Distribution models described by density curve
Density Curve
• A model for the frequency distribution of data using areas under the
curve to represent relative frequencies
• Consider histogram of exam score data
o Distribution could be approximately described by some density
curve

• The area under a density curve between any two numbers will represent
the proportion of the data that lie between those same numbers

14
• Any density curve must satisfy the following conditions
o Always positive or zero
o Total area under the curve above the z-axis equal to 1.00

Normal Density Curve
• Most important density is normal density curve
o Used to describe bell-shaped distributions
o Defined by 2 parameters
▪ Mean: μ defines the center
▪ SD: σ defines the spread
o All normal have same shape but different μ, σ

15
Proportions from Histogram
• Proportion of data in interval [a,b] is equal to total relative frequency of
bars between interval

Proportions from Density


• Proportion of data in interval [a,b] is equal to area under density curve
falling within interval

16
Standard Normal
• Standard Normal: normal distribution with mean of σ=1 and μ=0
o Is only applicable if the distribution is unimodal and symmetric to begin
with
• Can find any proportion for standard Normal using Table Z or software

• Can find any proportion for any normal by using converting to z-scores
and using standard deviation normal
• Nearly normal condition; the shape of the data’s distribution is
unimodal and fairly symmetric (Check by making histogram)

17
Checking for Normality
• Normal density is theoretical model; when should we use it to describe
real data
• Can check histogram
o Look for bell-shape (uni-modal & symmetric)
Normal Probability Plot
• Plot data values against their theoretical Normal z-score
• If points lie close to straight line the data is well described by normal

18
Lecture 4 Scatter plots, Association and Correlation
Scatterplot
• Variables measured along horizontal (y-) and vertical (x-) axis; each
dot presents combination of corresponding individual's values

Roles of Variables
• Usually there is a variable of interest (response/dependent variable) and
a variable whose effect on the response we want to examine
(explanatory/independent)
o Ex: We want to study whether BP increases with Age; how
would we classify the variables?
▪ Response variable: Blood pressure
▪ Explanatory variable: Age

19
Types of Relationships
• Overall pattern of scatterplots describes form, direction and strength of
relationship
o Form
▪ Linear

▪ Non-linear

o Direction
▪ Positive relationship

20
▪ Negative relationship

o Strength
▪ Strong relationship

21
▪ Weak relationship

Outliers
• Scatterplots also help identify outliers
o Extreme deviations from the mean and overall pattern

22
Correlation
• Correlation coefficient ®
o Numerical measure of liner relationship between 2 variables
▪ R is always a number between –1 and 1
▪ R describes strength & direction of linear relationship
ONLY

Correlation is not Causation


• If two variables are correlated this does not mean that y implies x
• Lurking variable
o A hidden variable that as an important effect on the relationship
among variables in a study but not one of the ones tested
▪ Ex: X= crime rates, Y = ice cream sales; highly correlated
in a study BUT there must be another factor Z
(temperature, summertime etc.)
Linear Model
• Equation representing straight line that passes through the data

23
o For a given X I can predict Y

• y= true value at x
• Y-hat= predicted/fitted value at x
• Y- y (hat) = residual
• B0 = y intercept
• B1 = slope

Linear Regression
• Best fitting line minimizes sum of squared residuals given by

24
ALWAYS CALCULATE B1 FIRST
o Where
▪ R = correlation between x and y
▪ Sx, sy = standard deviation of x, y
▪ Mean of x, y
Regression and Correlation
• Correlation efficient (r) between two variable essentially equals the
slope of liner model of standardized values (z-scores)
Residual Standard Deviation
• If linear regression assumptions are satisfied we can measure prediction
accuracy using residual standard deviation

o Se measures average distance between true and predicted values


Coefficient of Determination (R2)
• Proportion of y-variation accounted for by linear model

• Equal to squared coefficient or correlation ( r )

25
• Proportion of y-variation left in residuals = 1- R2
Lecture 5: Line Regression & Regression Wisdom
Regression Diagnostics
• Can fit linear model to any set of data
o Ex: X, Y don’t need to be linearly related
• Want to check whether linear model offers good description of data
o Can use scatterplot; but residual plot often provides at better
picture
• Anscombe’s quartet
o Comprises four datasets that have similar statistical properties –
even the slopes of regression lines are similar
▪ r1-r3: 0.816
Residual Plot
• Plot residuals (y-yhat) against x
• Residual plot should be evenly scattered around 0 with not particular
pattern
o Mean of residual is always 0

• Uneven dispersion
o If residual spread changes with x – linear model is not evenly
accurate throughout X

26
Residual Standard Deviation
• If linear regression assumptions are satisfied, we can measure
prediction accuracy using residual SD
o A.k.a error SD
• Se measures average distance between true and predicted values

Coefficient Determination
• How useful is linear regression model in describing the y-value?
o Compare y-data's variation to residual variation from linear
model
• Proportion of y-variation accounted for by linear model
• Equal to squared coefficient of correlation ®

27
• Proportion of y-variation left in residuals
o = 1- R2
X-variable Predictions
• For predicting x based on y-variable should not just use y = b0 + b1 *x
model in reverse
o Instead fit new model: x = b0 + b1 *y
▪ Better because it minimizes (x-hat – x)2
Diagnostics

Before using linear model, always check assumptions
o
Linearity assumption

Straight line form
o
Equal spread assumption

Data evenly spread around straight line
• Residuals should be evenly scattered around 0, with no particular
pattern
Subset Regressions
• Sometimes assumptions fail because data combine groups with
different behaviors
o Ex: Height vs Weight of 18-year-olds
• Fit separate regressions for male/female subsets
Extrapolation
• The farther away the new x-value is from the mean of x the less
trustworthy the predicted value should be
• Extrapolations are dubious because they require the assumption that
nothing about the relationship between x and y changes even at an
extreme value of x
Re-expressing Data
• Sometimes can improve linear regression assumptions by modeling
function of the data instead of their original values
• Typical choice of functions are as follows

28
o Apply function to y-variable first and perhaps to x-variable if
necessary
Lurking Variables and Causation
• No matter how strong the association or how large the R2 value there
is no way to conclude from a regression alone that one variable
causes the other
o There is always the possibility that some third variable is driving
both of the variables you have observed
• With observational data there is no way to be sure that a lurking
variable is not the cause of any apparent association

Outliers, High leverage and Influential points
• Should always check scatter/residual plot for outliers and influential
points
• Influential point is observation that significantly affects fitted line
o Removing influential point from regression would change fitted
line considerably
o Influential points typically have an x-value that is far from the
observed range of x-values
o We say a point is influential if omitting it from the analysis
changes the model enough to make a meaningful difference

29
• High leverage point
o A data point can be unusual if its x-value is far from the mean of
x-values
▪ Such points are said to have high leverage
o Data points whose x-values are far from the mean of x are said to
exert leverage on a linear model
▪ High-leverage points pull the line close to them so they
have a large effect on the line
• With high enough leverage their residuals can be
deceptively small
o A point with high leverage has the potential to change the
regression line but doesn’t always use that potential

Lecture 7: Sample Surveys, Experiments and Observational Studies


Sample Survey
• A study designed to ask questions of a sample drawn from a population
in the hope of learning something about the entire population
o Census: Ask EVERYONE

30
o Sample: Ask SOME people
• Population: entire group of individuals that we want information about
• Sample: Smaller group of individuals selected from population
o A subset of a population examined in hope of learning about the
population
Parameter
• Population parameter: numerical measure describing population; fixed
but unknown value which we want to find out about
o We rarely expect to know the true value of a population
parameter but we do hope to estimate it from sample

Sample Statistic
• Numerical measure describing sample; used to estimate corresponding
population parameter
• When the statistics we compute from the sample accurately reflect the
corresponding parameter then the sample is said to be representative
Sampling Variability
• Results can vary depending on which sample gets selected

• Differences between parameter & statistic called sampling variability

31
Sampling Size
• Sampling variability can be controlled by the sample size
o Number of individuals in the sample
o Larger sample – lower variability
• It is the sample size and not the size of the population that really
matters
o Only exception: If the populations is small enough and the
sample size is more than 10% of the whole populations then
population size CAN matter
Bias
• Sampling methods that tend to over or underemphasize characteristics
of the population
• Any systematic difference between sample statistics & population
parameter called bias

Randomization
• Protects us from the influences of all features of our population by
making sure on average the sample looks like the population

32
• To avoid bias, select sample at random
o Every individual in population has same chance of being
included in sample
• Randomization ensures that selected sample are representative of
population
o In the long run every individual appears equally often
o Able to represents population in all ways
Simple Random Sampling (SRS)
• A simple random sample of size n consists of n individuals from the
population chosen in a way that every n has an equal chance of being
selected
Sampling Frame
• In order to perform random sampling we need a record of the entire
population
• The sampling frame is the list of individuals in the population from
which a sample is drawn
o Might not necessarily be the same as population
• Try to use the sampling frame that is as close as possible to population
of interest
Stratified Sampling
• Sometimes population is divided into groups of similar individuals
called strata
• To ensure more representative sample we use stratified random
sampling
o Select separate SRS’s with each stratum (separate sample sizes
relative to stratum) and combine them to get stratified sample

33
Cluster Sampling
• SRS can sometimes be impractical
• If population is divided in cluster, by geographic or other boundaries it
is easier to perform SRS of clusters
• Select the number of clusters at random and perform a census within
each of them
o Instead of a complete census we randomly select only a portion
of the individuals in each cluster
Systematic Sample
• Individuals in population are often ordered
• If order is not associated with response we can use systematic
sampling: which is sampling every nth individual at random

• A sampling design in which entire groups, or clusters, are chosen at


random is called a cluster sample.
• A sampling design in which population is divided into several
subpopulations is called a stratified random sample.

34
• A sample drawn by selecting individuals systematically from a
sampling frame is called a systematic sample.
• Sampling schemes that combine several sampling methods is called a
multistage sample.

Convenience Sample A convenience sample


consists of individuals
who are conveniently
available.

Cluster sample: Cluster sampling is a


sampling design in
which entire groups, or
clusters, are chosen at
random.
Simple random sample: A simple random
sample of sample size n
is a sample in which
each element of the
population has an equal
chance of being
selected.
Systematic sample: A sample drawn by A systematic sample is
selecting individuals drawn by selecting
systematically from a individuals
sampling frame is systematically from a
called a systematic sampling frame.
sample.
Stratified random Stratified random
sample: A sampling design in sampling is a sampling
which population is design in which the
divided into several population is divided
subpopulations is called into several
a stratified random subpopulations, or
sample strata, and random

35
samples are then drawn
from each stratum.
Voluntary response In voluntary response
sample: sampling, a large group
of individuals is invited
to respond, and all who
do respond are counted.

What Can Go Wrong


• Non-response bias
o Those who answer are not representative of the entire population
▪ Ex: sampling volunteers
• Response bias
o Favoring certain answer in way question is asked
▪ Make sure wording is neutral; use pilot trail study
• Convenience sampling
o Sampling badly but conveniently
▪ Surveying your friends/family
• Under coverage
o Not being able to sample certain parts of a population
• Voluntary response bias
o The sample is not representative even though every individual
was given the chance to respond
Collecting Data
• Observational Studies
o Observed individuals & measure variable of interest, without any
control on their response
• Experiments
o Impose different treatments on individuals in order to measure
and compare their effect
Observational Studies
• Don’t ask the sample the group just observe them

36
• Typically, cheaper than sample survey however we cannot assign
questions/choices
• Retrospective study: pick individual and extract historical data on them
o Pro: doesn’t take time, can span longer periods
o Con: cannot control/correct past data recording
• Prospective study: pick individuals and collect data as events happen
over time
o Pro: have control of observation process
o Con: more time consuming
• Observational studies also useful for finding trends & possible variable
relationships
• Observational studies do not demonstrate a causal relationship
o Ex: It is not necessarily true that exercise reduces insomnia
• In order to demonstrate causal relationship we need to perform the
experiment
Experiments
• How do we establish cause & effect?
o Randomly select some subjects and instruct them to exercise and
remaining subject not to exercise
▪ Assess and compare results from both groups
• How does this help?
o Choosing two groups at random means they start out relatively
equal in terms of any characteristic that might matter
o If groups end up with unequal terms this is proof of results
• Placebo
o Fake treatment designed to look like a real one, used when just
knowledge of receiving any treatment can affect response
▪ For comparing results, often use current standard treatment
as a baseline
▪ Subjects getting placebo/standard treatment called control
group
• Experiments require random assignment, a factor and at least one
response variable

37
Experiment Terminology
• Experimental units/subjects
o Individual participating in experiment
• Factor
o Explanatory variable whose level can be manipulated by
experimenter
▪ Levels: Specific values chosen for factor
▪ Treatment: Specific combination of manipulated levels of
one or more factors
• Response: variable whose values are compared across treatments
• Statistically Significant: a factor effect so large that it would rarely
occur by chance
Principles of Experimental Design
• Control
o Control sources of variation besides factors, by making
conditions as similar as possible for all treatment group
• Randomize
o Helps equalize effects of unknown/uncontrollable sources of
variation
o Assigning experimental units to treatment at random allows us to
use methods to draw conclusions
• Replicate
o Get several measurements of response for each treatment
o Two kinds of replication show up in comparative experiments
▪ We should apply each treatment to a few subjects
▪ Entire experiment is repeated on a different population of
experimental units
• Blocking
o For variable we can identify but cannot control and which affect
response divide subjects into groups of same variable value and
randomize each block
▪ Removes much of the variability due to the difference
among the blocks

38
• Group similar individual together and then randomize
these blocks
Experimental Designs
• Completely Randomized Design
o All experimental units are allocated at random among all
treatments
• Randomized Block Design
o Random assignment of units to treatment is carried out separately
within each block
Blinding
• Knowledge of assigned treatment can often influence the assessment of
the response
o Two classes of individual can affect experiment
▪ Those who influence results
▪ Those who evaluate results
• Blinding avoids bias from knowing treatment
o Single-blind: every individual in either one or two classes doesn’t
know treatments
o Double-blind: every individual in both of the two classes doesn’t
know treatments
Lecture 8: Probability
Probability
• Many things in life are random (uncertain)
• Probability describes how likely is ‘something’ to happen
o Probabilities take values between 0 and 1
▪ A value of 1 means event is sure to happen
▪ A value of 0 means event is not going to happen
Probability Terminology
• Random phenomenon: process whose result is not known beforehand
o Ex: Rolling a standard die
• Trial: particular realization of random phenomenon

39
o Ex: particular roll of a standard die
• Outcome: basic result of a trial that cannot be broken down to simpler
results
o Ex: Rolling a 1 is an outcome
• Sample space: collection of all possible outcome
o Sample space is usually denoted by S
o Ex: For rolling a standard die S = {1, 2, 3, 4, 5, 6}
Events
• A collection of one or more outcomes
o Usually denoted by capital letters (A, B etc.)
o Ex: A = rolling an even number
▪ Event can contain just single outcome
• Probabilities are assigned to each event
• Example: Students course grade is random phenomenon with sample
space
Empirical Probability
• Observe sequence of coin tosses (trials) and count the number of times
of an event
o Event: # of times event occurs/ total number of trials
▪ In short term; unpredictable
▪ In long term; predictable
• Limiting relative frequency value called empirical probability of event
denoted by P (X)
Law of Large Numbers (LLN)
• LLN guarantees relative frequency will eventually settle on a specific
value
o LLN holds if trials are independent
▪ Outcomes of one trial do not influence that of another
• LLN does not apply in short run
o Ex: Assume you flip a coin, and the first 10 outcomes are tails, is
the next flip more likely to be heads? No.
• LLN only applies in the long run

40
Other Ways To Assign Probability
• Theoretical Probability
o Sometimes P(X) can be deduced from mathematical model
▪ For equally likely outcomes (rolling a fair die) the
probability can be shown as follows
• P (X) = # of outcomes in X / total # of outcomes
• Personal/ Subjective Probability
o Assign P(X) based on personal beliefs
▪ Typically, experts' views
▪ Most general case; can always be used
Venn Diagrams
• Method for representing events graphically
o Sample space (S): rectangle
o Outcomes: points in S
o Events: areas in rectangle
• Examples: Roll of a die
o Event X = [5, 6 ] = Roll greater or equal to 5

41
Composite Events
• Given events X and Y form new events as:
o X or Y (union)
▪ Either event X or Y or both occur

o X and Y (intersection)
▪ Both events X and Y occur simultaneously

o Not X or not Y (complement)


▪ Event X or Y does not occur

42
Disjoint Events
• Events are called disjoint (mutually exclusive) if they have no
outcomes in common
Probability Rules
• No matter how probabilities are assigned they must satisfy the
following three rules
o Probability of any event is between 0 and 1
o Probability of all outcomes is 1
▪ P (S) =1, where S = sample space
o For two disjoint events X & Y
▪ P (X or Y) = P (A) + P (B)
Complement Rule
• Probability of Xc, can be found from that of A using the complement
rule
o P (Xc) = 1- P(X)
• This rule only works if we don’t know the probability of one event
General Addition Rule
• For any event, X or Y (overlap)
o P (X or Y) = P (X) + P (Y) - P (X and Y)
• If events are in disjoint (no overlap)
o P (X and Y) equals 0
▪ P (X or Y) = P (X) + P (Y)

43
• Regardless they should equal to 0
Conditional Probability
• Probability of event X given that event Y has occurred
o Denoted by P (X given Y), probability of X if all of the possible
outcomes are restricted only to the ones in B
• Given by the formula
o P (X|Y) = P (X and Y) / P (Y)
▪ Probability of both of them occurring divided by the
probability of the ‘given’ occurring alone

Independence
• Events X and Y are independent if the occurrence of one does not
change the probability of the other
o X, Y independent = P (X|Y) = P (X) or P (Y|X) = P (Y)
▪ EVENTS ARE INDEPENDENT ONLY IF THE
PROBABILITY OF ONE DOES NOT CHANGE FROM
THE OTHER
• Independent is NOT the same as mutually exclusive
o If X, Y are mutually exclusive then they are dependent
o If X, Y are not mutually exclusive they can either be independent
or not depending on their probabilities
Multiplication Rule
• For any event X, Y
o P (X and Y) = P(X) x P (Y|X) or P(Y) x P (X|Y)
• If events are independent then
o P (X and Y) = P (X) x P(Y)
▪ PROBABILITY THEY BOTH OCCUR

44
Lecture 9: Random Variables
Random Variables
• Random variable X assigns a single number to every outcome of an
experiment
o Ex: Consider flipping 2 fair coins
▪ Random variable X = # of heads in 2 flips

• Random variables are used to describe events and their probability


o Example: Rolling 2 fair die
▪ Random variable X = sum of 2 rolls
▪ List sample points of events (X=5)
▪ Find the probability P(X=5)
Types of Random Variables
• Discrete
o Can only take specific value
▪ Ex: 1, 2, 3, etc.

45
o Typically, the result of counting something
▪ Ex: the number of children in a family
• Continuous
o Can take any value within an interval
▪ Ex: all values between [0, 1]
o The result of some type of measurement
▪ Ex: temperature
Probability Model
• A table, graph or formula that describes the following:
o The values of a random variable
o The probability associated with each value
• Probability of X taking some specific value x is denoted by P (X = x)
or P(x)
o Ex: the sum of two dice rolls- P(X=5) = 4/36
• For any probability

Expected Values
• Probability model of discrete random variable can be described
numerically as follows

Linear Transformations
• Often need to change the values of random variables

46
Example
• Let X = # of heads in fair coin flips
o Gamble offers $4 for every head but must pay $5 to play
• Let Y = # net gamble gain
o If X = 0, Y= -5 + 4 x 0 = -5
o If X = 1, Y= -5 + 4 x 1= -1
o If X = 2, Y = -5 + 4 x 2 = +3
• Generally, Y = -5 + 4X
Centre & Spread of Linear Transformations
• Don’t need to calculate center & spread of linear transformations of
random variables from scratch
• If Y = a. X + b for any a or b then

Combining Random Variables


• Often, we need to combine random variables
o Ex: Let X = # of heads in two coin flips, Let Y = # of heads in
another coin flip
• Define Z = X + Y (# of heads in all 3 flips)
o Find the probability model of Z by considering all possible
combinations of x + y
Centre & Spead of Combination Random Variables
• For any 2 random variables

• For two independent random variables


o Note: RV’s are independent if value of one does not affect the
probability of the other

47
Combinations of Normal Random Variables
• For normal and independent RV’s in particular, means and standard
deviations are all we need to know to calculate probabilities
• If X1 is Normal, X2 is Normal and they are independent then X1 +/-
X2 is Normal
o Use this result to calculate probabilities of combination using
Normal distribution
Bernoulli Trial
• Bernoulli Trial: trial with only two outcomes
o Ex: True/False, Yes/No, Heads/Tails
▪ Usually labeled success (1) and failure (0)
o P (success) = p, P (failure) = 1-p
▪ Ex: P (1) = ½ & P (0) = 1- ½ =½
• Trials form basis of many common probability models
Binomial Model
• Several Bernoulli trials but only interested in total number of successes
• Binomial Setting
o Fixed number (n) of Bernoulli trials
o Same probability of success for each trial
o Bernoulli trials are independent
• Binomial Random Variable
o X = # of successes in a Binomial setting
Binomial Distribution
• If X follows binomial distribution with n trials and probability of
success p
o X takes values from 0 to n
o Probabilities are given by formula

48
Mean and Variance of Binomial
• If X is a Binomial random variable, then:

o Mean increases with number of trials (n) and the probability of


success (p)
• Variance increases with # of trials (n)
o Variance is biggest when p =½
o Variance becomes small when p is closest to 0 or 1
Normal Approximation of Binomial
• The normal distribution yields a good approximation of the binomial
for large n values
• Use approximation when np> 10 AND n (1-p) .10
• Approximate binomial with a normal with the same means and
standard deviation

Lecture 10: Sampling Distribution Models


Sampling Distributions
• Sample statistic takes different values with different probabilities
o Mean of X follows a sampling distribution

49
Mean and Variance of Sampling Distribution
• What is sampling distribution’s mean?

• What is sampling distribution’s variance?

Central Limit Theorem


• For a population with mean, variance if samples are independent and
random and sample size (n) is large enough (n = > 30)
o If sampling without replacement, n must be less than 10% of
population
• Sampling distribution of X is approximately Normal with mean u and
variance o2/ n

50
Sample Proportion
• Bernoulli population: Each population subject belongs to one of two
categories
o Example: male/female, employed/unemployed
• Population parameter (p)
o Ex: proportion of male students
• Random sample of size (n)
o X = # of sample subjects in category of interest
o X follows binomial with parameter n, p
• Sample proportion
o Sample estimate of population proportion p

Sampling Distribution of Sample Proportion


• For population with proportion of success p, if samples are random and
independent and sample size (n) is large enough so that
o Np is greater or equal to 10
o N (1-p) is greater or equal to 10
• If sampling without replacement, n must be less than 10% of
population
o Sampling distribution of p hat is approximately Normal with
mean p and variance [ p(1-p)/n]

Lecture 11: Confidence Intervals for Proportions


Confidence Interval
• Sampling distribution model of p is centered at p with standard
deviation

51
o Don’t know p then we can’t find true standard deviation
o Approximate with standard error

• By the 68-95-99.7 rules we know the following


o 68% of samples have p (hat) within 1 SE of p
o 95% of sample have p’s within 2 SE’s of p
o 99.7% of sample have p’s within 3 SE’s of p
• Consider 95% level: there is a 95% chance that p is no more than 2
SE’s away from p (hat)
o We call this the 95% confidence interval
• The confidence interval uses sample statistic to estimate parameter
o Since samples vary the statistics used and thus the CI’s also vary
o CI’s sometimes capture the true parameter while other times they
don’t
Margin of Error
• With 95% CI we are 95% confident that the interval contains the true p
• The extent of the interval on either side of p (hat) is called the margin
of error
o In general confidence intervals have the form of the following

▪ The more confident we want to be the larger our margin of


error needs to be

52
Assumptions and Conditions
• Independence Assumption
o Randomization Condition: data must be sampled at random or
generated from properly randomized experiment
o 10% Condition: sample size (n) must be < 10% of the population
size
• Sample size assumption: n is large enough to use CLT (normal
approximation)
o Success/Failure Condition: must expect at least 10 “successes”
and 10 “failures” in sample

53
Proportion Confidence Interval
• If conditions are met, build confidence interval for proportion at C% as
follows

o Critical value z* from normal distribution, depending on


particular confidence level C
Lecture 12: Testing Hypotheses for Proportions
• Hypothesis: statement about the value of a population parameter for the
purpose of testing
o Ex: Proportion of customers returning to specific restaurant
within a month is 20%
• Hypothesis Testing: procedure for testing whether a hypothesis is valif
or not, based on the sample evidence
• Hypothesis testing steps
o Every hypothesis test has the following steps
▪ Set up 2 hypotheses (null & alternative)
• Null hypothesis: specifies population parameter of
interest and proposes baseline value for it. Presumed
true unless disproven
• Alternative hypothesis: contains value of parameter
different from null hypothesis; typically what you
want to show
▪ Specify appropriate model (find test statistic)
▪ Mechanics (calculate p-value)
▪ Arrive at conclusion (reject the null hypothesis or not)
Analogy with Trial
• Think about logic of jury trials: we start by presuming someone is
innocent

54
o Retain innocence unless evidence makes it unlikely beyond
reasonable doubt
• Similar in hypothesis testing, null hypothesis is presumed true unless
proven otherwise
Test Statistic
• How should data behave if null hypotheses were true? (use sampling
distributions)
• If H0: p = p0 (plus same conditions as for confidence intervals)
o P hat is approx. Normal w mean p0 and standard deviation

this is standard normal and is called the test


statistic

P-value
• How can we quantify evidence against the null hypothesis?
• Look at the probability of getting equally or more extreme data (test
statistic) if the null hypothesis is true

• P-value tells us how surprising the data observed is under the null
hypothesis
o Calculated as probability from sampling distribution of test
statistic

55
• For two-sided test, p-value is twice the probability of whichever tail the
observed statistic falls in
Conclusion
• Reject the null hypothesis if p-value is small
• We choose a cut-off point (a) ahead of time
o (a) called significance level
o Typically a value of 5%, can use even smaller values if we need
to be sure when rejecting null hypothesis
• If p-value < a – reject the null and accept the alternative
• If p-value > a – accept the null and reject the alternative
Lecture 13: More About Tests
One-Proportion Z-test

56
Hypothesis Test Errors
• What does significance level (a) represent?
o It can make two types of mistakes
▪ Type 1: null hypothesis is true, but we reject it
▪ Type 2: null hypothesis is false, but we fail to reject it

• Probabilities of making 2 types of errors


o (a) = P (Type 1 error | null hypothesis is true)
▪ Null hypothesis true (p = p of null)

57
o (b) = P (Type 2 error | alternative hypothesis is true)
▪ Alternative hypothesis true (p > p of null hypothesis)

Significance level (a)


• For any test, there is a trade-off between a and b
o For fixed n, decreasing (a) increases (b) and vice versa
• Say we reject the null hypothesis if P-value < 5%
o Then P (Type 1error | null hypothesis) = 5%
• Significance level (a) is the limit we set for the probability of type 1
error
o We only control type 1 error because this type of error is
considered more important to avoid
▪ If we reject the null hypothesis we call this result
statistically significant at level (a)
Lecture 13: Inference About Means
• Inference for mean of quantitative variables
o Example: Average time spend watching TV
▪ Population parameter: (u, and o)
▪ Sample statistic (x and s)
Sampling Distribution of Sample Mean

58
• From CLT we know that the sampling distribution of (mean of x) is
approximately normal with (u) and (o)

• However, we don’t know the population (sigma)


o We can always estimate with sample standard deviation BUT the

sampling distribution of is NOT approximately normal


▪ Follows t-distribution
T-statistic

• We call the t-statistic


o Similar to z-score but we use (s) instead of sigma
• Sampling distribution of t-statistic depends on the sample size (n)
o Degrees of freedom (df) parameter controls spread of distribution
is also given by (n)
• Conditions for t-distribution: independent & random sample from
unimodal and symmetric population

• Characteristics of t-distribution
o Symmetric around 0 & bell-shaped
o More spread (variance) than standard normal (z)
o As degree of freedom increases, the distribution gets closer to
standard normal

59
• T-distribution always more spread out than standard normal to account
for uncertainty of using estimate (s) instead of (o)
One-sample t-interval for (u)
• Confidence interval for (u)

• Conditions: Random & independent sample of size <10% of population


size
• Normal Population Assumption: we can never be certain that the data
are from a population that follows a normal model, but we can check
the condition
o Nearly Normal Condition: the data comes from a distribution that
is unimodal and symmetric
▪ Checking this condition is done by making a histogram or a
normal probability plot
▪ The smaller the sample size (n <15) the more closely the
data should follow a normal model
▪ For moderate sample sizes (between 15 and 40) the ‘t’
works well if the data is unimodal and symmetric

60
▪ For sample sizes larger than 40 the methods are still safe to
use unless the data is skewed
T-distribution Critical Values
• Need critical values of (t*) from t-distribution
• Table T gives critical t values for 5 common levels and various degrees
of freedom
o Columns correspond to confidence level
o Rows correspond to df
o Last row gives standard normal critical values

Interpreting Confidence Intervals


• Say: “X% of intervals found in this way cover the true value”
o Ex: I am 90% confident the true mean of travel time is between
14.4 and 19.6 minutes

61
Relationship between Intervals & Tests
• Confidence intervals and hypothesis tests are built from the same
calculations
o Complementary ways of looking at the same question
o The confidence interval contains all the null hypothesis values we
can’t reject with the data
• A level C confidence interval contains all the possible null hypothesis
values that would NOT be rejected but a two-sided test
o Alpha level 1-C
▪ 95% confidence interval matches a 0.05 level test
• Confidence intervals are naturally two-sided so they match exactly with
two-sided hypothesis tests
o When the hypothesis is one-sided the corresponding alpha level
is (1-C)/2
Determining the Sample Size
• To find the sample size needed for a particular confidence level with a
particular ME (solve equation for n):

62
• The problem with this equation is that we don’t know most of the
values
o We can use (s) from a small pilot study
o We can use z* in place of the necessary t-value

63
64

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy