STAB22 Lecture's Notes
STAB22 Lecture's Notes
▪ Relative Frequency
• List ALL categorical data and their relative
frequencies
o Ex: Proportion/percentage # of each category
2
▪ Contingency Tables
• Two-way table lists frequencies for each combination
of the two categorical variables
o Ex: Student grade by year
o Charts
▪ Pie Chart
• Categories represented by slices, where size of each
slice is proportional to relative frequency
▪ Bar Chart
3
• Categories represented by bars, where the height of
each bar is relative frequency
▪ Side-by-side
Describing Two Categorical Variables
❖ Consider two categorical variables for each individual case
o Ex: Student course grades (variable 1) & year the course was
taken (variable 2)
❖ Want to describe joint behavior of both categorical variables
Conditional Distributions
❖ Often need distribution of one variable for a particular value of the
other
o Ex: What percentage of males get accepted/rejected from
university?
❖ Fix value of one variable and look at distribution of the other for that
value only
o Ex: Conditional distribution of decision, conditional on gender =
male
❖ Can condition on either variable (I.e. rows or columns) of contingency
tables
o Ex: conditional distributions of decision for gender =
male/female
4
❖ If conditional distribution of one variable are the same for every value
of the other, we say the two variables are independent
o Ex: If conditional distribution of student course grades does not
change with the year then the grade is independent of the year
❖ Compare conditional distributions visually using side-by-side plot
(double bar-graph)
Simpson’s Paradox
❖ How data can be skewed when looking at an overall basis of it versus
when we look at it in a more disciplined light
Lecture 2: Displaying and Summarizing Quantitative Data
Describing Quantitative Data
❖ Variables measure numerical quantities
o Arithmetic operations make sense
▪ Ex: Breakfast cereal data (calories per serving)
❖ Described using
o Plots: histograms, boxplot
o Numerical summaries: mean, median, range etc.
❖ For any quantitative variable we want to describe the overall pattern
of its distribution
o Shape- peaks (modes), symmetry, outliers
o Centre- mean, median
o Spread- range, standard deviation
Histogram
❖ Create artificial categories called bins or classes for each value of
quantitative variables
o Ex: calorie classes; 40-60, 60-80 etc.
▪ Classes span ALL data and DON’T overlap
❖ Pretend variable is categorical and create bar plot
Shape of Distribution
❖ Modality
o Check for modes/peaks
▪ Also known as the most frequently occurring values
5
❖ Symmetry
o Distribution is called symmetric if, when we draw a vertical line
down its center the two sides are similar shape and size
❖ Skewedness
o Unimodal distributions with one tail longer than the other are
called skewed
❖ Outliers
o Any data that lie far off the main body of the distribution are
called outliers
6
Centre of Distribution
❖ Mean
o Sum of all values of quantitative variable divided by the number
of values
▪ All the actual values added up divided by how many values
there actually are
o Generally, for #n sample values x1, x2, xn, sample mean is given
by
7
▪ MEAN = MEDIAN
o For skewed distributions
▪ MEAN >< MEDIAN
Spread of Distribution
❖ Spread describes how much data very about their center
(dispersion/variability)
o Spread is an important aspect of a distribution
Range
❖ Range= Maximum – Minimum value
o Ex: Data (4, 6, 8, 8, 10, 15)
▪ Range = 15 – 4 = 11
❖ Simple to calculate but not always helpful
o Data set 1: 4, 4, 4, 4, 4, 10
o Data set 2: 4, 5, 6, 7, 8, 9
▪ Same range but not the same spread
• No variability in the first set vs the second and
outliers exist too
❖ Range is very sensitive to outliers
o Depends only on extreme values
o Better alternative is interquartile range
Quartiles
❖ 3 values that divide distribution into 4 parts, each containing ¼ of data
❖ To find Q1 & Q3, split data in two halves & calculate the mean of each
half
8
o When the data is odd include median value in each half
Interquartile Range
❖ IQR = Q3 – Q1
o Distance between Q1 and Q3
o Resistant (not sensitive) to outliers
Variance
❖ Measures average squared deviation of individual data from their mean
❖ For values x1, x2 etc. variance is given by
9
Five Number Summary
❖ Set of five measure giving quick summary of distribution
o Measures consist of: Minimum, Q1, Median, Q3 and Maximum
❖ Gives an idea for both center and spread
Boxplot
❖ Visual display of 5-number summary
10
o This are suspected outliers that require examination
Choosing a Summary
❖ Mean and standard deviation work well for symmetric distributions
without outliers
❖ Five-number summary is better for describing skewed distributions
with outliers
❖ In addition to numerical summaries always try to include the plot of the
distribution
11
• Better alternative: side-by-side boxplots
o Easy to compare
▪ All boxes have same left axis
o Gives lots of information
▪ Centre, spread, shape and outliers
o Easy to read even for many groups
Percentiles
• Can look what percentile student stands at within each distribution
• Pth percentile: value such that
o P% of data are below
o 100-p% of the data above that value
o Need to know all data values to find percentile
12
• Assume we don’t know all other scores in exam but only their means &
standard deviations
• If x is value from distribution with mean (_X) and standard deviation
(s) the standardized value of x is
13
o Multiplying or dividing each value by a constant, data will then
change in measurement units
▪ This effects all measures
Distribution Models
• Z-scores measure distances from mean in terms of units of standard
deviations
o Describe how big/small a value is but doesn’t tell what
proportion falls below/above value
• To find proportions from mean & SD need theoretical model of
distribution
o Distribution models described by density curve
Density Curve
• A model for the frequency distribution of data using areas under the
curve to represent relative frequencies
• Consider histogram of exam score data
o Distribution could be approximately described by some density
curve
• The area under a density curve between any two numbers will represent
the proportion of the data that lie between those same numbers
14
• Any density curve must satisfy the following conditions
o Always positive or zero
o Total area under the curve above the z-axis equal to 1.00
•
Normal Density Curve
• Most important density is normal density curve
o Used to describe bell-shaped distributions
o Defined by 2 parameters
▪ Mean: μ defines the center
▪ SD: σ defines the spread
o All normal have same shape but different μ, σ
15
Proportions from Histogram
• Proportion of data in interval [a,b] is equal to total relative frequency of
bars between interval
16
Standard Normal
• Standard Normal: normal distribution with mean of σ=1 and μ=0
o Is only applicable if the distribution is unimodal and symmetric to begin
with
• Can find any proportion for standard Normal using Table Z or software
• Can find any proportion for any normal by using converting to z-scores
and using standard deviation normal
• Nearly normal condition; the shape of the data’s distribution is
unimodal and fairly symmetric (Check by making histogram)
17
Checking for Normality
• Normal density is theoretical model; when should we use it to describe
real data
• Can check histogram
o Look for bell-shape (uni-modal & symmetric)
Normal Probability Plot
• Plot data values against their theoretical Normal z-score
• If points lie close to straight line the data is well described by normal
18
Lecture 4 Scatter plots, Association and Correlation
Scatterplot
• Variables measured along horizontal (y-) and vertical (x-) axis; each
dot presents combination of corresponding individual's values
Roles of Variables
• Usually there is a variable of interest (response/dependent variable) and
a variable whose effect on the response we want to examine
(explanatory/independent)
o Ex: We want to study whether BP increases with Age; how
would we classify the variables?
▪ Response variable: Blood pressure
▪ Explanatory variable: Age
19
Types of Relationships
• Overall pattern of scatterplots describes form, direction and strength of
relationship
o Form
▪ Linear
▪ Non-linear
o Direction
▪ Positive relationship
20
▪ Negative relationship
o Strength
▪ Strong relationship
21
▪ Weak relationship
Outliers
• Scatterplots also help identify outliers
o Extreme deviations from the mean and overall pattern
22
Correlation
• Correlation coefficient ®
o Numerical measure of liner relationship between 2 variables
▪ R is always a number between –1 and 1
▪ R describes strength & direction of linear relationship
ONLY
23
o For a given X I can predict Y
• y= true value at x
• Y-hat= predicted/fitted value at x
• Y- y (hat) = residual
• B0 = y intercept
• B1 = slope
Linear Regression
• Best fitting line minimizes sum of squared residuals given by
24
ALWAYS CALCULATE B1 FIRST
o Where
▪ R = correlation between x and y
▪ Sx, sy = standard deviation of x, y
▪ Mean of x, y
Regression and Correlation
• Correlation efficient (r) between two variable essentially equals the
slope of liner model of standardized values (z-scores)
Residual Standard Deviation
• If linear regression assumptions are satisfied we can measure prediction
accuracy using residual standard deviation
25
• Proportion of y-variation left in residuals = 1- R2
Lecture 5: Line Regression & Regression Wisdom
Regression Diagnostics
• Can fit linear model to any set of data
o Ex: X, Y don’t need to be linearly related
• Want to check whether linear model offers good description of data
o Can use scatterplot; but residual plot often provides at better
picture
• Anscombe’s quartet
o Comprises four datasets that have similar statistical properties –
even the slopes of regression lines are similar
▪ r1-r3: 0.816
Residual Plot
• Plot residuals (y-yhat) against x
• Residual plot should be evenly scattered around 0 with not particular
pattern
o Mean of residual is always 0
• Uneven dispersion
o If residual spread changes with x – linear model is not evenly
accurate throughout X
26
Residual Standard Deviation
• If linear regression assumptions are satisfied, we can measure
prediction accuracy using residual SD
o A.k.a error SD
• Se measures average distance between true and predicted values
Coefficient Determination
• How useful is linear regression model in describing the y-value?
o Compare y-data's variation to residual variation from linear
model
• Proportion of y-variation accounted for by linear model
• Equal to squared coefficient of correlation ®
27
• Proportion of y-variation left in residuals
o = 1- R2
X-variable Predictions
• For predicting x based on y-variable should not just use y = b0 + b1 *x
model in reverse
o Instead fit new model: x = b0 + b1 *y
▪ Better because it minimizes (x-hat – x)2
Diagnostics
•
Before using linear model, always check assumptions
o
Linearity assumption
▪
Straight line form
o
Equal spread assumption
▪
Data evenly spread around straight line
• Residuals should be evenly scattered around 0, with no particular
pattern
Subset Regressions
• Sometimes assumptions fail because data combine groups with
different behaviors
o Ex: Height vs Weight of 18-year-olds
• Fit separate regressions for male/female subsets
Extrapolation
• The farther away the new x-value is from the mean of x the less
trustworthy the predicted value should be
• Extrapolations are dubious because they require the assumption that
nothing about the relationship between x and y changes even at an
extreme value of x
Re-expressing Data
• Sometimes can improve linear regression assumptions by modeling
function of the data instead of their original values
• Typical choice of functions are as follows
28
o Apply function to y-variable first and perhaps to x-variable if
necessary
Lurking Variables and Causation
• No matter how strong the association or how large the R2 value there
is no way to conclude from a regression alone that one variable
causes the other
o There is always the possibility that some third variable is driving
both of the variables you have observed
• With observational data there is no way to be sure that a lurking
variable is not the cause of any apparent association
•
Outliers, High leverage and Influential points
• Should always check scatter/residual plot for outliers and influential
points
• Influential point is observation that significantly affects fitted line
o Removing influential point from regression would change fitted
line considerably
o Influential points typically have an x-value that is far from the
observed range of x-values
o We say a point is influential if omitting it from the analysis
changes the model enough to make a meaningful difference
29
• High leverage point
o A data point can be unusual if its x-value is far from the mean of
x-values
▪ Such points are said to have high leverage
o Data points whose x-values are far from the mean of x are said to
exert leverage on a linear model
▪ High-leverage points pull the line close to them so they
have a large effect on the line
• With high enough leverage their residuals can be
deceptively small
o A point with high leverage has the potential to change the
regression line but doesn’t always use that potential
30
o Sample: Ask SOME people
• Population: entire group of individuals that we want information about
• Sample: Smaller group of individuals selected from population
o A subset of a population examined in hope of learning about the
population
Parameter
• Population parameter: numerical measure describing population; fixed
but unknown value which we want to find out about
o We rarely expect to know the true value of a population
parameter but we do hope to estimate it from sample
Sample Statistic
• Numerical measure describing sample; used to estimate corresponding
population parameter
• When the statistics we compute from the sample accurately reflect the
corresponding parameter then the sample is said to be representative
Sampling Variability
• Results can vary depending on which sample gets selected
31
Sampling Size
• Sampling variability can be controlled by the sample size
o Number of individuals in the sample
o Larger sample – lower variability
• It is the sample size and not the size of the population that really
matters
o Only exception: If the populations is small enough and the
sample size is more than 10% of the whole populations then
population size CAN matter
Bias
• Sampling methods that tend to over or underemphasize characteristics
of the population
• Any systematic difference between sample statistics & population
parameter called bias
Randomization
• Protects us from the influences of all features of our population by
making sure on average the sample looks like the population
32
• To avoid bias, select sample at random
o Every individual in population has same chance of being
included in sample
• Randomization ensures that selected sample are representative of
population
o In the long run every individual appears equally often
o Able to represents population in all ways
Simple Random Sampling (SRS)
• A simple random sample of size n consists of n individuals from the
population chosen in a way that every n has an equal chance of being
selected
Sampling Frame
• In order to perform random sampling we need a record of the entire
population
• The sampling frame is the list of individuals in the population from
which a sample is drawn
o Might not necessarily be the same as population
• Try to use the sampling frame that is as close as possible to population
of interest
Stratified Sampling
• Sometimes population is divided into groups of similar individuals
called strata
• To ensure more representative sample we use stratified random
sampling
o Select separate SRS’s with each stratum (separate sample sizes
relative to stratum) and combine them to get stratified sample
33
Cluster Sampling
• SRS can sometimes be impractical
• If population is divided in cluster, by geographic or other boundaries it
is easier to perform SRS of clusters
• Select the number of clusters at random and perform a census within
each of them
o Instead of a complete census we randomly select only a portion
of the individuals in each cluster
Systematic Sample
• Individuals in population are often ordered
• If order is not associated with response we can use systematic
sampling: which is sampling every nth individual at random
34
• A sample drawn by selecting individuals systematically from a
sampling frame is called a systematic sample.
• Sampling schemes that combine several sampling methods is called a
multistage sample.
35
samples are then drawn
from each stratum.
Voluntary response In voluntary response
sample: sampling, a large group
of individuals is invited
to respond, and all who
do respond are counted.
36
• Typically, cheaper than sample survey however we cannot assign
questions/choices
• Retrospective study: pick individual and extract historical data on them
o Pro: doesn’t take time, can span longer periods
o Con: cannot control/correct past data recording
• Prospective study: pick individuals and collect data as events happen
over time
o Pro: have control of observation process
o Con: more time consuming
• Observational studies also useful for finding trends & possible variable
relationships
• Observational studies do not demonstrate a causal relationship
o Ex: It is not necessarily true that exercise reduces insomnia
• In order to demonstrate causal relationship we need to perform the
experiment
Experiments
• How do we establish cause & effect?
o Randomly select some subjects and instruct them to exercise and
remaining subject not to exercise
▪ Assess and compare results from both groups
• How does this help?
o Choosing two groups at random means they start out relatively
equal in terms of any characteristic that might matter
o If groups end up with unequal terms this is proof of results
• Placebo
o Fake treatment designed to look like a real one, used when just
knowledge of receiving any treatment can affect response
▪ For comparing results, often use current standard treatment
as a baseline
▪ Subjects getting placebo/standard treatment called control
group
• Experiments require random assignment, a factor and at least one
response variable
37
Experiment Terminology
• Experimental units/subjects
o Individual participating in experiment
• Factor
o Explanatory variable whose level can be manipulated by
experimenter
▪ Levels: Specific values chosen for factor
▪ Treatment: Specific combination of manipulated levels of
one or more factors
• Response: variable whose values are compared across treatments
• Statistically Significant: a factor effect so large that it would rarely
occur by chance
Principles of Experimental Design
• Control
o Control sources of variation besides factors, by making
conditions as similar as possible for all treatment group
• Randomize
o Helps equalize effects of unknown/uncontrollable sources of
variation
o Assigning experimental units to treatment at random allows us to
use methods to draw conclusions
• Replicate
o Get several measurements of response for each treatment
o Two kinds of replication show up in comparative experiments
▪ We should apply each treatment to a few subjects
▪ Entire experiment is repeated on a different population of
experimental units
• Blocking
o For variable we can identify but cannot control and which affect
response divide subjects into groups of same variable value and
randomize each block
▪ Removes much of the variability due to the difference
among the blocks
38
• Group similar individual together and then randomize
these blocks
Experimental Designs
• Completely Randomized Design
o All experimental units are allocated at random among all
treatments
• Randomized Block Design
o Random assignment of units to treatment is carried out separately
within each block
Blinding
• Knowledge of assigned treatment can often influence the assessment of
the response
o Two classes of individual can affect experiment
▪ Those who influence results
▪ Those who evaluate results
• Blinding avoids bias from knowing treatment
o Single-blind: every individual in either one or two classes doesn’t
know treatments
o Double-blind: every individual in both of the two classes doesn’t
know treatments
Lecture 8: Probability
Probability
• Many things in life are random (uncertain)
• Probability describes how likely is ‘something’ to happen
o Probabilities take values between 0 and 1
▪ A value of 1 means event is sure to happen
▪ A value of 0 means event is not going to happen
Probability Terminology
• Random phenomenon: process whose result is not known beforehand
o Ex: Rolling a standard die
• Trial: particular realization of random phenomenon
39
o Ex: particular roll of a standard die
• Outcome: basic result of a trial that cannot be broken down to simpler
results
o Ex: Rolling a 1 is an outcome
• Sample space: collection of all possible outcome
o Sample space is usually denoted by S
o Ex: For rolling a standard die S = {1, 2, 3, 4, 5, 6}
Events
• A collection of one or more outcomes
o Usually denoted by capital letters (A, B etc.)
o Ex: A = rolling an even number
▪ Event can contain just single outcome
• Probabilities are assigned to each event
• Example: Students course grade is random phenomenon with sample
space
Empirical Probability
• Observe sequence of coin tosses (trials) and count the number of times
of an event
o Event: # of times event occurs/ total number of trials
▪ In short term; unpredictable
▪ In long term; predictable
• Limiting relative frequency value called empirical probability of event
denoted by P (X)
Law of Large Numbers (LLN)
• LLN guarantees relative frequency will eventually settle on a specific
value
o LLN holds if trials are independent
▪ Outcomes of one trial do not influence that of another
• LLN does not apply in short run
o Ex: Assume you flip a coin, and the first 10 outcomes are tails, is
the next flip more likely to be heads? No.
• LLN only applies in the long run
40
Other Ways To Assign Probability
• Theoretical Probability
o Sometimes P(X) can be deduced from mathematical model
▪ For equally likely outcomes (rolling a fair die) the
probability can be shown as follows
• P (X) = # of outcomes in X / total # of outcomes
• Personal/ Subjective Probability
o Assign P(X) based on personal beliefs
▪ Typically, experts' views
▪ Most general case; can always be used
Venn Diagrams
• Method for representing events graphically
o Sample space (S): rectangle
o Outcomes: points in S
o Events: areas in rectangle
• Examples: Roll of a die
o Event X = [5, 6 ] = Roll greater or equal to 5
41
Composite Events
• Given events X and Y form new events as:
o X or Y (union)
▪ Either event X or Y or both occur
o X and Y (intersection)
▪ Both events X and Y occur simultaneously
42
Disjoint Events
• Events are called disjoint (mutually exclusive) if they have no
outcomes in common
Probability Rules
• No matter how probabilities are assigned they must satisfy the
following three rules
o Probability of any event is between 0 and 1
o Probability of all outcomes is 1
▪ P (S) =1, where S = sample space
o For two disjoint events X & Y
▪ P (X or Y) = P (A) + P (B)
Complement Rule
• Probability of Xc, can be found from that of A using the complement
rule
o P (Xc) = 1- P(X)
• This rule only works if we don’t know the probability of one event
General Addition Rule
• For any event, X or Y (overlap)
o P (X or Y) = P (X) + P (Y) - P (X and Y)
• If events are in disjoint (no overlap)
o P (X and Y) equals 0
▪ P (X or Y) = P (X) + P (Y)
43
• Regardless they should equal to 0
Conditional Probability
• Probability of event X given that event Y has occurred
o Denoted by P (X given Y), probability of X if all of the possible
outcomes are restricted only to the ones in B
• Given by the formula
o P (X|Y) = P (X and Y) / P (Y)
▪ Probability of both of them occurring divided by the
probability of the ‘given’ occurring alone
Independence
• Events X and Y are independent if the occurrence of one does not
change the probability of the other
o X, Y independent = P (X|Y) = P (X) or P (Y|X) = P (Y)
▪ EVENTS ARE INDEPENDENT ONLY IF THE
PROBABILITY OF ONE DOES NOT CHANGE FROM
THE OTHER
• Independent is NOT the same as mutually exclusive
o If X, Y are mutually exclusive then they are dependent
o If X, Y are not mutually exclusive they can either be independent
or not depending on their probabilities
Multiplication Rule
• For any event X, Y
o P (X and Y) = P(X) x P (Y|X) or P(Y) x P (X|Y)
• If events are independent then
o P (X and Y) = P (X) x P(Y)
▪ PROBABILITY THEY BOTH OCCUR
44
Lecture 9: Random Variables
Random Variables
• Random variable X assigns a single number to every outcome of an
experiment
o Ex: Consider flipping 2 fair coins
▪ Random variable X = # of heads in 2 flips
45
o Typically, the result of counting something
▪ Ex: the number of children in a family
• Continuous
o Can take any value within an interval
▪ Ex: all values between [0, 1]
o The result of some type of measurement
▪ Ex: temperature
Probability Model
• A table, graph or formula that describes the following:
o The values of a random variable
o The probability associated with each value
• Probability of X taking some specific value x is denoted by P (X = x)
or P(x)
o Ex: the sum of two dice rolls- P(X=5) = 4/36
• For any probability
Expected Values
• Probability model of discrete random variable can be described
numerically as follows
Linear Transformations
• Often need to change the values of random variables
46
Example
• Let X = # of heads in fair coin flips
o Gamble offers $4 for every head but must pay $5 to play
• Let Y = # net gamble gain
o If X = 0, Y= -5 + 4 x 0 = -5
o If X = 1, Y= -5 + 4 x 1= -1
o If X = 2, Y = -5 + 4 x 2 = +3
• Generally, Y = -5 + 4X
Centre & Spread of Linear Transformations
• Don’t need to calculate center & spread of linear transformations of
random variables from scratch
• If Y = a. X + b for any a or b then
47
Combinations of Normal Random Variables
• For normal and independent RV’s in particular, means and standard
deviations are all we need to know to calculate probabilities
• If X1 is Normal, X2 is Normal and they are independent then X1 +/-
X2 is Normal
o Use this result to calculate probabilities of combination using
Normal distribution
Bernoulli Trial
• Bernoulli Trial: trial with only two outcomes
o Ex: True/False, Yes/No, Heads/Tails
▪ Usually labeled success (1) and failure (0)
o P (success) = p, P (failure) = 1-p
▪ Ex: P (1) = ½ & P (0) = 1- ½ =½
• Trials form basis of many common probability models
Binomial Model
• Several Bernoulli trials but only interested in total number of successes
• Binomial Setting
o Fixed number (n) of Bernoulli trials
o Same probability of success for each trial
o Bernoulli trials are independent
• Binomial Random Variable
o X = # of successes in a Binomial setting
Binomial Distribution
• If X follows binomial distribution with n trials and probability of
success p
o X takes values from 0 to n
o Probabilities are given by formula
48
Mean and Variance of Binomial
• If X is a Binomial random variable, then:
49
Mean and Variance of Sampling Distribution
• What is sampling distribution’s mean?
50
Sample Proportion
• Bernoulli population: Each population subject belongs to one of two
categories
o Example: male/female, employed/unemployed
• Population parameter (p)
o Ex: proportion of male students
• Random sample of size (n)
o X = # of sample subjects in category of interest
o X follows binomial with parameter n, p
• Sample proportion
o Sample estimate of population proportion p
51
o Don’t know p then we can’t find true standard deviation
o Approximate with standard error
52
Assumptions and Conditions
• Independence Assumption
o Randomization Condition: data must be sampled at random or
generated from properly randomized experiment
o 10% Condition: sample size (n) must be < 10% of the population
size
• Sample size assumption: n is large enough to use CLT (normal
approximation)
o Success/Failure Condition: must expect at least 10 “successes”
and 10 “failures” in sample
53
Proportion Confidence Interval
• If conditions are met, build confidence interval for proportion at C% as
follows
54
o Retain innocence unless evidence makes it unlikely beyond
reasonable doubt
• Similar in hypothesis testing, null hypothesis is presumed true unless
proven otherwise
Test Statistic
• How should data behave if null hypotheses were true? (use sampling
distributions)
• If H0: p = p0 (plus same conditions as for confidence intervals)
o P hat is approx. Normal w mean p0 and standard deviation
P-value
• How can we quantify evidence against the null hypothesis?
• Look at the probability of getting equally or more extreme data (test
statistic) if the null hypothesis is true
• P-value tells us how surprising the data observed is under the null
hypothesis
o Calculated as probability from sampling distribution of test
statistic
55
• For two-sided test, p-value is twice the probability of whichever tail the
observed statistic falls in
Conclusion
• Reject the null hypothesis if p-value is small
• We choose a cut-off point (a) ahead of time
o (a) called significance level
o Typically a value of 5%, can use even smaller values if we need
to be sure when rejecting null hypothesis
• If p-value < a – reject the null and accept the alternative
• If p-value > a – accept the null and reject the alternative
Lecture 13: More About Tests
One-Proportion Z-test
56
Hypothesis Test Errors
• What does significance level (a) represent?
o It can make two types of mistakes
▪ Type 1: null hypothesis is true, but we reject it
▪ Type 2: null hypothesis is false, but we fail to reject it
57
o (b) = P (Type 2 error | alternative hypothesis is true)
▪ Alternative hypothesis true (p > p of null hypothesis)
58
• From CLT we know that the sampling distribution of (mean of x) is
approximately normal with (u) and (o)
• Characteristics of t-distribution
o Symmetric around 0 & bell-shaped
o More spread (variance) than standard normal (z)
o As degree of freedom increases, the distribution gets closer to
standard normal
59
• T-distribution always more spread out than standard normal to account
for uncertainty of using estimate (s) instead of (o)
One-sample t-interval for (u)
• Confidence interval for (u)
60
▪ For sample sizes larger than 40 the methods are still safe to
use unless the data is skewed
T-distribution Critical Values
• Need critical values of (t*) from t-distribution
• Table T gives critical t values for 5 common levels and various degrees
of freedom
o Columns correspond to confidence level
o Rows correspond to df
o Last row gives standard normal critical values
61
Relationship between Intervals & Tests
• Confidence intervals and hypothesis tests are built from the same
calculations
o Complementary ways of looking at the same question
o The confidence interval contains all the null hypothesis values we
can’t reject with the data
• A level C confidence interval contains all the possible null hypothesis
values that would NOT be rejected but a two-sided test
o Alpha level 1-C
▪ 95% confidence interval matches a 0.05 level test
• Confidence intervals are naturally two-sided so they match exactly with
two-sided hypothesis tests
o When the hypothesis is one-sided the corresponding alpha level
is (1-C)/2
Determining the Sample Size
• To find the sample size needed for a particular confidence level with a
particular ME (solve equation for n):
62
• The problem with this equation is that we don’t know most of the
values
o We can use (s) from a small pilot study
o We can use z* in place of the necessary t-value
63
64