POL244 Midterm Lectures
POL244 Midterm Lectures
Winter 2024
Wednesdays 1-3pm @ MN1170
Part I
Data I: on Research Design, Experiments,
Interviews and Questionnaires
Week 1 Jan. 10 Introduction and course details
Week 2 Jan. 17 Data I: Research design, experiments, interviews and questionnaires
Week 3 Jan. 24 Data II: Sampling, size and distributions
Announcements
Time slots:
T0101 3-4 pm (@IB377)
T0102 4-5 pm EST (@IB377)
They are conducted by out TA, Mujahed, and begin today after our class
Tutorials
Zoom Meeting
https://utoronto.zoom.us/j/87382638114
At the end of every set of slides posted on Quercus, you will be able to
find the main points and a glossary of terms related to the topic and
concepts we address that week
Research Design
In last week’s introduction we spoke about variables and causation.
This means that the experiment’s participants are randomly assigned to one of
two possible values of X
Experiments
Random assignment* of groups ensures that the comparison between them is
as ‘pure as possible, and that some other cause of the DV (say, a factor Z) will
not pollute [or, affect] that comparison.’ (Kellstedt & Whitten)
Randomness ensures that these groups are identical, save for the different
values of X (rather than any of them having particular characteristics that might
skew the testing).
4. Control for effects that may be instead caused by other, ‘confounding’ variables (Z),
rendering the correlation between X and Y spurious
Random assignment of values of X (✓ treatment, X control) removes any chance it is
correlated with any variables, like Z (potential confounding variables).
Note that this does not mean there are no other potential causes of Y, but that thanks
to randomness the two groups of the experimental setting are equally affected by
them (therefore, controlled, allowing to check for variation based on X
Experiments
With all four criteria satisfied, we can speak of internal validity – ‘a research
design yielding high levels of confidence in the conclusions about causality
among the cases that are specifically analyzed’ (Kellstedt & Whitten)
External validity, on the other hand, refers to a truly random sample of the
population that is being tested-whether a research’s conclusions apply to others
equally (i.e., whether such conclusions are generalize-able)
Validity
Different research methods fare differently with respect to internal and external
validity. There is no trade-off between the two, and all approaches can be
valuable, but in general it is preferable to have a good measure of both
(be within what a few scholars have termed the ‘cone of validity’)
Definition: ‘a research design in which the researcher does not have control over
values of the independent variable, which occur naturally.’ (Kellstedt & Whitten)
Still, a degree of variability in the IV across cases, and variation in the DV must
be present
Observational studies: types (pure)
Cross-sectional: (quantitative) looks at variation in different units at a singe time
unit; ‘examines a cross-section of social reality, focusing on variation between
individual spatial units (e.g., citizens, countries, etc.) and explaining the variation in
the DV across them.’ (Kellstedt & Whitten)
While not aspiring to explain causes and effects in terms of general laws and
principles, (like quantitative research), qualitative research also employs
observational studies.
Cross-sectional qualitative research can also resemble a quantitative study’s cross-
sectional structure (e.g., interviews with inventory of issues to be discussed)
In qualitative studies, longitudinal design also examines cases in different times
but without manipulating the IV like in experiments. It involves panel and cohort
studies that study groups in different occasions, or groups sharing the same
experience over time; case studies often include this type of research.
Observational studies: components
There exists an abundance of observed data that can be used in exploring
political phenomena. Many are unstructured and unorganized (e.g., answers to
open questions in interviews, content in books, etc.).
Researchers go through them, deriving categories and the basis of codes that
allow the information to be ordered and available for systematic study.
Observational designs examine data sets --e.g., the values of countries’ GDP in
2021 (the spatial unit being the countries and the time unit 2021), or values of
the single spatial unit Canadian PM’s approval rating across time (the time unit
being the month)
4. Control for effects that may be instead caused by other, ‘confounding’ variables
(Z), rendering the correlation between X and Y spurious
Multiple regression analyses can help researchers uncover if controlling for other
variables reveals an X and Y causal relationship. But one has to try and identify
all possible confounding variables, in order to statistically control for them.
Here, (again) theory and examination of prior studies on the topic can help.
Concepts and their measurements
Concepts are abstract terms that represent and organize characteristics of
objects, phenomena and ideas in the political world
Measures are observable, empirical evidence
Examples:
Political legitimacy operationalization Frequency of anti-gov’t protest
(concept) (variable)
Validity
When the measure of a concept is represented accurately, it is deemed to be
valid.
In contrast, an invalid measure measures something other than what is
intended.
*E.g., Statistics Canada, UN, IMF, OECD, World Bank, MAR (Minorities at Risk),
COW (Correlates of War), ICB (International Crisis Behavior)
Primary data collection and qualitative research
When there is little knowledge on a topic, qualitative approaches help provide
in-depth, rich explanations and new findings. Researchers draw information
from human subject research through:
Quantitative
• Generalize-able, can explain more towards theory-testing and systematic amassment
of knowledge
• Can ignore real-world settings
• Potentially sidesteps human subjects’ perceptions for sake of ‘findings’
• May convey artificial sense of accuracy and precision
• Assumes an objective reality, independent of observation
• Creates power hierarchies (researcher vs. subject)
Data collection: interviews and questionnaires
Overall, survey research follows this sequence of steps:
1. Selecting a population of interest
2. Drawing a sample from this population
3. Devising a number of questions to measure concepts of interest
4. Survey made available to the research subjects
5. Data are collected, cleaned and tabulated
6. Data are analyzed via descriptive and inferential statistics
Before
• Planning (identifying a population to study and sampling)
• Creating an interview framework (set and sequence of questions to ask)
• Knowing the interview schedule
Interviews
Conducting in-person interviews: guidelines (cont.)
During
• Interview in the form of conversation
• Introducing the interviewer and the research to respondents
(who it is, whom is this research by and what for, how the respondent has
been selected, explain voluntary nature and confidentiality clause, allow
interviewee to ask any questions)
• Establishing rapport
• Using probing cautiously (only if interviewee needs help understanding, or
further clarifications, or if interviewer needs more data, details)
• Avoiding prompting (suggest a possible answer to an open question)
• Recording information (after having obtained consent) during the interview
Interviews
Conducting in-person interviews: guidelines (cont.)
After
• Recording information after the interview (as soon as possible)
• Creating a transcript of the recording or experience towards analysis
• Converting (where appropriate) into datasets and identifying any errors
• Begin analysis (where appropriate)
Interviews
In-person interviews (structured) – checklist
Drawbacks
• Lack of telephone, or land line
• Duration (shorter, short attention span)
• Hearing impairments
• Impersonal (lacks rapport, interviewer cannot see respondent’s reactions)
• Target may be missed (who is responding?)
Interviews
Alternatives to in-person interviews:
• Closed (C) (or, close-ended) - fixed number of concrete answers to select from
Forced choice (limited choice of answers that best reflects
respondents’ position)
Scale (asks respondents to rate their position to a statement-
‘e.g., ‘strongly agree’, ‘moderately agree’, etc.)
Feeling thermometer (respondents indicate their ‘warmth’ – e.g., ‘how
do you feel about x?’)
Questionnaires: types and formats of questions
Example (modified from Besco)
• Open (O) question: “What was the primary reason for applying to UTM?”
.............................. . . . . ......
• Closed (C) question: “What was the primary reason applying to UTM?”
a. School’s ranking and reputation
b. Quality of program of study
c. Lower fees compared to U.S. universities
d. Proximity to home
e. Family member, friend or alumnus/a recommendation
f. Other: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Questionnaires: design
How to design questions (O, C) (Bryman and Bell)
• Order questions bearing in mind the possible effect an earlier one might have
on a latter one (e.g., ‘Do you know how many volts it takes to kill a human
being?’ followed by ‘Do you agree with the death penalty?’)
• Ask general questions ahead of specific ones to set the tone
• Pose important questions early to capture respondents’ attention before it
wanes
• Postpone asking uncomfortable questions (that might affect the respondent)
for later in the questionnaire
• Group set of questions according to themes (e.g., questions 1-5 on personal
data, rather than all over the place)
• Consider using existing questionnaires (saves time, allows for comparison)
Questionnaires: design
How to design (C) questions (cont.)
• Allow for replies that the researcher may have not thought about
• Responders can expand on a topic and offer additional insights and focus
• Can provide pointers (what topic is important) for closed-format questionnaires
• Standardized
• Easier, faster to complete
• Fixed set of clear (hopefully) answers, render research clearer to respondents
• Help avoid intra- and inter-interviewer lack of consistency
• Low cost
• Lower response -and processing-time
• Fewer unanswered questions
• Better response to open questions
At the same time…
• Lower response rates
• Limited to those with online access
• Confidentiality and anonymity issues
• Multiple replies
Secondary data
Data not directly collected by researchers themselves but by others.
Can include survey datasets and official statistics (closed-ended measures).
Most quantitative Political Science research based on them
Experiment (lab, field, quasi-): a research design in which the researcher both
controls and randomly assigns values of the independent variable to the
participants.
Observational study: a research design in which the researcher does not have
control over values of the independent variable, which occur naturally.
Internal Validity: when there is sufficient evidence that a causal relationship exists
between two or more variables
External Validity: the results of study can be generalized beyond the specific
research in which they were generated
Questions: structured (prepared in advance, same for all respondents), open (not
prepared in advance), semi-structured (mixture of rigid and open questions)
Interviews: Interviews-respondents are asked questions that are recorded for analysis.
Two types-one person, focus group
Questionnaires: respondents read questions, record own answers, shorter, more rigid.
Still can be open-ended, or closed (forced choice, scale, feeling thermometer)
POL244H
Research Methods for Political Science II
Problem:
Question is too wordy.
Should not be more than 20 words.
Should be able to ask the question comfortably in a single breath.
Questionnaires: design
Examples (Besco)
2. “The NDP will not form the Official Opposition after the next election; the Bloc
Quebecois will. Do you agree or disagree?”
Problem:
A double-barreled (combined) question.
Could agree with one part of the question and disagree with the other.
Questionnaires: design
Examples (Besco)
3. “How often have you read about politics in the paper during the last week?”
Problem:
Assumes respondent has read a newspaper at least once during previous week.
Questionnaires: design
Examples (Besco)
4. “Would you favor or oppose extending the USMCA to include other countries?”
Problem:
Assumes respondents are competent to answer.
May not know what acronym stands for (US-Mexico-Canada Agreement), what it is,
and/or what countries are currently included, etc.
Questionnaires: design
Examples (Besco)
5. “Do you agree or disagree with the supposition that continued constitutional
uncertainty will be detrimental and deleterious to the Quebec’s possibilities for
sustained economic growth?”
Problem:
Question wording is unnecessarily confusing
Questionnaires: design
Examples (Besco)
6. “Do you agree that Canada has an obligation to see that its impoverished citizens
are given a humane standard of living?”
Problem:
Leading because it uses emotionally-laden language to encourage agreement with
the statement (e.g., “impoverished”, “humane”)
POL244H
Research Methods for Political Science II
Winter 2024
Wednesdays 1-3pm @ MN1170
Part I
Data II: Sampling & Descriptive statistics
Week 1 Jan. 10 Introduction and course details
Week 2 Jan. 17 Data I: Research design, experiments, interviews and questionnaires
Week 3 Jan. 24 Data II: Sampling, size and distributions
Announcements
*E.g., Statistics Canada, UN, IMF, OECD, World Bank, MAR (Minorities at Risk),
COW (Correlates of War), ICB (International Crisis Behavior)
Data collection: interviews and questionnaires
Overall, survey research follows this sequence of steps:
1. Selecting a population of interest
2. Drawing a sample from this population
3. Devising a number of questions to measure concepts of interest
4. Survey made available to the research subjects
5. Data are collected, cleaned and tabulated
6. Data are analyzed via descriptive and inferential statistics
• Low cost
• Lower response -and processing-time
• Fewer unanswered questions
• Better response to open questions
At the same time…
• Lower response rates
• Limited to those with online access
• Confidentiality and anonymity issues
• Multiple replies
Secondary data
Data not directly collected by researchers themselves but by others.
Can include survey datasets and official statistics (closed-ended measures).
Most quantitative Political Science research based on them
When not possible to include all population, research focuses on a sample – the set of
observations that a dataset does contain
(remember, datasets contain variables, and each variable represents a particular
characteristic related to a study’s observations)
Sampling - process of selecting a number of cases from a larger population for study.
Scores of a sample measured in numeric terms, called sample statistic (a numerical
characteristic of a sample).
Sample statistics used to estimate a population’s parameters.
*Or, in one’s imagination, like Borges’ 1946 “On Exactitude in Science”
Populations and samples
Sampling error: when there is a difference b/w the characteristics of a sample
(statistic) and those of a population (parameter) from which it was selected
Inevitable to have some error in sample, as long as not systematic (e.g., if sample
is non-representative, biased – thus non-random)
Ice cream I No ice cream
Central Limit Theorem: the sum of random variables is itself a random variable
and follows a normal distribution (a distribution with a symmetrical bell shape)
Important for statistical analysis (we will discuss how in future class)
Populations and samples
Quantitative research
Uses large-N studies to identify patterns and generalize (external validity) from
the sample to the population
Process: data are collected, sample statistics are calculated, and used to
estimate the population parameters
Qualitative research
Focuses on small-n studies to uncover rich details and reach some conclusions
that lead to better understanding of the population
• Low cost
• Less time-consuming
1. Accuracy of sampling frame – the list of all units or elements in the target
population
For example, for the 2023-24 population of universities in Canada, the sampling
frame is all registered students in every Canadian university)
Sampling frame needs to include all cases, from which a sample can then be
drawn, otherwise, it may not be representative of the population.
Classic case of 1936 Roosevelt vs. Landon election misprediction based mail
survey drawn from non-representative lists.
Sample: representativeness – sample selection techniques
2. Sample selection method (two types):
i. Simple (random) – every element within the population has equal chance of
being included in the sample
For example, draw lots from a bowl model; use a table of random numbers
to select a sample from a population; (better) random number generator]
iii. Stratified – population divided into mutually exclusive groups (strata), from
which random or systematic samples are selected [e.g., U of T students by campus]
iv. Proportional stratified (sample strata, proportional to their pop. sizes)
[In this case, one creates a stratified sampling frame, determines strata size
proportional to pop. strata sizes, then selects random sample. Here, it is important
to know pop. proportions]
v. Disproportional stratified
[Same as above, except sample proportions different from population ones-
e.g., to get equal representation of an underpopulated province; used to compare
groups. NB. To reconstruct pop. proportions and make inferences, weights – i.e.,
compensatory mathematical corrections - must be assigned within the dataset]
Sample: representativeness – sample selection techniques
iii. Purposive (or, judgment) – selection of specific cases that provide maximum
information needed for study while ensuring some diversity [used for focus
groups]
Sample: representativeness – sample selection techniques
Non-probability (non-random) sampling (Hiberts et al.):
iv. Snowball (or, network / chain referral)- identification of initial cases that can
refer new ones so that the sample branches (or, snowballs) out [used for
hidden populations with no apparent sample, like drug-users, or political
dissidents in hiding]
More broadly, we speak of a measurement metric (of a variable): the type of values
the variable takes on.
It consists of (a) label or name, and (b) the values we have measured of it.
Types of variables
In order of precision, the types (or, levels of measurement) of variables include:
1. Nominal (or, Categorical): This type of variable is composed of categories that bear
no relationship to one another except that they are different (Bryman et al.)
2. Ordinal: With this type of variable, its categories can be rank ordered – i.e., they
can indicate if observations have more or less of a particular attribute (‘greater
than’, ‘less than’, etc.).
3. Interval / Ratio: This type of variable is the most precise of all types. While
Nominal / Categorical data only indicate difference, and Ordinal ones indicate
order (but not distance), Interval / Ratio ones provide both. Moreover, there exist
units of measurement, and the distances or intervals between categories are
separated by a standard unit.
Representations of data
Datasets can contain a lot of information that provide quick, broad overviews.
This can be represented in a variety of ways via graphs and tables.
For continuous variables ( Interval / Ratio), the visualizations used are Histograms,
Box plots and scatter plots
Representations of data
Tables are useful for displaying summary statistics (Measures of Central Tendency,
and of Dispersion)
For example, we have a set of grades in a class. Before we investigate if they are
correlated (and more) – e.g., with time of study per week - we can learn useful
information about that variable from the dataset: for example,
What is the variation, or, spread of the data in the dataset? (e.g., from 57 to 89)
• Distribution – set of all possible values and frequencies associated with these values
• Central tendency – which is the most typical value?
• Dispersion – how much do the values spread out?
Distribution
Frequency distribution: indicates the number of cases in each category of the
variable
Can be visualized through pie chart (nominal variable), bar chart (nominal or
ordinal variable), line chart (nominal or ordinal variable) or histogram
(continuous variable)
Central Tendency Measures
Mean: the average value of an observation – a common approach used on a daily
basis (e.g., comparing university grades, prices, salaries, commute times, GDP, etc.)
σ𝑛𝑖=1 (𝑥𝑖 )
𝑋ത (𝑝𝑟𝑜𝑛𝑜𝑢𝑛𝑐𝑒𝑑 𝑥 𝑏𝑎𝑟) =
𝑛
Σ: sum (total) from 1 to n
𝑥𝑖 = observations - the value of each individual case in a variable–E.g., x1, x2, x3 …xn
n: sample size
Central Tendency Measures
Example: set of grades in a seminar class of 17 students:
{74, 70, 57, 60, 78, 67, 75, 67, 83, 71, 72, 89, 75, 73, 78, 81, 63}
For n: 17
= 72.52
If in a dataset one does not have all the values, the n is the number of observations available.
For example, if there are 100 observations but only 82 are reported, in mean calculations the entries to
be considered are 82 (not 100).
Central Tendency Measures
But mean values can be biased – they can be made larger or smaller based on a few
outlier observations
Example:
Observation Annual income in $
1 30,000
2 33,000
3 36,000
4 51,000
5 42,000
6 620,000*
n=6 Mean= 135,300 $
To locate the median, order cases from smallest to largest and identify the middle
observation
NB. Unlike the mean / average (135,300$), the extreme value (observation #6) does
not affect the median
Central Tendency Measures
When is it appropriate to use one or the other? Two considerations
• Level of measurement
Mean assumes values are ordered and have consistent distance between them
Median only assumes that values can be ordered (so, it is good for ordinal
variables)
• Level of how skewed the data are (i.e., if one has extreme observations) that can
yield a biased picture
E.g., the salary of all actors in a film: this is an interval /Ratio type of variable; as there
may be extremes, from stars to extras, best to use median (to avoid bias)
Central Tendency Measures
To obtain an idea of where a distribution(s) of values peaks (esp. for Categorical
variables) the mode is used. It represents the most common / frequently occurring
value for a variable. It can be found by counting the number of cases in each category,
and determining which category is most frequent
Example (Stats Can): points scored by a player during a 10-game hockey tournament
{7, 5, 0, 7, 8, 5, 5, 4, 1, 5 }
Points scored Number of games
0 1
1 1
4 1
5 4
7 2
8 1
For nominal / categorical variables, the measure often used is the variation ratio –
the percentage of cases that are not the mode.
Smaller value → less variation (i.e., the mode represents the distribution well);
Larger one → more variation (mode doesn’t represent it well)
Also a problem when there are outliers (e.g., in our annual income example, range
is from 30,000$ to 620,000$).
Interquartile range (IQR): the broader picture of dispersion around the median-or,
range between the 25% and 75% percentile of cases (that way, not influenced by
outliers).
Quartiles: points that divide the data into four equal parts, based on number of
observations (not on the possible values of a variable). Similarly, deciles divide into ten, etc.
Q1 Q2 Q3
This indicates whether the middle part of the data in a dataset are close together or
not.
Measures of Dispersion
Standard deviation (s) of a sample
For Interval / Ratio data, this is the best measure that indicates how far, on average,
an observation is from the mean (or, the average amount that each observation
differs from the mean).
Its value depends on how tightly the scores are ‘clustered’ around the mean
(more clustered → smaller s; wider dispersion, larger s)
Σ 𝑥𝑖 −𝑥 ҧ 2
s=
𝑛−1
𝑥ҧ = mean
𝑥𝑖 = observations
(square and root to eliminate negative)
𝑥𝑖 − 𝑥ҧ tells us how far an observation is from the mean
Measures of Dispersion
Standard deviation (s)
Example: a set of five grades {70, 75, 78, 82, 85}
70 + 75 + 78 + 82 + 85
𝑥ҧ = = 78
5
2
Σ 𝑥𝑖 −𝑥 ҧ 2
70−78 2+ 75−78 2+ 78−78 2+ 82−78 2+ 85−78
s= = = 5.2
𝑛 5
NB. Standard deviation formula sometimes seen as over n, instead of n-1. When we
have the actual mean (as in this case) we use the population standard deviation and
divide by n. When we have an estimate of the mean based on averaging the data
(when we do not have all the data), then we divide by n-1 (1 degree of freedom).
Standard deviation
+/- 1 s
78+5.2=83.5
78-5.2=72.8
+/- 2 s
78+10.4=88.4
78-10.4=67.6
67.6 72.8 78 83.5 88.4
Useful for knowing how far an individual case is from the mean
Standard deviation will be helpful when we discuss probability
Measures of Dispersion
Variance also indicates the spread of the data around the mean. It is the square of the
standard deviation.
Σ 𝑥𝑖 −𝑥 ҧ 2
variance=
𝑛−1
Variance can be expressed in squared units or as a percentage. For data, the metric of
standard deviation is used
Distributions with larger standard deviations have more variance away from the
mean, broadening and flattening the distribution’s curve
Measures of Dispersion
Spread can be very different even for distributions with identical measures of center
When you see a z-score, it shows us how far away an observation is from the mean in
standardized units.
It allows for standardized comparisons between groups, can be positive or negative.
Positive z-score indicates above the mean; negative indicates below
(𝑥𝑖 − 𝑥)ҧ
𝑧𝑖 =
𝑠
… … …
Level of
measurement Central tendency Dispersion Use
Interval / Ratio Mean (median) Standard deviation Most reliable and precise
information
Sampling: populations are studied to find their characteristics. When not possible to
include all population, research focuses on a sample – the set of observations that a
dataset does contain. Sampling is the process of selecting a number of cases from a larger
population for study. Representativeness of a sample is crucial, and depends on (a)
accuracy of sampling frame, (b) sample selection method, and (c) sample size
The appropriate method for summarizing data depends upon the level of measurement
Part I, Data II: Glossary
Central Limit Theorem: the sum of random variables follows a normal distribution
(a symmetrical bell shape)
Part I, Data II: Glossary
Sample representativeness: depends on accuracy, sample selection method, sample size
Mean: the average value of an observation (sum of values in a sample divided by total
number of observations)
Median: value in the middle of a dataset
Mode: most common, frequently occurring value for a variable
Interquartile range (IQR): the range between the 25% and 75% percentile of cases;
indicates whether the middle part of the data in a dataset are close together or not.
Standard deviation: how much variation there is within a group of values. It measures the
deviation (difference) from the group’s mean (average)
Variance: also indicates the spread of the data around the mean. It is the square of the
standard deviation
Z-score: shoes how far away an observation is from the mean in standardized units
POL244H
Research Methods for Political Science II
Winter 2024
Wednesdays 1-3pm @ MN1170
Part I
Analysis I: inferential statistics, uni- and bivariate analysis
Week 1 Jan. 10 Introduction and course details
Week 2 Jan. 17 Data I: Research design, experiments, interviews and questionnaires
Week 3 Jan. 24 Data II: Sampling, size and distributions
Announcements
Inferential Statistics
Assignments
Assignment 1 (5%)
Due: tomorrow, February 1
by 11:59pm EST (Quercus)
Dear Humanities and Social Sciences Students at UTM,
Ready to have your voice heard? We need your input!
Have you used digital tools in your courses? Have you created games? Used Omeka or ArcGIS or
StoryMaps? Analyzed big data sets like historical newspapers? Produced podcasts? Scraped data from social
media? Or experimented with other emerging digital methods for humanities $ social science research?
We invite you to join us for an engaging Town Hall discussion centered on your experiences with digital
tools and methods in your classes and research. Your insights are crucial in helping us understand how
faculty at UTM can better support you.
Two one-hour sessions are scheduled on Thurs, Feb 8, and Mon, Feb 12, from 4-5 PM on Zoom. Each
session will be an open forum for sharing experiences, discussing challenges, and exchanging ideas. Your
active participation will directly shape the resources and support provided to humanities and social
sciences students. Let's collaborate to enrich your research journey together!
Thurs Feb 8, 4:00–5:00 pm https://utoronto.zoom.us/j/83705592719
Mon Feb 12, 4:00–5:00 pm https://utoronto.zoom.us/j/83827679021
Want to share your ideas now? Take our quick and fun poll! Join us at one of our Town Halls where we will
reveal the results, and together we can turn your opinions into action. Click this link for our Pre-Event Poll!
BONUS! Attend one of our Town Halls and you could win $50 in UTM Gift Dollars! The lucky winner will
see the funds added to their TCard shortly after the random drawing. Don’t miss out on the chance to
participate and potentially boost your wallet! Spread the word and see you there!
Elspeth Brown, Director of CDHI and co-chair of the UTM Digital Scholarship Working Group
Paula Hannaford, Acting Chief Librarian, UTM and co-chair of the UTM Digital Scholarship Working Group
Sampling
First steps of survey research
We have already discussed the concepts of population and sample data, and described
a sample.
As it is very difficult to obtain population samples, we rely on sample ones.
If randomly selected, a sample can help us generalize about the whole population via
statistical inference (that is why we discussed random and non-random sampling at
length).
Through this process, from what we know to be true about a randomly selected,
representative sample, we can probabilistically infer what is likely to be true about the
population, or, project from observed cases (sample) to the whole population
(K&W; Besco). This is about being able to generalize.
Examples from our daily lives: a winning lottery ticket, a successful penalty kick in
soccer, a snowy day, tails in a flipping coin, etc.
As we will see, probability provides the link between a sample and a population, via
exploring how common a finding is.
It addresses the questions whether a statistic in a sample (e.g., mean, standard
deviation) is the same as the whole population, and how similar or dissimilar the
sample and the population are (Besco).
Characteristics
What is the probability that on our first draw we will pick a blue marble?
Example
2 Coin flips
Possible outcomes?
Outcome Frequency Probability, P
Heads, Heads 1 0.25 (1/4)
Heads, Tails 2 0.50 (2/4)
What is the most common outcome? Heads & Tails (or Tails & Heads), 50%
What is the probability of 3 Heads? 0%
If the coin is fair, then the more times we repeat this two-coin flip, the closer we will
approximate a 50-50 chance of a Heads and Tails outcome.
Probability: how many are there?
Joint (or, Unconditional) Probability, P(A ∩ B) or P( A & B) -- the chance of two events
(a) and (b) occurring together.
It is also called the Probability of the intersection of (a) and (b);
It is termed ‘unconditional, because it does not depend on order or sequence
Probability: how many are there?
Further,
Conditional Probability P(A|B) -- the chance that event (a) occurs, given that
event (b) has taken place. In this case, sequence matters and the occurrence of
(b) may alter the probability of (a) happening
E.g., probability of rain, given a cold front emerging west of Toronto; probability
of COVID-19 infection, given one’s exposure; probability of entrance into an Ivy
League U.S. university given a particular socio-economic level
When we speak of an 80% probability of snow means that in a long series of days
with similar conditions, snow falls on 80% of the days.
Put differently, and for our purposes, with a random sample, the probability that an
observation has a particular outcome is the proportion of times that outcome would
occur in a long sequence of like observations (Agresti)
Distributions of observations
Normal distribution
Resembles a bell-shape, is single centered peak, unimodal and symmetrical (here,
mode, median and mean are the same). Normal distribution (denoted N) is
characterized by the mean, and standard deviation, *. The same goes for normally
distributed variables.
In other words, the normal distribution has a predictable area under the curve
within specific distances from the mean (K&W).
This probability equals 68% within one standard deviation, 95% within two
standard deviations and 99% within three standard deviations.
The more spread out the distribution, the larger the
Probability distributions
In other words, if a probability distribution is a normal bell-shaped one, about 68% of
that probability falls between - and +, about 95% between -2 and +2, and
99% between -3 and +3. This is called the empirical rule (or, the 68-95-99 rule)
To illustrate more, let us go back to the topic of random variables (they are often
called ‘random’ to highlight the random variation behind the outcome varying from
observation to observation; this can be summarized by probabilities - Agresti)
Variables can be discrete (0, 1, 2, 3…) - i.e., separate values-or, continuous (if possible
outcomes are an infinite continuum - e.g., all real numbers between 0 and 1)
Example (Agresti)
Survey question: What is the ideal number of children for a family?
Discrete, as it takes numbers 0, 1, 2, 3…
For a randomly chosen person, probability distribution of ideal number of children for
a family (y) is shown via a table and a histogram
y P(y)
0 0.01
1 0.03
2 0.60
3 0.23
4 0.12
5 0.01
Total 1.00
Probability distributions
More importantly, these parameter values are the values these measures would
assume in the long run if the random sample took observations on the variable y
having that probability distribution (Agresti)
E.g., in the ‘ideal number of children’ example over the long run we expect y=0 to
occur 1% of the time, y=1 to occur 3% of the time, y=2 to occur 60% of the time, y=3
to occur 23% of the time, etc. In 100 observations, we expect: One 0, three 1’s, sixty
2’s, twenty-three 3’s, twelve 4’s and one 5.
Statistical inference
In other words, if repeated random samples are drawn from a population, the sampling
distribution of the sample estimate will approach normality (Halperin & Heath)
Central limit theorem
Example (Kellstedt and Whitten)
A distribution of actual scores in a sample (what we call a frequency distribution)
represents the frequency of each value of a particular variable.
If rolled die 600 times for an infinite amount of times (i.e., take a sample an infinite
number of times), the mean would be exactly 3.5* and standard deviation 0.07
= 3.5, = 0.07
*Another way of looking at this, is that with enough samples from a population, the means
will be arranged into a distribution around the true population mean and will approximate
a normal distribution. The larger the sample (not the population) the more accurate it is.
Central limit theorem
As we take more samples, especially large ones, our graph of the sample means
will look more like a normal distribution.
According to the CLT, the average of our sample means will be the population
mean.
Put differently, if we add up the means from all our samples, and we calculate
the average, that average will be our actual population mean.
Similarly, if we calculate the average of all the standard deviations in our sample,
we will find the actual standard deviation for our population (Kotz et al.)
Sampling distribution and standard error
This hypothetical distribution of sample means is called a sampling
distribution
The mean of sampling distribution would be equal to the true population
mean, and,
Sampling distribution would be normally shaped
Standard deviation of sampling distribution, Yҧ is
Yҧ = sY / 𝑛
𝑛: sample size
When it increases (i.e., means are more spread out), it becomes more likely
that any given mean is an inaccurate representation of the true population
mean (error between sample and population)
Therefore, we can state (with 95% confidence) that the population mean for our rolls
of die is within 3.33 (3.47-0.14) and 3.61 (3.47+0.14)
Inference and polling
A large, representative sample will look like the population (Central Limit theorem)
A poll with a margin of error of (say) ± 2 % follows the same logic of 95%
confidence interval.
This means, that if we conducted 100 different polls on samples from the same
population, we would expect the answers from 95 of these polls to be within 2%
points in one or other direction of the true value in the population
In polls, the important sample statistic is a % not a mean
A margin of error (confidence interval) indicates how many % points a poll’s results
will differ from the real population value. In the above example, our statistic will be
within 4 percentage points of the real population value 95% of the time.
Inference and polls
The standard error is associated with levels of accuracy. It indicates how much
dispersion to expect from sample to sample, or from poll to poll (dispersion of
the sample means)
𝑝 (1−𝑝)
Standard error for percentage =
𝑛
A poll (n=1000) in one U.S. state during presidential election produces the
following results:
Support for Democrats will now range from 48-56% (± 4% margin of error)
For Republicans, from 43-51% (± 4% margin of error)
More confident, but... within these ranges, polls can get the result wrong
(e.g., in the 2012 U.S. presidential election)
Inference and polls
E.g., instead of being ‘pretty sure’ that Jefferson was third or fourth U.S. president,
you can be ‘absolutely positive’ that he was among the first five (Wheelan)
Smaller (and more biased) samples produce larger standard errors and larger
confidence interval (‘margin of sample error’)
To examine whether X possibly causes Y, we must first investigate if the two are
related through a logic of inference from a sample to the whole population
Non-directional (two-tailed)
Expectation that what is investigated will be different (one variable), or, related
(two variables)
Directional (one-tailed)
Expectation that what is investigated will be different in a given way
(more / less than a given value), or related in a given way (positive / negative)
*This is the measure of the strength of the evidence that must be present in
one’s sample before they reject the null hypothesis and conclude that the
effect is statistically significant. The researchers themselves determine the
significance level before conducting their research.
Hypothesis testing
Given a data sample, one compares the potential relationship between X and Y in
that dataset, with what one would expect to find if X and Y were not related in
the underlying population (K&W)
In other words, the more different the empirically observed relationship is from
what would be expected if there were not a relationship, the more the
confidence that X and Y are indeed related in the population.
More broadly, hypothesis testing indicates the probability of seeing what one
does in a sample if the null hypothesis is true
Hypothesis testing and p-value
Let us say we want to test a hypothesis that some interesting phenomenon in Political
Science is occurring.
No amount of evidence can ever prove a hypothesis is correct 100% of the time.
Instead, one first assumes that the phenomenon does not actually happen (which, in
technical terms, is called the null hypothesis H0) , and attempt to reject this idea. (Balkus)
In simple terms, p-value is an indicator if two variables we are exploring are related.
The probability we would see the relationship we are finding because of random
chance; probability that we see the observed relationship we are finding between X
and Y in a sample data if there were no relationship between them in the unobserved
population
It conveys the level of confidence with which one can reject the H0
Hypothesis testing
Example: Global Warming in the Arctic Circle-is the North Pole melting?
H0 ⇒ = 0 oC
HA ⇒ > 0 oC
(This is a one-sided test. If we wanted to ask whether the temperature is lower than
0 oC it would be a two-sided one, with p-values at both ends of the distribution)
n=9
ഥ = 1.2
X
SD = 3
0 = 0
Df = 8 (sample size, n - 1)
Hypothesis testing and p-value
But that does not mean that global warming is not taking place.
We need a much larger sample at a more rigorous significance level (lower p-value)
Parenthesis: statistical significance and error
Useful mnemonic: when one undergoes a medical test, null hypothesis is that
they do not have a disease, x. If laboratory results confirm the disease and one is
not ill, then this is a false positive. If the test results are clear, and one is, in fact,
ill, then false negative.
(back to) Hypothesis testing
Sampling distribution of sampling means of 𝑦ത if H0 : = 0 . A very small p-value indicates that the probability
of obtaining values so extreme from the null hypothesis mean (0) is minuscule and not random. When a HA
is two-tailed, H0 rejection regions located on both ends of curve (and area under them = / 2 )
A p-value < 0.05 is considered the benchmark for results that are not a matter of
chance, but statistically significant. Overall, the standard one sets as the benchmark
for significance is symbolized by (sign. level, alpha) –i.e., how extreme the data must
be before we can reject the null hypothesis.
p-value and the 0.05 threshold
In interwar Britain, Muriel Bristol claimed to tell the difference between milk
poured into tea, and tea poured into milk.
To test this claim, in his 1935 book ‘The Design of Experiments’, Ronald Fisher,
a British statistician proposed a lady tea tasting test: she should be presented
with 8 cups of tea (4 with milk into tea, and 4 with tea into milk)=70
combinations with only 1 separating all 8 cups correctly.
If the woman was successful, it would be an extremely improbable result (1.4%
chance) indicating there was something other than random selection of the
correct answer.
p-value and the 0.05 threshold
Source: Balkus
Hypothesis testing
At the same time, there is some debate about what is the threshold for ‘rare’ (and
many do not think that a p-value smaller than 0.05 is necessarily consequential, or
that, say one of 0.07 is not)
p-value and levels of statistical significance
Research uses samples. If randomly selected, a sample can help us generalize about
the whole population through statistical inference. From what we know to be true
about a randomly selected, representative sample, we can use probability theory to
infer what is likely to be true about the population.
A normal distribution has a predictable area under the curve within specific
distances from the mean: about 68% of that probability falls between ± 1 standard
deviation, about 95% between ± 2 ones and 99% between ± 3 (the empirical rule-
or, the 68-95-99 rule)
The distribution of sample means follows a normal distribution, so the peak of the
normal distribution equals the population mean. This can help determine how far
our sample mean is from a hypothesized population value and its associated
probability (it can indicate how accurate a representation of our sample is within a
confidence interval, so that we can generalize from what we have).
Part I, Analysis I: main points
Hypothesis testing indicates the probability of seeing what one does in a sample if
the null hypothesis is true
P-value is the probability of that data being collected simply by chance assuming
the null hypothesis, H0 – that the phenomenon does not occur. It conveys the level
of confidence with which one can reject the H0
P-values (very low ones) are important as they provide evidence of presence of
relationship between two variables
Part I, Analysis I: Glossary
Conditional Probability: the chance that event (a) occurs, given that event (b)
has taken place
Margin of error: in polling (confidence interval) indicates how many % points a poll’s
results will differ from the real population value.
Part I, Analysis I: Glossary
Hypothesis: an expectation about what is happening in the (unobserved) population-
perhaps a relationship between two variables, X and Y. Researchers seek evidence to
support or reject them
Type I error (false positive): rejection of a null hypothesis that is actually true. The
lower the value of level of significance, the less likely this error
Type II error (false negative): no rejection of a null hypothesis that is false and should
have been rejected. The lower the level of significance value, the more likely this error
t-statistic: indicates how far our observed sample value is from a hypothesized
population value
POL244H
Research Methods for Political Science II
Winter 2024
Wednesdays 1-3pm @ MN1170
Part I
Analysis II: measures of association; regression analysis
Week 1 Jan. 10 Introduction and course details
Week 2 Jan. 17 Data I: Research design, experiments, interviews and questionnaires
Week 3 Jan. 24 Data II: Sampling, size and distributions
Announcements
Measures of Association
Regression analysis I
Today’s tutorials
Assignment 2 (10%)
Due: end of next week February 18
by 11:59pm EST (Quercus)
On the path to proving Causality
P-values (very low ones) are important as they provide evidence of
presence of relationship between two variables
Chi-squared
Lambda
Gamma
(Pearson’s) r (correlation coefficient)
Direction of association
Positive: high values associated with high values (and low with low)
E.g., ethnic polarization and likelihood of civil war
Negative: high values associated with low values (and vice versa)
E.g., levels of poverty and voting turnout
Association between two variables
Chi-squared
It helps test whether there is relationship
between two variables, X and Y (but not strength, or direction)
Such tests compare an empirical result with a hypothetical result that would occur if
the data were random (H0 = X, Y not related)
2
𝑂−𝐸
Chi-squared, 𝑥 2 =
𝐸
O: observed number of cases (also expressed as fo, or, observed frequency)
E: expected number of cases (also expressed as fe or, expected frequency)
Association between two variables
Example (Singh)
Is location related to political party affiliation in the United States?
Always helps to look at the table for any apparent association
Association between two variables
Ordinal: clear ordering of the categories in this type (e.g., education, income,
satisfaction ratings)
Interval / Ratio: also called continuous, or, numerical: ordering and equal spacing
between values for interval (no natural zero-e.g., temperature absence of zero is an
actual temperature) and ratio (have a natural zero-e.g., age) types
Measures of Association: nominal variables
Lambda
This is a test that indicates strength of association
(as Categorical / Nominal variables are non-directional)
It ranges from 0.00 to 1.00 and helps improve one’s predictions of one variable
if one knows about the other.
Measures of Association: nominal variables
Lambda, =
𝝐1−𝝐2
𝝐2
Gamma
This test indicates both strength and direction of association
Example (Singh)
Support for deregulation and support for free markets
Each variable has five ordinal categories
Measures of Association: ordinal variables
Discordant pairs: an observation rates higher on one variable but lower on the
other, as compared to its counterpart
E.g., while respondent C strongly approves both deregulation and free
markets, respondent D strongly disapproves deregulation while is in favor of
free markets
From Rademaker
Measures of Association: ordinal variables
𝐶−𝐷
Gamma, 𝛾= 𝐶+𝐷
𝐶−𝐷
In this example, 𝛾= 𝐶+𝐷 = 0.54 ⇒ 𝛾 > 0 Modified from Singh
𝑋𝑖 −𝑋ത 𝑌𝑖 −𝑌ത
Correlation coefficient, 𝑟X, Y = 𝑛−1 𝑆𝑌𝑆𝑋
It ranges from -1.0 (strong negative association) to +1.0 (strong positive association)
If r = 0, this does not mean the two variables are not related at all; only that there is
no linear association between them
Measures of Association: continuous variables
Example: exploring the association of age and income (Singh)
H0: no difference between age and income
HA : increase in age → increase in income
With regression analysis, we will see that what is important is the slope of a line
Both straight lines with same Correlation coefficient r=1, but note that their slope different
Measures of association: tests
Yes No
IV type
Categorical Continuous
Categorical Tabular analysis Probit/Logit
DV type
Continuous Difference of means (e.g., t-test) Correlation coefficient
Regression Regression model
Let’s recap...
𝑏: slope
This is the change in Y associated with a one-unit increase in X.
Once we know these two parameters, we can draw that line across any
range of X values
Estimation of linear relationships
𝑌𝑖 = 𝑎 + 𝑏𝑋𝑖
As we look to minimize the vertical distances between the fitted line and
each point in the scatterplot,
we select the line that minimizes the total (or, sum) of squared residuals
(squared, to ensure we do not have negatives) (‘line of best fit’)
2
Equation: 𝑒𝑖2 = 𝑌𝑖 − 𝑌𝑖
Regression coefficient
From the original equation, the OLS regression coefficients and b are
obtained (although, as the intercept is a constant, the term
regression coefficient is used for b)
= 𝑌ത - b𝑋ത ⇒ 𝑎ො = 𝑦ത − 𝛽መ 𝑥ҧ
Example
We are examining the Vertical axis: DV
relationship between X and Y,
and obtain several observations
Horizontal axis: IV
Estimation of linear relationships
ത and 𝑌=4
Since 𝑋=3 ത
𝑋𝑖 −𝑋ത 𝑌𝑖 −𝑌ത
from the OLS regression equations ( = 𝑌
ത - b𝑋ത and b = )
𝑋𝑖 −𝑋ത 2
we obtain
= - 0.2 (intercept-where it crosses the Y-axis; or, constant, since it is
the starting point for a calculation), and,
b = 1.4 (the slope of the line, or, regression coefficient)
Visually,
Intercept, is the value of y when X=0
(‘anchors’ the regression line)
On b, regression coefficient
(From Singh)
Put differently, the Y1 plot has a smaller sum of squared residuals ( 𝑒𝑖2 = 𝑦𝑖 − 𝑦ො𝑖 2 )
To better describe the variation about the regression line (the ‘noise’),
the Root Mean Square Error (Se) is used (also called the standard error of the
regression model)
𝑒𝑖2
𝑆𝑒 =
𝑛−𝑘−1
n = number of observations
k = number of independent variables (always equal to 1 in bivariate regression, so
𝑒𝑖2
the equation can also be written as 𝑆𝑒 = )
𝑛
That’s a lot of equations! I have included them for those interested, but don’t
worry-all of these are reported by the statistical software output
Goodness of fit measures: Root MSE
More broadly, Root MSE is the average vertical distance (or, deviation) of a data
point from the fitted regression line and indicates how concentrated the data is
around the line of best fit.
It is always expressed in the metric of the DV and not bounded (e.g., from
-1 to 1) so, it is more difficult to compare. Still, more broadly,
the higher the value of Se , the worse the fit of the regression line
Se = 10.45 Se = 52.23
Goodness of fit measures: R-squared
2
ഥ
𝑒𝑖 −ⅇ
𝑣𝑎𝑟 (𝑒)
𝑅2 = 1− 𝑛−1
2 Or, 𝑅2 =1-
ഥ
𝑦𝑖 −𝑦 𝑣𝑎𝑟 (𝑦)
𝑛−1
∗
2 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑅𝑆𝑆
Also expressed as, 𝑅 = 1 -
𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 (𝑇𝑆𝑆)
* RSS is also called Sum of Squared Errors (SSE)
Goodness of fit measures: R-squared
𝑅2 ranges from 0 to 1
If 𝑅2 =1 ⇒ perfect relationship between X and Y (accounting for all
variation)
If 𝑅2 = 0 ⇒ no (linear) relationship between X and Y
But it also depends on the data; e.g., survey date are more ‘noisy’
If more variables are added, it increases.
𝑅2 = 0.90 𝑅2 = 0.36
We guess about 𝜇, 𝑥ҧ
H0 rejected
zero slope,
no correlation
(at least, not
linear)
H0 not rejected
Inference: hypothesis-testing and the t-statistic
𝑏 − 𝛽0 𝑏
𝑡= =
𝑆𝑏 𝑆𝑏
𝑒𝑙2ሶ
𝑠𝑒
Sb, standard error of b : 𝑠𝑏 = and 𝑠𝑒 =
𝑛−𝑘−1
𝑥𝑖 −𝑥ҧ 2
Example
Are Divorce rate and Unemployment rate in the U.S. related? (Singh)
H0: β = 0
H A: β 0
Sample n=192
Divorce rate
Unemployment rate
Inference: hypothesis-testing and the t-statistic
i = a + b × unemploymenti
Estimation of the equation: Divorce
(remember, : intercept, b: slope/regression coefficient)
Inference: hypothesis-testing and the t-statistic
Regression results
Coefficient on Unemployment : 0.365
Standard error of the coefficient: 0.051
𝑏−𝛽0 𝑏 𝑏−𝛽0 𝑏
Using 𝑡 = = , we have 𝑡 = = (coefficient/standard error)
𝑆𝑏 𝑆𝑏 𝑆𝑏 𝑆𝑏
0.365
⇒𝑡= = 7.2
0.051
With this value for t we can determine whether this t-statistic indicates statistical
significance
Inference: hypothesis-testing and the t-statistic
NB. This result should not be interpreted as ‘there is a less than 5% chance
that the true impact of Unemployment on the expected Divorce rate is zero’
Inference: hypothesis-testing and the t-statistic
Also, note that t-statistics tend to get bigger with a bigger sample
𝑏 − 𝛽0 𝑏
𝑡= =
𝑆𝑏 𝑆𝑏
𝑒𝑙2ሶ
𝑠𝑒
Where Sb, standard error of b : 𝑠𝑏 = and 𝑠𝑒 =
𝑛−𝑘−1
𝑥𝑖 −𝑥ҧ 2
In simple terms, a larger size sample increases significance (all else equal)
Inference: p-values
Screenshots (Aug. 2021, and Jan. 2022) from Five Thirty Eight
Confidence intervals are about the likely location of the population mean
– the percentage of confidence that the population mean lies within
Inference: confidence intervals
Regression results
Coefficient on Unemployment : 0.365
Standard error of the coefficient: 0.051
Inference: confidence intervals
𝑏 − 𝑡𝑎/2 ∗ 𝑠𝑏 to 𝑏 + 𝑡𝑎/2 ∗ 𝑠𝑏
This means that one can be 95% confident that β falls in the interval between
0.265 and 0.465
Linear regression and related assumptions
Measures of goodness of fit include Root MSE (or, standard error of the regression
model –the higher, the worse the fit) and 𝑅2 (from 0 to 1, the closer to 1 the perfect X
& Y relationship)
Part I, Analysis II: main points
Regressions can also help us learn about an unobserved population, by using sample
data and employing familiar inference tools (e.g., H0 null hypothesis testing, p-values,
confidence intervals)
In regression analysis, instead about , we make inferences about β
Null hypothesis, H0 : X does not cause Y (β = 0)
When Se is smaller, Sb is also smaller, indicating that b is a more precise estimate of β
Confidence intervals tell us we can be (usually 95)% (at p: 0.05) confident that β
falls in the interval between two values
Multiple Regression is further used to identify confounding variables (variables that are
correlated both with X and Y). Multiple regression is simply the addition of one (or
more) independent variable(s), Z, to the regression equation.
Part I, Analysis II: Glossary
Gamma: This test shows both strength and direction of association. It ranges
from -1.0 to +1.0, and determines whether an observation rating high on one
variable means that observation will rate high on another.
Correlation coefficient, r (or, Pearson’s r): This measure provides both the
strength and direction of a linear relationship between two continuous
variables. It ranges from -1.0 (strong negative association) to +1.0 (strong
positive association). If r = 0, this only means is no linear association between
them.
Part I, Analysis II: Glossary
Residual 𝑒: the deviation of individual Y values from the regression line Some
use instead the notation u for the residual
𝑅2 (R squared): A better indicator of how accurately the regression line describes X and Y-it is
the proportion of variance in Y explained by X. It ranges from 0 to 1. If 𝑅2 =1 ⇒ perfect
relationship between X and Y (accounting for all variation). If 𝑅2 =0 ⇒ no (linear) relationship
between X and Y.
T-statistic: The t-statistic measures how much the estimated value differs from its expected
value, considering the standard error. It's key in deciding if we accept or reject the null
hypothesis in a t-test. Usually, researchers use a 5% level to examine if their results are
statistically significant. Larger samples make the t-statistic bigger and the results more
significant. Generally, in a two-tailed t-test with a 5% significance level and a big enough
sample, the critical t-value is around ±1.96.
Confidence intervals: Confidence intervals are about the likely location of the population
mean – the percentage of confidence that the population mean lies within
Part I, Analysis II: Glossary
Winter 2024
Wednesdays 1-3pm @ MN1170
Part I
Analysis III: Regression (concl.)
Week 1 Jan. 10 Introduction and course details
Week 2 Jan. 17 Data I: Research design, experiments, interviews and questionnaires
Week 3 Jan. 24 Data II: Sampling, size and distributions
Announcements
Regression analysis II
Big data, Machine Learning (next class)
Network analysis (next class)
Assignments
Assignment 2 (10%)
Due: February 18 by 11:59pm EST (Quercus)
Exceptionally, due to Reading Week, no late penalty until end of Feb. 20 (by 11:59pm EST)
NB. The six academic papers to select from have been posted on Quercus
Assignments
Posted on Quercus
Reading Week: there will be office hours
Prof. K
Wednesday, Feb. 21 via zoom
Expanded from 3:30-5:30pm EST
Zoom link: https://utoronto.zoom.us/j/83233816131
Mujahed
Thursday, Feb. 22 from 9-10am, and from 3-4pm
Zoom link: https://utoronto.zoom.us/j/87382638114
Meeting ID: 873 8263 8114
Passcode: POL244
Mid-term test
Spurious correlations
Crude oil imports by the U.S. and U.S. Chicken consumption (annual/lb)
r: 0.899; correlation 89.9%
(from Vigen @ spurious correlations)
Confounding variables
Example (Singh)
Watching Fox News (X) and supporting the Republican Party (Y)
Does X cause Y? Association is spurious because X and Y also correlated
with numerous other variables (e.g., political ideology, urbanity,
demographics, etc.). Result is an overestimation of X over Y
High Low
Voting Democrat 31 68
preference Republican 69 32
Total 100 100
n 475 525
(From Singh)
Confounding variables
single intercept slope estimates for each IV (X, Z): they are called Partial Coefficients
If X increases by 1 unit, and Z is the same (or, ‘while accounting for the
impact of Z’, or, ‘all else equal’), Y changes by the slope (or, coefficient) b1
More broadly, Multiple Regression for k number of IVs
𝑌𝑖 = + β1𝑋1𝑖 + β2𝑋2𝑖 + … + β𝑘𝑋𝑘𝑖 + ε𝑖
Multiple Regression
In that way, our fourth and final criterion for causality can be met
Multiple Regression: geometry
in multiple regression
denotes a plane (the ‘response plane’)
There is still a single intercept ,
but now there are two slopes, b1 and b2
Remedy
• Increasing sample size, collection of more data
• Combining collinear variables (if conceptually similar) by adding the values of them
• Accepting it as a fact of the model (while ‘wider’ sampling distributions are not biased)
NB. Removing IVs suspect of collinearity should be avoided, because this can lead to
variable omission bias; it is less problematic to have collinearity and larger standard
errors, than miss a potentially important causal variable
Multiple Regression: terminology
• Explanatory variables
• Covariates
• Predictor variables
• ‘Right-hand side’ variables
What does this tell us? A $1,000 (X unit) increase is associated with 2.18-point increase in
the expected prestige of an occupation. Also, with zero income, prestige is negative
Multiple Regression: statistical inference
Regression results
Coefficient on Income : 2.18
Standard error of the coefficient: 0.36
𝑏−𝛽 𝑏 𝑏−𝛽 𝑏
Using 𝑡 = 𝑆 0 = 𝑆 , we have 𝑡 = 𝑆 0 = 𝑆 (coefficient/standard error) ⇒
𝑏 𝑏 𝑏 𝑏
𝑡 = 5.95
This value exceeds the critical t, therefore this estimate is significant at
the 5% level (two-sided)
Confidence interval: 1.08 to 3.29
p-value < 0.001. Hence, we can be very certain that income is not
unassociated with prestige
R-squared (goodness of fit): 0.73. Thus, income (X) explains 73% of the
variance in prestige (Y)
Multiple Regression: statistical inference
Predictions
But, could there be another factor what affects the status of an occupation
besides income?
Regression results
Coefficient on Income: 0.65 (<2.18)
Standard error of the coefficient: 0.41
𝑡 = 1.59, which does not exceed the critical t, therefore this estimate is not
significant
Confidence interval: -0.24 to 1.54
p-value: 0.139
Multiple Regression: goodness of fit
We can also use the Adjusted R-squared, which accounts for number of
observations and variables in the model (degrees of freedom)
2 𝑘
𝑅𝑎𝑑𝑗 = R2 – * (1 - R2)
ℎ−𝑘−1
2
= 0.9 – ( ) 1 − 0.9 = 0.885
15−2−1
This indicator can point to how reliable the correlation is and how much is
determined by adding IVs to the model:
It increases only when the predictor strengthens the model
Multiple Regression: statistical inference
Result
The effect of Income on Occupational Prestige was overestimated due to an
omitted confounding variable (Education).
Income does not equate with Prestige and our causal inference was
incorrect.
Once we control for a confounding variable, we can obtain a less biased
estimate
Multiple Regression: interpretation
Example (K&W)
𝛽መ 𝐴 𝛽መ 𝐶 𝐺𝑟𝑜𝑤𝑡ℎ because model C controls for the effects of quarterly economic growth
In model C,
5.5
𝛽መ𝑆𝑡𝑑 𝐺𝑟𝑜𝑤𝑡ℎ = 0.58 = 0.53
6.0
This means that for every 1 standard deviation increase in Growth, we estimate an
increase of 0.53 standard deviation in the vote percentage for the incumbent
candidate, while controlling for the effects of quarterly econ. growth
2.9
𝛽መ𝑆𝑡𝑑 𝑞𝑢𝑎𝑟𝑡. 𝑒𝑐𝑜𝑛. 𝑔𝑟𝑜𝑤𝑡ℎ = 0.63 = 0.31
6.0
Similarly, for every 1 standard deviation increase in quart. econ. growth, we estimate
a 0.31 standard deviation increase in the vote percentage for the incumbent
candidate, while controlling for the effects of overall Growth
Therefore, overall Growth has a greater impact on incumbent voting % than ‘good
news’ about consecutive econ. growth in the year leading up to an election
Multiple Regression: Regression model
Models are simplified versions of the world that help researchers explore
the potential causes of phenomena by mathematically sorting out what
variables may have an impact
Multiple Regression: Regression model
Note that adding variables will often change the other coefficients
(slopes)
Multiple Regression: Regression model
But, at times many variables are not continuous (instead, nominal, ordinal)
Example (Albright)
A Linear model cannot accurately depict the Non-linear models like BNL and BNP accurately accommodates
cumulative effects of alcohol on the likeability different rates of change (effect) at the opposite ends of the IV.
of the singer (beer nr 3 does not produce the Essentially, these models take the linear one and filter it
same effects as beer nr 7). through a function based on a probability distribution to reflect
the non-linear relationship within the 0-1 range
Regression: Polytomous Categorical Dependent Variables
Data can often be collected across time through repeated, regular temporal
observations on a single unit of analysis (Shin).
This analysis allows to investigate the ‘history’ of a variable and explain
patterns over time (including the systematic or random nature of residuals)
Estimation of long-term behavior (‘trend’) involves bivariate OLS Regression
analysis with time (in regular intervals) as IV. It also conveys information
visually on trends (e.g., the impact of events, policies, etc.)
Big data: the capacity to search, aggregate, and cross-reference large datasets
(Boyd and Crawford)
Digital trace data (big data and metadata): a characteristic of the digital era
2017: 500,000,000 tweets/day (Source: Omnicore)
2017: 1,500,000,000 people daily active FB users (Source: Zephoria)
2017: 260,000,000,000 emails send/day
2019: 319,600,000,000/day (Source: Radicati Group)
Pandemic has only accelerated this trend
NB. One zettabyte is 1 trillion
2021 - total volume of data created: 64.2 zettabytes gigabytes (or all of Shakespeare’s
2025 - total volume of data created (forecast): 180 zettabytes works 178 trillion times)
In an era of 24/7 connectivity, humans produce, emit and provide data (some
argue we are data) daily that can be collected and analyzed.
Big data: promise
These predictions are made with some level of accuracy, which techniques
like regression seek to improve. A model that identifies sub-optimal
predictions and adjusts towards greater accuracy is said to ‘learn’ - hence,
‘machine learning’
Machine Learning: discovery
Examples:
Jungherr and Theocharis vs. Grimmer on methodological merit
Blair and Sambanis vs. Beger, Morgan and Ward) on whether a
theory-based model is better than Machine Learning to predict civil
war onset.
Advanced Machine Learning: Artificial Neural Networks
The learning part of creating models has led to the next step of artificial neural
networks, a broader family of learning methods based on learning data
representations (instead of task-specific algorithms). These networks store and
evaluate how significant each of the inputs is to the output of a model (NB. that
both inputs and outputs are binary-1s and 0s ). At the same time, these types of
models include an intermediate, not observable ‘hidden layer’ that stores
information regarding the input’s importance, and it makes associations between
the importance of combinations of inputs (Johnson, Rowe). In that sense, it mimics
the human brain’s architecture and function by quickly making decisions
Models with more than one intermediate, not observable ‘hidden layers’ are
engaged in ‘deep learning’. While like with Artificial Neural Networks each
connection has its weight, in deep neural networks, the most important features
for classification can be obtained automatically (via the ‘activation function’) similar
to neurons in the brain
Application example
From Assael, Y., Sommerschield, T., Shillingford, B. et al. (2022) Restoring and
Attributing Ancient Texts Using Deep Neural Networks. Nature 603, 280–283.
Probabilistic prediction of missing ancient text via deep neural networks. In the example displayed above, this inscription
[Inscriptiones Graecae vol. 1, edition 3, document 4, face B (IG I3 4B)] records a decree concerning the Acropolis of Athens and
dates to 485/4 BC. (Marsyas, Epigraphic Museum, WikiMedia CC BY 2.5). The marble-looking parts are the surviving parts of the
inscription, with the rest predicted by a self-learning deep neural network algorithm, Ithaca. Alone, archaeologists were able to
correctly predict 25% of the text; Ithaca achieved 62%; a collaboration between the two yielded 72% successful prediction.
See more details at https://www.nature.com/articles/d41586-022-00702-6
Part I, Analysis III: main points
If X increases by 1 unit, and Z is the same (or, ‘while accounting for the impact of
Z’, or, ‘all else equal’), Y changes by the slope (or, coefficient) b1
b1 represents the effect of X on Y while holding constant the effects of Z and
b2 represents the effect of Z on Y while holding constant the effects of X
Part I, Analysis III: main points
To find out which IV(s) cause X, one creates models, each containing a different
combination of the independent variables measured.
To compare Regression coefficients of different units, they are standardized
Another way to compare models is the AIC, BIC and related metrics.
For IVs that are not continuous (instead, nominal, ordinal), dichotomous or
polytomous (if the categories represented are both exhaustive and mutually
exclusive) can be represented by ‘dummy’ ones that take a value of either one
(presence of a characteristic) or zero (absence).
Time Series is an analysis that investigates a variable and explain patterns over
time.
Part I, Analysis III: Glossary
Response plane: in multiple regression, the surface formed by b1 and b2 which (like a
line does in two-variable regression) represent the predicted values
Residuals: Residuals are the vertical distance between each observation and the
regression surface.
Part I, Analysis III: Glossary
Multicollinearity: when two or more of the IVs in a Regression model are highly
correlated with one another.
Outliers (or, influencers): extreme value observations relative to other ones - often
with unusual IV values (or, leverage) and large residual values - that strongly influence
the parameter estimates in a Regression model.
Models: Models are simplified versions of the world that help researchers explore the
potential causes of phenomena by mathematically sorting out what variables may
have an impact.
Adjusted R-squared: similar to R-squared [0-1], points to how reliable a correlation is
AIC, AICc, BIC and Mallows Cp: metrics used for model evaluation and selection. The
lower these metrics, the better the model
Dummy variables (or ‘indicators’): dichotomous variables that take a value of either
one (presence of a characteristic) or zero (absence). Can also be used for polytomous
independent variables if categories they represent are mutually exclusive, exhaustive
Part I, Analysis III: Glossary
Linear Probability Model (LPM): special case of an OLS model where the DV is a dummy
one, and the DV estimates are interpreted as predicted probabilities.
Binomial Logit (BLM) and Binomial Probit (BPM): account both for non-linearity and a
dichotomous DV. These models are able to capture non-linear relations via their
cumulative distribution functions which have a sigmoidal s-shape. For goodness of fit,
Pseudo-R-squared and the Percent Correctly Predicted are used.
In OLS, DV both continuous and observed (hence, residuals point to the line).
BNL contains an unobserved probability (we only see 1s and 0s). To estimate the
parameters of this model, the method of Maximum Likelihood Estimation (MLE) is used
Time Series: data can often be collected across time through repeated, regular temporal
observations on a single unit of analysis. This analysis allows to investigate the ‘history’
of a variable and explain patterns over time.
Part I, Analysis III: Glossary
Big data: the capacity to search, aggregate, and cross-reference large datasets.
Machine learning (ML): ML is a ‘class of flexible algorithmic and statistical techniques
for prediction and dimension reduction’. More plainly, it is the automation of
statistical learning techniques by computers to identify patterns in data and make
predictions.
Have a restful, productive and healthy Winter reading week!