0% found this document useful (0 votes)
183 views343 pages

POL244 Midterm Lectures

This document provides an overview of the POL244H Research Methods for Political Science II course being offered in Winter 2024. It will cover topics related to research design, data collection, analysis, and indigenous methods over 12 weeks. The course involves both weekly lectures and optional tutorials to provide additional support. Students will complete one assignment related to questionnaires. Experimental and observational research designs are discussed as two main approaches used in political science to explore causality. The benefits and limitations of each are outlined. Key concepts include internal and external validity, and different types of observational studies such as cross-sectional, time series, and hybrid approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
183 views343 pages

POL244 Midterm Lectures

This document provides an overview of the POL244H Research Methods for Political Science II course being offered in Winter 2024. It will cover topics related to research design, data collection, analysis, and indigenous methods over 12 weeks. The course involves both weekly lectures and optional tutorials to provide additional support. Students will complete one assignment related to questionnaires. Experimental and observational research designs are discussed as two main approaches used in political science to explore causality. The benefits and limitations of each are outlined. Key concepts include internal and external validity, and different types of observational studies such as cross-sectional, time series, and hybrid approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 343

POL244H

Research Methods for Political Science II

Winter 2024
Wednesdays 1-3pm @ MN1170

Part I
Data I: on Research Design, Experiments,
Interviews and Questionnaires
Week 1 Jan. 10 Introduction and course details
Week 2 Jan. 17 Data I: Research design, experiments, interviews and questionnaires
Week 3 Jan. 24 Data II: Sampling, size and distributions

Week 4 Jan. 31 Analysis I: Univariate, bivariate analysis


Week 5 Feb. 7 Analysis II: Regression Statistics
Week 6 Feb. 14 Analysis III: Regression (cont.); big data, machine learning, Networks
Reading week NO CLASS

Week 7 Feb. 28 Mid-term test


Singe and small-n cases
Week 8 Mar. 6 Data I: Ethnography
Week 9 Mar. 12 Data II: Interviews, archives, texts/documents

Week 10 Mar. 19 Analysis I: Process tracing, content analysis

Week 11 Mar. 26 Analysis II: Comparative study

Week 12 Apr. 3 Indigenous methods, Research ethics, conclusions


Today’s schedule

Announcements

Research Design, Experiments and Observation Studies


(Concepts and Measurements)
Surveys (Interviews and Questionnaires)
Tutorials

Department has condensed the three groups into two

Time slots:
T0101 3-4 pm (@IB377)
T0102 4-5 pm EST (@IB377)

They are conducted by out TA, Mujahed, and begin today after our class
Tutorials

Exceptionally, today’s tutorials will be conducted via remote platform

Zoom Meeting
https://utoronto.zoom.us/j/87382638114

Meeting ID: 873 8263 8114


Passcode: POL244
Assignments

Assignment 1 (5%) is related to questionnaires


Due: January 28, by 11:59pm EST (Quercus)
Details will be presented in today’s tutorials
Main Points and Glossary

At the end of every set of slides posted on Quercus, you will be able to
find the main points and a glossary of terms related to the topic and
concepts we address that week
Research Design
In last week’s introduction we spoke about variables and causation.

How do we know if an X causes Y?

Political Scientists use a number of research strategies to explore causality,


including experiments and observational studies (and case studies-which we
will look at later in the course)
Experiments
Definition: a research design in which the researcher both controls and
randomly assigns values of the independent variable to the participants

Example: in the Medical Sciences, to test a new medicine, two sets of


participants are randomly assigned to each group:

Treatment group: receives new medicine


Control group: does not receive it (instead, a placebo)

This means that the experiment’s participants are randomly assigned to one of
two possible values of X
Experiments
Random assignment* of groups ensures that the comparison between them is
as ‘pure as possible, and that some other cause of the DV (say, a factor Z) will
not pollute [or, affect] that comparison.’ (Kellstedt & Whitten)

Randomness ensures that these groups are identical, save for the different
values of X (rather than any of them having particular characteristics that might
skew the testing).

*Not to be confused with random sampling (study subjects selected at random


from a general population to participate in a study) which we will address next
week
Experiments: causal checklist
How do experimental set ups fare against a causality checklist?

1. Existence of credible causal mechanism between X and Y


Requires study of theory to hypothesize plausible causal connections

2. Elimination of the possibility that the causal arrow is reversed (Y→ X)


Random assignment of values of X removes any chance it is correlated with any
variable, like Y
Experiments: causal checklist
How do experimental set ups fare against a causality checklist? (cont.)

3. Covariation between X and Y


Can be established through a statistical evaluation of the relationship between them

4. Control for effects that may be instead caused by other, ‘confounding’ variables (Z),
rendering the correlation between X and Y spurious
Random assignment of values of X (✓ treatment, X control) removes any chance it is
correlated with any variables, like Z (potential confounding variables).
Note that this does not mean there are no other potential causes of Y, but that thanks
to randomness the two groups of the experimental setting are equally affected by
them (therefore, controlled, allowing to check for variation based on X
Experiments
With all four criteria satisfied, we can speak of internal validity – ‘a research
design yielding high levels of confidence in the conclusions about causality
among the cases that are specifically analyzed’ (Kellstedt & Whitten)

External validity, on the other hand, refers to a truly random sample of the
population that is being tested-whether a research’s conclusions apply to others
equally (i.e., whether such conclusions are generalize-able)
Validity
Different research methods fare differently with respect to internal and external
validity. There is no trade-off between the two, and all approaches can be
valuable, but in general it is preferable to have a good measure of both
(be within what a few scholars have termed the ‘cone of validity’)

Modified from Bhattacherjee


Experiments: types
Laboratory experiment: under controlled, artificial conditions (Natural/Medical Sciences)

Field experiment: (designed) study occurs in a natural setting (Social Sciences)


As such, X cannot be controlled or assigned randomly by the researcher
Instead, X values arise naturally, and this can resemble a random assignment

Quasi-experiment: (self-emerging) experiment-like conditions produced naturally (e.g.,


population of an isolated island with no TV can be used to explore hypotheses on TV and
violent behavior). The main difference with a field experiment is that the population is not
randomly assigned.
Drawbacks of experimental research design
1. Not every X (IV) is controllable
In the Social Sciences, not often possible to meaningfully assign randomly
different values to groups of participants (e.g., to test for the impact of
revolutionary past on a state’s foreign policy, to assign different revolutionary
histories to populations of two groups we are testing; these already exist, or
they don't)
2. Potentially low external validity of
 the sample of population tested
(not the case in many experiments, which instead employs a sample of
convenience-that, in turn makes it difficult to generalize results, unless
the experiment is replicated and produces the same results with different
samples), and of,
 the X variable
(natural and experimental environments very different, can affect
whether/how subjects interact with it)
Drawbacks of experimental research design
Challenges to external validity (details)

• Representativeness of the participants of a study and generalize-ability


• History effects: can one generalize into the future?
• The effects of the experimental setting
• Pre-testing effects: pre-screening experimental subjects may introduce bias
• Reactive effects: experimental subjects recognize they are in an experiment,
and this may introduce bias
Drawbacks of experimental research design
3. Ethical dilemmas - e.g., Milgram 1963, 1974; Zimbardo 1971; Facebook ‘Massive-Scale
Emotional Contagion through Social Networks’ 2012 (in addition to low external validity)

Milgram https://vimeo.com/93599024 Zimbardo https://www.youtube.com/watch?v=DsWJPNhLCUU

4. Over-emphasis of X’s causal prominence


(that X may be found to have an effect does not preclude other causes)
Observational studies
As it is difficult to conduct controlled experiments in the political realm, in
seeking to emulate them, more common research designs in Political Science
involve observation of reality as it exists.

Definition: ‘a research design in which the researcher does not have control over
values of the independent variable, which occur naturally.’ (Kellstedt & Whitten)
Still, a degree of variability in the IV across cases, and variation in the DV must
be present
Observational studies: types (pure)
Cross-sectional: (quantitative) looks at variation in different units at a singe time
unit; ‘examines a cross-section of social reality, focusing on variation between
individual spatial units (e.g., citizens, countries, etc.) and explaining the variation in
the DV across them.’ (Kellstedt & Whitten)

GDP of select countries, Q1 2021.


Source: OECD
Observational studies: types (pure)
Time-series: focuses on variation of a single unit over multiple time units;
‘examines variation within one spatial unite over time.’ (Kellstedt & Whitten)

GDP of Canada, 2000-20, in bn CAD.


Source: StatCan
Observational studies: types (pure)
Time-series: focuses on variation of a single unit over multiple time units;
‘examines variation within one spatial unite over time.’ (Kellstedt & Whitten)

Canadian PM Justin Trudeau approval rating, 2014-21


Source: AngusReid
Observational studies: types (pure)
Note, that there can also be hybrid designs, combining the two pure types

While not aspiring to explain causes and effects in terms of general laws and
principles, (like quantitative research), qualitative research also employs
observational studies.
Cross-sectional qualitative research can also resemble a quantitative study’s cross-
sectional structure (e.g., interviews with inventory of issues to be discussed)
In qualitative studies, longitudinal design also examines cases in different times
but without manipulating the IV like in experiments. It involves panel and cohort
studies that study groups in different occasions, or groups sharing the same
experience over time; case studies often include this type of research.
Observational studies: components
There exists an abundance of observed data that can be used in exploring
political phenomena. Many are unstructured and unorganized (e.g., answers to
open questions in interviews, content in books, etc.).
Researchers go through them, deriving categories and the basis of codes that
allow the information to be ordered and available for systematic study.

Observational designs examine data sets --e.g., the values of countries’ GDP in
2021 (the spatial unit being the countries and the time unit 2021), or values of
the single spatial unit Canadian PM’s approval rating across time (the time unit
being the month)

These units are called the data set’s dimensions


Observational studies: causal checklist
How do observational studies fare against a causality checklist?

1. Existence of credible causal mechanism between X and Y


Like all other research designs, requires study of theory to hypothesize plausible
causal connections

2. Elimination of the possibility that the causal arrow is reversed (Y→ X)


As X and Y are observed, it is often more difficult to establish which causes
which (e.g., in Democratic Peace theory, does democracy cause peace, or peace
causes democracy?)
Observational studies: causal checklist
How do observational studies fare against a causality checklist? (cont.)

3. Covariation between X and Y


Again, this can be established through a statistical evaluation of the relationship
between them. Even if no covariation found, it is still possible to uncover it once
the research controls for other variables, Z

4. Control for effects that may be instead caused by other, ‘confounding’ variables
(Z), rendering the correlation between X and Y spurious
Multiple regression analyses can help researchers uncover if controlling for other
variables reveals an X and Y causal relationship. But one has to try and identify
all possible confounding variables, in order to statistically control for them.
Here, (again) theory and examination of prior studies on the topic can help.
Concepts and their measurements
Concepts are abstract terms that represent and organize characteristics of
objects, phenomena and ideas in the political world
Measures are observable, empirical evidence

Qualitative research moves from measure to theme (identification of recuring


patterns relevant to the topic of study) to concept. This is an inductive approach

Quantitative research starts with a concept, translates (or operationalizes it)


into a variable (concrete representation of the concept that varies) and derives
from it an indicator (tool to assign individual real-world cases to the different
values of the variable). This is a deductive approach; the first part of our course
focuses on it.
Concepts and their measurements
To explore causality between variables (mostly through observational studies), Social
Science examines their statistical association.

To do so, variables need to be operationalized from a theoretical, more abstract


conceptual level to an empirical, quantified one so they can be measured and
statistically analyzed. Hence, we need data that correspond to the concepts we are
trying to study.

Examples:
Political legitimacy operationalization Frequency of anti-gov’t protest
(concept) (variable)

Democratic consolidation operationalization Number of Civil Society organizations


(concept) (variable)
Concepts and their measurements
Measurement is important to

• Identify differences / variations


• Provide a consistent benchmark for detecting them, and,
• Offer a basis for estimating the strength of a relationship between variables
Concepts and their measurements
In the Social Sciences, measurement not always as straight forward as in the
Natural/Medical Sciences.
Some concepts (e.g., GDP) are easier to capture than others (e.g., leadership
style).

Concepts in question need to be clear, easily quantifiable, and measurable in a


valid and reliable way.
Reliability
When the measure of a concept is repeatable or consistent, it is deemed to
be reliable.
The same measurement rules to the same case or observation will produce
identical results (as opposed to inconsistent ones).

Validity
When the measure of a concept is represented accurately, it is deemed to be
valid.
In contrast, an invalid measure measures something other than what is
intended.

NB. Both reliability and validity are needed to evaluate causality.


Information
To explain phenomena and construct theories and hypothesis, Political
Scientists need to collect information, or, data.
Data collection: primary and secondary data
This can be done by qualitative and quantitative research via surveys
(interviews and questionnaires) and related sampling techniques

Primary data: new (‘raw’) information emanating from qualitative and


quantitative research

Secondary data: analysis of existing information (secondary datasets*)


previously collected by other researchers

*E.g., Statistics Canada, UN, IMF, OECD, World Bank, MAR (Minorities at Risk),
COW (Correlates of War), ICB (International Crisis Behavior)
Primary data collection and qualitative research
When there is little knowledge on a topic, qualitative approaches help provide
in-depth, rich explanations and new findings. Researchers draw information
from human subject research through:

Interviews-respondents are asked questions that are recorded for analysis


Focus groups-group of respondents (smaller set) is engaged in discussion over
limited time period with information that is recorded for analysis
Observation- (often ethnographic) research subjects are observed by
researcher(s) embedded in real-life settings with the emerging information
collected over longer periods of time and recorded for analysis

These approaches involve conducting background research on the topic-and


subjects, selecting what population to study, collecting (in less rigid, semi-
structured fashion, i.e., with a plan but also flexibility to ask or expand)
recording, transcribing and analyzing data
Primary data collection and quantitative research
This type of research uses information to make descriptive and explanatory
claims about large groups of individuals (towards generalization)

Primary information is collected by researchers with interviews and broader


survey research through the design and administration of a questionnaire

(The other main types of information-seeking approaches that non-


experimental research engages in
is through collecting secondary (existing) data and their analysis)
Primary data collection: advantages and disadvantages of approaches
Qualitative
• Provides rich, detailed information
• Cannot generalize from small data sets
• Observation can take a long time
• Subjects may be conscious and change behavior if observed

Quantitative
• Generalize-able, can explain more towards theory-testing and systematic amassment
of knowledge
• Can ignore real-world settings
• Potentially sidesteps human subjects’ perceptions for sake of ‘findings’
• May convey artificial sense of accuracy and precision
• Assumes an objective reality, independent of observation
• Creates power hierarchies (researcher vs. subject)
Data collection: interviews and questionnaires
Overall, survey research follows this sequence of steps:
1. Selecting a population of interest
2. Drawing a sample from this population
3. Devising a number of questions to measure concepts of interest
4. Survey made available to the research subjects
5. Data are collected, cleaned and tabulated
6. Data are analyzed via descriptive and inferential statistics

Our focus here is on interviews and questionnaires – as part of survey research


Forms of research
• Structured (prepared question and answer format, same for all
respondents)

• Open (questions asked not prepared in advance-rarely used)

• Semi-structured (mixture of more rigid, formal questioning and not


pre-determined questions)
Interviews
Interviews typically less structured (than surveys)

• One-person (researcher faces respondent, asks series of questions, records


responses)

• Focus group (may include also multiple interviewers)


Interviews
Conducting in-person interviews: guidelines

Before
• Planning (identifying a population to study and sampling)
• Creating an interview framework (set and sequence of questions to ask)
• Knowing the interview schedule
Interviews
Conducting in-person interviews: guidelines (cont.)

During
• Interview in the form of conversation
• Introducing the interviewer and the research to respondents
(who it is, whom is this research by and what for, how the respondent has
been selected, explain voluntary nature and confidentiality clause, allow
interviewee to ask any questions)
• Establishing rapport
• Using probing cautiously (only if interviewee needs help understanding, or
further clarifications, or if interviewer needs more data, details)
• Avoiding prompting (suggest a possible answer to an open question)
• Recording information (after having obtained consent) during the interview
Interviews
Conducting in-person interviews: guidelines (cont.)

After
• Recording information after the interview (as soon as possible)
• Creating a transcript of the recording or experience towards analysis
• Converting (where appropriate) into datasets and identifying any errors
• Begin analysis (where appropriate)
Interviews
In-person interviews (structured) – checklist

• Clarity regarding introduction, research, selection process, instructions,


questions themselves and their recording
• Relevance
• Pre-tested value of questions
• Avoidance of ambiguity, loaded, long, general, leading, technical questions
• Ability of respondents to understand and respond adequately
• Balance of questions
• Reasonable expectations (respondent’s memory)
Interviews
Alternatives to in-person interviews: telephone interviewing
Advantages
• Lower cost
• Easier to supervise
• Reduced potential face-to-face (characteristics) bias

Drawbacks
• Lack of telephone, or land line
• Duration (shorter, short attention span)
• Hearing impairments
• Impersonal (lacks rapport, interviewer cannot see respondent’s reactions)
• Target may be missed (who is responding?)
Interviews
Alternatives to in-person interviews:

Computer-assisted interviewing (web-based, or, by email)

Web surveys: click on link, complete online


(with filter questions, skipping automatically to the next appropriate one)

Email surveys: attached questionnaire received, completed and then returned


by email
Interviews
Also (qualitative research)

Alternatives to in-person interviews:

Researcher-driven diary – researchers ask participants to keep a diary and


record their impressions, observations, etc. Can be either structured, or ‘free
text’. Can provide reliable, more in-depth, more sensitive information.
But…
more expensive, prone to respondent fatigue, errors, omissions or editing
Questionnaires
Questionnaires– essentially, structured interviews without an interviewer
Development and deployment of a questionnaire to collect original data is an
important part of survey research; often part of quantitative, large-N research

Paper-pencil, electronic; (e)mail (for individuals), or group administered


( focus group, which is semi-structured and facilitated)

Differences b/w questionnaires and interviews


• Respondents must read the questions and record their own answers
• Fewer open questions (closed easier to answer)
• User-friendly designs
• Shorter (avoids respondent fatigue)
Questionnaires
Advantages and drawbacks b/w questionnaire and structured interview
• Less expensive
• Easier to administer
• Absence of interviewer effects
But…
• Unable to further explain questions
• Risk of missing information (incomplete questionnaire)
• Unable to probe, ask many, complex questions
• Questionnaire can be read as whole, biasing respondents
• Problematic with some populations (e.g., language)
• Identity of respondents
Questionnaires: types and formats of questions
Types
• Personal, factual about oneself: about respondents’ age, occupation, etc.
• Factual about others: less reliable unless researcher interested in
respondents’ perceptions
• Factual about event: useful when questioning about a witnessed event
• Factual about general knowledge: what a respondent know in general
• Factual about belief: “agree”, “disagree”, etc.
most
common, • Attitude: evaluations of events, ideas, etc.
important
part • Behavior: shopping, news consuming choices, voting preferences, etc.
Questionnaires: types and formats of questions
Formats
• Open (O) (or, open-ended) - respondents have freedom to respond as they
wish, and,

• Closed (C) (or, close-ended) - fixed number of concrete answers to select from
 Forced choice (limited choice of answers that best reflects
respondents’ position)
 Scale (asks respondents to rate their position to a statement-
‘e.g., ‘strongly agree’, ‘moderately agree’, etc.)
 Feeling thermometer (respondents indicate their ‘warmth’ – e.g., ‘how
do you feel about x?’)
Questionnaires: types and formats of questions
Example (modified from Besco)
• Open (O) question: “What was the primary reason for applying to UTM?”
.............................. . . . . ......
• Closed (C) question: “What was the primary reason applying to UTM?”
a. School’s ranking and reputation
b. Quality of program of study
c. Lower fees compared to U.S. universities
d. Proximity to home
e. Family member, friend or alumnus/a recommendation
f. Other: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Questionnaires: design
How to design questions (O, C) (Bryman and Bell)

• Focus on what the research question is, structure questions around it


• Decide what needs to be known
• How would you answer it
• Ensure you do not include ambiguous terms
• Make sure respondents have knowledge required to answer
Questionnaires: design
How to design (C) questions

• Refrain from terms that are too technical


Questionnaires: design
How to design (C) questions (cont.)

• Provide a balanced range of answers (e.g., ‘strongly agree’, ‘agree’, ‘neutral’,


‘disagree’, ‘strongly disagree’-rather than ‘strongly agree’, ‘somewhat agree’,
‘agree’, ‘neutral’, ‘disagree’)
• Establish that answers are mutually exclusive (e.g., ‘10-30%’, ‘40-70%’, ‘80-
100%’ rather than ‘10-30%’, ‘20-40%’, etc.)
• Have reasonable expectations on respondents’ ability to retrieve information
(e.g., avoid ‘Do you agree or disagree with the preamble of UNSC Res. 1325
(2000) on Women, Peace and Security?’)
• Consider whether a ‘Do not know’ option is suitable (can offer more choice,
but also escape clause for respondents)
Questionnaires: design
How to design (C) questions (cont.)

• Order questions bearing in mind the possible effect an earlier one might have
on a latter one (e.g., ‘Do you know how many volts it takes to kill a human
being?’ followed by ‘Do you agree with the death penalty?’)
• Ask general questions ahead of specific ones to set the tone
• Pose important questions early to capture respondents’ attention before it
wanes
• Postpone asking uncomfortable questions (that might affect the respondent)
for later in the questionnaire
• Group set of questions according to themes (e.g., questions 1-5 on personal
data, rather than all over the place)
• Consider using existing questionnaires (saves time, allows for comparison)
Questionnaires: design
How to design (C) questions (cont.)

Types of questions to avoid:


• Long, very general, unspecific (“how satisfied are you with your job?”)
• Composite (e.g., “how satisfied are you with your salary and work
environment?”) that can obfuscate answer
• Combined, asking about more than one item (e.g., “Whom did you vote for in
the Sept. 2021 elections?”→ “Did you vote?” then “What party?”)
• Leading (e.g., “Given how much humanitarian assistance the government has
given to safe lives, do you agree with its aid policies?”)
• Negative (e.g., “Should the government not proceed with this trade
agreement?” Yes, it should not? Or, No, it should not?)
Questionnaires: open and closed questions
Open (O) questions: advantages and drawbacks (Bryman and Bell; Berdahl et al.)

• Allow for replies that the researcher may have not thought about
• Responders can expand on a topic and offer additional insights and focus
• Can provide pointers (what topic is important) for closed-format questionnaires

• Time-consuming both for researcher (transcribe) and respondent (may be


dissuaded to reply)
• Answers have to be coded
• Accuracy issues
• Lack of consistency in asking (intra- and inter-interviewer reliability)
Questionnaires: open and closed questions
Closed (C) questions: advantages and drawbacks (Bryman and Bell; Berdahl et al.)

• Standardized
• Easier, faster to complete
• Fixed set of clear (hopefully) answers, render research clearer to respondents
• Help avoid intra- and inter-interviewer lack of consistency

• Lack of clarity in phrasing question and answers and resulting misunderstandings


(e.g., what does ‘frequently’ mean?)
• Set fixed of answers may not include all options
• Lack of spontaneity and rapport between researcher and respondents
• Data entry accuracy issues
Questionnaires: presentation
Guidelines for presentation of questionnaires

Clarity of layout (usually, vertical) and instructions


Keeping questions and possible answers (C) on same (paper or virtual) page
Questionnaires: online
Advantages and drawbacks of online vs. mailed questionnaires (Cobanoglu, Kent & Lee)

• Low cost
• Lower response -and processing-time
• Fewer unanswered questions
• Better response to open questions
At the same time…
• Lower response rates
• Limited to those with online access
• Confidentiality and anonymity issues
• Multiple replies
Secondary data
Data not directly collected by researchers themselves but by others.
Can include survey datasets and official statistics (closed-ended measures).
Most quantitative Political Science research based on them

• Less costly in time and research funds


• Readily available
• Quality (high) and standards (rigorous)
• Amenable to broader study, new interpretations, cross-cultural analysis

• Lack of familiarity with database


• Complex and obfuscating data
• Quality (low) and lack of variables
Part I, Data I: main points

To know if an X causes Y, Political Scientists use research strategies like experiments


and observational studies (and case studies).

Experiment (lab, field, quasi-): a research design in which the researcher both
controls and randomly assigns values of the independent variable to the
participants.
Observational study: a research design in which the researcher does not have
control over values of the independent variable, which occur naturally.

To explain phenomena and construct theories and hypothesis, Political Scientists


need to collect information, or, data.
This can be done by qualitative and quantitative research via surveys (interviews
and questionnaires) and related sampling techniques
Part I, Data I: Glossary

Internal Validity: when there is sufficient evidence that a causal relationship exists
between two or more variables

External Validity: the results of study can be generalized beyond the specific
research in which they were generated

Measurement Validity: degree to which a measure of a concept actually measures


what it is supposed to measure

Reliability: degree to which a measure of a concept is stable or consistent


Part I, Data I: Glossary
Primary data: new (‘raw’) information emanating from qualitative and quantitative
research

Secondary data: analysis of existing information (secondary datasets) previously collected


by other researchers

Questions: structured (prepared in advance, same for all respondents), open (not
prepared in advance), semi-structured (mixture of rigid and open questions)

Interviews: Interviews-respondents are asked questions that are recorded for analysis.
Two types-one person, focus group

Questionnaires: respondents read questions, record own answers, shorter, more rigid.
Still can be open-ended, or closed (forced choice, scale, feeling thermometer)
POL244H
Research Methods for Political Science II

On Questionnaires: Examples of Problematic Questions


Questionnaires: design
Examples (Besco)

1. “Should survivor benefits be based on any relationship of economic dependency


where people are living together, such as elderly siblings living together or two
friends living together, or should survivor benefits only be available to those in
married or common-law relationships and parent-child relationships?”

Problem:
Question is too wordy.
Should not be more than 20 words.
Should be able to ask the question comfortably in a single breath.
Questionnaires: design
Examples (Besco)

2. “The NDP will not form the Official Opposition after the next election; the Bloc
Quebecois will. Do you agree or disagree?”

Problem:
A double-barreled (combined) question.
Could agree with one part of the question and disagree with the other.
Questionnaires: design
Examples (Besco)

3. “How often have you read about politics in the paper during the last week?”

Problem:
Assumes respondent has read a newspaper at least once during previous week.
Questionnaires: design
Examples (Besco)

4. “Would you favor or oppose extending the USMCA to include other countries?”

Problem:
Assumes respondents are competent to answer.
May not know what acronym stands for (US-Mexico-Canada Agreement), what it is,
and/or what countries are currently included, etc.
Questionnaires: design
Examples (Besco)

5. “Do you agree or disagree with the supposition that continued constitutional
uncertainty will be detrimental and deleterious to the Quebec’s possibilities for
sustained economic growth?”

Problem:
Question wording is unnecessarily confusing
Questionnaires: design
Examples (Besco)

6. “Do you agree that Canada has an obligation to see that its impoverished citizens
are given a humane standard of living?”

Problem:
Leading because it uses emotionally-laden language to encourage agreement with
the statement (e.g., “impoverished”, “humane”)
POL244H
Research Methods for Political Science II

Winter 2024
Wednesdays 1-3pm @ MN1170

Part I
Data II: Sampling & Descriptive statistics
Week 1 Jan. 10 Introduction and course details
Week 2 Jan. 17 Data I: Research design, experiments, interviews and questionnaires
Week 3 Jan. 24 Data II: Sampling, size and distributions

Week 4 Jan. 31 Analysis I: Univariate, bivariate analysis


Week 5 Feb. 7 Analysis II: Regression Statistics
Week 6 Feb. 14 Analysis III: Regression (cont.); big data, machine learning, Networks
Reading week NO CLASS

Week 7 Feb. 28 Mid-term test


Singe and small-n cases
Week 8 Mar. 6 Data I: Ethnography
Week 9 Mar. 12 Data II: Interviews, archives, texts/documents

Week 10 Mar. 19 Analysis I: Process tracing, content analysis

Week 11 Mar. 26 Analysis II: Comparative study

Week 12 Apr. 3 Indigenous methods, Research ethics, conclusions


Today’s schedule

Announcements

Sampling, Size and Distributions


Variables
Descriptive Statistics
Tutorials

Today’s tutorials will be conducted via remote platform

Join Zoom Meeting


https://utoronto.zoom.us/j/87382638114

Meeting ID: 873 8263 8114


Passcode: POL244
Assignments

Assignment 1 (5%) is related to questionnaires


Due: January 28 February 1
by 11:59pm EST (Quercus)
Data collection: primary and secondary data
This can be done by qualitative and quantitative research via surveys
(interviews and questionnaires) and related sampling techniques

Primary data: new (‘raw’) information emanating from qualitative and


quantitative research

Secondary data: analysis of existing information (secondary datasets*)


previously collected by other researchers

*E.g., Statistics Canada, UN, IMF, OECD, World Bank, MAR (Minorities at Risk),
COW (Correlates of War), ICB (International Crisis Behavior)
Data collection: interviews and questionnaires
Overall, survey research follows this sequence of steps:
1. Selecting a population of interest
2. Drawing a sample from this population
3. Devising a number of questions to measure concepts of interest
4. Survey made available to the research subjects
5. Data are collected, cleaned and tabulated
6. Data are analyzed via descriptive and inferential statistics

Our focus here is on interviews and questionnaires – as part of survey research


Questionnaires: online
Advantages and drawbacks of online vs. mailed questionnaires (Cobanoglu, Kent & Lee)

• Low cost
• Lower response -and processing-time
• Fewer unanswered questions
• Better response to open questions
At the same time…
• Lower response rates
• Limited to those with online access
• Confidentiality and anonymity issues
• Multiple replies
Secondary data
Data not directly collected by researchers themselves but by others.
Can include survey datasets and official statistics (closed-ended measures).
Most quantitative Political Science research based on them

• Less costly in time and research funds


• Readily available
• Quality (high) and standards (rigorous)
• Amenable to broader study, new interpretations, cross-cultural analysis

• Lack of familiarity with database


• Complex and obfuscating data
• Quality (low) and lack of variables
So, where were we?

To explain phenomena and construct theories and hypothesis,


Political Scientists need to collect information, or, data.
Sampling
First steps of survey research

1. Selecting a population of interest


2. Drawing a sample from this population
3. Devising a number of questions to measure concepts of interest
4. Survey made available to the research subjects
5. Data are collected, cleaned and tabulated
6. Data are analyzed via descriptive and inferential statistics
Populations of study
Is what we observe an isolated phenomenon? Can we generalize from our
knowledge?
Observations: individual pieces that represent our unit of analysis (what we
want to study)
All the possible observations that a dataset could contain is called population
(or, universe); what we want to generalize about
It can consist of people (e.g., Canadian MPs), or, things (e.g., policies, voted
laws). Some are small (e.g., U.S. Senators), while others large (e.g., number of
Canadian citizens)
Populations are studied to find their characteristics (which MP or senator voted
what, how do Canadians feel about a policy, etc.).
Characteristics measured for each individual member of a population are called
a population parameter (a numerical characteristic of a population)
Populations and samples
When size is small or resources plenty (e.g., national census collected from all elements
in a state’s population) the whole of the population may be included in a study.*

When not possible to include all population, research focuses on a sample – the set of
observations that a dataset does contain
(remember, datasets contain variables, and each variable represents a particular
characteristic related to a study’s observations)

Sampling - process of selecting a number of cases from a larger population for study.
Scores of a sample measured in numeric terms, called sample statistic (a numerical
characteristic of a sample).
Sample statistics used to estimate a population’s parameters.
*Or, in one’s imagination, like Borges’ 1946 “On Exactitude in Science”
Populations and samples
Sampling error: when there is a difference b/w the characteristics of a sample
(statistic) and those of a population (parameter) from which it was selected

Inevitable to have some error in sample, as long as not systematic (e.g., if sample
is non-representative, biased – thus non-random)
Ice cream I No ice cream

No sampling error Little sampling error Non-negligible Significant


sampling error sampling error
From (Bryman et al.)
Populations and samples
Random samples will vary randomly around the true population parameter and
the average of all possible sample statistics will be equal to the true population
parameter (Hiberts et al.)

Central Limit Theorem: the sum of random variables is itself a random variable
and follows a normal distribution (a distribution with a symmetrical bell shape)

Important for statistical analysis (we will discuss how in future class)
Populations and samples
Quantitative research
Uses large-N studies to identify patterns and generalize (external validity) from
the sample to the population
Process: data are collected, sample statistics are calculated, and used to
estimate the population parameters

Qualitative research
Focuses on small-n studies to uncover rich details and reach some conclusions
that lead to better understanding of the population

Both require proper sampling


Populations and samples
Sampling

• Low cost
• Less time-consuming

• Possibility it is not representative of the population

Hence, representativeness of the sample is critical


Sample: representativeness – sampling frame
Quantitative research seeks to generalize, hence it must have representative
sample (accurately representing population)

Representativeness depends on three factors:

1. Accuracy of sampling frame


2. Sample selection method
3. Sample size
Sample: representativeness – sampling frame

1. Accuracy of sampling frame – the list of all units or elements in the target
population

For example, for the 2023-24 population of universities in Canada, the sampling
frame is all registered students in every Canadian university)

Sampling frame needs to include all cases, from which a sample can then be
drawn, otherwise, it may not be representative of the population.

Classic case of 1936 Roosevelt vs. Landon election misprediction based mail
survey drawn from non-representative lists.
Sample: representativeness – sample selection techniques
2. Sample selection method (two types):

a. Probability (random) sampling – the random selection of a sample, based on


probability theory, enabling use of statistics to test how likely it is
representative

b. Non-probability (non-random) sampling – not based on probability or


statistics (qualitative and some quantitative research)
Sample: representativeness – sample selection techniques
Probability (random) sampling (Hiberts et al.):

i. Simple (random) – every element within the population has equal chance of
being included in the sample
For example, draw lots from a bowl model; use a table of random numbers
to select a sample from a population; (better) random number generator]

ii. Systematic – used when sampling frame is a list


Here, one determines the size of sample interval (k), selects random
starting point b/w 1 and k, then selects every kth element for inclusion in
the sample. However, there is a danger of periodicity (cyclical pattern of
data that can produce biased sample)
Sample: representativeness – sample selection techniques
Probability (random) sampling (Hiberts et al.):

iii. Stratified – population divided into mutually exclusive groups (strata), from
which random or systematic samples are selected [e.g., U of T students by campus]
iv. Proportional stratified (sample strata, proportional to their pop. sizes)
[In this case, one creates a stratified sampling frame, determines strata size
proportional to pop. strata sizes, then selects random sample. Here, it is important
to know pop. proportions]
v. Disproportional stratified
[Same as above, except sample proportions different from population ones-
e.g., to get equal representation of an underpopulated province; used to compare
groups. NB. To reconstruct pop. proportions and make inferences, weights – i.e.,
compensatory mathematical corrections - must be assigned within the dataset]
Sample: representativeness – sample selection techniques

vi. Cluster sampling – random selection of elements’ clusters (instead of


elements). This is used when target population is spread to reduce cost, or
when no access to all elements of a sampling frame; as clusters may be of
unequal size, probabilities proportional to each size (PPS) may be used for
sampling

One-stage sampling [random selection of clusters, then include all


cluster elements] and
Two-stage cluster sampling [random selection both of clusters and
elements within them]
Sample: representativeness – sample selection techniques
Non-probability (non-random) sampling (Hiberts et al.):

i. Convenience (or, accidental) - sampling based on availability [e.g.,


undergraduate students in a university study; good for pilot studies; used in
experimental research] biased
ii. Self-selection – sampling based on responding volunteers; self-selection bias

iii. Purposive (or, judgment) – selection of specific cases that provide maximum
information needed for study while ensuring some diversity [used for focus
groups]
Sample: representativeness – sample selection techniques
Non-probability (non-random) sampling (Hiberts et al.):

iv. Snowball (or, network / chain referral)- identification of initial cases that can
refer new ones so that the sample branches (or, snowballs) out [used for
hidden populations with no apparent sample, like drug-users, or political
dissidents in hiding]

From Volz and Heckathorn


Sample: representativeness – sample selection techniques
Non-probability (non-random) sampling (Hiberts et al.):

v. Quota – identification of targeted strata and setting of a quota of sample to


be met [e.g., ‘need 50 students from UTM, 50 from STG, 50 from UTSC’;
a combination of convenience or purposive sampling with stratification]
Sample: representativeness – sample size
3. Sample size – the number of cases included in the full sample

Size appropriateness depends on these factors

1. Homogeneity – how similar a population is (the more, the smaller sample


needed)
2. Complexity of the study - number of variables studied
3. Degree of accuracy aspired (confidence levels)
4. Sampling method used
5. Statistical test aspiring to (and significance levels)
Sample bias
Types of sample bias in survey research:

• Coverage – some groups systematically excluded (e.g., homeless, no internet access)


• Non-response –respondents do not participate
Studies of samples characterized by response rate (% of people in the
sample who participate in the study).
For acceptable sample representation of the population >60% response rate
(although can still be biased if miss systematic difference in the 40% that do
not participate) but the higher the better (>70% very good, >85% excellent)
• Sample selection – researcher intentionally selects or avoids parts of sample
• Sample attrition – diminishing original number of individuals interviewed over time
Sample bias
Virtual sampling problems

• Multiple email addresses per individual


• One computer, many users
• Sampling frames expensive
• Internet users demographically biased sample
Back to our original premise...
To explain phenomena and construct theories and hypothesis, Political
Scientists need to collect information, or, data.
(Concepts are operationalized via variables, which are then measured to
provide data for analysis)

Primary, secondary data and datasets

Datasets: rows vs. columns,


with columns containing variables, and
rows containing observations
Measurement metrics
Once a sample is collected, before testing variables for causality, important to get a
sense of one’s data – specifically about the values for each variable.

This is done by examining


• Level – or, type - of measurement of each variable
• Descriptive statistics (those that describe the characteristics of a sample of a
population)

More broadly, we speak of a measurement metric (of a variable): the type of values
the variable takes on.
It consists of (a) label or name, and (b) the values we have measured of it.
Types of variables
In order of precision, the types (or, levels of measurement) of variables include:

1. Nominal (or, Categorical): This type of variable is composed of categories that bear
no relationship to one another except that they are different (Bryman et al.)

2. Ordinal: With this type of variable, its categories can be rank ordered – i.e., they
can indicate if observations have more or less of a particular attribute (‘greater
than’, ‘less than’, etc.).

3. Interval / Ratio: This type of variable is the most precise of all types. While
Nominal / Categorical data only indicate difference, and Ordinal ones indicate
order (but not distance), Interval / Ratio ones provide both. Moreover, there exist
units of measurement, and the distances or intervals between categories are
separated by a standard unit.
Representations of data
Datasets can contain a lot of information that provide quick, broad overviews.
This can be represented in a variety of ways via graphs and tables.

Graphs are pictorial representations


For discrete variables (Nominal / Categorical, Ordinal), where categories are
separate and distinct, the visualizations used are Pie, Bar and Line charts

For continuous variables ( Interval / Ratio), the visualizations used are Histograms,
Box plots and scatter plots
Representations of data
Tables are useful for displaying summary statistics (Measures of Central Tendency,
and of Dispersion)

It is the most common way to display results in the Social Sciences


Common items in tables include coefficients, standard errors, coefficient intervals,
etc.
As a general rule, tables should contain all the information necessary to make them
clear without any reference to the surrounding text
Descriptive statistics
So, we have our dataset.
Prior to examining causality between variables, a broad view of a dataset can reveal
useful information that can describe a singe variable.

For example, we have a set of grades in a class. Before we investigate if they are
correlated (and more) – e.g., with time of study per week - we can learn useful
information about that variable from the dataset: for example,

What is the variation, or, spread of the data in the dataset? (e.g., from 57 to 89)

Where, in that dataset is a single case situated? (is 89 a typical grade or an


outliner?)
Descriptive statistics
More formally, descriptive statistics are useful summaries of the variation for
individual continuous variables that describe the characteristics of a sample, or, a
population.
Include rank statistics (median) and statistical moments (mean, standard deviation)

Broadly, characteristics of variables grouped by way of:

• Distribution – set of all possible values and frequencies associated with these values
• Central tendency – which is the most typical value?
• Dispersion – how much do the values spread out?
Distribution
Frequency distribution: indicates the number of cases in each category of the
variable

Displays frequency with which each possible value occurs

Can be visualized through pie chart (nominal variable), bar chart (nominal or
ordinal variable), line chart (nominal or ordinal variable) or histogram
(continuous variable)
Central Tendency Measures
Mean: the average value of an observation – a common approach used on a daily
basis (e.g., comparing university grades, prices, salaries, commute times, GDP, etc.)

How do we calculate the mean?


Add up all the values in a sample for a variable and divide by the total number of
observations it has data values for

σ𝑛𝑖=1 (𝑥𝑖 )
𝑋ത (𝑝𝑟𝑜𝑛𝑜𝑢𝑛𝑐𝑒𝑑 𝑥 𝑏𝑎𝑟) =
𝑛
Σ: sum (total) from 1 to n
𝑥𝑖 = observations - the value of each individual case in a variable–E.g., x1, x2, x3 …xn
n: sample size
Central Tendency Measures
Example: set of grades in a seminar class of 17 students:
{74, 70, 57, 60, 78, 67, 75, 67, 83, 71, 72, 89, 75, 73, 78, 81, 63}

For n: 17

(74 + 70 + 57 + 60 + 78 + 67 + 75 + 67 + 83 + 71 + 72 + +89 + 75 + 73 + 78 + 81 + 63)


𝑋ത =
17

= 72.52

If in a dataset one does not have all the values, the n is the number of observations available.
For example, if there are 100 observations but only 82 are reported, in mean calculations the entries to
be considered are 82 (not 100).
Central Tendency Measures
But mean values can be biased – they can be made larger or smaller based on a few
outlier observations
Example:
Observation Annual income in $
1 30,000
2 33,000
3 36,000
4 51,000
5 42,000
6 620,000*
n=6 Mean= 135,300 $

* Outlier: a case that differs significantly from the others


Central Tendency Measures
For a descriptor that is not biased by extremely large or small values, the median also
needs to be examined.
The median is the value in the middle of your dataset; a value where 50% of
observations in a dataset are found above it, and 50% below it

To locate the median, order cases from smallest to largest and identify the middle
observation

In our seminar grades dataset


{ 57, 60, 63, 67, 67, 70, 71, 72, 73, 74, 75, 75, 78, 78, 81, 83, 89 }
the median is: 73
Central Tendency Measures
In our annual income example:

The dataset is {30,000 33,000 36,000 42,000 51,000 620,000}

And the median is: 39,000$

NB. Unlike the mean / average (135,300$), the extreme value (observation #6) does
not affect the median
Central Tendency Measures
When is it appropriate to use one or the other? Two considerations

• Level of measurement
Mean assumes values are ordered and have consistent distance between them
Median only assumes that values can be ordered (so, it is good for ordinal
variables)
• Level of how skewed the data are (i.e., if one has extreme observations) that can
yield a biased picture

E.g., the salary of all actors in a film: this is an interval /Ratio type of variable; as there
may be extremes, from stars to extras, best to use median (to avoid bias)
Central Tendency Measures
To obtain an idea of where a distribution(s) of values peaks (esp. for Categorical
variables) the mode is used. It represents the most common / frequently occurring
value for a variable. It can be found by counting the number of cases in each category,
and determining which category is most frequent
Example (Stats Can): points scored by a player during a 10-game hockey tournament
{7, 5, 0, 7, 8, 5, 5, 4, 1, 5 }
Points scored Number of games
0 1
1 1
4 1
5 4
7 2
8 1

Frequency table for points scored. 5 is the most frequent value


(points scored 4 times) and it is this dataset’s mode.
Measures of Dispersion
Dispersion refers to the spread of the data.
It can help one understand how typical the measure of the central tendency is
(Pamphillis)

For nominal / categorical variables, the measure often used is the variation ratio –
the percentage of cases that are not the mode.
Smaller value → less variation (i.e., the mode represents the distribution well);
Larger one → more variation (mode doesn’t represent it well)

It is a simple statistic (as well), not often used by Social Scientists.


Measures of Dispersion
For ordinal variables, the measure often used is the range: that is the difference
between the largest and smallest observed values for an ordinal variable
(maximum value – minimum value)

It is an important consideration as one seeks to generalize (e.g., more difficult if a


range is small).

Also a problem when there are outliers (e.g., in our annual income example, range
is from 30,000$ to 620,000$).

Hence, better to use the…


Measures of Dispersion

Interquartile range (IQR): the broader picture of dispersion around the median-or,
range between the 25% and 75% percentile of cases (that way, not influenced by
outliers).
Quartiles: points that divide the data into four equal parts, based on number of
observations (not on the possible values of a variable). Similarly, deciles divide into ten, etc.
Q1 Q2 Q3

25% 25% 25% 25%

Interquartile range IQR= Q3 – Q1 Interquartile range on a normal distribution


(a distribution with a symmetrical bell shape)

This indicates whether the middle part of the data in a dataset are close together or
not.
Measures of Dispersion
Standard deviation (s) of a sample
For Interval / Ratio data, this is the best measure that indicates how far, on average,
an observation is from the mean (or, the average amount that each observation
differs from the mean).
Its value depends on how tightly the scores are ‘clustered’ around the mean
(more clustered → smaller s; wider dispersion, larger s)

Σ 𝑥𝑖 −𝑥 ҧ 2

s=
𝑛−1
𝑥ҧ = mean
𝑥𝑖 = observations
(square and root to eliminate negative)
𝑥𝑖 − 𝑥ҧ tells us how far an observation is from the mean
Measures of Dispersion
Standard deviation (s)
Example: a set of five grades {70, 75, 78, 82, 85}

70 + 75 + 78 + 82 + 85
𝑥ҧ = = 78
5

2
Σ 𝑥𝑖 −𝑥 ҧ 2
70−78 2+ 75−78 2+ 78−78 2+ 82−78 2+ 85−78
s= = = 5.2
𝑛 5

NB. Standard deviation formula sometimes seen as over n, instead of n-1. When we
have the actual mean (as in this case) we use the population standard deviation and
divide by n. When we have an estimate of the mean based on averaging the data
(when we do not have all the data), then we divide by n-1 (1 degree of freedom).
Standard deviation

+/- 1 s
78+5.2=83.5
78-5.2=72.8

+/- 2 s
78+10.4=88.4
78-10.4=67.6
67.6 72.8 78 83.5 88.4

Useful for knowing how far an individual case is from the mean
Standard deviation will be helpful when we discuss probability
Measures of Dispersion
Variance also indicates the spread of the data around the mean. It is the square of the
standard deviation.
Σ 𝑥𝑖 −𝑥 ҧ 2

variance=
𝑛−1

Variance can be expressed in squared units or as a percentage. For data, the metric of
standard deviation is used

Distributions with larger standard deviations have more variance away from the
mean, broadening and flattening the distribution’s curve
Measures of Dispersion
Spread can be very different even for distributions with identical measures of center

Tall and narrow distribution curve shape = small standard deviation


Short and wide distribution curve shape = large standard deviation
z-score

Finally, on the subject of how far a value is from the mean…

When you see a z-score, it shows us how far away an observation is from the mean in
standardized units.
It allows for standardized comparisons between groups, can be positive or negative.
Positive z-score indicates above the mean; negative indicates below

Standardized score (or, z-score) for an individual observation

(𝑥𝑖 − 𝑥)ҧ
𝑧𝑖 =
𝑠

Numerator: How far an observation is from the mean


Denominator: Converted into standard deviations
Central Tendency and Dispersion Measures
Both Central tendency and Dispersion measures connected (as we see in the formula
for mean and standard deviation) and useful.
Example: in examining thirty-year span annual growth rates in provincial economies
(Goldberg and Levi), used both mean and standard deviation values to analyze their
data for a more accurate view of the topic.
Mean scores indicated provinces differed significantly in annual growth rates.
Standard deviation scores (how far each score lies from the mean) revealed more
variability between provinces, with AB and SK having larger s scores than ON and QC.
PROVINCE (selection) MEAN GROWTH RATE STANDARD DEVIATION OF GROWTH

ALBERTA 5.25 6.39

SASKATCHEWAN 3.37 8.24

ONTARIO 4.09 3.30

QUEBEC 3.67 2.65

… … …

CANADA (total) 4.06 2.55


Summary
Central tendency and Dispersion measures

Level of
measurement Central tendency Dispersion Use

Nominal / Mode Variation Ratio Only for Nominal data; not


very informative
Categorical
Ordinal Median (mode) Range Can also be employed for
Interval/Ratio variables
IQR when distribution contains
extreme values

Interval / Ratio Mean (median) Standard deviation Most reliable and precise
information

In red, key characteristics of any distribution

From Berdahl et al.


Part I, Data II: main points

Sampling: populations are studied to find their characteristics. When not possible to
include all population, research focuses on a sample – the set of observations that a
dataset does contain. Sampling is the process of selecting a number of cases from a larger
population for study. Representativeness of a sample is crucial, and depends on (a)
accuracy of sampling frame, (b) sample selection method, and (c) sample size

Tables, graphs and figures visualize data.


Descriptive statistics: they describe the characteristics of a sample or a population
Characteristics of variables are grouped by way of distribution (set of all typical values and
frequencies associated with these values), central tendency (which value is the most
typical), and dispersion (how much do the values spread out)

The appropriate method for summarizing data depends upon the level of measurement
Part I, Data II: Glossary

Population (or, universe): all possible observations a dataset could contain

Parameter: characteristics measured for each individual member of a population

Sample: set of observations that a dataset contains (a portion of a population); can be


random, or non-random

Generalization: attempting to extend the results of a sample to a population

Sampling error: a difference between the characteristics of a sample and those of a


population from which it was selected
Sample biases include coverage, non-response, sample selection and sample attrition

Central Limit Theorem: the sum of random variables follows a normal distribution
(a symmetrical bell shape)
Part I, Data II: Glossary
Sample representativeness: depends on accuracy, sample selection method, sample size

Types of variables: nominal (or, categorical); ordinal; interval/ratio

Descriptive statistics: useful summaries of the variation for individual continuous


variables that describe the characteristics of a sample or a population; the exploration of
observed data. Contrasted to...
Inferential statistics: the use of observed data to predict what is true of areas beyond the
data

Quartiles: separation of the data into four equal-sized groups


Part I, Data II: Glossary
Frequency distribution: number of cases in each category of the variable

Mean: the average value of an observation (sum of values in a sample divided by total
number of observations)
Median: value in the middle of a dataset
Mode: most common, frequently occurring value for a variable

Interquartile range (IQR): the range between the 25% and 75% percentile of cases;
indicates whether the middle part of the data in a dataset are close together or not.
Standard deviation: how much variation there is within a group of values. It measures the
deviation (difference) from the group’s mean (average)
Variance: also indicates the spread of the data around the mean. It is the square of the
standard deviation
Z-score: shoes how far away an observation is from the mean in standardized units
POL244H
Research Methods for Political Science II

Winter 2024
Wednesdays 1-3pm @ MN1170

Part I
Analysis I: inferential statistics, uni- and bivariate analysis
Week 1 Jan. 10 Introduction and course details
Week 2 Jan. 17 Data I: Research design, experiments, interviews and questionnaires
Week 3 Jan. 24 Data II: Sampling, size and distributions

Week 4 Jan. 31 Analysis I: Inferential statistics, univariate, bivariate analysis


Week 5 Feb. 7 Analysis II: Regression Statistics
Week 6 Feb. 14 Analysis III: Regression (cont.); big data, machine learning, Networks
Reading week NO CLASS

Week 7 Feb. 28 Mid-term test


Singe and small-n cases
Week 8 Mar. 6 Data I: Ethnography
Week 9 Mar. 12 Data II: Interviews, archives, texts/documents

Week 10 Mar. 19 Analysis I: Process tracing, content analysis

Week 11 Mar. 26 Analysis II: Comparative study

Week 12 Apr. 3 Indigenous methods, Research ethics, conclusions


Today’s schedule

Announcements
Inferential Statistics
Assignments

Assignment 1 (5%)
Due: tomorrow, February 1
by 11:59pm EST (Quercus)
Dear Humanities and Social Sciences Students at UTM,
Ready to have your voice heard? We need your input!
Have you used digital tools in your courses? Have you created games? Used Omeka or ArcGIS or
StoryMaps? Analyzed big data sets like historical newspapers? Produced podcasts? Scraped data from social
media? Or experimented with other emerging digital methods for humanities $ social science research?
We invite you to join us for an engaging Town Hall discussion centered on your experiences with digital
tools and methods in your classes and research. Your insights are crucial in helping us understand how
faculty at UTM can better support you.
Two one-hour sessions are scheduled on Thurs, Feb 8, and Mon, Feb 12, from 4-5 PM on Zoom. Each
session will be an open forum for sharing experiences, discussing challenges, and exchanging ideas. Your
active participation will directly shape the resources and support provided to humanities and social
sciences students. Let's collaborate to enrich your research journey together!
Thurs Feb 8, 4:00–5:00 pm https://utoronto.zoom.us/j/83705592719
Mon Feb 12, 4:00–5:00 pm https://utoronto.zoom.us/j/83827679021
Want to share your ideas now? Take our quick and fun poll! Join us at one of our Town Halls where we will
reveal the results, and together we can turn your opinions into action. Click this link for our Pre-Event Poll!
BONUS! Attend one of our Town Halls and you could win $50 in UTM Gift Dollars! The lucky winner will
see the funds added to their TCard shortly after the random drawing. Don’t miss out on the chance to
participate and potentially boost your wallet! Spread the word and see you there!
Elspeth Brown, Director of CDHI and co-chair of the UTM Digital Scholarship Working Group
Paula Hannaford, Acting Chief Librarian, UTM and co-chair of the UTM Digital Scholarship Working Group
Sampling
First steps of survey research

1. Selecting a population of interest


2. Drawing a sample from this population
3. Devising a number of questions to measure concepts of interest
4. Survey made available to the research subjects
5. Data are collected, cleaned and tabulated
6. Data are analyzed via descriptive and inferential statistics
From samples to populations: statistical inference

We have already discussed the concepts of population and sample data, and described
a sample.
As it is very difficult to obtain population samples, we rely on sample ones.
If randomly selected, a sample can help us generalize about the whole population via
statistical inference (that is why we discussed random and non-random sampling at
length).
Through this process, from what we know to be true about a randomly selected,
representative sample, we can probabilistically infer what is likely to be true about the
population, or, project from observed cases (sample) to the whole population
(K&W; Besco). This is about being able to generalize.

Statistical inference is based on probability theory, to which we now turn.


Probability

Probability is the study of events and outcomes involving an element of uncertainty.


It is about the chance that an event occurs, or, the chance that an event did occur.

Examples from our daily lives: a winning lottery ticket, a successful penalty kick in
soccer, a snowy day, tails in a flipping coin, etc.

As we will see, probability provides the link between a sample and a population, via
exploring how common a finding is.
It addresses the questions whether a statistic in a sample (e.g., mean, standard
deviation) is the same as the whole population, and how similar or dissimilar the
sample and the population are (Besco).

First, a look at probability


Probability

How do we determine/calculate probability?


Number of ways an event can occur relative to the number of possible outcomes
Probability, P
𝑓
P=
𝑁
f=number of ways an outcome can occur
N=total number of possible outcomes

For example, in flipping a coin, what is the P(Heads) = f (h) / N?


f (h) : number of results of a Heads outcome
N: total number of coin flip outcomes
Probability

Characteristics

Ranges from 0.00 (lowest) to 1.00 (highest)

Probability of 0 implies absolutely no occurrence


Probability of 1 implies occurrence with 100% certainty

Often expressed as a percentage.


Probability

Example (modified from Pamphilis)

A bowl with 3 blue and 7 red marbles.


Close eyes and pick one

What is the probability that on our first draw we will pick a blue marble?

P(blue)= f / N = 3 (possible blue) / 10 (possible marbles)

P(blue)= 3 / 10 = 0.30 (or, 30%)


Probability

The example of a frequency table also is one associated with probability

Participants’ Age group Frequency (f) Cumulative Percentage Cumulative


age (grouped) frequency percentage
Ø 34 and younger 0 0 0 0
36 35-44 1 1 10 (0.10) 10
48 54 45-54 2 3 20 (0.20) 30
57 63 55-64 2 5 20 (0.20) 50
66 67 65-74 2 7 20 (0.20) 70
76 80 75-84 2 9 20 (0.20) 90
92 85 and older 1 10 10 (0.10) 100

N (total number of How many observations Proportion observation in each category


observations)=10 fall into each outcome (akin to Probability, P=f/N)
Probability

Example
2 Coin flips
Possible outcomes?
Outcome Frequency Probability, P
Heads, Heads 1 0.25 (1/4)
Heads, Tails 2 0.50 (2/4)

Tails, Tails 1 0.25 (1/4)


Total 4 1.00

What is the most common outcome? Heads & Tails (or Tails & Heads), 50%
What is the probability of 3 Heads? 0%
If the coin is fair, then the more times we repeat this two-coin flip, the closer we will
approximate a 50-50 chance of a Heads and Tails outcome.
Probability: how many are there?

Probability of an event (a) occurring, P(a) -- also called Marginal Probability

How about the probability of outcomes in relation to others?

Joint (or, Unconditional) Probability, P(A ∩ B) or P( A & B) -- the chance of two events
(a) and (b) occurring together.
It is also called the Probability of the intersection of (a) and (b);
It is termed ‘unconditional, because it does not depend on order or sequence
Probability: how many are there?

Further,

Conditional Probability P(A|B) -- the chance that event (a) occurs, given that
event (b) has taken place. In this case, sequence matters and the occurrence of
(b) may alter the probability of (a) happening

E.g., probability of rain, given a cold front emerging west of Toronto; probability
of COVID-19 infection, given one’s exposure; probability of entrance into an Ivy
League U.S. university given a particular socio-economic level

𝑃 𝐴𝑖 & 𝐵 𝑃 𝐵|𝐴𝑖 ∗𝑃(𝐴𝑖)


P(Ai & B) = = (Bayes Rule) (after Thomas Bayes, 1702-
𝑃(𝐵) 𝑃(𝐵|𝐴𝑖)∗𝑃(𝐴𝑖)
1761, an 18th c. statistician
and Presbyterian clergyman)
Probability: what is it good for?

Probability has real usefulness in Statistical Sciences, as it allows for


understanding the likelihood to see what we see in a sample, based on certain beliefs
about the population-i.e., how common our findings are.

When we speak of an 80% probability of snow means that in a long series of days
with similar conditions, snow falls on 80% of the days.

This long-run approach is important, because another definition of probability -- for


a particular possible outcome for a random phenomenon -- is the proportion of times
that the outcome would occur in a very long sequence of observations.

Put differently, and for our purposes, with a random sample, the probability that an
observation has a particular outcome is the proportion of times that outcome would
occur in a long sequence of like observations (Agresti)
Distributions of observations

Normal distribution
Resembles a bell-shape, is single centered peak, unimodal and symmetrical (here,
mode, median and mean are the same). Normal distribution (denoted N) is
characterized by the mean,  and standard deviation,  *. The same goes for normally
distributed variables.

Kellstedt and Whitten


*A note on notation: in general, for the whole population, Greek letters are used; whereas for a sample, Latin.
Probability distributions
The probability within any particular number of standard deviations of  is the
same for all normal distributions.

In other words, the normal distribution has a predictable area under the curve
within specific distances from the mean (K&W).

This probability equals 68% within one standard deviation, 95% within two
standard deviations and 99% within three standard deviations.
The more spread out the distribution, the larger the 
Probability distributions
In other words, if a probability distribution is a normal bell-shaped one, about 68% of
that probability falls between - and  +, about 95% between  -2 and  +2, and
99% between  -3 and  +3. This is called the empirical rule (or, the 68-95-99 rule)

A particular normal distribution called the


normal standard distribution that is used
in inferential statistics has  =0 and  =1
Probability distributions

To illustrate more, let us go back to the topic of random variables (they are often
called ‘random’ to highlight the random variation behind the outcome varying from
observation to observation; this can be summarized by probabilities - Agresti)

Variables can be discrete (0, 1, 2, 3…) - i.e., separate values-or, continuous (if possible
outcomes are an infinite continuum - e.g., all real numbers between 0 and 1)

A probability distribution lists the possible outcomes and their probabilities


Why is this useful to know? So that its moments (e.g., mean) can be calculated, and
so that we can see whether a given observation is likely, unlikely, or very unlikely
(thus, considered an outlier)
Probability distributions

Example (Agresti)
Survey question: What is the ideal number of children for a family?
Discrete, as it takes numbers 0, 1, 2, 3…
For a randomly chosen person, probability distribution of ideal number of children for
a family (y) is shown via a table and a histogram

y P(y)
0 0.01
1 0.03
2 0.60
3 0.23
4 0.12
5 0.01
Total 1.00
Probability distributions

A probability distribution has parameters, much like a population distribution


(as we saw, mean for center, and standard deviation for variability).

More importantly, these parameter values are the values these measures would
assume in the long run if the random sample took observations on the variable y
having that probability distribution (Agresti)

E.g., in the ‘ideal number of children’ example over the long run we expect y=0 to
occur 1% of the time, y=1 to occur 3% of the time, y=2 to occur 60% of the time, y=3
to occur 23% of the time, etc. In 100 observations, we expect: One 0, three 1’s, sixty
2’s, twenty-three 3’s, twelve 4’s and one 5.
Statistical inference

We can make inferences about a population based on observed data

Important for statistical inference:

As the number of observations in a probability (random) sample increases,


the sample mean approaches the true population mean
(this is also called the Law of large numbers)

(remember, Mean: the average value of an observation )


Statistical inference

This brings us back to (possibly the most important theorem in Statistics)...


Central Limit theorem: as the size of a random sample increases, sample means drawn
from any population approach a normal distribution

Remember its special properties:


Mode, median, mean are the same
Predictable area under the curve
within specific distances of the mean

In other words, if repeated random samples are drawn from a population, the sampling
distribution of the sample estimate will approach normality (Halperin & Heath)
Central limit theorem
Example (Kellstedt and Whitten)
A distribution of actual scores in a sample (what we call a frequency distribution)
represents the frequency of each value of a particular variable.

Example: roll a six-sided die 600 times

Frequency distribution (more or less equal chance


of each of the six numbers resulting). In K &W
Central limit theorem
(slightly lower than expected value, 3.5, if we
Adding up all the scores → Mean: Yത = 3.47 had exactly 100 rolls of each of the six sides)
Standard deviation: S = 1.71

If rolled die 600 times for an infinite amount of times (i.e., take a sample an infinite
number of times), the mean would be exactly 3.5* and standard deviation 0.07

 = 3.5,  = 0.07

*Another way of looking at this, is that with enough samples from a population, the means
will be arranged into a distribution around the true population mean and will approximate
a normal distribution. The larger the sample (not the population) the more accurate it is.
Central limit theorem
As we take more samples, especially large ones, our graph of the sample means
will look more like a normal distribution.

According to the CLT, the average of our sample means will be the population
mean.

Put differently, if we add up the means from all our samples, and we calculate
the average, that average will be our actual population mean.

Similarly, if we calculate the average of all the standard deviations in our sample,
we will find the actual standard deviation for our population (Kotz et al.)
Sampling distribution and standard error
This hypothetical distribution of sample means is called a sampling
distribution
The mean of sampling distribution would be equal to the true population
mean, and,
Sampling distribution would be normally shaped
Standard deviation of sampling distribution, Yҧ is

Yҧ = sY / 𝑛
𝑛: sample size

The standard deviation of this distribution is called the standard error.


Sampling distribution and standard error
Standard error: level of accuracy of the mean of any given sample from that
population compared to the true population mean.

When it increases (i.e., means are more spread out), it becomes more likely
that any given mean is an inaccurate representation of the true population
mean (error between sample and population)

Therefore, the standard error is an important metric that helps us decide


whether the sample we have drawn is one of the samples that is close to the
true population value or not (H&H)
In other words, whether we can safely generalize from our sample or not
Central limit theorem
Back to our example:
Difference between true mean ( = 3.5) and mean from our sample (Yത = 3.47)
suggests that the two are somewhat different.

Knowledge that sampling distribution is shaped normally, and,


Use of the empirical rule
allows for the estimation of likely location of the population mean within a
confidence interval
Central limit theorem
Usually, the 95% confidence level used
Hence, in our dice example,
the sample mean, Yത +/- 2 Yҧ (because 95% of data within 2 Y)ҧ = 3.47 ± 0.14

Therefore, we can state (with 95% confidence) that the population mean for our rolls
of die is within 3.33 (3.47-0.14) and 3.61 (3.47+0.14)
Inference and polling

Inference is also used in polling, or a survey based on a sample of the population

A large, representative sample will look like the population (Central Limit theorem)
A poll with a margin of error of (say) ± 2 % follows the same logic of 95%
confidence interval.
This means, that if we conducted 100 different polls on samples from the same
population, we would expect the answers from 95 of these polls to be within 2%
points in one or other direction of the true value in the population
In polls, the important sample statistic is a % not a mean
A margin of error (confidence interval) indicates how many % points a poll’s results
will differ from the real population value. In the above example, our statistic will be
within 4 percentage points of the real population value 95% of the time.
Inference and polls

The standard error is associated with levels of accuracy. It indicates how much
dispersion to expect from sample to sample, or from poll to poll (dispersion of
the sample means)

Formula for standard error of a percentage / proportion



(is a bit different than SE = 𝑛 )

𝑝 (1−𝑝)
Standard error for percentage =
𝑛

p = proportion of respondents of an opinion


(1 – p) = proportion of respondents of a different opinion
n = number of respondents in the sample
Inference and polls

Example of polling (Wheelan)

A poll (n=1000) in one U.S. state during presidential election produces the
following results:

Democratic candidate 52%


Republican Candidate 47%
Independent candidate 1%
Margin of error 2%
Inference and polls

68% of the time, we expect the sample proportion to be


within one standard error of the true final result

Hence, with a 68% confidence,


Democrat support will range from 50-54% (± 2% margin of error)
Republican support, from 45-49% (± 2% margin of error)

68% is not a very good percentage of confidence.


To increase it, the margin of error has to be broadened
Inference and polls

According to the Central Limit theorem,


95% of sample population will fall
within two standard errors of the true population.

To be 95% confident in our prediction,


two standard errors ⇒ a ± 4% margin of error

Support for Democrats will now range from 48-56% (± 4% margin of error)
For Republicans, from 43-51% (± 4% margin of error)

More confident, but... within these ranges, polls can get the result wrong
(e.g., in the 2012 U.S. presidential election)
Inference and polls

Hence, there is a trade-off between precision and confidence in prediction of the


result (H&H)

E.g., instead of being ‘pretty sure’ that Jefferson was third or fourth U.S. president,
you can be ‘absolutely positive’ that he was among the first five (Wheelan)

Smaller (and more biased) samples produce larger standard errors and larger
confidence interval (‘margin of sample error’)

In Canadian context, the nature of elections, electoral volatility and diverse


methodologies can contribute to polling errors (Coletto et al.)
Hypotheses

To examine whether X possibly causes Y, we must first investigate if the two are
related through a logic of inference from a sample to the whole population

Before that we need to re-examine the concept of hypothesis

Hypothesis: an expectation about what is happening in the (unobserved) population-


perhaps a relationship between two variables, X and Y
Researchers seek evidence to support or reject them

Hypotheses can imply a relationship, a difference, and/or direction (more / less)


Hypotheses

Non-directional (two-tailed)
Expectation that what is investigated will be different (one variable), or, related
(two variables)

Directional (one-tailed)
Expectation that what is investigated will be different in a given way
(more / less than a given value), or related in a given way (positive / negative)

Usually, when we have a theory that provides expectations


Hypothesis testing

We have a covered bowl with 95 red marbles and 5 blue marbles.


If we reach in, pick one marble and it is blue, evidence suggests
(but does not guarantee) that the bowl does not contain 95% red
marbles

More broadly, if we hypothesize that a true population parameter


equals 0, and evidence suggests it is not 0, a t-statistic tells us how
certain we are
It tells how far our observed sample value is from a hypothesized
population value
We can use the t-distribution and the mean to learn about the
population mean
Hypothesis testing

H0 (null hypothesis) represents a claim to be tested


HA (alternative-or, research-hypothesis) is an alternative claim to examine

With sufficient evidence, we can either reject H0 in favor of a HA at a particular


significance level* (usually 10%, 5%, 1%), or fail to reject it

*This is the measure of the strength of the evidence that must be present in
one’s sample before they reject the null hypothesis and conclude that the
effect is statistically significant. The researchers themselves determine the
significance level before conducting their research.
Hypothesis testing

Given a data sample, one compares the potential relationship between X and Y in
that dataset, with what one would expect to find if X and Y were not related in
the underlying population (K&W)

In other words, the more different the empirically observed relationship is from
what would be expected if there were not a relationship, the more the
confidence that X and Y are indeed related in the population.

More broadly, hypothesis testing indicates the probability of seeing what one
does in a sample if the null hypothesis is true
Hypothesis testing and p-value

Let us say we want to test a hypothesis that some interesting phenomenon in Political
Science is occurring.
No amount of evidence can ever prove a hypothesis is correct 100% of the time.
Instead, one first assumes that the phenomenon does not actually happen (which, in
technical terms, is called the null hypothesis H0) , and attempt to reject this idea. (Balkus)

Once data is gathered, a p-value is calculated.


P-value is the probability of that data being collected simply by chance assuming the null
hypothesis H0– that the phenomenon does not occur.
A low p-value suggests that the null hypothesis H0 is highly unlikely, and this supports the
hypothesis that the phenomenon does exist.
Hypothesis testing and p-value

In simple terms, p-value is an indicator if two variables we are exploring are related.
The probability we would see the relationship we are finding because of random
chance; probability that we see the observed relationship we are finding between X
and Y in a sample data if there were no relationship between them in the unobserved
population

Ranges between 0 and 1

It conveys the level of confidence with which one can reject the H0
Hypothesis testing

Example: Global Warming in the Arctic Circle

H0 = mean temperature at North Pole in the summer is 0 oC


HA  mean temperature at North Pole in the summer is higher than 0 oC
We test at the 10% level of significance

Take a random sample of 9 temperature measurements


Mean temperature = 1.2 oC
Standard deviation = 3
Can we reject the H0 ?
Hypothesis testing

Example: Global Warming in the Arctic Circle-is the North Pole melting?

H0 ⇒  = 0 oC
HA ⇒  > 0 oC
(This is a one-sided test. If we wanted to ask whether the temperature is lower than
0 oC it would be a two-sided one, with p-values at both ends of the distribution)
n=9
ഥ = 1.2
X
SD = 3
0 = 0
Df = 8 (sample size, n - 1)
Hypothesis testing and p-value

We get a result of t = 1.2


And compare it to a critical t value at 10% and 8 df
Critical t = 1.397
Hypothesis testing and p-value
Compare? Critical t value?

The T distribution (also called Student’s T Distribution) is a


type of distributions almost identical to the normal
distribution curve; it is used when we have small samples

The resulting t value can be compared to critical values of t


(from existing table of critical values for t), for a specific
degree of freedom, at a specific level of statistical significance

Critical t is the statistical threshold that corresponds to a


particular significance value and degrees of freedom.
If the t value is smaller than the critical t one, it is not inside
this small area, α, where the null hypothesis can be rejected.
Hypothesis testing and p-value

Therefore, based on these measurements, we cannot reject H0


The p-value tells us how much area under the curve lies beyond our observed t-statistic.

But that does not mean that global warming is not taking place.
We need a much larger sample at a more rigorous significance level (lower p-value)
Parenthesis: statistical significance and error

Type I error (false positive)


Rejection of a null hypothesis that is actually true
(that is, there is no relationship in the population, results are due to chance)
The lower the value of level of significance, the less likely this error

Type II error (false negative)


No rejection of a null hypothesis that is false and should have been rejected
(here there is an existing relationship in the population, yet the results are
attributed erroneously to chance)
The lower the value of level of significance, the more likely this error

Useful mnemonic: when one undergoes a medical test, null hypothesis is that
they do not have a disease, x. If laboratory results confirm the disease and one is
not ill, then this is a false positive. If the test results are clear, and one is, in fact,
ill, then false negative.
(back to) Hypothesis testing

Sampling distribution of sampling means of 𝑦ത if H0 :  = 0 . A very small p-value indicates that the probability
of obtaining values so extreme from the null hypothesis mean (0) is minuscule and not random. When a HA
is two-tailed, H0 rejection regions located on both ends of curve (and area under them =  / 2 )

A p-value < 0.05 is considered the benchmark for results that are not a matter of
chance, but statistically significant. Overall, the standard one sets as the benchmark
for significance is symbolized by  (sign. level, alpha) –i.e., how extreme the data must
be before we can reject the null hypothesis.
p-value and the 0.05 threshold

Originates from …tea tasting

In interwar Britain, Muriel Bristol claimed to tell the difference between milk
poured into tea, and tea poured into milk.
To test this claim, in his 1935 book ‘The Design of Experiments’, Ronald Fisher,
a British statistician proposed a lady tea tasting test: she should be presented
with 8 cups of tea (4 with milk into tea, and 4 with tea into milk)=70
combinations with only 1 separating all 8 cups correctly.
If the woman was successful, it would be an extremely improbable result (1.4%
chance) indicating there was something other than random selection of the
correct answer.
p-value and the 0.05 threshold

Fisher’s Lady Tea Tasting experiment

Source: Balkus
Hypothesis testing

A p-value < 0.05 is generally used* in Scientific research.

*And abused-e.g., via p-hacking – or, data dredging.


P-hacking refers to misreporting of true effect sizes in published studies, and takes
place when researchers try out various statistical analyses and selectively report those
that demonstrate statistical significance (Head et al.)
A recent practice asks researchers to declare in advance what statistical procedure
they are undertaking and be more transparent about their research

At the same time, there is some debate about what is the threshold for ‘rare’ (and
many do not think that a p-value smaller than 0.05 is necessarily consequential, or
that, say one of 0.07 is not)
p-value and levels of statistical significance

p-value and assessing evidence against null hypothesis (Weiss)

‘Generally’, because ...Different standards of what is a statistically significant p-value


Social Science: significance if p-value < 0.05 (i.e., there are at most 5 chances in 100,
or 1 in 20 that the sample shows a relationship not also found in the population)
Natural Sciences: aim for even more rigor (p-value < 0.01) (less than 1 in 100)
Medical Sciences: maximum rigor (p-value <0.001 and much less) (less than 1 in a 1000)

Overall, lower p-value: increase confidence there is there is a statistically significant


relationship between X and Y
P-values
Limitations
1. Irreversibility: if p=0.001, does not mean there is 0.999 chance of systematic
relationship between X and Y.
2. Correlation  causality: p-values inform only about the former
(whether X and Y are related; a necessary but only first step in a causal investigation)
3. Presence  strength: lower p-values only constitute evidence of presence of
relationship between two variables, not the strength of it
4. Randomness-dependent: the less random the sample, the less confidence in p-value
5. Chance: some tests may be statistically significant simply by chance!
6. Confidence  quality: p-values are not related to quality of measurement procedure
for variables
7. Impact: true effects are often smaller than reported estimates
On the path to proving Causality
P-values (very low ones) are important as they provide evidence of
presence of relationship between two variables

To be able to meaningfully speak of a relationship between two


variables with respect to strength and, often, direction, we have
measures of association

In many analyses you will see these measures:


Chi-squared, Lambda, Gamma, and the correlation coefficient
(Pearson’s) r
Part I, Analysis I: main points

Research uses samples. If randomly selected, a sample can help us generalize about
the whole population through statistical inference. From what we know to be true
about a randomly selected, representative sample, we can use probability theory to
infer what is likely to be true about the population.

A normal distribution has a predictable area under the curve within specific
distances from the mean: about 68% of that probability falls between ± 1 standard
deviation, about 95% between ± 2 ones and 99% between ± 3 (the empirical rule-
or, the 68-95-99 rule)

The distribution of sample means follows a normal distribution, so the peak of the
normal distribution equals the population mean. This can help determine how far
our sample mean is from a hypothesized population value and its associated
probability (it can indicate how accurate a representation of our sample is within a
confidence interval, so that we can generalize from what we have).
Part I, Analysis I: main points

Hypothesis testing indicates the probability of seeing what one does in a sample if
the null hypothesis is true

With sufficient evidence, we can either reject H0 in favor of a HA at a particular


significance level (strength of the evidence needed), or fail to reject it

P-value is the probability of that data being collected simply by chance assuming
the null hypothesis, H0 – that the phenomenon does not occur. It conveys the level
of confidence with which one can reject the H0

P-values (very low ones) are important as they provide evidence of presence of
relationship between two variables
Part I, Analysis I: Glossary

Probability (or, Marginal Probability): the chance (ranging between 0 and 1)


that an event occurs, or, the chance that an event did occur. It provides the link
between a sample and a population, via exploring how common a finding is.

Conditional Probability: the chance that event (a) occurs, given that event (b)
has taken place

Law of large numbers: as the number of observations in a probability (random)


sample increases, the sample mean approaches the true population mean
Part I, Analysis I: Glossary

Standard error: the standard deviation of a sampling distribution. When high


(i.e., means are more spread out), it becomes more likely that any given mean
is an inaccurate representation of the true population mean (error between
sample and population)

p-value: an indicator (between 0 and 1) about whether two variables are


related; this is the probability one would see the relationship one is seeking
between variables because of random chance. If too low (e.g., <0.05) it
suggests that results are statistically significant between the variables.

Margin of error: in polling (confidence interval) indicates how many % points a poll’s
results will differ from the real population value.
Part I, Analysis I: Glossary
Hypothesis: an expectation about what is happening in the (unobserved) population-
perhaps a relationship between two variables, X and Y. Researchers seek evidence to
support or reject them

Null hypothesis, H0 : represents a claim to be tested

Alternative-or, research-hypothesis, HA : n alternative claim to examine

Type I error (false positive): rejection of a null hypothesis that is actually true. The
lower the value of level of significance, the less likely this error

Type II error (false negative): no rejection of a null hypothesis that is false and should
have been rejected. The lower the level of significance value, the more likely this error

t-statistic: indicates how far our observed sample value is from a hypothesized
population value
POL244H
Research Methods for Political Science II

Winter 2024
Wednesdays 1-3pm @ MN1170

Part I
Analysis II: measures of association; regression analysis
Week 1 Jan. 10 Introduction and course details
Week 2 Jan. 17 Data I: Research design, experiments, interviews and questionnaires
Week 3 Jan. 24 Data II: Sampling, size and distributions

Week 4 Jan. 31 Analysis I: Inferential statistics, univariate, bivariate analysis


Week 5 Feb. 7 Analysis II: Measures of Association, Regression Statistics
Week 6 Feb. 14 Analysis III: Regression (cont.); big data, machine learning, Networks
Reading week NO CLASS

Week 7 Feb. 28 Mid-term test


Singe and small-n cases
Week 8 Mar. 6 Data I: Ethnography
Week 9 Mar. 12 Data II: Interviews, archives, texts/documents

Week 10 Mar. 19 Analysis I: Process tracing, content analysis

Week 11 Mar. 26 Analysis II: Comparative study

Week 12 Apr. 3 Indigenous methods, Research ethics, conclusions


Today’s schedule

Announcements
Measures of Association
Regression analysis I
Today’s tutorials

On regression and models


Exercise
Introduction of Assignment #2
Assignments

Assignment 2 (10%)
Due: end of next week February 18
by 11:59pm EST (Quercus)
On the path to proving Causality
P-values (very low ones) are important as they provide evidence of
presence of relationship between two variables

To be able to meaningfully speak of a relationship between two


variables with respect to strength and, often, direction, we have
measures of association

In many analyses you will see these measures:


Chi-squared, Lambda, Gamma, and the correlation coefficient
(Pearson’s) r
Association between two variables

Association is a non-random relationship between two variables


E.g., random relationship: height and taste in modern art
E.g., non-random relationship: height and basketball ability

Measures of association between two variables are used as precursors to


regression analysis
Measures of Association

Most common measures of association?


A Greek alphabet soup (and a pinch of r)

Chi-squared
Lambda
Gamma
(Pearson’s) r (correlation coefficient)

These are useful indicators of the existence of a relationship between two


variables with respect to strength and, often, direction
Association between two variables

Direction of association

Positive: high values associated with high values (and low with low)
E.g., ethnic polarization and likelihood of civil war
Negative: high values associated with low values (and vice versa)
E.g., levels of poverty and voting turnout
Association between two variables

Chi-squared
It helps test whether there is relationship
between two variables, X and Y (but not strength, or direction)

Such tests compare an empirical result with a hypothetical result that would occur if
the data were random (H0 = X, Y not related)

2
𝑂−𝐸
Chi-squared, 𝑥 2 = 
𝐸
O: observed number of cases (also expressed as fo, or, observed frequency)
E: expected number of cases (also expressed as fe or, expected frequency)
Association between two variables

Example (Singh)
Is location related to political party affiliation in the United States?
Always helps to look at the table for any apparent association
Association between two variables

Performing a 𝑥 2 test results in 𝑥 2 = 9.554 and a p-value = 0.008


(lower than 0.05—at the 5% level of statistical significance)
The calculated value of 𝑥 2 can be compared to the values in the table aligned
with the specific degrees of freedom we have. For the p-value, one looks at
the area under the 𝑥 2 distribution to the right of the test statistic; df = [(table
columns – 1) x (table rows -1) = 2]
This shows the probability that the deviations (between what we expected to
see and what we actually saw) are due to chance alone and our hypothesis or
model can be supported.
In our example, the probability of observing this pattern if the two variables
(political affiliation, location) were truly randomly related is 0.008

Put differently, if there was no relationship, the probability of observing this


would be a very low 0.008
Types of variables

Chi-squared can be used for variables measured at any level

Categorical / Nominal: variables are qualitatively different and cannot be ordered


(e.g., location, profession, hair color, etc.)

Ordinal: clear ordering of the categories in this type (e.g., education, income,
satisfaction ratings)

Interval / Ratio: also called continuous, or, numerical: ordering and equal spacing
between values for interval (no natural zero-e.g., temperature absence of zero is an
actual temperature) and ratio (have a natural zero-e.g., age) types
Measures of Association: nominal variables

Lambda
This is a test that indicates strength of association
(as Categorical / Nominal variables are non-directional)

It ranges from 0.00 to 1.00 and helps improve one’s predictions of one variable
if one knows about the other.
Measures of Association: nominal variables

Lambda,  =
𝝐1−𝝐2
𝝐2

𝝐1: the number of errors in prediction without knowing X


𝝐2: the number of errors in prediction when X is known

For example, if  = 0.345

Hence, an IV can help predict a DV, reducing erroneous prediction by 34.5%


Measures of Association: ordinal variables

Gamma
This test indicates both strength and direction of association

It ranges from -1.0 to +1.0, and determines


whether an observation rating high on one variable
means that observation will rate high on another
Measures of Association: ordinal variables

Example (Singh)
Support for deregulation and support for free markets
Each variable has five ordinal categories
Measures of Association: ordinal variables

Concordant pairs: an observation rates higher on both variables, (or lower on


both) as compared to its counterpart
E.g., respondent A strongly approves both deregulation and free markets;
respondent B strongly disapproves both deregulation and free markets

Discordant pairs: an observation rates higher on one variable but lower on the
other, as compared to its counterpart
E.g., while respondent C strongly approves both deregulation and free
markets, respondent D strongly disapproves deregulation while is in favor of
free markets

From Rademaker
Measures of Association: ordinal variables

𝐶−𝐷
Gamma, 𝛾= 𝐶+𝐷

Concordant pairs: positive association


Discordant pairs: negative association

𝐶−𝐷
In this example, 𝛾= 𝐶+𝐷 = 0.54 ⇒ 𝛾 > 0 Modified from Singh

Therefore, more concordant pairs (positive association), and,


moderate support for Deregulation and Free Markets
Measures of Association: continuous variables

Correlation coefficient, r (or, Pearson’s r)


This measure provides both the strength and direction
of a linear relationship between two variables

෌ 𝑋𝑖 −𝑋ത 𝑌𝑖 −𝑌ത
Correlation coefficient, 𝑟X, Y = 𝑛−1 𝑆𝑌𝑆𝑋

It ranges from -1.0 (strong negative association) to +1.0 (strong positive association)
If r = 0, this does not mean the two variables are not related at all; only that there is
no linear association between them
Measures of Association: continuous variables
Example: exploring the association of age and income (Singh)
H0: no difference between age and income
HA : increase in age → increase in income

Always helps to look at the scatter plot. Calculating 𝑟 = 0.92 ⇒ 𝑟 > 0


Hence, strong positive relationship between the two variables.
Knowing one’s age helps predict their income
Relationship between variables

Correlation: how close observations fall to a line

With regression analysis, we will see that what is important is the slope of a line

Both straight lines with same Correlation coefficient r=1, but note that their slope different
Measures of association: tests

Decision flowchart for Measures of Association (Sirkin)

Is the table 2x2?

Yes No

 test* Both variables ordinal?

*Similar to r (Correlation Coefficient) No Yes


with a range from -1 to 1, where 0
indicates no relationship
 test 𝛾 test
or, Cramer’s V
(related to  test)
Hypothesis testing

Test appropriateness for type of variable

IV type
Categorical Continuous
Categorical Tabular analysis Probit/Logit
DV type
Continuous Difference of means (e.g., t-test) Correlation coefficient
Regression Regression model
Let’s recap...

A hypothesis is a statement positing / predicting a relationship


between two observable, quantifiable and measurable representations
of concepts (variables) that can be empirically tested

The goal is to find if a causal relationship exists between variables of


different kind (nominal, ordinal, interval/ratio)

There is a list of four criteria to be met for causation to be established


that researchers must be mindful of

This is done by collecting and examining related data


Recap

In the previous classes we have been looking into quantitative research


This line of research seeks to propose and test theories, aiming to
generalize from them
It examines (mostly) randomly selected samples from a population and
analyzes the data via descriptive and inferential statistics, based on
laws and theorems of probability
E.g., the Central Limit theorem that suggests that as the number of
samples increases, the distribution of sample means follows a normal
distribution with specific properties
The use of statistical tests allows for analyses to explore if and how
variables may be associated or correlated.
And that brings us to…
Bivariate regression analysis

As simple association is not enough, regression analysis is a powerful


tool that helps us quantify the relationship between a particular
variable and the outcome we study while controlling for other factors
(helping us meet criteria from the causation list)
Bivariate regression analysis

Regression is a major tool in Political Science research, especially its


linear form. That is because 70% of relations b/w continuous variables
in Political Science are linear.

For two variables (bivariate), unless we have strong visual evidence in


our scatterplot to the contrary, we start by postulating a linear one
between them
Estimation of linear relationships

Perfect linearity can be expressed as Y𝑖 = 𝑎 + 𝑏𝑋𝑖

𝑎 ∶ intercept (or, the ‘constant’)


This is where the straight line crosses the y-axis (at X=0)

𝑏: slope
This is the change in Y associated with a one-unit increase in X.

Once we know these two parameters, we can draw that line across any
range of X values
Estimation of linear relationships

Imperfect linearity can be expressed as: 𝑌𝑖 = 𝑎 + 𝑏𝑋𝑖 + 𝑒𝑖

Residual 𝑒: the deviation of individual Y values from the regression line


Some use the notation u for the residual
Estimation of linear relationships

Basic Regression formula


General equation for a line calculating the predicted value of Y:

𝑌෡𝑖 = 𝑎 + 𝑏𝑋𝑖

The ‘hat’ operator on 𝑌 (called ‘y-hat’) indicates this is a predictive value

The line serves as a ‘statistical model of reality’:


it is the best prediction on the DV score for any given IV score
Ordinary Least Squares (OLS)

As we look to minimize the vertical distances between the fitted line and
each point in the scatterplot,
we select the line that minimizes the total (or, sum) of squared residuals
(squared, to ensure we do not have negatives) (‘line of best fit’)

2
Equation: ෌ 𝑒𝑖2 = ෍ 𝑌𝑖 − 𝑌෠𝑖
Regression coefficient

From the original equation, the OLS regression coefficients  and b are
obtained (although, as the intercept  is a constant, the term
regression coefficient is used for b)

How to find them algebraically?

 = 𝑌ത - b𝑋ത ⇒ 𝑎ො = 𝑦ത − 𝛽መ 𝑥ҧ

෌ 𝑋𝑖 −𝑋ത 𝑌𝑖 −𝑌ത ෌ 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത


b= ⇒ 𝛽መ =
෌ 𝑋𝑖 −𝑋ത 2 ෌ 𝑥𝑖 −𝑥ҧ 2
Estimation of linear relationships

Example
We are examining the Vertical axis: DV
relationship between X and Y,
and obtain several observations

Horizontal axis: IV
Estimation of linear relationships

ത and 𝑌=4
Since 𝑋=3 ത
෌ 𝑋𝑖 −𝑋ത 𝑌𝑖 −𝑌ത
from the OLS regression equations (  = 𝑌
ത - b𝑋ത and b = )
෌ 𝑋𝑖 −𝑋ത 2
we obtain
 = - 0.2 (intercept-where it crosses the Y-axis; or, constant, since it is
the starting point for a calculation), and,
b = 1.4 (the slope of the line, or, regression coefficient)

Based on the general equation Y = 𝑎 + 𝑏 𝑋𝑖


𝑌෡𝑖 = −0.2 + 1.4 𝑋𝑖
Estimation of linear relationships

Visually,
Intercept,  is the value of y when X=0
(‘anchors’ the regression line)

Slope, b, important, for predictions


If the IV changes by 1 unit, how much
does the DV change?
In this example, when X increases by 1
unit, then Y increases by 1.4 units
E.g., if X = 2, expected value of
𝑌෡𝑖 = −0.2 + 1.4 ∗ 2 = 2.6
Slope

On b, regression coefficient

If b > 0 , then the relationship between the


IV and DV is positive, and the line-of-best fit
will be upward sloping

If b < 0 , then the relationship between the


IV and DV is negative (or, inverse), and the
line-of-best fit will be downward sloping
Goodness of fit measures
For two different DV, Y1 and Y2 , which line fits the data best?

(From Singh)

As always, important to look at the data. Examine scatterplots and regression


lines. Slope, Intercept very similar, but the one on the left appears to fit better,
given the ‘noise’ (residuals) than the one on the right.
Can miss it if only run regression.
Goodness of fit measures: Root MSE

Put differently, the Y1 plot has a smaller sum of squared residuals (෌ 𝑒𝑖2 = ෌ 𝑦𝑖 − 𝑦ො𝑖 2 )
To better describe the variation about the regression line (the ‘noise’),
the Root Mean Square Error (Se) is used (also called the standard error of the
regression model)

෌ 𝑒𝑖2
𝑆𝑒 =
𝑛−𝑘−1
n = number of observations
k = number of independent variables (always equal to 1 in bivariate regression, so
෌ 𝑒𝑖2
the equation can also be written as 𝑆𝑒 = )
𝑛
That’s a lot of equations! I have included them for those interested, but don’t
worry-all of these are reported by the statistical software output
Goodness of fit measures: Root MSE

More broadly, Root MSE is the average vertical distance (or, deviation) of a data
point from the fitted regression line and indicates how concentrated the data is
around the line of best fit.
It is always expressed in the metric of the DV and not bounded (e.g., from
-1 to 1) so, it is more difficult to compare. Still, more broadly,
the higher the value of Se , the worse the fit of the regression line

Se = 10.45 Se = 52.23
Goodness of fit measures: R-squared

A more preferable indicator of how accurately the regression line


describes X and Y, is R2, the proportion of variance in Y explained by X

2

෍ 𝑒𝑖 −ⅇ
𝑣𝑎𝑟 (𝑒)
𝑅2 = 1− 𝑛−1
2 Or, 𝑅2 =1-

෍ 𝑦𝑖 −𝑦 𝑣𝑎𝑟 (𝑦)
𝑛−1


2 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑅𝑆𝑆
Also expressed as, 𝑅 = 1 -
𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 (𝑇𝑆𝑆)
* RSS is also called Sum of Squared Errors (SSE)
Goodness of fit measures: R-squared

𝑅2 ranges from 0 to 1
If 𝑅2 =1 ⇒ perfect relationship between X and Y (accounting for all
variation)
If 𝑅2 = 0 ⇒ no (linear) relationship between X and Y

But it also depends on the data; e.g., survey date are more ‘noisy’
If more variables are added, it increases.

Hence, the slope remains very important


(how much Y increases for each unit of X)
Goodness of fit measures: R-squared

Can also run regressions for Y1 and Y2 and compare 𝑅2 .


In the latest pair of scatterplots we have looked at:

𝑅2 = 0.90 𝑅2 = 0.36

(the closest 𝑅2 is to 1, the better the prediction)


Goodness of fit
Examples
(Back to) Inference

In general, with most cases of Political Science research it is not


possible to observe the full population with a mean, 

Hence, we take sample and estimate its mean 𝜇𝑥ҧ

We guess about 𝜇, 𝑥ҧ

Regression also involves estimation -- about a line


Inference

To learn about an unobserved population, one can conduct regressions


using sample data and employing familiar inference tools
(e.g., H0 null hypothesis testing, p-values, confidence intervals)

To make inferences about the true population parameters


𝑌𝑖 = 𝑎 + β 𝑋𝑖 + ε𝑖
(remember, we use Greek letters in notation for the whole population,
and Latin ones for a sample) we use the sample estimates
𝑦𝑖 = a + 𝑏𝑥𝑖 + 𝑒𝑖
Inference: hypothesis-testing and the t-statistic

A familiar process and tool: the T-test

For a test of the population mean, the t-statistic formula is:


𝑥ҧ − 𝜇0
𝑡=
𝑠Τ 𝑛
𝑥ҧ : sample mean
𝜇0 : value of the population mean proposed in the null hypothesis, H0
𝑠 : sample standard deviation
n: number of observations
Denominator, 𝑠Τ 𝑛 : standard error
Inference: hypothesis-testing and the t-statistic

In regression analysis, rather than about , we make inferences about β


Null hypothesis, H0 : X does not cause Y (β = 0)

H0 rejected

zero slope,
no correlation
(at least, not
linear)

H0 not rejected
Inference: hypothesis-testing and the t-statistic

𝑏 − 𝛽0 𝑏
𝑡= =
𝑆𝑏 𝑆𝑏

෍𝑒𝑙2ሶ
𝑠𝑒
Sb, standard error of b : 𝑠𝑏 = and 𝑠𝑒 =
𝑛−𝑘−1
෌ 𝑥𝑖 −𝑥ҧ 2

NB. Some textbooks use the notation β* instead of 𝛽0


Se measures the precision of the estimate of the coefficient. It is smaller
when the points are more tightly arranged around the fitted line.
When Se is smaller, Sb is also smaller, indicating that b is a more precise
estimate of β.
Inference: hypothesis-testing and the t-statistic

Example
Are Divorce rate and Unemployment rate in the U.S. related? (Singh)
H0: β = 0
H A: β  0
Sample n=192
Divorce rate

Unemployment rate
Inference: hypothesis-testing and the t-statistic

෣ i = a + b × unemploymenti
Estimation of the equation: Divorce
(remember, : intercept, b: slope/regression coefficient)
Inference: hypothesis-testing and the t-statistic

What does this tell us?

A one-point increase in the Unemployment rate (IV)


is associated with 0.36-point increase in the expected Divorce rate (DV).

Also, with zero unemployment, a divorce rate of 2.08% is expected


෣ i = a + b × unemploymenti ⇒ Divorce = 2.08 + 0)
(Divorce
Inference: hypothesis-testing and the t-statistic

How statistically significant is b?


In other words, how certainly can one state that β is different from 0?

Regression results
Coefficient on Unemployment : 0.365
Standard error of the coefficient: 0.051

𝑏−𝛽0 𝑏 𝑏−𝛽0 𝑏
Using 𝑡 = = , we have 𝑡 = = (coefficient/standard error)
𝑆𝑏 𝑆𝑏 𝑆𝑏 𝑆𝑏
0.365
⇒𝑡= = 7.2
0.051

With this value for t we can determine whether this t-statistic indicates statistical
significance
Inference: hypothesis-testing and the t-statistic

Two-tailed test (or, non-directional--as we have not added


a direction in our hypothesis, e.g., that the higher the
unemployment, the lower the divorce rate, etc.)
For df (n-k-1): 190 (k: nr of parameters estimated, i.e., 1 for a
bivariate regression with one IV and one DV)
p-value: 0.05
Critical value of 𝑡= 1.97 (between 1.96 and 1.98 for these df)

In our example, 𝑡= 7.2


(remember, if a t value is larger than the critical t , we can reject the null hypothesis H0 )
7.2 > 1.97
Hence, result is statistically significance
We can reject the null hypothesis, H0 that Unemployment
rate is not related to Divorce rate
Inference: hypothesis-testing and the t-statistic

What this means is that


‘If the true impact of Unemployment on the expected Divorce rate were zero,
there is a less that 5% chance that we would have observed a t-statistic as
large in magnitude as we did’ (Singh)

In more formal terms, this can be expressed as


Pr (|t| 1.97|β = 0) < 0.05

NB. This result should not be interpreted as ‘there is a less than 5% chance
that the true impact of Unemployment on the expected Divorce rate is zero’
Inference: hypothesis-testing and the t-statistic

For a directional hypothesis, we would express this as


(e.g., if we expect a positive relationship) H0: β  0 and HA: β  0
or (if we expect a negative relationship) H0: β  0 and HA: β 0

and perform a one-tailed hypothesis test using the same formula


The difference is that we would use the t-table differently
(i.e., we would look for the one-tailed values)
Inference: t-statistic

Also, note that t-statistics tend to get bigger with a bigger sample

𝑏 − 𝛽0 𝑏
𝑡= =
𝑆𝑏 𝑆𝑏
෍𝑒𝑙2ሶ
𝑠𝑒
Where Sb, standard error of b : 𝑠𝑏 = and 𝑠𝑒 =
𝑛−𝑘−1
෌ 𝑥𝑖 −𝑥ҧ 2

When larger size of n, σ 𝑥𝑖 − 𝑥ҧ 2 increases, and 𝑠𝑏 decreases, resulting


in a larger sample t-statistic

In simple terms, a larger size sample increases significance (all else equal)
Inference: p-values

Remember that p-values provide information


on the likelihood that X and Y are truly unrelated in a population
In general, a p-value indicates how likely it is that we would find what we
did in our sample, if there was no relationship between X and Y in the
population

In regression, the null hypothesis, H0 , is that β = 0


p-value is the likelihood of getting slope b by chance if the real slope in
the population is 0
Hence, a very low p-value suggests that the null hypothesis, H0 is not true
Inference: p-values

P-value and significance level

A p-value less than or equal to 0.05 indicates that the estimate is


significant at the 5% level

NB. At the same time, a parameter estimate that is statistically


significant does not necessarily mean that it is also substantively
significant
Inference: confidence intervals

Screenshots (Aug. 2021, and Jan. 2022) from Five Thirty Eight

Confidence intervals are about the likely location of the population mean
– the percentage of confidence that the population mean lies within
Inference: confidence intervals

Formula for confidence intervals: 𝑏 − 𝑡𝑎 ∗ 𝑠𝑏 to 𝑏 + 𝑡𝑎 ∗ 𝑠𝑏

Caution! Confusing notation


Here, t is the critical t-statistic from t-table values
(not the observed one) associated with the
significance level we select, and,
 is the significance level (equal to 1- confidence interval size)

Also, remember that for a two-tailed test


you need to use the two-tailed corresponding critical value
Inference: confidence intervals

Back to the Divorce and Unemployment rate example

What is the confidence interval for the coefficient on Unemployment


rate?

Regression results
Coefficient on Unemployment : 0.365
Standard error of the coefficient: 0.051
Inference: confidence intervals

For a 95% confidence interval,  = 1 – 0.95 = 0.05

Plotting the numbers into the formula


(this is also reported by statistical software, so don’t worry)

𝑏 − 𝑡𝑎/2 ∗ 𝑠𝑏 to 𝑏 + 𝑡𝑎/2 ∗ 𝑠𝑏

we get 0.365 – (1.97 * 0.051) to 0.365 + (1.97 * 0.051) =


= 0.265 to 0.465

This means that one can be 95% confident that β falls in the interval between
0.265 and 0.465
Linear regression and related assumptions

For regression models that estimate  and β to be meaningful, certain


assumptions must be met

In general, an estimator should be


• Unbiased - if it equals the true population parameter
• Consistent - if as n gets larger, it approaches the true population parameter
• Efficient - if it has the smallest sampling variance of all other estimators
Assumptions: specifics

Unbiasedness, consistency and efficiency can be addressed by seven


specific assumptions (Lewis-Beck & Lewis-Beck)

1. Linearity in the parameters


2. Expected value of the error term = 0 Provide unbiasedness*; consistency
(greater accuracy as sample size
3. X values measured without error increases)
*Yet, will still be biased if sample is unrepresentative
4. Errors are independent from X
5. Errors have constant variance Ensure OLS estimates are ‘best’ (more
6. Errors independent from each other efficient that ones from different methods)

7. Errors normally distributed Allows for statistical inference


(even in smaller samples)
Regression analysis

The road to causation

• Plausible theoretical relation between X, Y


• Y must not temporally precede X
• X and Y must be associated
• Ensure relation is not spurious due to
other ‘confounding’ variables Z
(a variable that is correlated both with X and Y, and one which, if
omitted from a regression, will result in biased estimates of β (and
perhaps also of α)
Confounding variables

How can a confounding variable be identified?

• Good knowledge of existing literature (that is why literature reviews


help!) and solid theoretical foundations for research and hypotheses to
identify such a possible variable (and gathering of data to measure
them)
• Controlling for it statistically
• When suspecting more than one possible confounding variable,
multiple regression towards accurately identifying a causal relationship
Multiple Regression: algebra

Multiple regression is simply the addition of one (or more) independent


variable(s), Z, to that equation. For two IVs

single intercept slope estimates for each IV (X, Z)

If X increases by 1 unit, and Z is the same, Y changes by the slope (or,


coefficient) b1
More broadly, Multiple Regression for k number of IVs
𝑌𝑖 =  + β1𝑋1𝑖 + β2𝑋2𝑖 + … + β𝑘𝑋𝑘𝑖 + ε𝑖
Part I, Analysis II: main points

To meaningfully speak of a non-random relationship between two variables with


respect to strength and, often, direction, there are measures of association,
including lambda, gamma and r (correlation coefficient).
in Political Science, 70% of relations b/w continuous variables are linear. But simple
association is not enough, so the linear form regression analysis is used to quantify
the relationship between a variable and the outcome studied while controlling for
other factors.
For two variables (bivariate), we postulate a linear one between them.
Perfect linearity can be expressed as Y𝑖 = 𝑎 + 𝑏𝑋𝑖
𝑎: intercept (or, the ‘constant’) is where the straight line crosses the y-axis (at X=0)
𝑏: slope This is the change in Y associated with a one-unit increase in X.
Once we know these two parameters, we can draw that line across any range of X
values (and make predictions about Y)
Part I, Analysis II: main points

Imperfect linearity can be expressed as: 𝑌𝑖 = 𝑎 + 𝑏𝑋𝑖 + 𝑒𝑖


As we look to minimize the vertical distances between the fitted line and each point in
the scatterplot, we select the line that minimizes the total of squared residuals: this is
the ‘line of best fit’ (and why this regression is called OLS - Ordinary Least Squares)

The slope of a diagram is important.


If the regression coefficient b > 0 , then the relationship between the IV and DV is
positive, and the line-of-best fit will be upward sloping.
If b < 0 , then the relationship between the IV and DV is negative (or, inverse), and the
line-of-best fit will be downward sloping.

Measures of goodness of fit include Root MSE (or, standard error of the regression
model –the higher, the worse the fit) and 𝑅2 (from 0 to 1, the closer to 1 the perfect X
& Y relationship)
Part I, Analysis II: main points

Regressions can also help us learn about an unobserved population, by using sample
data and employing familiar inference tools (e.g., H0 null hypothesis testing, p-values,
confidence intervals)
In regression analysis, instead about , we make inferences about β
Null hypothesis, H0 : X does not cause Y (β = 0)
When Se is smaller, Sb is also smaller, indicating that b is a more precise estimate of β
Confidence intervals tell us we can be (usually 95)% (at p: 0.05) confident that β
falls in the interval between two values

Multiple Regression is further used to identify confounding variables (variables that are
correlated both with X and Y). Multiple regression is simply the addition of one (or
more) independent variable(s), Z, to the regression equation.
Part I, Analysis II: Glossary

Lambda: This is a test that indicates strength of association (as Categorical /


Nominal variables are non-directional). It ranges from 0.00 to 1.00 and helps
improve one’s predictions of one variable if one knows about the other.

Gamma: This test shows both strength and direction of association. It ranges
from -1.0 to +1.0, and determines whether an observation rating high on one
variable means that observation will rate high on another.

Correlation coefficient, r (or, Pearson’s r): This measure provides both the
strength and direction of a linear relationship between two continuous
variables. It ranges from -1.0 (strong negative association) to +1.0 (strong
positive association). If r = 0, this only means is no linear association between
them.
Part I, Analysis II: Glossary

Regression: Regression is a major tool in Political Science research, especially


its linear form. For two variables (bivariate), unless we have strong visual
evidence in our scatterplot to the contrary, we start by postulating a linear one
between them.

Residual 𝑒: the deviation of individual Y values from the regression line Some
use instead the notation u for the residual

Slope/regression coefficient: On b, regression coefficient. If b > 0 , then the


relationship between the IV and DV is positive, and the line-of-best fit will be
upward sloping. If b < 0 , then the relationship between the IV and DV is
negative (or, inverse), and the line-of-best fit will be downward sloping.
Part I, Analysis II: Glossary
Root MSEthe average vertical distance (or, deviation) of a data point from the fitted
regression line. It is always expressed in the metric of the DV and not bounded (e.g., from -1
to 1) so, it is more difficult to compare. Still, broadly, the higher the value of Se , the worse
the fit of the regression line.

𝑅2 (R squared): A better indicator of how accurately the regression line describes X and Y-it is
the proportion of variance in Y explained by X. It ranges from 0 to 1. If 𝑅2 =1 ⇒ perfect
relationship between X and Y (accounting for all variation). If 𝑅2 =0 ⇒ no (linear) relationship
between X and Y.
T-statistic: The t-statistic measures how much the estimated value differs from its expected
value, considering the standard error. It's key in deciding if we accept or reject the null
hypothesis in a t-test. Usually, researchers use a 5% level to examine if their results are
statistically significant. Larger samples make the t-statistic bigger and the results more
significant. Generally, in a two-tailed t-test with a 5% significance level and a big enough
sample, the critical t-value is around ±1.96.
Confidence intervals: Confidence intervals are about the likely location of the population
mean – the percentage of confidence that the population mean lies within
Part I, Analysis II: Glossary

Linear regression assumptions: for regression models that estimate a and β to be


meaningful, an OLS estimator should be unbiased, consistent and efficient.

Regression assumptions: Unbiasedness, consistency and efficiency can be addressed


by seven specific assumptions: (i) linearity in the parameters (ii) expected value of the
error term = 0 (iii) X values measured without error (iv) errors are independent from X
(v) errors have constant variance (vi) errors independent from each other and (vii)
errors normally distributed.
POL244H
Research Methods for Political Science II

Winter 2024
Wednesdays 1-3pm @ MN1170

Part I
Analysis III: Regression (concl.)
Week 1 Jan. 10 Introduction and course details
Week 2 Jan. 17 Data I: Research design, experiments, interviews and questionnaires
Week 3 Jan. 24 Data II: Sampling, size and distributions

Week 4 Jan. 31 Analysis I: Inferential statistics, univariate, bivariate analysis


Week 5 Feb. 7 Analysis II: Measures of Association, Regression Statistics
Week 6 Feb. 14 Analysis III: Regression (cont.)
Reading week NO CLASS

Week 7 Feb. 28 Mid-term test


Big data, machine learning Network Analysis
Week 8 Mar. 6 Single and small-n cases; Data I: Ethnography
Week 9 Mar. 12 Data II: Interviews, archives, texts/documents

Week 10 Mar. 19 Analysis I: Process tracing, content analysis

Week 11 Mar. 26 Analysis II: Comparative study

Week 12 Apr. 3 Indigenous methods, Research ethics, conclusions


Today’s schedule

Announcements
Regression analysis II
Big data, Machine Learning (next class)
Network analysis (next class)
Assignments

Assignment 2 (10%)
Due: February 18 by 11:59pm EST (Quercus)
Exceptionally, due to Reading Week, no late penalty until end of Feb. 20 (by 11:59pm EST)

NB. The six academic papers to select from have been posted on Quercus
Assignments

Extra office hrs on Assignment #2 by Mujahed: tomorrow (Thursday, Feb.


15), 3:30-4:30 pm EST.

Zoom link: https://utoronto.zoom.us/j/87382638114


Meeting ID: 873 8263 8114
Passcode: POL244
Assignments

Multiple Regression – guide on writing about a Regression table


based on example of article by Collier and Hoeffler (2004)

Posted on Quercus
Reading Week: there will be office hours

Prof. K
Wednesday, Feb. 21 via zoom
Expanded from 3:30-5:30pm EST
Zoom link: https://utoronto.zoom.us/j/83233816131

Mujahed
Thursday, Feb. 22 from 9-10am, and from 3-4pm
Zoom link: https://utoronto.zoom.us/j/87382638114
Meeting ID: 873 8263 8114
Passcode: POL244
Mid-term test

In class, February 28th from 1:10-2:10pm


Material: weeks 2-6 (including Halperin and Heath ch. 17)
(but, excluding Big Data, Machine Learning and Network Analysis -slides
and their two readings*- which will be discussed after the mid-term)
Focus: 75% on slides, 25% on tutorials and textbook

Then, mini lecture on next topic (2:20-3pm)


Tutorials (3-4, 4-5pm)
* 1. Jungherr, A. & Theocharis, Y. 2017. The Empiricist’s Challenge:
Asking Meaningful Questions in Political Science in the Age of Big
Data. Journal of Information Technology & Politics 14, 2: 97-109.
2. Borgatti, S. P., Everett, M. G. and Johnson, J. C. 2018. Analyzing
Social Networks 2nd ed. (London: SAGE), 1-12.
Regression analysis

The road to causation

• Plausible theoretical relation between X, Y


• Y must not temporally precede X
• X and Y must be associated
• Ensure relation is not spurious due to
other ‘confounding’ variables Z
(a variable that is correlated both with X and Y, and one which, if
omitted from a regression, will result in biased estimates of β (and
perhaps also of α)
Regression analysis

Spurious correlations

Crude oil imports by the U.S. and U.S. Chicken consumption (annual/lb)
r: 0.899; correlation 89.9%
(from Vigen @ spurious correlations)
Confounding variables

A variable that is correlated both with X and Y


A variable which, if omitted from a regression, will result in biased
estimates of β (and perhaps also of α)
Confounding variables

Difference between a confounding variable and an intervening


(mediating) variable

Intervening variable A is Confounding variable Z is


mediating between X and Y correlated both with X and Y
Confounding variables

How can a confounding variable be identified?

• Good knowledge of existing literature (that is why literature reviews


help!) and solid theoretical foundations for research and hypotheses to
identify such a possible variable (and gathering of data to measure
them)
• Controlling for it statistically
Confounding variables

Example (Singh)
Watching Fox News (X) and supporting the Republican Party (Y)
Does X cause Y? Association is spurious because X and Y also correlated
with numerous other variables (e.g., political ideology, urbanity,
demographics, etc.). Result is an overestimation of X over Y

If variable Z (Conservative Political Ideology) is


omitted, errors always above the true line for
Fox viewers, because these viewers are also
conservative. There is a positive correlation
between X and errors and the result is a biased
estimate of the impact of exposure to Fox
News to voting Republican
Confounding variables

Fox News and Republican voting preference example: illustration of


how a multivariate analysis can help avoid spurious correlations
Fox News
viewership

High Low
Voting Democrat 31 68
preference Republican 69 32
Total 100 100
n 475 525

Cross tabulation of survey results


Confounding variables

If the sample is split to statistically control for political ideology…

Liberals Moderates Conservatives

…it reveals that there is negligible relationship between voting


preference and Fox News viewership
Confounding variables

Visualization: little true relationship between X and Y (spurious), unless


Political Ideology confounding variable is omitted (black regression line)

Survey datapoints by ideology


Republican  Conservative
vote (Y)
 Moderate
 Liberal

Exposure to Fox News, in hrs (X)

(From Singh)
Confounding variables

How can a confounding variable be identified?

• Good knowledge of existing literature (that is why literature reviews


help!) and solid theoretical foundations for research and hypotheses to
identify such a possible variable (and gathering of data to measure
them)

• Controlling for it statistically


• When suspecting more than one possible confounding variable,
multiple regression towards accurately identifying a causal relationship
Multiple Regression

If you recall, the equation for bivariate regression is:

𝑌𝑖 = 𝑎 + 𝑏𝑋𝑖 + ε𝑖 (true population)

(sample estimate to make inferences about the above)

NB. Sometimes, the notation u for residuals, or errors is used instead of e

If X increases by 1 unit, Y changes by the slope (or, coefficient) b


Multiple Regression

Multiple regression is simply the addition of one (or more) independent


variable(s), Z, to that equation. For two IVs:

single intercept slope estimates for each IV (X, Z): they are called Partial Coefficients

If X increases by 1 unit, and Z is the same (or, ‘while accounting for the
impact of Z’, or, ‘all else equal’), Y changes by the slope (or, coefficient) b1
More broadly, Multiple Regression for k number of IVs
𝑌𝑖 =  + β1𝑋1𝑖 + β2𝑋2𝑖 + … + β𝑘𝑋𝑘𝑖 + ε𝑖
Multiple Regression

β1 represents the effect of X on Y while holding constant the effects of Z and


β2 represents the effect of Z on Y while holding constant the effects of X

In that way, our fourth and final criterion for causality can be met
Multiple Regression: geometry

Whereas in two-variable regression


the equation 𝑌𝑖 = 𝑎 + 𝑏𝑋𝑖 represents a line,

in multiple regression
denotes a plane (the ‘response plane’)
There is still a single intercept ,
but now there are two slopes, b1 and b2

Residuals are the vertical distance between


each observation and the regression surface

The R-squared formula is the same


Multiple Regression

The same assumptions for regression analysis stand in multiple regression


– plus an additional one: ‘no perfect multicollinearity’

That means if variables X and Z are highly or perfectly collinear (r  1.0),


(i.e., perfectly related to each other), it is impossible to estimate unique
regression coefficients and distinguish the effects of X on Y from the
effects of Z on Y.
Multiple Regression: collinearity
Detection
• When high R-squared value, but statistically insignificant parameters
• Coefficient estimates that vary significantly (esp. standard errors, which increase)
when other IVs are added or removed
• High variable correlations among IVs, and counterintuitive coefficient estimates

Remedy
• Increasing sample size, collection of more data
• Combining collinear variables (if conceptually similar) by adding the values of them
• Accepting it as a fact of the model (while ‘wider’ sampling distributions are not biased)

NB. Removing IVs suspect of collinearity should be avoided, because this can lead to
variable omission bias; it is less problematic to have collinearity and larger standard
errors, than miss a potentially important causal variable
Multiple Regression: terminology

Also note that in multiple regression, IVs can also be called:


(Singh, Kuhnert et al.)

• Explanatory variables
• Covariates
• Predictor variables
• ‘Right-hand side’ variables

Also, do not confuse multiple with ‘multivariant’ – term that refers to


more than one DVs (and not estimated by OLS Regression)
Multiple Regression: statistical inference

Same logic of statistical inference, hypothesis-testing, confidence


intervals and p-values in Multiple Regression

Example : Occupational prestige in Canada (Pineo and Porter; Singh)


Factors affecting whether an occupation is considered prestigious (Y)
Research question: what affects the status of an occupation?

DV: status (Y)


Possible IV: Income (X)
Multiple Regression: statistical inference

Scatterplot of 15 professions vs income

r= 0.85 ⇒ very strong, positive linear relationship


Multiple Regression: statistical inference

Estimation of bivariate Regression of Occupational Prestige (Y) on Income (X)


⇒ 𝑌෠ = 𝑎 +β * X

What does this tell us? A $1,000 (X unit) increase is associated with 2.18-point increase in
the expected prestige of an occupation. Also, with zero income, prestige is negative
Multiple Regression: statistical inference

Regression results
Coefficient on Income : 2.18
Standard error of the coefficient: 0.36
𝑏−𝛽 𝑏 𝑏−𝛽 𝑏
Using 𝑡 = 𝑆 0 = 𝑆 , we have 𝑡 = 𝑆 0 = 𝑆 (coefficient/standard error) ⇒
𝑏 𝑏 𝑏 𝑏
𝑡 = 5.95
This value exceeds the critical t, therefore this estimate is significant at
the 5% level (two-sided)
Confidence interval: 1.08 to 3.29
p-value < 0.001. Hence, we can be very certain that income is not
unassociated with prestige
R-squared (goodness of fit): 0.73. Thus, income (X) explains 73% of the
variance in prestige (Y)
Multiple Regression: statistical inference

Predictions

What would be the predicted Occupational Prestige of a profession


with an income of Cdn $ 35,000 / year?

Predicted Occupational Prestigei = α + β * incomei


= -0.6.72 + 2.19 * 35,000 = 70
Multiple Regression: statistical inference

But, could there be another factor what affects the status of an occupation
besides income?

Possible IV: Education (Z)

Looking at the Pineo and Porter dataset, we obtain that


Education (Z) and Income (X) correlated at (Correlation Coefficient) r=0.82, and,
Education (Z) and Occupational Prestige (Y) correlated at r=0.94

That means this variable Z is correlated both to X and Y. It is a confounding


variable and needs to be accounted for
Multiple Regression: statistical inference

Estimation of Multiple Regression of Occupational Prestige (Y)


on Income (X) and Education (Z) ⇒ 𝑌෡𝑖 = 𝑎 + 𝑏1 * Xi + b2 * Zi

Regression results
Coefficient on Income: 0.65 (<2.18)
Standard error of the coefficient: 0.41
𝑡 = 1.59, which does not exceed the critical t, therefore this estimate is not
significant
Confidence interval: -0.24 to 1.54
p-value: 0.139
Multiple Regression: goodness of fit

Goodness of fit, R-squared: 0.90.

R-squared will always increase when IVs added, but theoretical


considerations and a solid understanding of background knowledge
related to the research - not addition ad infinitum (called ‘overfitting’) -
should guide this process.

Thus, Education (Z) and Income (X) together explain 90 % of the


variance in prestige (Y).
But, note that Multiple Regression cannot tell how much each does.
Multiple Regression: goodness of fit

In this section (d) the individual


influences of X and Z on the
variance explained by Y cannot
be distinguished.
All we can do is determine which
IV has largest b coefficient, by
standardizing them

Venn diagram of total variance (correlation)


in X, Z, Y (from K&W, Singh)
Multiple Regression: goodness of fit

We can also use the Adjusted R-squared, which accounts for number of
observations and variables in the model (degrees of freedom)
2 𝑘
𝑅𝑎𝑑𝑗 = R2 – * (1 - R2)
ℎ−𝑘−1
2
= 0.9 – ( ) 1 − 0.9 = 0.885
15−2−1

This indicator can point to how reliable the correlation is and how much is
determined by adding IVs to the model:
It increases only when the predictor strengthens the model
Multiple Regression: statistical inference

Result
The effect of Income on Occupational Prestige was overestimated due to an
omitted confounding variable (Education).
Income does not equate with Prestige and our causal inference was
incorrect.
Once we control for a confounding variable, we can obtain a less biased
estimate
Multiple Regression: interpretation

Example (K&W)

Factors affecting Incumbent-Party Presidential Vote (Y)

Research question: what affects the reelection of an incumbent


president?

DV: Votes for incumbent (Y)


Possible IV: economic growth (X)
Multiple Regression: interpretation

Number of observations (elections): 35


Regression model A (Growth)
Estimated Regression coefficients:
𝛼ො = 51.6
𝛽መ = 0.65 (w/standard error = 0.15)
Estimated Y-intercept: 51.61
(w/standard error: 0.81)
Both statistically significant

For every X increase by 1 unit, Y changes by the slope (or, coefficient) b


This means that for every 1% increase in Growth (X), there is an increase of 0.65%
in the vote percentage for the incumbent candidate
Multiple Regression: interpretation

Could there be other causes for Y?

Regression model B (‘Good news’, or Z)


(Quarterly economic growth for election year)
Estimated Regression coefficients:
𝛼ො = 47.6
𝛽መ = 0.87 (w/standard error = 0.32)
Estimated Y-intercept: 47.6
(w/standard error: 1.87) Not bounded (e.g., from -1 to
1), but broadly, the higher
Both statistically significant the value of Se , the worse
the fit of the regression line

For every Z increase by 1 unit, Y changes by the slope (or, coefficient) b


This means that for every additional consecutive Quarter of good economic news (Z), an
increase of 0.87% in the vote percentage for the incumbent candidate
Multiple Regression: interpretation
Regression model C
Estimation both of X and Z effects
on incumbent vote
Estimated Regression coefficients:
𝛽መ 𝐶 𝐺𝑟𝑜𝑤𝑡ℎ = 0.58 (w/standard error = 0.15)
𝛽መ 𝐶 𝑄𝑢𝑎𝑟𝑡 = 0.63 (w/standard error = 0.28)
Estimated Y-intercept: 48.47
(w/standard error: 1.58)
All three statistically significant
But, maybe not substantively so. Stat. significance a necessary, not sufficient condition
𝛽መ 𝐶 𝑄𝑢𝑎𝑟𝑡 = 0.63 means that for every 1 unit (here, Q of econ. growth), we estimate an
increase of 0.63% in the vote percentage for the incumbent candidate, while controlling
for the effects of Growth
Multiple Regression: interpretation

Also important, the difference between


estimated Regression coefficients in
models A and C
𝛽መ 𝐴 = 0.65 (w/standard error = 0.15)
𝛽መ 𝐶 𝐺𝑟𝑜𝑤𝑡ℎ = 0.58 (w/standard error = 0.15)

𝛽መ 𝐴  𝛽መ 𝐶 𝐺𝑟𝑜𝑤𝑡ℎ because model C controls for the effects of quarterly economic growth

Also, R-squared value in model C indicates an increase in the amount of DV variance


that can be explained Remember, 𝑅2 (the proportion of variance in Y explained by X) ranges from 0 to 1,
and the closer to 1 the better the prediction
Multiple Regression: interpretation

NB. Coefficients are measured in metric of variable, so are not standardized.


In this example, when comparing effects of Growth rates and quarterly
economic growth, the result is apples to oranges
To allow for meaningful comparison,
Standardized coefficients have been proposed
𝑠𝑋
𝛽መ𝑆𝑡𝑑 = 𝛽መ
𝑠𝑌
𝛽መ𝑆𝑡𝑑 : standardized coefficient
መ unstandardized coefficient
𝛽:
𝑠𝑋 , 𝑠𝑌 : standard deviations of X, Y (where SX= var 𝑋 , SY= var 𝑌 )
Multiple Regression: interpretation

In model C,
5.5
𝛽መ𝑆𝑡𝑑 𝐺𝑟𝑜𝑤𝑡ℎ = 0.58 = 0.53
6.0
This means that for every 1 standard deviation increase in Growth, we estimate an
increase of 0.53 standard deviation in the vote percentage for the incumbent
candidate, while controlling for the effects of quarterly econ. growth
2.9
𝛽መ𝑆𝑡𝑑 𝑞𝑢𝑎𝑟𝑡. 𝑒𝑐𝑜𝑛. 𝑔𝑟𝑜𝑤𝑡ℎ = 0.63 = 0.31
6.0
Similarly, for every 1 standard deviation increase in quart. econ. growth, we estimate
a 0.31 standard deviation increase in the vote percentage for the incumbent
candidate, while controlling for the effects of overall Growth
Therefore, overall Growth has a greater impact on incumbent voting % than ‘good
news’ about consecutive econ. growth in the year leading up to an election
Multiple Regression: Regression model

What is a Regression model? (Besco, M. Islam, Redman)

Models are simplified versions of the world that help researchers explore
the potential causes of phenomena by mathematically sorting out what
variables may have an impact
Multiple Regression: Regression model

Important considerations (Besco, M. Islam)

• What variables to include


• A solid background knowledge, a well-thought-out theory, and a
plausible hypothesis; one has to begin with theory in search of testing,
not with data seeking a theory
• Simplicity is key (researchers start simple with a straight-forward
hypothesis that includes the main variables they are considering): what
are the key IV and DV?
Multiple Regression: Regression model

Important considerations (Besco, M. Islam)

• What possible variables could be used as control ones?


 Here, it is important to think about spurious causation
 Unexplained variation must be reduced
 Controlling for intervening/mediating variables should be
avoided; instead, one should seek confounding ones

Note that adding variables will often change the other coefficients
(slopes)
Multiple Regression: Regression model

How to determine what model is best for one’s (observational) data?


In Political Science, researchers seek the X cause of Y outcome, and there may be
competing theories with different IVs for Y (e.g., civil war cause-grievance, or greed?)
Further, the significance of a variable may not be immediately apparent
Background knowledge, literature review and empirical cases serve as the primary
guides (also explains all these PoliSci courses one has to take as they advance)
Then, the causality checklist helps culminating in a Regression model with candidate
IVs and its testing for unbiased estimate of the causal impact of the IV identified as
causal.
One can also use AIC, AICc, BIC and Mallows Cp metrics for model evaluation and
selection. The lower these metrics, the better the model
Overall, parsimony (simplest) theory is preferred.
When two theories are equally plausible, F-tests (similar to t-tests) are used to
determine if one model outperforms another in terms of explained variance
Multiple Regression: outliers

Outliers (or, influencers): extreme value observations relative to other


ones - often with unusual IV values (or, leverage) and large residual
values - that strongly influence the parameter estimates in a Regression
model
How can an influencer be detected? Looking at visualizations
(for observations far from the others)

(from Kellstedt and Whitten) (from Agresti)


Regression: Categorical Independent Variables

We have examined continuous variables (interval/ratio)

But, at times many variables are not continuous (instead, nominal, ordinal)

Let us briefly look at two- (and multi-) category variables


Regression: Dichotomous Categorical Dependent Variables

What happens when the Dependent Variable is Categorical and


dichotomous? Dummy variables (1 and 0).
Dummy variables (or ‘indicators’) are dichotomous variables that take a
value of either one (presence of a characteristic) or zero (absence).
Often encountered in Political Science (e.g., voting/not voting in an
election; employed status/unemployed status; favorable position on
immigration policies/non-favorable position on immigration policies; civl
war occurrence/war absence)

Approached via different models:


Linear Probability, Binomial Logit and Binomial Probit
Regression: Dichotomous Categorical Dependent Variables

Linear Probability Model (LPM)


This is a special case of an OLS model where the DV is a dummy one, and
the DV estimates are interpreted as predicted probabilities.

But has issues (e.g., out-of-bounds predictions, errors non uniformly


distributed) which arise due to trying to impose linear functional form in a
non-linear relation.
Regression: Dichotomous Categorical Dependent Variables

Binomial Logit (BLM) and Binomial Probit (BPM) Models


Instead of LPM, researchers frequently use other models that account both
for non-linearity and a dichotomous DV – like Binomial Logit and Binomial
Probit ones.
In essence, these models are able to capture non-linear relations via their
cumulative distribution functions which have a sigmoidal s-shape
They are expressed via Logit and Probit measures of odds ratios or log odds
(common of effect size for proportions).
Their goodness of fit statistics include the Pseudo-R-squared and the
Percent Correctly Predicted.
Regression: Dichotomous Categorical Dependent Variables

How do they work? (You do not need to know that)


BNL and BPN use link functions Λ (lambda) and Φ (phi) respectively, that
link the linear component 𝑋𝑖 𝛽መ to the predicted probability that the
dummy DV =1, capturing the non-linear relationship between IV and DV in
a non-Bernoulli (not only two possible outcomes) distribution
Regression: Dichotomous Categorical Dependent Variables

Example (Albright)

Research question: does the consumption of alcohol affects one’s affinity


for Justin Bieber’s music?

DV=1 (love Bieber), DV=0 (can’t stand Bieber)


(…with data from a college fraternity)
Binomial Logit (BLM) and Binomial Probit (BPM) Models

A Linear model cannot accurately depict the Non-linear models like BNL and BNP accurately accommodates
cumulative effects of alcohol on the likeability different rates of change (effect) at the opposite ends of the IV.
of the singer (beer nr 3 does not produce the Essentially, these models take the linear one and filter it
same effects as beer nr 7). through a function based on a probability distribution to reflect
the non-linear relationship within the 0-1 range
Regression: Polytomous Categorical Dependent Variables

Finally, when the Dependent Variable is Categorical and polytomous:


Also encountered in Political Science (e.g., ratings of a political leader,
choice of vote in a parliamentary democracy with multiple parties, range of
regional alliances)
While beyond the scope of our class,
For ordinal DVs, OLS to be avoided; use of ordered logit and ordered probit
models (generalize from dichotomous models to account for additional
ordered categories) (Singh)
For multinomial (categorical) DVs, complexity increases and cannot use
OLS. Instead, multinomial logit and multinomial probit ones (also
generalizations of more simple models)
Time Series

Data can often be collected across time through repeated, regular temporal
observations on a single unit of analysis (Shin).
This analysis allows to investigate the ‘history’ of a variable and explain
patterns over time (including the systematic or random nature of residuals)
Estimation of long-term behavior (‘trend’) involves bivariate OLS Regression
analysis with time (in regular intervals) as IV. It also conveys information
visually on trends (e.g., the impact of events, policies, etc.)

Example: Partisan change in Southern


U.S., 1952-84. From Stanley
To be examined in our next class, following the mid-term test

The era of big data: machine learning and network analysis


The era of big data

We live amidst a transformative technological revolution that has


seen digital technology, computational power as well as
huge volumes, more readily accessible and new forms of data
(e.g., from social media) emerge in the last 25 years
Big data

Big data: the capacity to search, aggregate, and cross-reference large datasets
(Boyd and Crawford)

Digital trace data (big data and metadata): a characteristic of the digital era
2017: 500,000,000 tweets/day (Source: Omnicore)
2017: 1,500,000,000 people daily active FB users (Source: Zephoria)
2017: 260,000,000,000 emails send/day
2019: 319,600,000,000/day (Source: Radicati Group)
Pandemic has only accelerated this trend
NB. One zettabyte is 1 trillion
2021 - total volume of data created: 64.2 zettabytes gigabytes (or all of Shakespeare’s
2025 - total volume of data created (forecast): 180 zettabytes works 178 trillion times)
In an era of 24/7 connectivity, humans produce, emit and provide data (some
argue we are data) daily that can be collected and analyzed.
Big data: promise

1. Massive volume and storage of data


2. Real-time analysis (velocity)
3. Wide variety, richness of sources
4. All-encompassing scope
5. Highly detailed resolution
6. Relational (of complex dataset)
7. Flexible (extensible laterally)
8. Scalable (expandable)
Computational Social Science

Result: emergence of computational Social Science, including


Machine Learning, Agent-Based Modelling, Social Network Analysis,
Content Analysis, Geographical Information Systems (GIS),
Complexity Theory (Page)

Our course looks briefly at two:


Machine Learning and (next class) Network Analysis
Machine Learning

When building a model, the canonical advice is to avoid


• Forward Selection (adding variables to maximize explained variance)
• Backward Elimination (adding every possible variable, then removing
ones with least significant coefficient in steps),
• Stepwise Regression (adding and then dropping variables based on
significance), and,
• Automated Variable Selection
The latter practically means constructing a model and build a theory based
on its results – termed ‘post-hoc theorizing’. It is akin to throwing a dart
and then painting the target around it. However, it is useful for prediction
and exploratory research on what correlates with a DV
Machine Learning

Machine learning is a ‘class of flexible algorithmic and statistical


techniques for prediction and dimension reduction*’ (Grimmer, Roberts
and Stewart)
*Where input information relevant only to output is preserved
More plainly, it is the automation of statistical learning techniques by
computers to identify patterns in data and make predictions.

These predictions are made with some level of accuracy, which techniques
like regression seek to improve. A model that identifies sub-optimal
predictions and adjusts towards greater accuracy is said to ‘learn’ - hence,
‘machine learning’
Machine Learning: discovery

Can also be useful for large datasets


Organizes together similar observations in a dataset via clustering
(with techniques like unsupervised dimensionality reduction and
automated feature engineering - transforming inputs into new basic
functions, like neural networks) towards recognizing behavioral
patterns that may be difficult to identify

Examples: Grimmer’s research on U.S. Congressional grant allocation


producing unexpected findings, generating more questions; Pape’s
analysis of Jan. 6, 2021 storming of Congress

Implications for mode of scientific inquiry…


Machine Learning: data-driven research agendas

Instead of formulating hypotheses a priori, inductive approach of


learning new things about a phenomenon studied.
This is a challenge to deductive approach with focus on sequential
and interactive ways to analyze data
Going against the grain, this has recently generated a lot of
theoretical and empirical debates in Political Science

Examples:
Jungherr and Theocharis vs. Grimmer on methodological merit
Blair and Sambanis vs. Beger, Morgan and Ward) on whether a
theory-based model is better than Machine Learning to predict civil
war onset.
Advanced Machine Learning: Artificial Neural Networks

The learning part of creating models has led to the next step of artificial neural
networks, a broader family of learning methods based on learning data
representations (instead of task-specific algorithms). These networks store and
evaluate how significant each of the inputs is to the output of a model (NB. that
both inputs and outputs are binary-1s and 0s ). At the same time, these types of
models include an intermediate, not observable ‘hidden layer’ that stores
information regarding the input’s importance, and it makes associations between
the importance of combinations of inputs (Johnson, Rowe). In that sense, it mimics
the human brain’s architecture and function by quickly making decisions

Outputs are passed to the next perceptron until an answer is provided.


Like with linear Regression, neural networks vary inputs to improve accuracy.
From Nielsen They can be ‘trained’ by varying the weights x1, x2, x3, … , xn and the bias.
Advanced Machine Learning: Deep Neural Networks

Models with more than one intermediate, not observable ‘hidden layers’ are
engaged in ‘deep learning’. While like with Artificial Neural Networks each
connection has its weight, in deep neural networks, the most important features
for classification can be obtained automatically (via the ‘activation function’) similar
to neurons in the brain

From Nielsen From Dickson


Advanced Machine Learning: Deep Neural Networks

Application example

From Assael, Y., Sommerschield, T., Shillingford, B. et al. (2022) Restoring and
Attributing Ancient Texts Using Deep Neural Networks. Nature 603, 280–283.

Probabilistic prediction of missing ancient text via deep neural networks. In the example displayed above, this inscription
[Inscriptiones Graecae vol. 1, edition 3, document 4, face B (IG I3 4B)] records a decree concerning the Acropolis of Athens and
dates to 485/4 BC. (Marsyas, Epigraphic Museum, WikiMedia CC BY 2.5). The marble-looking parts are the surviving parts of the
inscription, with the rest predicted by a self-learning deep neural network algorithm, Ithaca. Alone, archaeologists were able to
correctly predict 25% of the text; Ithaca achieved 62%; a collaboration between the two yielded 72% successful prediction.
See more details at https://www.nature.com/articles/d41586-022-00702-6
Part I, Analysis III: main points

Multiple Regression is used to accurately identify a causal relationship when a


confounding variable (Z) – or more – are suspected of affecting both X and Y.
Also used for statistical inference, hypothesis-testing and follows same logic as
simple Regression (confidence intervals, p-values)

Multiple regression is simply the addition of one (or more) independent


variable(s), Z, to that equation. For two IVs:

If X increases by 1 unit, and Z is the same (or, ‘while accounting for the impact of
Z’, or, ‘all else equal’), Y changes by the slope (or, coefficient) b1
b1 represents the effect of X on Y while holding constant the effects of Z and
b2 represents the effect of Z on Y while holding constant the effects of X
Part I, Analysis III: main points

To find out which IV(s) cause X, one creates models, each containing a different
combination of the independent variables measured.
To compare Regression coefficients of different units, they are standardized
Another way to compare models is the AIC, BIC and related metrics.

For IVs that are not continuous (instead, nominal, ordinal), dichotomous or
polytomous (if the categories represented are both exhaustive and mutually
exclusive) can be represented by ‘dummy’ ones that take a value of either one
(presence of a characteristic) or zero (absence).

When the DV is Categorical and dichotomous, Linear Probability, Binomial Logit


and Binomial Probit models are used

Time Series is an analysis that investigates a variable and explain patterns over
time.
Part I, Analysis III: Glossary

Confounding variables: A variable that is correlated both with X and Y. A variable


which, if omitted from a regression, will result in biased estimates of β (and perhaps
also of α).

Intervening variables: Unlike a confounding variable, which is correlated with both X


and X, an intervening variable mediates between an IV and a DV. E.g., A would be an
intervening variable if it mediates between X and Y.
Different names of IVs: can also be called explanatory variables, covariates, predictor
variables or ‘right-hand side’ variables

Response plane: in multiple regression, the surface formed by b1 and b2 which (like a
line does in two-variable regression) represent the predicted values
Residuals: Residuals are the vertical distance between each observation and the
regression surface.
Part I, Analysis III: Glossary
Multicollinearity: when two or more of the IVs in a Regression model are highly
correlated with one another.
Outliers (or, influencers): extreme value observations relative to other ones - often
with unusual IV values (or, leverage) and large residual values - that strongly influence
the parameter estimates in a Regression model.
Models: Models are simplified versions of the world that help researchers explore the
potential causes of phenomena by mathematically sorting out what variables may
have an impact.
Adjusted R-squared: similar to R-squared [0-1], points to how reliable a correlation is
AIC, AICc, BIC and Mallows Cp: metrics used for model evaluation and selection. The
lower these metrics, the better the model
Dummy variables (or ‘indicators’): dichotomous variables that take a value of either
one (presence of a characteristic) or zero (absence). Can also be used for polytomous
independent variables if categories they represent are mutually exclusive, exhaustive
Part I, Analysis III: Glossary

Linear Probability Model (LPM): special case of an OLS model where the DV is a dummy
one, and the DV estimates are interpreted as predicted probabilities.
Binomial Logit (BLM) and Binomial Probit (BPM): account both for non-linearity and a
dichotomous DV. These models are able to capture non-linear relations via their
cumulative distribution functions which have a sigmoidal s-shape. For goodness of fit,
Pseudo-R-squared and the Percent Correctly Predicted are used.

In OLS, DV both continuous and observed (hence, residuals point to the line).
BNL contains an unobserved probability (we only see 1s and 0s). To estimate the
parameters of this model, the method of Maximum Likelihood Estimation (MLE) is used

Time Series: data can often be collected across time through repeated, regular temporal
observations on a single unit of analysis. This analysis allows to investigate the ‘history’
of a variable and explain patterns over time.
Part I, Analysis III: Glossary
Big data: the capacity to search, aggregate, and cross-reference large datasets.
Machine learning (ML): ML is a ‘class of flexible algorithmic and statistical techniques
for prediction and dimension reduction’. More plainly, it is the automation of
statistical learning techniques by computers to identify patterns in data and make
predictions.
Have a restful, productive and healthy Winter reading week!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy