0% found this document useful (0 votes)

5 views57 pages

10 - Linear Models

The document outlines a course on Linear Models and Analysis, covering course logistics, upcoming deadlines, and key concepts such as hypothesis testing and vectorization in Python. It discusses the relationship between GDP per capita and life expectancy, including model interpretation and significance testing through permutation methods. The course also addresses overfitting and the importance of regularization in model selection.

Uploaded by

john.coyle198

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views57 pages

10 - Linear Models

Uploaded by

john.coyle198

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Linear Models & Analysis

CIS 5450, Spring 2025

Ryan Marcus
Outline
●
Course logistics
●
Brief review
●
Linear models: interpretation
●
Linear models: analysis
Upcoming Deadlines
●
HW2 due Mar 2nd (12 days)

●
Midterm 1 on Mar 4th (14 days)
●
Sample midterm released this week
Office Hours
●
Average wait time: 3.5 minutes
●
P95 wait time: 18 minutes
Outline
●
Course logistics
●
Brief review
●
Linear models: interpretation
●
Linear models: analysis
Is Yawning Contagious?
CONTROL or TREAT

●
Mythbusters episode: got random people into
two rooms, had someone go yawn in one room,
recorded how many yawners.
Effect: “no yawn” or
“yawn”

Example adapted from Dr. Maria Tackett

Is Yawning Contagious?
●
25% of the control group yawned
●
29% of the treatment group yawned
●
Is it a real effect, or just random chance?

Example adapted from Dr. Maria Tackett

Is Yawning Contagious?
●
We’re going to be doing
hypothesis testing via
simulation
●
Intuitively:
“If we assume there is no
effect, and we did a
random trial, what is the
probability of seeing a
result like what we saw?”
Hypothesis Testing

In our real data, we saw an 0.04

difference, but in the null world, we
saw a 0.007 difference. Much
smaller!

… but just one trial

Hypothesis Testing
Hypothesis Testing
●
How often was our simulated test statistic in the
null world equal or stronger than what we
observed?

Empirical P value
Hypothesis Testing
●
P values can be confusing.

Assuming the null hypothesis (treat

& control are the same), there is a There is only a 39.8% chance the
p=39.8% chance of observing an null hypothesis is true!
equal or stronger effect

FALSE

Assuming null, If you did this

experiment 1000 times, you would There is a 100 – p = 60% chance
expect 398 runs to show an equal that yawning is contagious!
or stronger effect
FALSE
Vectorization
●
Python, as a programming language, often
trades speed for ease-of-use.
●
… but we’ve seen that doing thing with Python
libraries can be fast!
●
DuckDB, Polars
●
These libraries do all the hard work for you, and give you a
nice interface
●
Another similar tool: NumPy (numeric python)
Vectorization: Yawning

Big improvement!
Vectorization: The Downside
●
Downside to vectorization: memory usage
●
Large arrays requires space
●
Fix: batching!

Idea: use numpy to simulate a small number

(10k) of trials, tally up those results, then
simulate another small batch
Outline
●
Course logistics
●
Brief review
●
Linear models: interpretation
●
Linear models: analysis
Linear Models
How does GDP per capita
relate to life expectancy?
●
Before now: compare two groups (treat vs.
control)
●
… but what if my input variable isn’t binary?
Luxembourg

US
Seems like a clear
relationship – can we
model it?

Equatorial Guinea
LifeExp = a * GDPpc + b
LifeExp = a * GDPpc + b How to find a and b?

b, the intercept
On average, when GDPpc
increases by 1, life expectancy
a, the slope
increase by 0.00028 years
Linear Models
LifeExp = a * GDPpc + b
●
How do we know these
are the “right”
parameters?
●
The squared error loss
function is convex
●
assuming your input features
are linearly independent!
Use numpy to apply our
prediction to test input data

Plot the model’s

predictions

Not good!
Not really a linear
relationship, so our
conclusion is suspect

Luxembourg

Equatorial Guinea

transform the misbehaving

dimension
LifeEx = 0.00027818 * GDPpc + 67.9537

LifeEx = 4.62409602 * log(GDPpc) + 31.81983

Just as before, apply model
to test input data
LifeEx = 4.62409602 * log(GDPpc) + 31.81983

What happens if GDPpc goes up by 1? Hard to say.

… but what if GDPpc doubles?

NewLifeEx = 4.624 * log(2 * GDPpc) + 31.81983

NewLifeEx = 4.624 * [log(2) + log(GDPpc)] + 31.81983
NewLifeEx = [4.624 log(2)] + [4.624 log(GDPpc) + 31.81983]
NewLifeEx = [4.624 log(2)] + LifeEx
NewLifeEx = 3.21 + LifeEx
LifeEx = 4.62409602 * log(GDPpc) + 31.81983

What happens if GDPpc goes up by 1? Hard to say.

… but what if GDPpc doubles?

NewLifeEx = 4.624 * log(2 * GDPpc) + 31.81983

NewLifeEx = 4.624 * [log(2) + log(GDPpc)] + 31.81983
NewLifeEx = [4.624 log(2)] + [4.624 log(GDPpc)
On average, + If31.81983]
you double
NewLifeEx = [4.624 log(2)] + LifeEx
GDPpc, life expectancy goes
NewLifeEx = 3.21 + LifeEx
up by 3.21 years
On average, when GDPpc On average, If you double
increases by 1, life expectancy GDPpc, life expectancy goes
increase by 0.00028 years up by 3.21 years
whowhen
On average, woreGDPpc
it best? who wore it best? who wore
On average, it double
If you
increases by 1, life expectancy GDPpc, life expectancy goes
increase by 0.00028 years up by 3.21 years
Variance Variance of the data:

Average squared distance

to mean
Variance Variance of the data:

Average squared distance

to prediction

“Variance” of the model (MSE):

Explained Variance

model variance

total variance
Explained Variance
Variance after
modeling (green)
model variance

total variance

Variance before
modeling (red)
Explained Variance

model variance

total variance
Explained Variance

model variance

total variance

Close to 1 → small model variance compared to big total, good!

Close to 0 → nearly identical model and data variance, bad!
Train models and get
predicted values

Compute R2
→ 39% of the variance is explained by
the linear relationship between GDPpc
and life expectancy

→ 67% of the variance is explained by

the linear relationship between the log
of GDPpc and life expectancy
✅
On average, when GDPpc On average, If you double
increases by 1, life expectancy GDPpc, life expectancy goes
increase by 0.00028 years up by 3.21 years
Overfitting
●
Choosing a model based on fit only works when
the model is regularized (constrained)

✅ 😭
Overfitting

R2 = 0.396085
R2 = 0.589182

R2 = 0.521163
Outline
●
Course logistics
●
Brief review
●
Linear models: interpretation
●
Linear models: analysis
Significance?
●
How do we know our results are significant?
●
Hypothesis testing! Test stat = coefficient

LifeEx = 0.00027818 * GDPpc + 67.9537

Make a copy of x
and shuffle itPermutation Testing
●
Null hypothesis:
Plot 1 trial from the
● null world
the association
between features and
outputs is arbitrary.
●
What is the
distribution of the
slope under this
hypothesis?
Make a copy of x
and shuffle itPermutation Testing
●
Null hypothesis:
Plot 1 trial from the
● null worldbetween
the association
features and outputs is
arbitrary.
●
What is the
distribution of the
slope under this
hypothesis?
Do our simulation in a
Measure the coefficient
loop
under the null hypothesis
Measure the coefficient
under the null hypothesis
Time is a Dimension
●
We’ve been looking at just 2015 data, but we can
look at more years.
●
Prior conclusions: if you go from a country with GDPpc of X to a
country with GPDpc of X’...
Time is a Dimension
●
We’ve been looking at just 2015 data, but we can
look at more years.
●
Prior conclusions: if you go from a country with GDPpc of X to
a countryOn all data
with GPDpc of X’...

On only 2015 data

Permutation Testing
Average GPDpc and life
expectancy over time
●
How can we test for
significance with
Average is increasing
year over year
time?
●
Possible null
hypothesis:
input/output pairing
is arbitrary
What’s wrong with
assuming the pairing is
arbitrary?
Permutation Testing
Aside: for folks who have learned
●
How
aboutcan we test
hypothesis testingfor
before, this is
significance
multicolinearity with
of GDPpc and LifeEx
on year
time?
… how would you fix it? See if your
●
Possible null
answer is more or less complex than
hypothesis: our fix...
input/output pairing
is arbitrary
Permutation Testing
●
How do we fix it?
●
Null hypothesis:
input/output pairing is
arbitrary within each
year
●
Preserves increased averages
over time in the null world
Permutation Testing
Group by year, apply a
random permutation
●
How do we fix it?
●
Null hypothesis:
input/output pairing is
arbitrary within each
year
●
Preserves increased averages
over time in the null world
Permutation Testing
Group by year, apply a
random permutation
●
How do we fix it?
●
Null hypothesis:
input/output pairing is
arbitrary within each
year
●
Preserves increased averages
over time in the null world
Permutation Testing
Put it in a loop

●
How do we fix it?
Collect test stats from
●
Nullnull
hypothesis:
world
input/output pairing is
arbitrary within each
year
●
Preserves increased averages
over time in the null world

We don’t see anything

close to this strong
Interpretation
All time

5.181 * log(2) = 3.59

Only 2015

4.624 * log(2) = 3.21

Grouped Samples
Steepness of slopes within
groups can be viewed as the
efficiency of converting GDPpc
into LifeEx
Grouped Samples
Steepness of slopes within
groups can be viewed as the
efficiency of converting GDPpc
into LifeEx How would you design tests for:
●
Is there a significant
difference between two
countries?
●
Is the distribution of slopes
significantly different than
random?
Next Time
●
What if I have more than one input / feature
variable?

Econometric S Cheat Sheet
No ratings yet
Econometric S Cheat Sheet
3 pages
Homework2 1
No ratings yet
Homework2 1
3 pages
Introduction To Predictive Modeling With Examples: David A. Dickey, N. Carolina State U., Raleigh, NC
No ratings yet
Introduction To Predictive Modeling With Examples: David A. Dickey, N. Carolina State U., Raleigh, NC
14 pages
(The SAGE Quantitative Research Kit) Peter Martin - Linear Regression - An Introduction To Statistical Models-SAGE Publications (2022)
No ratings yet
(The SAGE Quantitative Research Kit) Peter Martin - Linear Regression - An Introduction To Statistical Models-SAGE Publications (2022)
201 pages
Regression With Linear Predictors Complete DOCX Download
100% (17)
Regression With Linear Predictors Complete DOCX Download
16 pages
Week 12 Slides - New
No ratings yet
Week 12 Slides - New
20 pages
Sociology: Intermediate Quantitative Research Method
No ratings yet
Sociology: Intermediate Quantitative Research Method
37 pages
Logistic Regression
No ratings yet
Logistic Regression
41 pages
Sociology: Intermediate Quantitative Research Method
No ratings yet
Sociology: Intermediate Quantitative Research Method
34 pages
Solutions To Sample Final Exam ECO2151
No ratings yet
Solutions To Sample Final Exam ECO2151
7 pages
Heus Preview
No ratings yet
Heus Preview
29 pages
CS ELEC 4 Finals Module
No ratings yet
CS ELEC 4 Finals Module
57 pages
Week03 LectureSlidesECO372
No ratings yet
Week03 LectureSlidesECO372
47 pages
Course Elementsofai
No ratings yet
Course Elementsofai
1 page
(Original PDF) Real Stats Using Econometrics For Political Science and Public Policy Instant Download
100% (8)
(Original PDF) Real Stats Using Econometrics For Political Science and Public Policy Instant Download
45 pages
(Original PDF) Real Stats Using Econometrics For Political Science and Public Policy Instant Download
100% (1)
(Original PDF) Real Stats Using Econometrics For Political Science and Public Policy Instant Download
45 pages
2025 - Applied Causal Inference Powered by ML and AI
No ratings yet
2025 - Applied Causal Inference Powered by ML and AI
518 pages
Analysis of Multiple Experiments Tigr Multiple Experiment Viewer (Mev)
No ratings yet
Analysis of Multiple Experiments Tigr Multiple Experiment Viewer (Mev)
130 pages
Mock Exam Solution Empirical Methods For Finance
No ratings yet
Mock Exam Solution Empirical Methods For Finance
6 pages
Slides 1 Handout
No ratings yet
Slides 1 Handout
23 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Simple Linear Regression Scott M Lynch
No ratings yet
Simple Linear Regression Scott M Lynch
111 pages
CausalML Book 2022
No ratings yet
CausalML Book 2022
500 pages
The Next Two Weeks
No ratings yet
The Next Two Weeks
11 pages
DS4420 Coding Midterm
No ratings yet
DS4420 Coding Midterm
5 pages
Unit 540 Differences Between Two Groups With Answers
No ratings yet
Unit 540 Differences Between Two Groups With Answers
8 pages
Proiect Econometrie
No ratings yet
Proiect Econometrie
15 pages
Unit 540 Differences Between Two Groups Without Answers
No ratings yet
Unit 540 Differences Between Two Groups Without Answers
5 pages
Unit 561 Unequal Variance and More With Answers
No ratings yet
Unit 561 Unequal Variance and More With Answers
13 pages
Econometrics: A Predictive Modeling Approach: Francis X. Diebold University of Pennsylvania
No ratings yet
Econometrics: A Predictive Modeling Approach: Francis X. Diebold University of Pennsylvania
247 pages
Basics
No ratings yet
Basics
8 pages
Data1901 Notes
No ratings yet
Data1901 Notes
70 pages
An Overview of Regression Analysis: Notes
No ratings yet
An Overview of Regression Analysis: Notes
5 pages
Which Test When: 1 Exploratory Tests
No ratings yet
Which Test When: 1 Exploratory Tests
5 pages
PS3 Stata
No ratings yet
PS3 Stata
3 pages
Econometrics Notes
No ratings yet
Econometrics Notes
95 pages
Week 6: Assumptions in Regression Analysis
No ratings yet
Week 6: Assumptions in Regression Analysis
69 pages
mt1 2017 Soln
No ratings yet
mt1 2017 Soln
8 pages
ECON 361: Income & Inequality: Lecture 2: Review of Statistics
No ratings yet
ECON 361: Income & Inequality: Lecture 2: Review of Statistics
279 pages
1 Residuals, Outliers and Regression Diagnostics - CH 14.8 15.8 Revised
No ratings yet
1 Residuals, Outliers and Regression Diagnostics - CH 14.8 15.8 Revised
48 pages
STAB27
No ratings yet
STAB27
51 pages
Class 10 Multilevel Models
No ratings yet
Class 10 Multilevel Models
42 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
U6 Deck1h PDF
No ratings yet
U6 Deck1h PDF
5 pages
R Egression Simplified
No ratings yet
R Egression Simplified
24 pages
Data Mining Tutorial: D. A. Dickey
No ratings yet
Data Mining Tutorial: D. A. Dickey
109 pages
Assignment 3 Hints
No ratings yet
Assignment 3 Hints
8 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
Unit 550 Multivariate Analysis Ratio and Two Independent Variables Addition Without Answers
No ratings yet
Unit 550 Multivariate Analysis Ratio and Two Independent Variables Addition Without Answers
4 pages
Econometrics (PDFDrive)
No ratings yet
Econometrics (PDFDrive)
307 pages
Sem
No ratings yet
Sem
583 pages
Tinywow Groupproject
No ratings yet
Tinywow Groupproject
376 pages
Panel Guidelines
No ratings yet
Panel Guidelines
3 pages
Regn Lect 5
No ratings yet
Regn Lect 5
9 pages
Da Public Slides Ch11 v3 2023
No ratings yet
Da Public Slides Ch11 v3 2023
43 pages
Week04 LectureSlidesECO372
No ratings yet
Week04 LectureSlidesECO372
40 pages
Econometrics Cheat Sheet
No ratings yet
Econometrics Cheat Sheet
4 pages
Precalculus: A Self-Teaching Guide
From Everand
Precalculus: A Self-Teaching Guide
Steve Slavin
4.5/5 (5)
Statistics II for Dummies
From Everand
Statistics II for Dummies
Deborah J. Rumsey
3.5/5 (31)
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
MA3251 Question Paper Analysis Model 2025
No ratings yet
MA3251 Question Paper Analysis Model 2025
4 pages
Modeling Extreme Events in Time Series Prediction: Daizong Ding, Mi Zhang Xudong Pan, Min Yang Xiangnan He
No ratings yet
Modeling Extreme Events in Time Series Prediction: Daizong Ding, Mi Zhang Xudong Pan, Min Yang Xiangnan He
9 pages
Central Limit Theorem and Confidence Interval
No ratings yet
Central Limit Theorem and Confidence Interval
4 pages
Chapter 4 Guide Page With Layout
No ratings yet
Chapter 4 Guide Page With Layout
3 pages
Research Writing Guide
100% (4)
Research Writing Guide
50 pages
Logistic SPSS
100% (1)
Logistic SPSS
29 pages
Jurna Yelma
No ratings yet
Jurna Yelma
13 pages
Module 1 Data Collection
No ratings yet
Module 1 Data Collection
6 pages
Meta Analysis
No ratings yet
Meta Analysis
18 pages
Count Data Models in SAS
No ratings yet
Count Data Models in SAS
12 pages
ISSR2 - Research Process 1
No ratings yet
ISSR2 - Research Process 1
32 pages
Chapter 3
No ratings yet
Chapter 3
5 pages
MktRes MARK4338 Lecture5 004
No ratings yet
MktRes MARK4338 Lecture5 004
32 pages
Q3 Prac1 Lecture Lesson1
No ratings yet
Q3 Prac1 Lecture Lesson1
4 pages
Chapter 3
No ratings yet
Chapter 3
42 pages
Proposal Literature THE INVINCIBLE MAN
No ratings yet
Proposal Literature THE INVINCIBLE MAN
26 pages
ANOVA Case Studies
No ratings yet
ANOVA Case Studies
14 pages
Appendix 3 Sample Lab Report
No ratings yet
Appendix 3 Sample Lab Report
8 pages
3is Reviewer
No ratings yet
3is Reviewer
5 pages
Analisa Data
No ratings yet
Analisa Data
14 pages
Facility User Details
No ratings yet
Facility User Details
23 pages
Ids Unit-2
No ratings yet
Ids Unit-2
26 pages
Narrative Report Week 2
No ratings yet
Narrative Report Week 2
2 pages
Advanced Quantum Physics: Lecture Handout
No ratings yet
Advanced Quantum Physics: Lecture Handout
5 pages
Chapter 9
No ratings yet
Chapter 9
16 pages
MB0040 Statistics For Management Set1
No ratings yet
MB0040 Statistics For Management Set1
9 pages
Unit 3 - Descriptive Statistics
No ratings yet
Unit 3 - Descriptive Statistics
44 pages
CIL KIT: Why Not Have Some Fun?
No ratings yet
CIL KIT: Why Not Have Some Fun?
3 pages
Introduction To Management Information Systems
No ratings yet
Introduction To Management Information Systems
193 pages
@vtucode - in Module 1 RM 2021 Scheme 5th Semester
No ratings yet
@vtucode - in Module 1 RM 2021 Scheme 5th Semester
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

10 - Linear Models

Uploaded by

10 - Linear Models

Uploaded by

Linear Models & Analysis

CIS 5450, Spring 2025

Example adapted from Dr. Maria Tackett

Example adapted from Dr. Maria Tackett

In our real data, we saw an 0.04

… but just one trial

Assuming the null hypothesis (treat

Assuming null, If you did this

Idea: use numpy to simulate a small number

Plot the model’s

transform the misbehaving

LifeEx = 4.62409602 * log(GDPpc) + 31.81983

What happens if GDPpc goes up by 1? Hard to say.

NewLifeEx = 4.624 * log(2 * GDPpc) + 31.81983

What happens if GDPpc goes up by 1? Hard to say.

NewLifeEx = 4.624 * log(2 * GDPpc) + 31.81983

Average squared distance

Average squared distance

“Variance” of the model (MSE):

Close to 1 → small model variance compared to big total, good!

→ 67% of the variance is explained by

LifeEx = 0.00027818 * GDPpc + 67.9537

On only 2015 data

We don’t see anything

5.181 * log(2) = 3.59

4.624 * log(2) = 3.21

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.