0% found this document useful (0 votes)
5 views57 pages

10 - Linear Models

The document outlines a course on Linear Models and Analysis, covering course logistics, upcoming deadlines, and key concepts such as hypothesis testing and vectorization in Python. It discusses the relationship between GDP per capita and life expectancy, including model interpretation and significance testing through permutation methods. The course also addresses overfitting and the importance of regularization in model selection.

Uploaded by

john.coyle198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views57 pages

10 - Linear Models

The document outlines a course on Linear Models and Analysis, covering course logistics, upcoming deadlines, and key concepts such as hypothesis testing and vectorization in Python. It discusses the relationship between GDP per capita and life expectancy, including model interpretation and significance testing through permutation methods. The course also addresses overfitting and the importance of regularization in model selection.

Uploaded by

john.coyle198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Linear Models & Analysis

CIS 5450, Spring 2025

Ryan Marcus
Outline

Course logistics

Brief review

Linear models: interpretation

Linear models: analysis
Upcoming Deadlines

HW2 due Mar 2nd (12 days)


Midterm 1 on Mar 4th (14 days)

Sample midterm released this week
Office Hours

Average wait time: 3.5 minutes

P95 wait time: 18 minutes
Outline

Course logistics

Brief review

Linear models: interpretation

Linear models: analysis
Is Yawning Contagious?
CONTROL or TREAT


Mythbusters episode: got random people into
two rooms, had someone go yawn in one room,
recorded how many yawners.
Effect: “no yawn” or
“yawn”

Example adapted from Dr. Maria Tackett


Is Yawning Contagious?

25% of the control group yawned

29% of the treatment group yawned

Is it a real effect, or just random chance?

Example adapted from Dr. Maria Tackett


Is Yawning Contagious?

We’re going to be doing
hypothesis testing via
simulation

Intuitively:
“If we assume there is no
effect, and we did a
random trial, what is the
probability of seeing a
result like what we saw?”
Hypothesis Testing

In our real data, we saw an 0.04


difference, but in the null world, we
saw a 0.007 difference. Much
smaller!

… but just one trial


Hypothesis Testing
Hypothesis Testing

How often was our simulated test statistic in the
null world equal or stronger than what we
observed?

Empirical P value
Hypothesis Testing

P values can be confusing.

Assuming the null hypothesis (treat


& control are the same), there is a There is only a 39.8% chance the
p=39.8% chance of observing an null hypothesis is true!
equal or stronger effect

FALSE

Assuming null, If you did this


experiment 1000 times, you would There is a 100 – p = 60% chance
expect 398 runs to show an equal that yawning is contagious!
or stronger effect
FALSE
Vectorization

Python, as a programming language, often
trades speed for ease-of-use.

… but we’ve seen that doing thing with Python
libraries can be fast!

DuckDB, Polars

These libraries do all the hard work for you, and give you a
nice interface

Another similar tool: NumPy (numeric python)
Vectorization: Yawning

Big improvement!
Vectorization: The Downside

Downside to vectorization: memory usage

Large arrays requires space

Fix: batching!

Idea: use numpy to simulate a small number


(10k) of trials, tally up those results, then
simulate another small batch
Outline

Course logistics

Brief review

Linear models: interpretation

Linear models: analysis
Linear Models
How does GDP per capita
relate to life expectancy?

Before now: compare two groups (treat vs.
control)

… but what if my input variable isn’t binary?
Luxembourg

US
Seems like a clear
relationship – can we
model it?

Equatorial Guinea
LifeExp = a * GDPpc + b
LifeExp = a * GDPpc + b How to find a and b?

b, the intercept
On average, when GDPpc
increases by 1, life expectancy
a, the slope
increase by 0.00028 years
Linear Models
LifeExp = a * GDPpc + b

How do we know these
are the “right”
parameters?

The squared error loss
function is convex

assuming your input features
are linearly independent!
Use numpy to apply our
prediction to test input data

Plot the model’s


predictions

Not good!
Not really a linear
relationship, so our
conclusion is suspect

Luxembourg

US

Equatorial Guinea

transform the misbehaving


dimension
LifeEx = 0.00027818 * GDPpc + 67.9537

LifeEx = 4.62409602 * log(GDPpc) + 31.81983


Just as before, apply model
to test input data
LifeEx = 4.62409602 * log(GDPpc) + 31.81983

What happens if GDPpc goes up by 1? Hard to say.


… but what if GDPpc doubles?

NewLifeEx = 4.624 * log(2 * GDPpc) + 31.81983


NewLifeEx = 4.624 * [log(2) + log(GDPpc)] + 31.81983
NewLifeEx = [4.624 log(2)] + [4.624 log(GDPpc) + 31.81983]
NewLifeEx = [4.624 log(2)] + LifeEx
NewLifeEx = 3.21 + LifeEx
LifeEx = 4.62409602 * log(GDPpc) + 31.81983

What happens if GDPpc goes up by 1? Hard to say.


… but what if GDPpc doubles?

NewLifeEx = 4.624 * log(2 * GDPpc) + 31.81983


NewLifeEx = 4.624 * [log(2) + log(GDPpc)] + 31.81983
NewLifeEx = [4.624 log(2)] + [4.624 log(GDPpc)
On average, + If31.81983]
you double
NewLifeEx = [4.624 log(2)] + LifeEx
GDPpc, life expectancy goes
NewLifeEx = 3.21 + LifeEx
up by 3.21 years
On average, when GDPpc On average, If you double
increases by 1, life expectancy GDPpc, life expectancy goes
increase by 0.00028 years up by 3.21 years
whowhen
On average, woreGDPpc
it best? who wore it best? who wore
On average, it double
If you
increases by 1, life expectancy GDPpc, life expectancy goes
increase by 0.00028 years up by 3.21 years
Variance Variance of the data:

Average squared distance


to mean
Variance Variance of the data:

Average squared distance


to prediction

“Variance” of the model (MSE):


Explained Variance

model variance

total variance
Explained Variance
Variance after
modeling (green)
model variance

total variance

Variance before
modeling (red)
Explained Variance

model variance

total variance
Explained Variance

model variance

total variance

Close to 1 → small model variance compared to big total, good!


Close to 0 → nearly identical model and data variance, bad!
Train models and get
predicted values

Compute R2
→ 39% of the variance is explained by
the linear relationship between GDPpc
and life expectancy

→ 67% of the variance is explained by


the linear relationship between the log
of GDPpc and life expectancy

On average, when GDPpc On average, If you double
increases by 1, life expectancy GDPpc, life expectancy goes
increase by 0.00028 years up by 3.21 years
Overfitting

Choosing a model based on fit only works when
the model is regularized (constrained)

✅ 😭
Overfitting

R2 = 0.396085
R2 = 0.589182

R2 = 0.521163
Outline

Course logistics

Brief review

Linear models: interpretation

Linear models: analysis
Significance?

How do we know our results are significant?

Hypothesis testing! Test stat = coefficient

LifeEx = 0.00027818 * GDPpc + 67.9537


Make a copy of x
and shuffle itPermutation Testing

Null hypothesis:
Plot 1 trial from the
● null world
the association
between features and
outputs is arbitrary.

What is the
distribution of the
slope under this
hypothesis?
Make a copy of x
and shuffle itPermutation Testing

Null hypothesis:
Plot 1 trial from the
● null worldbetween
the association
features and outputs is
arbitrary.

What is the
distribution of the
slope under this
hypothesis?
Do our simulation in a
Measure the coefficient
loop
under the null hypothesis
Measure the coefficient
under the null hypothesis
Time is a Dimension

We’ve been looking at just 2015 data, but we can
look at more years.

Prior conclusions: if you go from a country with GDPpc of X to a
country with GPDpc of X’...
Time is a Dimension

We’ve been looking at just 2015 data, but we can
look at more years.

Prior conclusions: if you go from a country with GDPpc of X to
a countryOn all data
with GPDpc of X’...

On only 2015 data


Permutation Testing
Average GPDpc and life
expectancy over time

How can we test for
significance with
Average is increasing
year over year
time?

Possible null
hypothesis:
input/output pairing
is arbitrary
What’s wrong with
assuming the pairing is
arbitrary?
Permutation Testing
Aside: for folks who have learned

How
aboutcan we test
hypothesis testingfor
before, this is
significance
multicolinearity with
of GDPpc and LifeEx
on year
time?
… how would you fix it? See if your

Possible null
answer is more or less complex than
hypothesis: our fix...
input/output pairing
is arbitrary
Permutation Testing

How do we fix it?

Null hypothesis:
input/output pairing is
arbitrary within each
year

Preserves increased averages
over time in the null world
Permutation Testing
Group by year, apply a
random permutation

How do we fix it?

Null hypothesis:
input/output pairing is
arbitrary within each
year

Preserves increased averages
over time in the null world
Permutation Testing
Group by year, apply a
random permutation

How do we fix it?

Null hypothesis:
input/output pairing is
arbitrary within each
year

Preserves increased averages
over time in the null world
Permutation Testing
Put it in a loop


How do we fix it?
Collect test stats from

Nullnull
hypothesis:
world
input/output pairing is
arbitrary within each
year

Preserves increased averages
over time in the null world

We don’t see anything


close to this strong
Interpretation
All time

5.181 * log(2) = 3.59

Only 2015

4.624 * log(2) = 3.21


Grouped Samples
Steepness of slopes within
groups can be viewed as the
efficiency of converting GDPpc
into LifeEx
Grouped Samples
Steepness of slopes within
groups can be viewed as the
efficiency of converting GDPpc
into LifeEx How would you design tests for:

Is there a significant
difference between two
countries?

Is the distribution of slopes
significantly different than
random?
Next Time

What if I have more than one input / feature
variable?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy