0% found this document useful (0 votes)
20 views

Regression3 Slides

This document provides an overview of logistic regression. It introduces logistic regression as a method for modeling discrete response variables with two or more categories. The key aspects covered are: 1) Logistic regression models the log odds of the probabilities of different outcomes as a linear combination of predictor variables. 2) This allows predicting probabilities as a function of the predictors rather than directly modeling the discrete outcomes. 3) Logistic regression is a type of generalized linear model used widely in machine learning for classification problems.

Uploaded by

Fantahun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Regression3 Slides

This document provides an overview of logistic regression. It introduces logistic regression as a method for modeling discrete response variables with two or more categories. The key aspects covered are: 1) Logistic regression models the log odds of the probabilities of different outcomes as a linear combination of predictor variables. 2) This allows predicting probabilities as a function of the predictors rather than directly modeling the discrete outcomes. 3) Logistic regression is a type of generalized linear model used widely in machine learning for classification problems.

Uploaded by

Fantahun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Regression 3: Logistic Regression

Marco Baroni

Practical Statistics in R
Outline

Logistic regression

Logistic regression in R
Outline

Logistic regression
Introduction
The model
Looking at and comparing fitted models

Logistic regression in R
Outline

Logistic regression
Introduction
The model
Looking at and comparing fitted models

Logistic regression in R
Modeling discrete response variables

I In a very large number of problems in cognitive science


and related fields
I the response variable is categorical, often binary (yes/no;
acceptable/not acceptable; phenomenon takes place/does
not take place)
I potentially explanatory factors (independent variables) are
categorical, numerical or both
Examples: binomial responses

I Is linguistic construction X rated as “acceptable” in the


following condition(s)?
I Does sentence S, that has features Y, W and Z, display
phenomenon X? (linguistic corpus data!)
I Is it common for subjects to decide to purchase the good X
given these conditions?
I Did subject make more errors in this condition?
I How many people answer YES to question X in the survey
I Do old women like X more than young men?
I Did the subject feel pain in this condition?
I How often was reaction X triggered by these conditions?
I Do children with characteristics X, Y and Z tend to have
autism?
Examples: multinomial responses
I Discrete response variable with natural ordering of the
levels:
I Ratings on a 6-point scale
I Depending on the number of points on the scale, you might
also get away with a standard linear regression
I Subjects answer YES, MAYBE, NO
I Subject reaction is coded as FRIENDLY, NEUTRAL,
ANGRY
I The cochlear data: experiment is set up so that possible
errors are de facto on a 7-point scale
I Discrete response variable without natural ordering:
I Subject decides to buy one of 4 different products
I We have brain scans of subjects seeing 5 different objects,
and we want to predict seen object from features of the
scan
I We model the chances of developing 4 different (and
mutually exclusive) psychological syndromes in terms of a
number of behavioural indicators
Binomial and multinomial logistic regression models

I Problems with binary (yes/no, success/failure,


happens/does not happen) dependent variables are
handled by (binomial) logistic regression
I Problems with more than one discrete output are handled
by
I ordinal logistic regression, if outputs have natural ordering
I multinomial logistic regression otherwise
I The output of ordinal and especially multinomial logistic
regression tends to be hard to interpret, whenever possible
I try to reduce the problem to a binary choice
I E.g., if output is yes/maybe/no, treat “maybe” as “yes”
and/or as “no”
I Here, I focus entirely on the binomial case
Don’t be afraid of logistic regression!

I Logistic regression seems less popular than linear


regression
I This might be due in part to historical reasons
I the formal theory of generalized linear models is relatively
recent: it was developed in the early nineteen-seventies
I the iterative maximum likelihood methods used for fitting
logistic regression models require more computational
power than solving the least squares equations
I Results of logistic regression are not as straightforward to
understand and interpret as linear regression results
I Finally, there might also be a bit of prejudice against
discrete data as less “scientifically credible” than
hard-science-like continuous measurements
Don’t be afraid of logistic regression!

I Still, if it is natural to cast your problem in terms of a


discrete variable, you should go ahead and use logistic
regression
I Logistic regression might be trickier to work with than linear
regression, but it’s still much better than pretending that the
variable is continuous or artificially re-casting the problem
in terms of a continuous response
The Machine Learning angle

I Classification of a set of observations into 2 or more


discrete categories is a central task in Machine Learning
I The classic supervised learning setting:
I Data points are represented by a set of features, i.e.,
discrete or continuous explanatory variables
I The “training” data also have a label indicating the class of
the data-point, i.e., a discrete binomial or multinomial
dependent variable
I A model (e.g., in the form of weights assigned to the
dependent variables) is fitted on the training data
I The trained model is then used to predict the class of
unseen data-points (where we know the values of the
features, but we do not have the label)
The Machine Learning angle

I Same setting of logistic regression, except that emphasis is


placed on predicting the class of unseen data, rather than
on the significance of the effect of the features/independent
variables (that are often too many – hundreds or thousands
– to be analyzed singularly) in discriminating the classes
I Indeed, logistic regression is also a standard technique in
Machine Learning, where it is sometimes known as
Maximum Entropy
Outline

Logistic regression
Introduction
The model
Looking at and comparing fitted models

Logistic regression in R
Classic multiple regression

I The by now familiar model:

y = β0 + β1 × x1 + β2 × x2 + ... + βn × xn + 

I Why will this not work if variable is binary (0/1)?


I Why will it not work if we try to model proportions instead
of responses (e.g., proportion of YES-responses in
condition C)?
Modeling log odds ratios
I Following up on the “proportion of YES-responses” idea,
let’s say that we want to model the probability of one of the
two responses (which can be seen as the population
proportion of the relevant response for a certain choice of
the values of the dependent variables)
I Probability will range from 0 to 1, but we can look at the
logarithm of the odds ratio instead:
p
logit(p) = log
1−p
I This is the logarithm of the ratio of probability of
1-response to probability of 0-response
I It is arbitrary what counts as a 1-response and what counts
as a 0-response, although this might hinge on the ease of
interpretation of the model (e.g., treating YES as the
1-response will probably lead to more intuitive results than
treating NO as the 1-response)
I Log odds ratios are not the most intuitive measure (at least
for me), but they range continuously from −∞ to +∞
From probabilities to log odds ratios

5
logit(p)
0
−5

0.0 0.2 0.4 0.6 0.8 1.0


p
The logistic regression model

I Predicting log odds ratios:

logit(p) = β0 + β1 × x1 + β2 × x2 + ... + βn × xn

I Back to probabilities:

elogit(p)
p=
1 + elogit(p)
I Thus:
eβ0 +β1 ×x1 +β2 ×x2 +...+βn ×xn
p=
1 + eβ0 +β1 ×x1 +β2 ×x2 +...+βn ×xn
From log odds ratios to probabilities

1.0
0.8
0.6
p
0.4
0.2
0.0

−10 −5 0 5 10
logit(p)
Probabilities and responses

1.0
● ● ● ●● ● ● ●

0.8
0.6
p
0.4
0.2
0.0

● ● ● ●● ● ● ●

−10 −5 0 5 10
logit(p)
A subtle point: no error term

I NB:

logit(p) = β0 + β1 × x1 + β2 × x2 + ... + βn × xn

I The outcome here is not the observation, but (a function of)


p, the expected value of the probability of the observation
given the current values of the dependent variables
I This probability has the classic “coin tossing” Bernoulli
distribution, and thus variance is not free parameter to be
estimated from the data, but model-determined quantity
given by p(1 − p)
I Notice that errors, computed as observation − p, are not
independently normally distributed: they must be near 0 or
near 1 for high and low ps and near .5 for ps in the middle
The generalized linear model

I Logistic regression is an instance of a “generalized linear


model”
I Somewhat brutally, in a generalized linear model
I a weighted linear combination of the explanatory variables
models a function of the expected value of the dependent
variable (the “link” function)
I the actual data points are modeled in terms of a distribution
function that has the expected value as a parameter
I General framework that uses same fitting techniques to
estimate models for different kinds of data
Linear regression as a generalized linear model

I Linear prediction of a function of the mean:

g(E(y )) = X β

I “Link” function is identity:

g(E(y )) = E(y )
I Given mean, observations are normally distributed with
variance estimated from the data
I This corresponds to the error term with mean 0 in the linear
regression model
Logistic regression as a generalized linear model

I Linear prediction of a function of the mean:

g(E(y )) = X β

I “Link” function is :
E(y )
g(E(y )) = log
1 − E(y )
I Given E(y ), i.e., p, observations have a Bernoulli
distribution with variance p(1 − p)
Estimation of logistic regression models

I Minimizing the sum of squared errors is not a good way to


fit a logistic regression model
I The least squares method is based on the assumption that
errors are normally distributed and independent of the
expected (fitted) values
I As we just discussed, in logistic regression errors depend
on the expected (p) values (large variance near .5,
variance approaching 0 as p approaches 1 or 0), and for
each p they can take only two values (1 − p if response
was 1, p − 0 otherwise)
Estimation of logistic regression models

I The β terms are estimated instead by maximum likelihood,


i.e., by searching for that set of βs that will make the
observed responses maximally likely (i.e., a set of β that
will in general assign a high p to 1-responses and a low p
to 0-responses)
I There is no closed-form solution to this problem, and the
optimal β ~ tuning is found with iterative “trial and error”
techniques
I Least-squares fitting is finding the maximum likelihood
estimate for linear regression and vice versa maximum
likelihood fitting is done by a form of weighted least squares
fitting
Outline

Logistic regression
Introduction
The model
Looking at and comparing fitted models

Logistic regression in R
Interpreting the βs

I Again, as a rough-and-ready criterion, if a β is more than 2


standard errors away from 0, we can say that the
corresponding explanatory variable has an effect that is
significantly different from 0 (at α = 0.05)
I However, p is not a linear function of X β, and the same β
will correspond to a more drastic impact on p towards the
center of the p range than near the extremes (recall the S
shape of the p curve)
I As a rule of thumb (the “divide by 4” rule), β/4 is an upper
bound on the difference in p brought about by a unit
difference on the corresponding explanatory variable
Goodness of fit

I Again, measures such as R 2 based on residual errors are


not very informative
I One intuitive measure of fit is the error rate, given by the
proportion of data points in which the model assigns p > .5
to 0-responses or p < .5 to 1-responses
I This can be compared to baseline in which the model
always predicts 1 if majority of data-points are 1 or 0 if
majority of data-points are 0 (baseline error rate given by
proportion of minority responses over total)
I Some information lost (a .9 and a .6 prediction are treated
equally)
I Other measures of fit proposed in the literature, no widely
agreed upon standard
Binned goodness of fit

I Goodness of fit can be inspected visually by grouping the


ps into equally wide bins (0-0.1,0.1-0.2, . . . ) and plotting
the average p predicted by the model for the points in each
bin vs. the observed proportion of 1-responses for the data
points in the bin
I We can also compute a R 2 or other goodness of fit
measure on these binned data
Deviance

I Deviance is an important measure of fit of a model, used


also to compare models
I Simplifying somewhat, the deviance of a model is −2 times
the log likelihood of the data under the model
I plus a constant that would be the same for all models for
the same data, and so can be ignored since we always look
at differences in deviance
I The larger the deviance, the worse the fit
I As we add parameters, deviance decreases
Deviance

I The difference in deviance between a simpler and a more


complex model approximates a χ2 distribution with the
difference in number of parameters as df’s
I This leads to the handy rule of thumb that the improvement
is significant (at α = .05) if the deviance difference is larger
than the parameter difference (play around with pchisq()
in R to see that this is the case)
I A model can also be compared against the “null” model
that always predicts the same p (given by the proportion of
1-responses in the data) and has only one parameter (the
fixed predicted value)
Outline

Logistic regression

Logistic regression in R
Preparing the data and fitting the model
Practice
Outline

Logistic regression

Logistic regression in R
Preparing the data and fitting the model
Practice
Back to the Graffeo et al.’s discount study
Fields in the discount.txt file

subj Unique subject code


sex M or F
age NB: contains some NA
presentation absdiff (amount of discount), result (price after
discount), percent (percentage discount)
product pillow, (camping) table, helmet, (bed) net
choice Y (buys), N (does not buy) → the discrete
response variable
Preparing the data

I Read the file into an R data-frame, look at the summaries,


etc.
I Note in the summary of age that R “understands” NAs
(i.e., it is not treating age as a categorical variable)
I We can filter out the rows containing NAs as follows:
> e<-na.omit(d)
I Compare summaries of d and e
I na.omit can also be passed as an option to the modeling
functions, but I feel uneasy about that
I Attach the NA-free data-frame
Logistic regression in R

> sex_age_pres_prod.glm<-glm(choice~sex+age+
presentation+product,family="binomial")

> summary(sex_age_pres_prod.glm)
Selected lines from the summary() output

I Estimated β coefficients, standard errors and z scores


(β/std. error):
Coefficients:
Estimate Std. Error z value Pr(>|z|)
sexM -0.332060 0.140008 -2.372 0.01771 *
age -0.012872 0.006003 -2.144 0.03201 *
presentationpercent 1.230082 0.162560 7.567 3.82e-14 *
presentationresult 1.516053 0.172746 8.776 < 2e-16 *
I Note automated creation of binary dummy variables:
discounts presented as percents and as resulting values
are significantly more likely to lead to a purchase than
discounts expressed as absolute difference (the default
level)
I use relevel() to set another level of a categorical
variable as default
Deviance

I For the “null” model and for the current model:

Null deviance: 1453.6 on 1175 degrees of freedom


Residual deviance: 1284.3 on 1168 degrees of freedom

I Difference in deviance (169.3) is much higher than


difference in parameters (7), suggesting that the current
model is significantly better than the null model
Comparing models

I Let us add a presentation by interaction term:

> interaction.glm<-glm(choice~sex+age+presentation+
product+sex:presentation,family="binomial")

I Are the extra-parameters justified?

> anova(sex_age_pres_prod.glm,interaction.glm,
test="Chisq")
...
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 1168 1284.25
2 1166 1277.68 2 6.57 0.04

I Apparently, yes (although summary(interaction.glm)


suggests just a marginal interaction between sex and the
percentage dummy variable)
Error rate
I The model makes an error when it assigns p > .5 to
observation where choice is N or p < .5 to observation
where choice is Y:

> sum((fitted(sex_age_pres_prod.glm)>.5 & choice=="N") |


(fitted(sex_age_pres_prod.glm)<.5 & choice=="Y")) /
length(choice)
[1] 0.2721088

I Compare to error rate by baseline model that always


guesses the majority choice:

> table(choice)
choice
N Y
363 813
> sum(choice=="N")/length(choice)
[1] 0.3086735

I Improvement in error rate is nothing to write home about. . .


Binned fit
I Function from languageR package for plotting binned
expected and observed proportions of 1-responses, as
well as bootstrap validation, require logistic model fitted
with lrm(), the logistic regression fitting function from the
Design package:
> sex_age_pres_prod.glm<-
lrm(choice~sex+age+presentation+product,
x=TRUE,y=TRUE)
I The languageR version of the binned plot function
(plot.logistic.fit.fnc) dies on our model, since it
never predicts p < 0.1, so I hacked my own version, that
you can find in the r-data-1 directory:
> source("hacked.plot.logistic.fit.fnc.R")
> hacked.plot.logistic.fit.fnc(sex_age_pres_prod.glm,e)
I (Incidentally: in cases like this where something goes
wrong, you can peek inside the function simply by typing
its name)
Bootstrap estimation

I Validation using the logistic model estimated by lrm() and


1,000 iterations:
> validate(sex_age_pres_prod.glm,B=1000)
I When fed a logistic model, validate() returns various
measures of fit we have not discussed: see, e.g., Baayen’s
book
I Independently of the interpretation of the measures, the
size of the optimism indices gives a general idea of the
amount of overfitting (not dramatic in this case)
Mixed model logistic regression

I You can use the lmer() function with the


family="binomial" option
I E.g., introducing subjects as random effects:
> sex_age_pres_prod.lmer<-
lmer(choice~sex+age+presentation+
product+(1|subj),family="binomial")
I You can replicate most of the analyses illustrated above
with this model
A warning
I Confusingly, the fitted() function applied to a glm
object returns probabilities, whereas if applied to a lmer
object it returns odd ratios
I Thus, to measure error rate you’ll have to do something
like:
> probs<-exp(fitted(sex_age_pres_prod.lmer)) /
(1 +exp(fitted(sex_age_pres_prod.lmer)))
> sum((probs>.5 & choice=="N") |
(probs<.5 & choice=="Y")) /
length(choice)

I NB: Apparently, hacked.plot.logistic.fit.fnc dies


when applied to an lmer object, on some versions of R (or
lme4, or whatever)
I Surprisingly, fit of model with random subject effect is
worse than the one of model with fixed effects only
Outline

Logistic regression

Logistic regression in R
Preparing the data and fitting the model
Practice
Practice time

I Go back to Navarrete’s et al.’s picture naming data


(cwcc.txt)
I Recall that the response can be a time (naming latency) in
milliseconds, but also an error
I Are the errors randomly distributed, or can they be
predicted from the same factors that determine latencies?
I We found a negative effect of repetition and a positive
effect of position-within-category on naming latencies – are
these factors also leading to less and more errors,
respectively?
Practice time

I Construct a binary variable from responses (error vs. any


response)
I Use sapply(), and make sure that R understands this is a
categorical variable with as.factor()
I Add the resulting variable to your data-frame, e.g., if you
called the data-frame d and the binary response variable
temp, do:
d$errorresp<-temp
I This will make your life easier later on
I Analyze this new dependent variable using logistic
regression (both with and without random effects)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy