1_LogisticRegressionNotes1
1_LogisticRegressionNotes1
Contents
The range of response in linear regression 1
Logistic regression 1
Logit transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Log odd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Logistic transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Interpretation of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Prediction: male or female . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Model performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Example in R 6
The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Descriptive statsitics and visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Make logistic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Parameter interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Calculate log odd and fitted probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Get error rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Validation of model: training and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Yi = β0 + β1 Xi + i
Now, if the predictor variable is not constrained, that is, if X can take any value, ranging from −∞ to +∞,
then the response Y also becomes unconstrained. The range of Y becomes the entire real number.
Logistic regression
Suppose we are performing a linear regression taking a person’s weight as response and height as predictor.
Here the response is a continous numeric variable. But if the response is a binary variable and can take just
two values (for example, male vs female, bad vs good, pass vs fail, win vs loss etc), the scenario becomes a
little special and we need special method to handle that.
In logistic regression, the response is a categorical variable with two levels. We code them with 1 and 0.
When we model it mathematically, we model the probability of one of levels (for example, probability of
being female, probability of win, probability of pass etc). When we model probability of being female, male
is our reference level. Similarly, when we model probability of win, lose is our reference.
Suppose we have a male/female response. Our reference is “female” and our response is Y . In the logistic
regression, we model the probability of being male, that is P (Y = male). Since Y can take either “male” or
“female”, then P (Y = male) + P (Y = f emale) = 1 must be satisfied.
1
Now let us make attempt to model our response like we did in the simple linear regression. Let us assume we
have one predictor X. We propose the following model:
P (Yi = male) = β0 + β1 Xi + i
Now there is one problem with the above model. We are modeling a “probability” here. And a “probability”
cannot be less than zero or greater than 1. But the range of the right hand side can take any real number
(given X is not constrained). So our proposed model is fundamentally wrong.
Let us try an example with a simulated data. We have heights and gender of 70 persons. Taking heigt
as a predictor, we will try to predict the gender of a person. We will code male and female with 1 and 0
respectively.
set.seed(1)
heightmale=rnorm(40,178,4) ## male heights mean=178
set.seed(2)
heightfemale=rnorm(30,166,3) ## female heights mean=166
gender=c(rep(1, 40), rep(0, 30)) ## coded male=1, female=0
height=c(heightmale,heightfemale)
mydata=as.data.frame(cbind(height, gender))
cbind(head(mydata), tail(mydata))
Figure 1 shows the regression line for the model we proposed. We see that the regression line picked the
overall trend. If X = height goes up, probability of being a male goes up. If X goes down, probability of
being a male goes down. But, the line is not restricted in between 0 and 1 line. But since we are taking
some probability measure along y axis, this is not desirable. We rather want something shown in Figure 2.
Increasing X should increase the probability, but does not exceed 1. And decreasing X should decrease the
probability, but does not make it below zero.
Logit transformation
Now the question is, how to make this S-shaped curve? The answer is, by doing a transformation. We make
a transformation on the response, that relaxes the response to the entire real number line, so that we can
model it by β0 + β1 X + . For logistic regression, our response is a probability, which has an interval of [0 1].
So we want such a function, that maps this interval to R.
Such a function is logit function and defined as follows:
P
logit(P ) = log( )
1−P
2
2.0
Male
Female
1.0
P(Y=male)
0.0
−1.0
height
Figure 1: P(Y=male)=b0+b1*height
0.8
P(Y=male)
0.4
0.0
height
3
This logit function has a domain of [0 1] and a range of R. Since log( 1−P
P
) ranges the entire real number,
we can actually model it as β0 + β1 X + .
Log odd
Logistic transformation
Now with the logit transformation, we found a way to model binary response. We can write the model for
our gender-height data as follows:
P
log( ) = β0 + β1 X +
1−P
where, P represents the probability of being a male. We can estimate the regression parameters by maximum
likelihood method. Suppose the estimated parameters are b0 and b1 . We have
P̂
log( ) = b0 + b1 X
1 − P̂
Now we have the estiamted log odds. But we are interested in the actual probability values rather than log
odds. The fitted probabilites can be written as:
1
P̂ =
1 + exp{−(b0 + b1 X)}
Now the term b0 + b1 X can be any real number. But P̂ is capped in between 0 and 1. This transformation,
that maps the entire real number line in between 0 and 1 is called logistic transformation.
1
logistic(x) =
1 + exp(−x)
Note: it is easy to see that logistic function is the inverse of logit function, and logit is the
inverse of logistic.
Interpretation of parameters
P̂
log( ) = b0 + b1 X
1 − P̂
b0 interpretation
4
b1 interpretation
P1
log( ) = b0 + b1 X1
1 − P1
P2
log( ) = b0 + b1 (X1 + 1)
1 − P2
From the above two,
P2 P1 P2 /(1 − P2 )
log( ) − log( ) = log( ) = b1
1 − P2 1 − P1 P1 /(1 − P1 )
P2 /(1 − P2 )
=⇒ = exp(b1 )
P1 /(1 − P1 )
The left hand side of above, the ratio of two odds, is called odd ratio (OR). If b1 > 0, OR > 1 and if b1 < 0,
OR < 1. So the interpretation of b1 is well understood if we describe two scenarios separately:
Interpretaion
• if b1 > 0, then with a unit increase in X, the odd of being male increases by (exp(b1 ) − 1) × 100%
• if b1 < 0, then with a unit increase in X, the odd of being male decreases by (1 − exp(b1 )) × 100%
From the logistic model above, we get the probabilities of each indivudual to be male. But how can we
predict the categories? Probably the simplest way would be predict everybody as female with a probability
less than 0.5 and predict everybody as male with a probability greater or equal to 0.5. This particular value
(here 0.5), where we draw the margin is called threshold or cutoff.
Note: although 0.5 as a cutoff seems a natural and intuitive choice, this does not always give us
the best result. We will see that later.
Model performance
A trivial metric to assess the performance of a logistic model is the error rate. It is the ratio of the number of
wrong prediction and number of total prediction made. Sometimes we report the accuracy rate (1−error rate).
Later in the course, we will see that there are other metrics to assess logistic model performances.
5
Example in R
The data
We will work with cheese data for making a logistic model. The data is from an Australian study of cheddar
cheese. Samples of cheese were analyzed for their chemical composition and were subjected to taste tests.
The data has two continuous and two categorical variables:
• Continuous variables:
– AceticConc: Concentration of acetic acid in log scale
– H2SConc: Concentration of hydrogen sulphide in log scale
• Categorical variables:
– LacticConc: Lactic acid concentration, low or high
– TasteScore: A binary score of test, coded with 0 (bad) and 1 (good). This is our response
variable.
cdata=read.csv("cheese.csv")
head(cdata)
Below we produced some descriptive stats of the data. Here our response is TasteScore. For two groups of
TasteScore, we produced the means of AceticConc and LacticConc. They look different. But from this
stat only, we cannot make any inference unless we know the variability of the data. We will do that shortly.
Predictor LacticConc is categorical. Our response is also categorical. So showing a pivot table seems logical.
We produced the pivot table for these two variables. We see that the proportion of good looks different in
the two groups of LacticConc.
library(plyr)
ddply(cdata,~TasteScore,
summarise,MeanAct=mean(AceticConc),
MeanHS=mean(H2SConc) )
##
## high low
## 0 3 11
## 1 13 3
Now we will produce boxplots and density plots of the two numeric variables.
Figure 3 shows the boxplot and density plot of two numeric variables. For better visualization, we produced
two separate plots for each group of the target.
6
Hydrogen sulphide concentration
Boxplot: Acetic acid concentration Boxplot: Hydrogen sulphide concentration
6.5
Acetic acid concentration
10
factor(TasteScore) factor(TasteScore)
6.0 0 0
8
1 1
5.5
6
5.0 0.5 0.5
0.5 4 0.5
4.5
0 1 0 1
factor(TasteScore) factor(TasteScore)
0.6 0.3
factor(TasteScore) factor(TasteScore)
Density
Density
0.4 0 0.2 0
1 1
0.2 0.1
0.0 0.0
4.5 5.0 5.5 6.0 6.5 4 6 8 10
Acetic acid concentration Hydrogen sulphide concentration
For acetic acid concentration, the boxplot shows a pronounced difference in the distributions in two target
groups. But the density plot shows that there is a lot of overlapping between them. The shifting in
distributions in two groups is more pronounced for the hydrogen sulphide concentration. Both There is no
overlapping of the boxes. The density plot overlapping is also less than the other variable we examined.
Let us make a logistic model for TasteScore with all other variables as predictors.
logModel1=glm(TasteScore~AceticConc+H2SConc+LacticConc, data=cdata,
family=binomial)
summary(logModel1)
##
## Call:
## glm(formula = TasteScore ~ AceticConc + H2SConc + LacticConc,
## family = binomial, data = cdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.03791 -0.30484 0.06527 0.39228 2.01177
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.4911 7.9704 -0.563 0.5731
## AceticConc -0.1619 1.6254 -0.100 0.9207
7
## H2SConc 1.1796 0.5287 2.231 0.0257 *
## LacticConclow -2.3729 1.2851 -1.846 0.0648 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.455 on 29 degrees of freedom
## Residual deviance: 17.750 on 26 degrees of freedom
## AIC: 25.75
##
## Number of Fisher Scoring iterations: 6
We used the glm function to make a logistic model. The glm (stands for generalized linear model) function
can be used for many types of regression with appropriate family and link. The default choice for binomial
family is the logit link. So, the model above models the log odd of taste=good as a linear combination of
the predictors.
We see that acetic acid concentration does not have statistically significant predictive power to predict the
taste of cheddar cheese. It has a very high p-value, which is greater than the greatest default significance
level 0.1. So we drop this variable and make a model with other two predictors.
logModel2=glm(TasteScore~H2SConc+LacticConc, data=cdata,
family=binomial)
summary(logModel2)
##
## Call:
## glm(formula = TasteScore ~ H2SConc + LacticConc, family = binomial,
## data = cdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0083 -0.2999 0.0658 0.3788 2.0157
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.2497 2.5031 -2.097 0.0360 *
## H2SConc 1.1538 0.4541 2.541 0.0111 *
## LacticConclow -2.3432 1.2472 -1.879 0.0603 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.455 on 29 degrees of freedom
## Residual deviance: 17.760 on 27 degrees of freedom
## AIC: 23.76
##
## Number of Fisher Scoring iterations: 6
Both variables now seem to have some predictive power. The H2SConc variable is statistically significant at
5% level (p-value less than 0.05) and the LacticConclow is statistically significant at 10% level.
8
Parameter interpretation
bH2SConc = 1.15 and exp(1.15) = 3.16. So, with a unit increase of hydrogen sulphide concentration, the odd
of the taste of the cheese being good increases by 216%.
bLacticConclow = −2.34. Now LacticConc is a categorical variable with two levels. Since we see the estimate
for low level, the other level, high was taken as reference. exp(−2.34) = 0.096. So, the odd of cheese being
good with low lactic acid concentration is only 9.6% of the odd of cheese being good with a high lactic acid
concentration.
Here is one thing. By default, R took high as the reference for LacticConc variable. When R needs to
convert a character variable to factor variable, it makes one group reference. My best guess, whichever group
comes first alphabetically, R makes that group reference. But it is possible to change the reference.
In Model3 below, we took low as reference for LacticConc.
logModel3=glm(TasteScore~H2SConc+LacticConc, data=cdata,
family=binomial)
summary(logModel3)
##
## Call:
## glm(formula = TasteScore ~ H2SConc + LacticConc, family = binomial,
## data = cdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0083 -0.2999 0.0658 0.3788 2.0157
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.5928 2.7537 -2.757 0.00583 **
## H2SConc 1.1538 0.4541 2.541 0.01107 *
## LacticConchigh 2.3432 1.2472 1.879 0.06029 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.455 on 29 degrees of freedom
## Residual deviance: 17.760 on 27 degrees of freedom
## AIC: 23.76
##
## Number of Fisher Scoring iterations: 6
We see that changing the reference chages the sign of the parameter. Now we have bLacticConchigh = 2.34 and
exp(2.34) = 10.38. So, the odd of being good with high lactic acid concentration is 10.34 times of the odd of
being good with low concentration level. And of course, 1/0.09632764 = 10.38.
9
Calculate log odd and fitted probability
Let us work with the last model, where we took low as the reference for LacticConc.
Now we will calculate the log odds of being good. Also, we will calculte the probabilities of being good by
doing the logistic transformation of the log odds.
log_odds=predict(logModel3, newdata = cdata)
## check few of them
log_odds[1:5]
## 1 2 3 4 5
## -3.9757360 0.5688275 1.0245709 3.3990514 -3.2003954
## calculate probabilities by logistic transformation
## check first 5
prob_being_good[1:5]
## 1 2 3 4 5
## 0.01841983 0.63849259 0.73586199 0.96767488 0.03915084
As it was mentioned, error rate is one way ti assess the performance of a logistic model. But we need to
classify the predictions as “good” or “bad”. Let us use 0.5 as our cutoff. Any value less than 0.5 will be
classified as bad (0), and the others will be classified as good (1).
cutoff=0.5
predicted_class=ifelse(prob_being_good<cutoff, 0, 1)
original_class=cdata$TasteScore
## make a confusion/contingency matrix
con_mat=table(original_class, predicted_class)
con_mat
## predicted_class
## original_class 0 1
## 0 11 3
## 1 2 14
From the contigency table, we see that 5 were predicted wrong out of 30. So the error rate here is 5/30=16.67%.
Note: as stated earlier, the error rate may change as we change the cutoff.
For validation of the model, we alomost always save a portion of the data to validate our model. We do not
touch this portion while making the model.
Let us save last 10 observations of our data for validation purpose.
10
train=cdata[c(1:20),] # training data
test=cdata[c(21:30),] # test data
model=glm(TasteScore~H2SConc+LacticConc, data=train,
family=binomial) # make a mogistic model
pred_log_odd_test=predict(model, newdata = test) ## log odds
## pred_class_test
## original_class_test 0 1
## 0 5 2
## 1 0 3
So, in the test data set, we got 2 misclassified out of 10, with an error rate of 20%.
11