0% found this document useful (0 votes)
39 views

5.1) Binary logistic regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

5.1) Binary logistic regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 32

Biostatistics

Binary Logistic Regression

Prof. Getu Degu

March 2013
Logistic regression

► In many studies the outcome variable of interest is the


presence or absence of some condition, whether or not the
subject has a particular characteristic such as a symptom of a
certain disease.

► We cannot use ordinary multiple linear regression for such


data, but instead we can use a similar approach known as
multiple linear logistic regression or just logistic
regression.

2
Logistic regression: Uses and selection of independent variables

♣ The first is the prediction (estimation) of the


probability that an individual will have (develop) the
characteristic.

♣ For example, logistic regression is often used in


epidemiological studies where the result of the
analysis is the probability of developing cancer after
controlling for other associated risks.

♣ Logistic regression also provides knowledge of the


relationships and strengths between an outcome
variable (dependent variable having only two
categories) and explanatory (independent) variables
that can be categorical or continuous. 3
Logistic regression
Example:
smoking 10 packs a day puts you at a higher risk for developing cancer than
working in an asbestos mine).

♣ Logistic regression can be applied to case-control and cross-sectional data.

The Model:
► The basic principle of logistic regression is much the same as for ordinary
multiple regression.

► The main difference is that instead of developing a model that uses a


combination of the values of a group of explanatory variables to predict the
value of a dependent variable, we predict a transformation of the dependent
variable.

►The dependent variable in logistic regression is usually dichotomous, that is,


the dependent variable can take the value 1 with a probability of success ,
or the value 0 with a probability of failure 1-.

This type of variable is called a binomial (or binary) variable. 4


Logistic regression
• Logistic regression extends ordinary least
squares (OLS) methods to model data with
binary (yes/no, success/failure) outcomes.

• Instead of directly estimating the value of


the outcome, logistic regression allows you
to estimate the probability of a success or
failure.
5
Logistic regression
 Applications of logistic regression have also been
extended to cases where the dependent variable is of
more than two cases, known as multinomial logistic
regression.

 When multiple classes of the dependent variable can


be ranked, then ordinal logistic regression is preferred
to multinomial logistic regression.

 As mentioned previously, one of the goals of logistic


regression is to correctly predict the category of outcome
for individual cases using the most parsimonious
(condensed) model.
6
Logistic regression: model creation
 To accomplish this goal, a model is created
that includes all predictor variables that are
useful in predicting the response variable.
 Several different options are available during
model creation.

 Variables can be entered into the model in the


order specified by the researcher.
 Logistic regression can test the fit of the model
after each coefficient is added or deleted, called
stepwise regression. 7
Logistic regression: Forward stepwise regression

► The first step in many analyses of multivariate data is to examine the


simple relation between each potential explanatory variable and the
outcome variable of interest ignoring all the other variables.

► Forward stepwise regression analysis uses this analysis as its starting


point. Steps in applying this method are:

a) Find the single variable that has the strongest association with the
dependent variable and enter it into the model (i.e., the variable
with the smallest p-value).
b) Find the variable among those not in the model that, when added
to the model so far obtained, explains the largest amount of the
remaining variability.
c) Repeat step (b) until the addition of an extra variable is not
statistically significant at some chosen level such as P=.05.

► N.B. You have to stop the process at some point otherwise you will
end up with all the variables in the model. 8
Logistic regression: Backward stepwise regression

♣ It appears to be the preferred method of exploratory


analyses, where the analysis begins with a full or
saturated model and variables are eliminated from
the model in an iterative process.

♣ The fit of the model is tested after the elimination of


each variable to ensure that the model still adequately
fits the data. When no more variables can be
eliminated from the model, the analysis has been
completed.

9
Logistic regression
♣ Logistic regression is a powerful statistical tool for estimating the magnitude
of the association between an exposure and a binary outcome after
adjusting simultaneously for a number of potential confounding
factors.

♣ If we have a binary variable and give the categories numerical values of 0


and 1, usually representing ‘No’ and ‘Yes’ respectively, then the mean of
these values in a sample of individuals is the same as the proportion of
individuals with the characteristic.

♣ We could expect, therefore, that the appropriate regression model would


estimate the probability (proportion) that an individual will have the
characteristic.

♣ We cannot use an ordinary linear regression, because this might predict


proportions less than zero or greater than one, which would be
meaningless.

♣ In practice, a statistically preferable method is to use a transformation


of this proportion. 10
Logistic regression
♣ The transformation we use is called the logit transformation, written as
logit (p). Here p is the proportion of individuals with the characteristic.

♣ For example, if p is the probability of a subject having a myocardial


infarction, then 1-p is the probability that they do not have one.

♣ The ratio p / (1-p) is called the odds and thus


P
logit (p) = ln  1  P  is the log odds.
The logit can take any value from minus infinity to plus infinity.

♣ We can fit regression models to the logit which are very similar to the
ordinary multiple regression models found for data from a normal
distribution.

♣ We assume that relationships are linear on the logistic scale:


 P 
 
1 P 
ln = a + b1X1 + b2X2 + … + bnXn
11
where, X1, … Xn are the predictor variables and p is the proportion to be predicted. The calculation is computer intensive.
Logistic regression
 The quantity to the left of the equal sign is called a logit.
It’s the log of the odds that an event occurs.

 The odds that an event occurs is the ratio of the number of

people who experience the event to the number of people


who do not.

This is what you get when you divide the probability that
the event occurs by the probability that the event does not
occur, since both probabilities have the same denominator
and it cancels, leaving the number of events divided by the
number of non-events.

The coefficients in the logistic regression model tell you


how much the logit changes based on the values of the 12
predictor variables.
 Because the logistic regression equation predicts
the log odds, the coefficients represent the
difference between two log odds, a log odds ratio.
 The antilog of the coefficients is thus an odds ratio. Most
programs print these odds ratios. These are often called
adjusted odds ratios.

 compare the adjusted odds ratio (AOR) with the crude odds
ratio (COR)

 The above equation can be rewritten to represent the


probability of disease as:
1
P
1  e  ( a b1 X 1b 2 X 2.bnXn )
13
e ( a b1 X 1b 2 X 2bnXn)
P
1  e ( a b1 X 1b 2 X 2bnXn )

If Z = a + b1X1 + b2X2 + … + bnXn

The above equation turns out to be:

eZ
P Z
1 e

14
Significance tests
► The process by which coefficients are tested for significance
for inclusion or elimination from the model involves several
different techniques.

I) Z-test
The significance of each variable can be assessed by treating
b
Z= se(b)

► This z value is then squared, yielding a Wald statistic with a


chi-square distribution. However, there are problems with the
use of the Wald statistic. The likelihood-ratio test is more
reliable for small sample sizes than the Wald test.

15
Significance tests
II) Likelihood-Ratio Test:

► Logistic regression uses maximum-likelihood estimation to


compute the coefficients for the logistic regression equation.

N.B. Multiple regression uses the least-squares method to find the


coefficients for the independent variables in the regression
equation
(it computes coefficients that minimize the residuals for all
cases).

► Maximum-likelihood estimation is an interative procedure that


successively tries to get closer and closer to the correct answer.

► Before proceeding to the likelihood ratio test, we need


to know about the deviance which is analogous to the
16
Deviance
 The deviance of a model is -2 times the log
likelihood (-2LL) associated with each model.

 As a model’s ability to predict outcomes improves,


the deviance falls. Poorly-fitting models have higher
deviance.

 If a model perfectly predicts outcomes, the


deviance will be zero. This is analogous to the
situation in linear regression, where the residual
sum of squares falls to 0 if the model predicts the
values of the dependent variable perfectly. 17
 Based on the deviance, it is possible to construct an
analogous to r² for logistic regression, commonly
referred to as the Pseudo r².

 If G1² is the deviance of a model with variables, and G0² is


the deviance of a2null model, the pseudo r² of the model is:
G1
r² = 1 - G 02 = 1 – (ln L1 / ln L0)

 One might think of it (pseudo r2) as the proportion of


deviance explained.
 Note that The deviance of a model is -2 times the log likelihood (i.e., -2LL)
associated with each model. 18
►The likelihood ratio test (LRT), which makes use of the deviance,
is analogous to the F-test from linear regression.

► In its most basic form, it can test the hypothesis that all the
coefficients in a model are all equal to 0:
H0: ß1 = ß2 = . . . = ßk = 0

►The test statistic has a chi-square distribution, with k degrees of


freedom.

► If we want to test whether a subset consisting of q coefficients


in a model are all equal to zero, the test statistic is the same,
except that for L0 we use the likelihood from the model without
the coefficients, and L1 is the likelihood from the model with
them.

►This chi-square has q degrees of freedom. 19


Assumptions
► Logistic
regression is popular in part because it enables the
researcher to overcome many of the restrictive assumptions of
OLS regression:

1. Logistic regression does not assume a linear relationship


between the dependents and the independents.
 It is possible and permitted to add explicit interaction
and power terms as variables on the right-hand side of the
logistic equation, as in OLS regression.

2. The dependent variable need not be normally distributed.

3. The dependent variable need not be homoscedastic for each


level of the independents; that is, there is no homogeneity of
variance assumption.
20
However, other assumptions still apply:
1. Meaningful coding. Logistic coefficients will be difficult to interpret if not
coded meaningfully. The convention for binomial logistic regression is to
code the dependent class of greatest interest as 1 and the other class as 0.

2. Inclusion of all relevant variables in the regression model

3. Error terms are assumed to be independent (independent sampling).


Violations of this assumption can have serious effects. Violations are apt to
occur, for instance, in correlated samples and repeated measures designs,
such as before-after or matched-pairs studies, cluster sampling, or time-
series data. That is, subjects cannot provide multiple observations at
different time points.

4. Linearity: Logistic regression does not require linear relationships between


the independents and the dependent, as does OLS regression, but it does
assume a linear relationship between the logit of the independents and
the dependent.

21
5. No multicollinearity: To the extent that one
independent is a linear function of another
independent, the problem of multicollinearity will occur
in logistic regression, as it does in OLS regression. As
the independents increase in correlation with each
other, the standard errors of the logit (effect)
coefficients will become inflated.

6. No outliers: As in OLS regression, outliers can affect


results significantly. The researcher should analyze
standardized residuals for outliers and consider
removing them or modeling them separately.
Standardized residuals >2.58 are outliers at the .01
level, which is the customary level (standardized 22
residuals > 1.96 are outliers at the less-used .05 level).
7. Large samples: Unlike OLS regression,
logistic regression uses maximum likelihood
estimation (MLE) rather than ordinary least
squares (OLS) to derive parameters.

♦ MLE relies on large-sample asymptotic normality


which means that reliability of estimates decline
when there are few cases for each observed
combination of independent variables.

♦ That is, in small samples one may get high


standard errors. In the extreme, if there are too few
cases in relation to the number of variables, it may
be impossible to converge on a solution.

♦ Very high parameter estimates (logistic 23


coefficients) may signal inadequate sample size.
Hosmer and Lemeshow Test
♣ The Hosmer -Lemeshow goodness - of - fit statistic is used to
assess whether the necessary assumptions for the application
of multiple logistic regression are fulfilled.

♣ The Hosmer and Lemeshow's goodness-of-fit statistic is


computed as the Pearson chi-square from the contingency
table of observed frequencies and expected frequencies.

♣ A good fit as measured by Hosmer and Lemeshow's test will


yield a large p-value (much larger than 0.05).

♣ The result of the Hosmer-Lemeshow goodness-of-fit is easily


obtained by clicking on the appropriate menu commands of
logistic regression. That is,
Analyze → Regression → Binary logistic → Options → Hosmer-Lemeshow goodness-of-fit
24
Summary
♣ A likelihood is a probability, specifically the probability that the values of
the dependent variable may be predicted from the values of the
independent variables. Like any probability, the likelihood varies from
0 to 1.

♣ The log likelihood ratio test (or sometimes called as model chi-square
test) of a model tests the difference between -2LL for the initial chi-square
in the null model and -2LL for the full model. That is, Model chi-square is
computed as -2LL for the null (initial) model minus -2LL for the
researcher’s model.

♣ The initial chi-square is -2LL for the model which accepts the null
hypothesis that all the β coefficients are zero.

♣ The log likelihood ratio test tests the null hypothesis that all population
logistic regression coefficients except the constant are zero. It is an
overall model test which does not assure that every independent is
significant. 25
♣ LRT measures the improvement in fit that the
explanatory variables make compared to the null
model.

♣ The method of analysis uses an iterative procedure


whereby the answer is obtained by several repeated
cycles of calculation using the maximum likelihood
approach.

♣ Because of this extra complexity, logistic regression


is only found in large statistical packages or those
primarily intended for the analysis of epidemiological
data. 26
EXAMPLES

The following Tables are given to show the


formats of selected presentations from the
results of logistic regression analysis (only
some of the independent variables are taken)

27
Table X :Results of separately regressing fertility levels (high versus low) on each
explanatory variable relating to women's sexual behaviour and use of contraceptives, North
and South Gondar zones, northwest Ethiopia, 2007 (Bivariate analyses)
fertility level Odds Ratio 95% Confidence P-value
Explanatory variable (crude) Interval
high low lower upper

Educational level 0.002


no education (does not read/ write) 884 989 1.00
primary 104 214 0.54 0.42 0.70
secondary and above 23 210 0.12 0.08 0.19

Monthly household income < 0.001


≤ 320 Eth Birr 256 556 1.00
321 – 500 Eth Birr 309 487 1.38 1.12 1.69
501 – 999 Eth Birr 327 292 2.43 1.96 3.02
≥ 1000 Eth Birr 119 78 3.31 2.40 4.57

Religion
Orthodox Christian 945 1312 1.00
Muslim 64 95 0.94 0.67 1.3 0.69
Knowledge of the respondent
regarding the period of pregnancy
Correct 92 274 1.00
Wrong 919 1139 2.40 1.87 3.09 < 0.001
Do you approve wife beating by
the husband for various reasons? 28
735 847 1.78 1.49 2.12 < 0.001
Yes
276 566 1.00
Table Z : Results from the multivariate analysis – adjusted for
demographic, socio-economic and reproductive variables, North
and South Gondar zones, northwest Ethiopia, 2007

Odds 95% P-
Explanatory variable fertility level Ratio Confidence value
(adjusted) Interval
high low lower upper

Educational level 0.002


no education (does not read/ write) 884 989 1.00
primary 104 214 0.92 0.67 1.27
secondary and above 23 210 0.37 0.21 0.64

Monthly household income < 0.001


≤ 320 Eth Birr 256 556 1.00
321 – 500 Eth Birr 309 487 1.48 1.16 1.88
501 – 999 Eth Birr 327 292 3.39 2.60 4.43
≥ 1000 Eth Birr 119 78 6.97 4.54 10.71

Knowledge of the respondent


regarding the period of pregnancy
Correct 92 274 1.00
♣ For variables having more than two categories, the overall
Wrong 919significance
1139 1.42 by their corresponding
(red color) is given 1.04 1.93
P-values. 0.027
29
♣ The assessment made whether the required assumptions for the application of multiple logistic regression was fulfilled showed that this
parsimonious model adequately fits the data as P = 0.88 (by using Hosmer and Lemeshow test)
Exercise
• Fifty women aged 16 to 65 were randomly taken from
a certain village to assess the level of trachoma (all
stages) and its associated risk factors.

• Selected socio-demographic characteristics of the


women together with their status (presence/absence)
of trachoma were recorded.

• The dependent variable (i.e., trachoma) is coded as


1 for ‘yes’ and 0 for ‘no’.

• There were three independent variables (predictors)


relating to the study subjects collected during the
investigation (age, educational status and wash face).
30
Exercise
► Age was taken as a continuous variable.

► Educational status and wash face were taken as categorical


variables.

Categorical variables

Educational status:
0 = Women who could not read/write)
1 = Women with some primary school education
2 = Women with secondary and above education

Wash face:
1 = at most once a day (without soap)
2 = twice a day (without soap)
3 = at least once a day (with soap)

By taking the above information into account, answer the following


questions. 31
Exercise
A) Are the necessary assumptions for the application of multiple logistic
regression fulfilled? How? If so, use the forward LR method to analyze the
given data (This procedure should be preceded by the classical bivariate analyses).

B) Does 'washing face' at least once a day (with soap) have any significant
effect on the prevention of trachoma?

C) Estimate the probability that a woman with some primary school education
(4th grade) who washes her face twice a day (without soap) will have
trachoma?

D) Estimate the probability that a woman with a secondary school education


who washes her face twice a day (without soap) will have trachoma?

E) Estimate the probability that a woman with a secondary school education


who washes her face at least once a day (with soap) will have trachoma?

F) What do you understand from your answers in parts C , D and E ?

G) If you are asked to have more independent variables and undertake a similar
study, what additional variables (predictors) would you suggest to be
included in the proposed study? Why?

32
H) What recommendations do you forward based on your findings?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy