0% found this document useful (0 votes)
18 views132 pages

Discrete Choice Models 230919 191735

The document discusses discrete choice models and the linear probability model (LPM) for regression analysis when the dependent variable is qualitative or binary. It explains that the LPM treats the probability of a characteristic or event occurring as a linear function of the independent variables. While simple to estimate using ordinary least squares regression, it notes some limitations of the LPM including non-normal errors, heteroscedasticity, and probabilities that can be predicted outside the 0-1 range.

Uploaded by

dav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views132 pages

Discrete Choice Models 230919 191735

The document discusses discrete choice models and the linear probability model (LPM) for regression analysis when the dependent variable is qualitative or binary. It explains that the LPM treats the probability of a characteristic or event occurring as a linear function of the independent variables. While simple to estimate using ordinary least squares regression, it notes some limitations of the LPM including non-normal errors, heteroscedasticity, and probabilities that can be predicted outside the 0-1 range.

Uploaded by

dav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

Discrete choice models

Econometrics

9/19/2023 1
 We know that, in a regression framework, the
dependent variable can be continuous or
discrete.

 So far, we have considered the cases under which


the LHS variable is always continuous.

 This lecture deals with regression models when the


dep. Variable is qualitative.

 Numerous behavioural responses in economics are


qualitative.

9/19/2023 2
 Regression models essentially imply averaging y for
given values of the Xs.

 However, if the LHS variable assumes a value of


either 0 or 1, then we move away from averaging
techniques such as IV (Instrumental variables)
regression or OLS.

 However, if you average a qualitative (dummy)


variable which takes the values of 0 or 1 across a
sample of observations, the result will give you the
proportion of observations in the sample which
have a particular characteristic.
9/19/2023 3
 What do we mean by a particular characteristic?

 You may predict for instance the probability of


falling ill, denied access to loan, falling into
poverty….etc.

 Therefore having illness, rejected loan application


and poverty are the characteristics we are
modelling.

 In the context of random sampling this proportion


estimates the probability of encountering this
characteristic in the population at large.

9/19/2023 4
 Inother words, averaging in this
context does not tell us something
about the average value a
qualitative variable assumes but
rather it tells us something about
the probability that the qualitative
variable will equal 1.
9/19/2023 5
Some Applications
 In some applications, you might be interested in
investigating factors affecting a qualitative event or a
binary outcome such as;

 Disease incidence, Probability of Contraceptive use,


Probability of Unemployment or labour market
participation,

 Probability of Joining a university, Probability of being


Credit constrained, Probability of taking illegal drugs,

 And other qualitative events

9/19/2023 6
The Linear probability model (LPM)
In all of the above examples our dep. Var. takes
only two values 1 if the event occurs and 0
otherwise.

The LHS is a binary variable. LPM is the simplest


binary choice model.

Therefore, the interest is to predict the


probability of the event happening conditional
on different covariates (i.e. P ( y | x) ).
9/19/2023 7
 As the name implies, the probability of the
event occurring is assumed to be a linear
function of a set of regressors under LPM.

 This model is estimated using OLS

 Note: The covariates (i.e. the x’s) can be binary


and/or continuous.

 In STATA, simply use the command ‘reg’ or


‘regress’ to estimate the LPM.

9/19/2023 8
Suppose we have the following multiple
regression model;
y   0  1 x1  ....   k x k  u (1)
where
E(u|x)=0 by definition.

Because y can take on only two values,  j


cannot be interpreted as the change in y given
a one-unit increase in x j , ceteris paribus.

9/19/2023 9
 The dep. Var. changes either from 1 to 0 or vice
versa or does not change at all.

 However, the estimated parameters have


useful interpretations.

9/19/2023 10
Using the expectations operator, we have

E ( y | x)   0  1 x1  ....   k x k (2)

when the dep. Var. is binary, it is always true


that

E ( y | x )  P ( y  1 | x ) or p(x) => probability of


success!

Thus we can write (2) as;

P( y  1 | x)   0  1 x1  ....   k x k (3)
9/19/2023 11
The probability is a linear function of the x’s.

Equation (3) is an example of a binary response


model and P ( y  1 | x) is referred to as the
RESPONSE PROBABILITY.

NOTE: P( y  0 | x)  1  P( y  1 | x) and it is also


a linear function of the regressors.
9/19/2023 12
The multiple linear regression model with a
binary dep. Var. is called the LINEAR
PROBABILITY MODEL (LPM) because the
response probability is linear in the parameters
 j ,  j.

9/19/2023 13
In the LPM,  j measures the change in the
probability of success when x j changes, ceteris
paribus:

P( y  1 | x)   j x j (4)

In other words, eq.(4) gives the marginal effect


of the x’s on y.

9/19/2023 14
Alternative conceptual framework for the
LPM.
Consider the following unconditional
expectation of a binary variable, y, defined as
a probability;

E ( y )  Pr( y  1)

If we have regressors x, we can define the


conditional expectation of y as;

E ( y | x )  Pr( y  1 | x )

A standard regression model is given by;

y  F ( x,  )  u

9/19/2023 15
Taking expectations;

E ( y | x)  F ( x,  ); as, E (u )  0

Therefore, the standard regression function


F ( x,  ) can be interpreted as the conditional
expectation of y given x.

It is simple to see that if the LHS is binary F ( x,  )


relates directly to the conditional probability of
observing y=1 (i.e. probability of success).
9/19/2023 16
We know that E ( y | x )  Pr( y  1 | x )  F ( x,  ) .
Note that the way we specify F ( x,  ) is crucial
for the characteristics of the binary choice
model.

It is linear in the LPM but non-linear in the probit


and logit case.

In LPM, we may specify the conditional


probability as;
Pr( y  1 | x)  F ( x,  )  x' 

9/19/2023 17
Introducing disturbances u, we can write the
model as
y  x'   u

For n observations;

y i  xi '   u i

The fitted value;

yˆ i  xi ' ̂
will give the estimated probability that the
event will occur or the characteristic will be
observed given the particular value of the x’s.
9/19/2023 18
Problems with LPM
(1) Disturbances are non-normal(error are either
0 or 1)

If the event happens, ui  1  xi '  with


probability f (u i )  xi '  ; where f (u i ) is the
density of the disturbances.

Alternatively,

Pr( yi  1 | x)  xi '   ui
1  xi '   ui
(
1  xi '   ui

9/19/2023 19
If the event does not happen, ui   xi '  with
probability f (ui )  (1  xi '  ) .

Alternatively,
Pr( yi  1 | x)  xi '   ui
0  xi '   ui
 xi '   ui

Hence OLS is not fully efficient due to the non-


normality of the residuals.
9/19/2023 20
(2) Disturbances are heteroscedastic. Meaning they don’t have constant
variance across observations (i.e. across the x’s).

9/19/2023 21
Why does the above expression show heteroscedasticity?

Both Pr( yi  1) and Pr( yi  0) depend on the


value of X i .

It follows that the different error terms


corresponding to different value of X i will have
different variances.

Heteroscedasticity leaves the estimator


unbaised but makes it inefficient.

9/19/2023 22
What is the solution?

 WLS (Weighted Least squares) is proposed as a


solution because OLS is inefficient due to the
heteroscedasticity problem.

 (3) We can get predictions which are either less


than zero or greater than one.

9/19/2023 23
 The major criticism relates to the formulation –
that the conditional expectations be
interpreted as the probability that the event will
occur.

 In many cases, this can lie outside the


limits(0,1)(Maddala, 1983) and still we are going
to (wrongly) interpret it as a probability.

9/19/2023 24
 (4)A probability cannot be linearly related to
the independent variables for all their possible
values.

 Meaning the impact of a regressor on the dep.


Var. when the regressor increases from 0 to 1 is
not necessarily the same as when it increases
from 1 to 2.

9/19/2023 25
Example
 Suppose we are interested in the impact of the
number of kids a woman has on her labour market
participation. The impact on the participation
probability when the woman moves from having
no (0) children to having 1 should not be equal to
the impact of moving from having 1 to 2 children.
In practice, subsequent children have smaller
impact than the first child.

 Still, LPM is commonly used in economics and


works well when the regressors have values closer
to the mean value.
9/19/2023 26
 STATA 9 example of Estimated LPM using the ‘Benefits’
data from http://www.econ.kuleuven.ac.be/GME .

Article
 McCall, B. P. (1995) The Impact of Unemployment Insurance
Benefit Levels on Recipiency, Journal of Business and
Economic Statistics, 13, 189-198. Also Verbeek Ch. 7, sec.
7.1.6.

 Linear Probability Model- using least squares


(predicting probability of applying for unemployment
benefits- blue-collar workers)

9/19/2023 27
Source | SS df MS Number of obs = 4877
-------------+------------------------------ F( 19, 4857) = 18.33
Model | 70.5531915 19 3.71332587 Prob > F = 0.0000
Residual | 983.900366 4857 .20257368 R-squared = 0.0669
-------------+------------------------------ Adj R-squared = 0.0633
Total | 1054.45356 4876 .216253806 Root MSE = .45008

------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rr | .6288587 .3842068 1.64 0.102 -.1243605 1.382078
rr2 | -1.019059 .480955 -2.12 0.034 -1.961949 -.0761697
age | .0157489 .0047841 3.29 0.001 .0063698 .025128
age2ten | -.0014595 .0006016 -2.43 0.015 -.0026389 -.0002801
tenure | .0056531 .0012152 4.65 0.000 .0032708 .0080355
slack | .1281283 .0142249 9.01 0.000 .100241 .1560156
abol | -.0065206 .0248281 -0.26 0.793 -.0551948 .0421537
seasonal | .0578745 .0357985 1.62 0.106 -.0123067 .1280557
head | -.043749 .016643 -2.63 0.009 -.0763769 -.0111211
married | .0485952 .0161348 3.01 0.003 .0169637 .0802267
dkids | -.0305088 .0174321 -1.75 0.080 -.0646837 .003666
dykids | .0429115 .0197563 2.17 0.030 .0041803 .0816428
smsa | -.035195 .0140138 -2.51 0.012 -.0626684 -.0077217
nwhite | .0165889 .0187109 0.89 0.375 -.0200928 .0532707
yrdispl | -.0133149 .0030686 -4.34 0.000 -.0193307 -.007299
school12 | -.0140365 .0168433 -0.83 0.405 -.0470571 .018984
male | -.0363176 .0178142 -2.04 0.042 -.0712415 -.0013936
statemb | .0012394 .0002039 6.08 0.000 .0008396 .0016393
stateur | .0181479 .0030843 5.88 0.000 .0121012 .0241945
_cons | -.076869 .122056 -0.63 0.529 -.316154 .162416
------------------------------------------------------------------------------
9/19/2023 28
Interpretation?
 When we estimate the LPM using OLS, no
corrections for heteroskedasticity are made
and no attempt is made to keep the implied
probabilities between 0 and 1.

 However, as we can see shortly, the signs of


coefficients of LPM and statistical significance
of regressors can be comparable with logit and
probit model results.

9/19/2023 29
Limited Dependent Variable (LDV)
models
 LDV – a dependent variable whose range of values is
substantively restricted.

 Two other standard discrete models address the


shortcomings we highlighted about LPM (linear
probability model).

 These two models are logit and probit models. One can
estimate a logit or a probit model for an equation with a
binary dep. Var.

 The only difference between these two models is their


distributional assumption about the error term.

9/19/2023 30
 The logit model assumes that the error
terms follow a logistic distribution while
the probit model assumes that it
follows a normal distribution.

9/19/2023 31
What about corner solution responses?
 In practice, optimising behaviour of economic agents
(e.g. individuals, households, …etc) leads to a corner
solution response for some nontrivial fraction of the
population.
Example:

 Observing zero expenditures in household expenditure


surveys for some commodities.

 If this is the case, a special class of model (i.e. the Tobit


model) is the appropriate estimating framework. Tobit is
explicitly developed to handle corner solution dep.
Variables.

9/19/2023 32
 poisson and negative binomial are count data
models which handle untypical dependent
variables for example when the dependent
variables is in the form of counts (e.g. number
of visits to a hospital in a given year)

 Other important topics following the above


models are the issues of data censoring and
sample selection bias (e.g. Heckit or other
techniques are used to handle it) both in the
case of binary and multinomial
(polychotomous) dependent variables

9/19/2023 33
The Logit model

 Is one of the sophisticated models that


overcomes the shortcomings of the LPM.

 This model has been extensively used in


disciplines such as biometrics, biology,
epidemiology and social sciences.

9/19/2023 34
Our primary interest is to estimate the response
probability;
P( y  1 | x)  P ( y  1 | x1 , x 2 ,..., x n ) (5)

Under LPM, the response probability is linear in a


set of parameters  j . To avoid that limitation,
consider a class of binary response models of
the form

P( y  1 | x)  G (  0  1 x1  ....   k x k )  G (  0  x )
(6)

where G is a function taking on values strictly


between zero and one: 0<G(z)<1, for all real
numbers z.
9/19/2023 35
Note that,
z  (  0  1 x1  ....   k xk ), or, (  0  x ) ;
Hence,

G( z )  G(  0  1 x1  ....   k xk )  G(  0  x )

Logit is one of the non-linear functions


suggested for the G function to make sure that
probabilities are between 0 and 1.
9/19/2023 36
Thus G is the logistic function:
exp( z ), or, exp z
G( z )   ( z )
1  exp( z ) (7)
G(z) is the standard logistic distribution function
which results in the logit model.

The logistic function was introduced or invented


in the 19th century (by Verhulst, 1804-1849) for
the description of population growth (Cramer,
2003).

yi   0  1 x1  ....   k xk   i
  0  x   i (8)
9/19/2023 37
The Likelihood function

If p i is the probability that yi =1, then (1  pi ) is


the probability that yi  0 .

To construct the likelihood function, we note


that the contribution of the ith observation can
be written as;
1 yi
p (1  pi )
i
yi

9/19/2023 38
In the case of random sampling where all
observations are sampled independently, the
likelihood function will simply be the product of
the individual contributions, as follows;
n
L   piyi (1  pi )1 yi
i 1
1 y1 1 y 2 1 y n
 p (1  pi ) p
i
y1
* p (1  pi )
i
y2
* ... * p (1  pi )
i
yn

(9)

The technique of maximum likelihood entails


that we choose those values of the parameters
of eq(8) which maximise the likelihood function
as given in eq(9).
9/19/2023 39
In practice, we maximise the logarithm of the
likelihood function:

ln L  [ yi ln pi  (1  yi ) ln(1  pi ) (10)

Article

McCall, B. P. (1995) The Impact of Unemployment Insurance


Benefit Levels on Recipiency, Journal of Business and
Economic Statistics, 13, 189-198.

9/19/2023 40
Logit Model using MLE technique
Iteration 0: log likelihood = -3043.028
Iteration 1: log likelihood = -2875.8198
Iteration 2: log likelihood = -2873.2003
Iteration 3: log likelihood = -2873.1965

Logistic regression Number of obs = 4877


LR chi2(19) = 339.66
Prob > chi2 = 0.0000
Log likelihood = -2873.1965 Pseudo R2 = 0.0558

------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rr | 3.06808 1.868225 1.64 0.101 -.5935732 6.729733
rr2 | -4.890618 2.333521 -2.10 0.036 -9.464236 -.3170007
age | .0676968 .0239095 2.83 0.005 .020835 .1145586
age2ten | -.0059681 .0030383 -1.96 0.050 -.0119231 -.000013
tenure | .0312492 .0066443 4.70 0.000 .0182267 .0442717
slack | .624822 .0706385 8.85 0.000 .4863731 .7632709
abol | -.0361753 .1178082 -0.31 0.759 -.2670751 .1947245
seasonal | .270874 .1711711 1.58 0.114 -.0646152 .6063633
head | -.2106822 .081226 -2.59 0.009 -.3698822 -.0514821
married | .2422656 .0794099 3.05 0.002 .0866251 .3979061
dkids | -.1579269 .0862177 -1.83 0.067 -.3269105 .0110566
dykids | .2058941 .0974924 2.11 0.035 .0148126 .3969756
smsa | -.1703537 .0697808 -2.44 0.015 -.3071216 -.0335858
nwhite | .0740701 .0929562 0.80 0.426 -.1081208 .256261
yrdispl | -.0637001 .0149972 -4.25 0.000 -.0930941 -.0343062
school12 | -.0652576 .0824126 -0.79 0.428 -.2267834 .0962681
male | -.179829 .087535 -2.05 0.040 -.3513944 -.0082636
statemb | .006027 .001009 5.97 0.000 .0040494 .0080046
stateur | .0956198 .0159116 6.01 0.000 .0644336 .126806
_cons | -2.800499 .6041675 -4.64 0.000 -3.984645 -1.616352
------------------------------------------------------------------------------

9/19/2023 41
Computing predicted probabilities from the
above logit estimates
exp( z )
G( z )   ( z )
1  exp( z )
Dividing both the numerator and the
denominator by exp(z), we get
exp( z ) / exp( z) 1
Pri  G( z )  
1 / exp( z)  exp( z) / exp( z ) 1 / exp( z)  1

9/19/2023 42
 What is the distinction between the logit model
and logistic model?

 Estimated betas are odds ratios in the logistic


model. But they are marginal effects in the logit
case.

 How do you handle hypothesis tests (such as t-


tests) in logit model estimates?

 No special problem is posed by logit models and


we conduct the test as in the linear case.

9/19/2023 43
The Probit model

 As G(.) has to be between 0 and 1, it seems


natural to choose G to be some distribution
function.

 The G function is the standard normal


cumulative distribution function (cdf) expressed
as an integral;

9/19/2023 44
z
G ( z )  ( z ) 

  ( z )dz, (11)
where
 ( z )  (2 ) 1 / 2 exp(  z 2 / 2)
1  1 2
 exp z  .
2  2 
Eq (11) can be rewritten as;
 1 2
z
1
G ( z )  ( z )  
 2
exp
 2
z dz
 (12)

Which again ensures that the response


probability will be between 0 and 1 and it leads
to the probit model.

Note that probit is simply a normal counterpart


of logit.
9/19/2023 45
Probit Model using MLE technique
Iteration 0: log likelihood = -3043.028
Iteration 1: log likelihood = -2875.021
Iteration 2: log likelihood = -2874.071
Iteration 3: log likelihood = -2874.0708

Probit regression Number of obs = 4877


LR chi2(19) = 337.91
Prob > chi2 = 0.0000
Log likelihood = -2874.0708 Pseudo R2 = 0.0555

------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rr | 1.863475 1.127476 1.65 0.098 -.3463382 4.073287
rr2 | -2.980436 1.410589 -2.11 0.035 -5.74514 -.2157332
age | .0422141 .0142969 2.95 0.003 .0141927 .0702355
age2ten | -.0037741 .0018118 -2.08 0.037 -.0073251 -.0002231
tenure | .0176942 .0038077 4.65 0.000 .0102312 .0251572
slack | .3754931 .0424115 8.85 0.000 .2923681 .458618
abol | -.0223137 .071845 -0.31 0.756 -.1631274 .1185
seasonal | .1612071 .1039498 1.55 0.121 -.0425308 .3649451
head | -.1247463 .0491627 -2.54 0.011 -.2211034 -.0283892
married | .1454763 .0477579 3.05 0.002 .0518725 .2390801
dkids | -.0965778 .051813 -1.86 0.062 -.1981294 .0049738
dykids | .1236098 .058581 2.11 0.035 .0087931 .2384265
smsa | -.1001521 .04183 -2.39 0.017 -.1821373 -.0181668
nwhite | .0517939 .0559871 0.93 0.355 -.0579388 .1615266
yrdispl | -.0384797 .0090685 -4.24 0.000 -.0562535 -.0207058
school12 | -.0415518 .0497067 -0.84 0.403 -.1389751 .0558714
male | -.1067169 .0527926 -2.02 0.043 -.2101885 -.0032454
statemb | .0036399 .0006071 6.00 0.000 .0024499 .0048298
stateur | .0568271 .0094492 6.01 0.000 .038307 .0753472
_cons | -1.699991 .3622682 -4.69 0.000 -2.410024 -.9899586
------------------------------------------------------------------------------

9/19/2023 46
Derivation of the above two models using
latent variable model

The latent approach is a common way of


specifying discrete choice models. Both
probit and logit models can be interpreted
as latent variable models. Let y* be an
unobserved or latent variable given as;

y*   0  x  e, y  1[ y*  0] (13)
9/19/2023 47
where 1[.] is the indicator function which
can be shown as;
 y  1, if , y*  0

 y  0, otherwise/ if , y*  0

Note that cov(x,e)=0; exogeneity of x!

Y* is a latent propensity variable, is


unobservable and y *  (, ) .
9/19/2023 48
e can follow either a logistic or normal
distribution and it is symmetrically
distributed about zero.

Therefore, 1-G(-z)=G(z).

Economists prefer the latter distribution as


it makes life easier to handle specification
problems (e.g. selectivity bias).
9/19/2023 49
Note:

y* has rarely a well-defined unit of


measurement (e.g. it might be the
difference in utility levels from two different
actions).

The response probability from eq(9) can


be derived as;
9/19/2023 50
P( y  1 | x)  P( y*  0 | x)  P[e  (  0  x ) | x]
 1  G[(  0  x )]  G (  0  x )
or
 1  F (  xi '  )  F ( x i '  )
(14)
We know that both the cumulative normal
distribution function (i.e.  (z ) for the probit
model)
9/19/2023 51
and the logistic function (i.e.  (z ) for the
logit model) are monotone increasing
functions of z.
F ( xi '  )  0, as, xi '   
F ( xi '  )  1, as, xi '   
Therefore, the probabilities are bounded
between 0 and 1 in the case of logit and
probit models.
9/19/2023 52
Economic theory foundations for the latent
variable approach (Duncan, 2000)

For the purposes of exposition, let y


represent the labour force participation
choice (y=1 if a woman works, 0
otherwise) and let the two outcomes,
working and not working, be described by
*
state-specific utilities U y ;
9/19/2023 53
U *
y 1  x' 1  u1
U *
y 0  x'  0  u 0
where x represents a common set of
control variables,  0 and 1 are vectors of
unknown parameters and u 0 and u1
represent unobservable (state-specific)
taste components.
9/19/2023 54
Under this characterisation, an individual
will choose to participate if the utility to be
*
enjoyed when working (i.e. U y 1 ) exceeds
the utility to be gained when out of work
*
(i.e. U y 0 ). That is the choice to participate
* *
is made when U > U , such that
y 1 y 0

9/19/2023 55
* *
Y=1( U > U ) y 1 y 0

= 1( x' 1  u1  x'  0  u0 )
=1( u1  u0   x' ( 1  0 )
Clearly, we cannot identify both sets of parameters  0 and 1 , however, we

can identify the difference ( 0 - 1 ) and

9/19/2023 56
implicitly parameterise the choice model
as
y=1(y*>0)
where y*  x' (1   0 )  (u1  u0 )  x'   u .

In other words, the latent variable approach to binary choice model specification
can be derived from an economic model of behaviour.
9/19/2023 57
 While the underling preference
specification is by necessity fairly
restrictive, the fact that the latent
variable approach can be presented
as having foundations in economic
theory lends weight to its application
in applied work.

9/19/2023 58
Goodness-of-fit
 Contrary to the linear regression model, there is
no single measure for the goodness-of-fit in
binary response (choice) models

 Often, goodness-of-fit measures are implicitly or


explicitly based on comparison with a model
that contains only a constant as explanatory
variable.

9/19/2023 59
 Let log denote the maximum likelihood value
of the model of interest and let log denote the
maximum value of the log likelihood function
when all parameters, except the intercept, are
set to zero.

9/19/2023 60
Clearly, log L1  log L0 . The larger the
difference between the two log likelihoods
values, the more the extended model
adds to the very restrictive model. A first
goodness-of-fit measure is defined as;
9/19/2023 61
1
R  1
2
Pseudo 1  2(log L1  log L0 ) / N

where N denotes the number of


observations.

McFadden(1974) suggested an alternative


measure;

log L1
R 1 2
McFadden log L0

9/19/2023 62
and it is sometimes referred to as the
Likelihood Ratio Index.

Because the log likelihood is the sum of log


probabilities, it follows that
log L0  log L1 <0,

from which it is straightforward to show


that both measures take on values in the
interval[0,1] only.
9/19/2023 63
If all estimated slope coefficients are
equal to 0, we have log L0 = log L1 , such
that both R-squared’s are equal to zero.

If the model would be able to


generate(estimated) probabilities that
correspond exactly to the observed values
ˆ i  yi ,i ), all probabilities in the log
(that is p
likelihood would be equal to one, such
that the log likelihood would be exactly
equal to zero.
9/19/2023 64
Consequently, the upper limit for the two
measures above obtained for log L1  0 .

The upper bound of 1 can therefore, in


theory, only be attained by McFadden’s
measure.
9/19/2023 65
Marginal Effects [MEs]

How can we explain the impact of


regressors (x’s) on the response probability
or specifically on the probability of success
[i.e. P(y=1|x)]?

The MEs are not straight forward due to


the non-linear function G(.) and we
compute them using calculus.
9/19/2023 66
If x j is a continuous variable, it is obtained
from the following partials;

p( x) dG
 g (  0  x )  j , where, g ( z )  ( z ).
x j dz
(11)
G= cdf of a continuous random variable
g= pdf
G(.) is strictly increasing and so g(z)>0,  z.
9/19/2023 67
Eq(11) tells us that the partial effect of x j
on p(x) depends on x through the positive
quantity g ( 0  x ) , which means that the
partial effect always has the same sign as
j.
9/19/2023 68
What is the ME if the regressor is binary?

Let X 1 is the binary regressor.

It will be;

G (  0  1 x1   2 x2  ...   k xk )  G (  0   2 x2  ... 
(12)
Note X 1 is 1 in the first term of eq(12) and
0 in the second term. Only the sign, not the
magnitude of the coefficient is important.
9/19/2023 69
Example:
 If y is an employment indicator and the regressor is
a location dummy (e.g. urban – rural).

 The parameter estimate of the location dummy


indicates the probability of illness due to location.

 Note that knowing the sign of the parameter


estimate is sufficient whether being in urban areas
has a positive or a negative effect on the
probability of having illness.
9/19/2023 70
We can use the difference in eq (12) for
other kinds of discrete variables (such as
the number of children in a given
household).

If x k denotes this variable, then the effect


on the employment probability of xk
going from c k to c k  1 is;

G (  0  1 x1  ...   k (c k  1)  G (  0  1 x1  ...  

(13)
9/19/2023 71
Other standard functional forms can be
included among the regressors (e.g.
polynomials of different order).

Example:

In the model,

P( y  1 | z)  G( 0  1 z1   z   3 log(z2 )   4 z3 )
2
2 1

(14)
9/19/2023 72
ME of z1 on P(y=1|z) is
P( y  1 | z )
 g (  0  x )(1  2 2 z1 )
z1
(15)
ME of z 2 is
P( y  1 | z )
 g (  0  x )( 3 / z 2 )
z 2
(16)
9/19/2023 73
where
x  1 z1   z  3 log(z2 )   4 z3
2
2 1 (17)

Thus
P( y  1 | z )
 g (  0  x )( 3 / 100)or, 3/(1 / 100)or, /  3
z 2
is the approximate change in the
response probability when z 2 increases by
1%.
9/19/2023 74
Computing Marginal Effects using STATA
(Type mfx after estimating equation)
Note:
 Interactions among regressors (i.e. including
those between discrete and continuous
variables) are handled similarly.

Estimation
 We know that we have different ways of
generating estimators i.e, method of moments,
least squares and maximum likelihood
estimation (MLE).
9/19/2023 75
 All of the discrete choice models we discussed
above are estimated using MLE technique. To
estimate LPM, we can use OLS or WLS
(weighted least squares) in some cases.

 However if E(y|x) is non-linear, we can’t use


OLS to estimate either the logit or probit model.

 Because MLE is based on the distribution of y


given x, the heteroskedasticity in Var(y|x) is
automatically accounted for.

9/19/2023 76
Assume that we have a random sample of
size n. To obtain the ML estimator
conditional on the regressors, we need the
density of y given x . This is
i i

f ( y | xi ;  )  [G( xi  )] [1  G( xi  )]
y 1 y
, y  0,1
(18)

when y=1, we get [G ( xi  )] and 1- [G ( xi  )]


when y=0.
9/19/2023 77
The log-likelihood function (i.e. the
function we ptimise)- a function of the
parameters and the data (x , y ) - is
i i

obtained by taking log of (18);


9/19/2023 78
li (  )  y i log[G ( xi  )]  (1  y i ) log[1  G ( xi  )].
(19)
The log-likelihood for a sample of size n is
obtained by summing (19) across all
observations;
n
L(  )   l i 
i 1

9/19/2023 79
Differentiaing this function w.r.t. to the
parameters gives us the following FOCs.
Solving them for the parameters of interest
will give us the ML estimates.
L( ) L( ) L( )
 0;  0,......... , 0
1  2  k
(20)
9/19/2023 80
If G(.) in eq. (19) is the standard logit cdf,
ˆ
we get (i.e. vector of parameters) as a
logit estimator and if G(.) is the standard
normal cdf, then the vector gives us the
probit estimator.

9/19/2023 81
 The non-linear nature of the maximisation
problem makes it computationally difficult to
write formulas for logit or probit ML estimates.

 However under general conditions, the MLE is


 consistent
 asymptotically normal and
 asymptotically efficient.

9/19/2023 82
Testing for Normality (Background)

 Statistically, the characteristics of the normal


distribution are well-known. In particular, if we
let the error term e represent a standard
normally distributed random variable, then we
may list properties of the moments of

9/19/2023 83
E ( ) as follows:
j

E ( 1 )  0 (mean) (c.1)
E ( )  1 (variance)
2
(c.2)
E ( )  0 (skewness)
3
(c.3)
E ( )  0 (kurtosis)
4
(c.4)

9/19/2023 84
When presented with a regression of the
form yi  x'i   u i in which the error term
u i is maintained as normal, the error term
should respect characteristics (c.1) to
(c.4).

9/19/2023 85
An obvious way to carry out such tests is to
compare the sample moments of
~ ~
standardised residuals;  i  ( yi  x'i  ) * ,
1

with those which one would expect under


the null of normality.

9/19/2023 86
Note that the first two sample moments of
~i respect (c.1) and (c.2) by definition.

Consider the FOCs for OLS that;


N
N 1 *  ~i  0( mean);
i 1

and,
N ,
N 1
*   i  1(var iance)
~ 2

i 1

9/19/2023 87
~
where  i  ( yi  x'i  ) *  .
~ 1

A test of normality utilises the 3rd and 4th


sample moments respectively,
N N
N 1
* ~
 ; and, N *
i
3 1
 ~
i
4

i 1 i 1

as test statistics.
9/19/2023 88
Under the null of normality, these two
statistics ought to respect conditions (c.3)
and (c.4).
Why?

9/19/2023 89
The skewness of any symmetric

E ( 3
)
 ( i  ) 3

distribution ( N  1) s 3 such
as the normal distribution is zero. S
is the standard deviation and  is
a univariate random variable.

9/19/2023 90
We also know that the kurtosis for a
standard normal distribution

E ( ) 
4  i
(   ) 4

( N  1) s 4 is 3. For this
reason, excess kurtosis is defined as

E ( 4
)
 ( i  ) 4

( N  1) s 4 so that the
standard normal distribution has a
kurtosis of zero.
9/19/2023 91
Likelihood Ratio (LR) Test for non-normality
in probit models

1. Estimate y *i  x'i    i to obtain ML


~
estimates of  and maximised log-
likelihood LogL0 ;
~ ~
2. Add test variables ( x'i  ) and ( x'i  ) to
2 3

an auxiliary regression
~ 2 ~ 3
y *i  x'i    1 ( x'i  )   2 ( x'i  )   i ;
9/19/2023 92
3. Obtain maximised log-likelihood
LogLN from the auxiliary regression

4. The test statistics 2( LogLN - LogL0 ) should


be distributed as 
2
2 under the null of
normality.

9/19/2023 93
LR test for heteroskedasticity in probit
models

1. Estimate y *i  x ' i    i to obtain ML


~
estimates of  and maximised log-
likelihood LogL0 ;
~
2. Add test variables ( x' i  ) * z i to an
auxiliary regression
~
y *i  x' i    1 ( x' i  ) * z i   i , where zi
represents an m-vector of variables
which may potentially cause the
heteroskedasticity.

9/19/2023 94
3. Obtained maximised log-likelihood
LogLH from the auxiliary regression.

4. The test statistic 2( LogLH - LogL0 )


should be distributed as a  under the
2
m

null of homoskedasticity.

9/19/2023 95
LR test for Omitted Variables in Probit
Models
1. Estimate y *i  x'i    i to obtain ML
~
estimates of  and maximised log-
likelihood LogL0 ;

2. Add test variables z i to an auxiliary


regression y *i  x'i    1 z i   i , where z i
represents an m-vector of potentially
omitted variables.
9/19/2023 96
3 Obtain maximised log-likelihood LogLM
from the auxiliary regression.

4 The test statistic 2( LogLM - LogL0 ) should


be distributed as a  under the null of
2
m

no incorrect omission.

9/19/2023 97
Testing exclusion restrictions in logit and
probit models
 In many applications, we do not go
beyond t and F-tests to assess the
statistical significance of parameters.

 However, it is useful to have other ways to


test multiple exclusion restrictions. For the
logit and probit models, there are several
ways of doing so.
9/19/2023 98
i. Lagrange Multiplier (LM) or score test
 The Lagrange multiplier statistic is obtained from
constrained optimization. In a linear regression
framework, it is simple to motivate the LM statistic.

 Note that the LM test or the score test only requires


estimating the model under the null hypothesis.

 The statistic is based on the assumptions that justify


the F statistic in large samples. We do not need the
normality assumption.

9/19/2023 99
Suppose we have the following model;
y   0  1 x1  ...   k x k  u (21)
Note: y can be continuous or discrete.
Test, whether the last q of these variables all
have zero population parameters. Thus, the null
is
H 0 :  k q 1  0,.... k  0
(22)

9/19/2023 100
 which puts q exclusion restrictions on the model
given in eq(21).

 As in the F-test case, the alternative is that at


least one of the parameters is different from
zero.

 The LM statistic requires the estimation of the


restricted model only.

9/19/2023 101
Step I:
The estimates from the restricted model can be
given as;
~ ~ ~
y   0   1 x1  ...   k  q x k  q  u~ (23)

If the omitted variables xk q 1 , to, xk truly have 0


population coefficients, then at least
approximately,
~
u should be uncorrelated with
each of these variables in the sample.

9/19/2023 102
Step II;
 This suggests running a regression of these
residuals on those independent variables
excluded under the null, which is almost what
the LM test does.

 In practice, to get a usable test statistic, we


must include all regressors.

9/19/2023 103
This takes the form;
~u     x  ....   x
0 1 1 k k (24)
This is referred to as an auxiliary regression which
is used to compute a test statistic but its
coefficients are not of direct interest.
9/19/2023 104
Step III
Compute the LM Statistic. Under the null, this
turns out to be the product of the sample size
and the R-squared obtained from the auxiliary
regression.
LM  nR ~ 
2
u
2
q (25)

9/19/2023 105
Step IV
 Compare the statistic with the chi-
squared reading of critical values and
decide.

 Unlike
the F-test, the degrees of freedom
in the unrestricted model plays no role
under the LM test.

 Thisis because of the asymptotic nature


of the LM statistic.

9/19/2023 106
Caution:
 If in step(I), we mistakenly regress y on all of the
independent variables and obtain the residuals
from this unrestricted regression, we do not get
an interesting statistic; the resulting R-squared
will be exactly zero.

 This is because OLS chooses the estimates so


that the residuals are uncorrelated in samples
with all included regressors.

9/19/2023 107
ii.) Wald test

 This test requires estimation of only the


unrestricted model.

 In the linear model case, the Wald Statistic,


after a simple transformation, is essentially the F
statistic, so there is no need to cover the Wald
statistic separately.

9/19/2023 108
 This statistic is computed by econometrics
packages (such as STATA) that allow exclusion
restrictions to be tested after the unrestricted
model has been estimated.

 For instance, the command ‘test’ in STATA


performs Wald tests for simple and composite
linear hypotheses about the parameters of the
most recently fitted model

9/19/2023 109
 The statistic has a chi-square distribution, the
number of restrictions tested being equal to the
degrees of freedom.

 see Greene (2005) for the formula of this test.

 If both the restricted and unrestricted models


are easy to estimate, the LR test is attractive
relative to the Wald test.

9/19/2023 110
 Inconducting the above tests using
STATA one needs to know or have
good knowledge on post estimation
commands

9/19/2023 111
The Multinomial Logit (Mlogit) model
 When we are interested in modelling decisions
among multiple alternatives, where the outcomes
are unordered, we can not use ordered or
bivariate models.

 A typical example might be the decision to use


some form of public transport, where one is
confronted with the choice of using train, bus,
tube, taxi…etc.

 Here there is no obvious ordering or sequence to


these alternatives.
9/19/2023 112
A more appropriate formulation of
the choice process in these
circumstances is the multinomial
logit or probit model (Nerlove and
Press, 1973) or the conditional logit
model (McFadden, 1974).

9/19/2023 113
Imagine that we have M possible
alternatives to a discrete (multiple) choice,
each with an associated probability Pmi ,
m=1,…,M and for i=1,…,N.

9/19/2023 114
Essentially, the Mlogit expresses these
probabilities (relative to some benchmark
outcome, say PMi ) in relation to some non-
linear transformation of a linear
combination of a set of k explanatory
variables x i .

9/19/2023 115
Suppressing the subscript i for ease of
exposition, let
Pm
 F ( x'  m )
Pm  PM (26)

for m=1,…,M-1, which implies that


Pm F ( x'  m )
   ( x'  m )
PM 1  F ( x'  m ) (27)

9/19/2023 116
Since Pm  (0,1) , we therefore have that
Pm
 0, as, Pm  0; PM  1
Pm  PM
Pm (28)
 1, as, Pm  1; PM  0.
Pm  PM

9/19/2023 117
Hence, it must also be the case that F(.) is
a monotone increasing function of its
argument, such that F(u)  0 as u   ,
M

and F(u)  1 as u   . Since P


m 1
m  1 , we

have that

9/19/2023 118
M 1 Pj 1  PM 1

j 1 PM

PM

PM
1
(29)
which therefore defines the Mth
probabilities(the benchmark) as

1
 Pj 
M 1
PM  1   
 j 1 PM 
1
 M 1  (30)
 1    ( x'  j )
 j 1 
9/19/2023 119
and the remaining M-1 probabilities Pm as

 ( x'  m ) exp( x'  m )


Pm  M 1
 M 1
1    ( x'  j ) 1   exp( x'  j )
j 1 j 1

(31)
for all m=1,…,M-1.

9/19/2023 120
In other words, each probability can be
expressed in terms of the set of
explanatory variables and an unknown set
of parameter vectors 1 ,  2 ,...,  M 1 .

9/19/2023 121
Estimation of Mlogit
In order to estimate the model, we need a
specification for  (.) and the formulation for the
mlogit turns out to be  (u )  exp(u ) such that the
implied specification of F(.) is logistic (why?).

Estimation of the mlogit for a given sample of


data is by ML.

9/19/2023 122
 The likelihood function itself is a logical
extension of the likelihoods in the binary case.

 In particular, we formulate the likelihoods as


the product of the probabilities of each
observation, conditional on the data and the
parameters of the model.

 For the ith observation in the sample, the


likelihood contribution is;

9/19/2023 123
M M
li   Pr( yi  m | xi ) zim   Pimzim (32)
m 1 m 1
z im  1( yi  m), for m=1,…,M and where P im

represents the probability that the ith dep. Var.


observation takes the mth value.

The overall likelihood can then be written as;


n M
L( 1 ,  2 ,...,  M 1 )   P zim
im (33)
i 1 m 1

9/19/2023 124
 with the probabilities defined using the
formulae derived above.

 Maximisation of the (log) likelihood will yield ML


estimates which does replicate the observed
data.

 The simplicity of the Mlogit lies in the ease with


which the probabilities can be calculated. In
general, logit models have ‘nice’
computational properties.

9/19/2023 125
Additional Example
Consider the outcomes 1,2,3,…,m
recorded in y, and the explanatory
variables X. Assume there are m=3
outcomes (e.g. choice of health care
provider, public hospital, private hospital
and traditional healer) with probabilities
Pr(y=1), Pr(y=2) and Pr(y=3).

In the multinomial logit model, you


estimate a set of coefficients  1 ,  2 and
 3 , corresponding to each outcome.
9/19/2023 126
e x1
Pr( y  1) 
e x1  e x 2  e x 3

e x 2
Pr( y  2)  x1 x 2 x 3
e e e

x 3
e
Pr( y  3)  x1 x 2 x 3
e e e

9/19/2023 127
The model, however, is unidentified in the
sense that there is more than one solution
to  1 ,  2 and  3 , that leads to the same
probabilities for y=1, y=2 and y=3.

To identify the model, you arbitrarily set


one of s  1 ,  2 or  3 , to 0- it does not
matter which. That is, if you arbitrarily set
 1 =0, the remaining coefficients  2 and
 3 , will measure the change relative to
the y=1 group.
9/19/2023 128
When you change the reference
category, the coefficients will differ
because they have different
interpretations, but the predicted
probabilities for y=1,2 and 3 will still be the
same. Thus either parameterisation will be
a solution to the same underlying model.

Setting  1 =0, the equations become;


9/19/2023 129
1
Pr( y  1)  x 2 x 3
1 e e

e x 2
Pr( y  2) 
1  e x 2  e x 3

e x 3
Pr( y  3) 
1  e x 2  e x 3

The relative probability of y=2 to the base


outcome is
Pr( y  2)
 e x 2
Pr( y  1)
9/19/2023 130
The Hessian of the log-likelihood of the
mlogit model is everywhere negative
definite. Therefore, any stationary value is
a global maximum.

In fact, this is true if we rule out linear


dependence among regressors.

The mlogit model can be derived from a


latent variable model under the
assumption of iid disturbances.

9/19/2023 131
The errors in the latent variable model
giving us the mlogit model may have the
conventional interpretation of the impact
of factors known to the decision maker
but not to the observer/econometrician.

Note that the mlogit model is used in


psychometrics and is termed the Luce
strict utility model (see Maddala, 1983).

9/19/2023 132

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy