0% found this document useful (0 votes)
10 views

metrikaq

The document discusses limited dependent variable models, departing from classical linear regression assumptions, particularly focusing on cases where the dependent variable is not continuously varying. It introduces qualitative response models (binary, multiple-choice, ordered choice, and count data models) and models with limited dependent variables (truncated, censored, and sample selection models), emphasizing the use of Maximum Likelihood Estimation (MLE) for estimation. The document also contrasts Linear Probability Models with non-linear probability models such as logit and probit, highlighting the importance of understanding the underlying distribution of the error terms in these models.

Uploaded by

babypluto999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

metrikaq

The document discusses limited dependent variable models, departing from classical linear regression assumptions, particularly focusing on cases where the dependent variable is not continuously varying. It introduces qualitative response models (binary, multiple-choice, ordered choice, and count data models) and models with limited dependent variables (truncated, censored, and sample selection models), emphasizing the use of Maximum Likelihood Estimation (MLE) for estimation. The document also contrasts Linear Probability Models with non-linear probability models such as logit and probit, highlighting the importance of understanding the underlying distribution of the error terms in these models.

Uploaded by

babypluto999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Limited Dependent Variable Models1

Review of classical linear regression model framework:

y = b0 + b1x1 +... + bk xk + e
Where we made several assumptions for OLS to have good properties, one of
which was that the error term is normal => y is normal => y has a continuous
distribution.

After estimating this model, we have b̂ j = ¶ŷ / ¶x j , which is a (continuous)


change in the estimated expected level of y as xj changes, ceteris paribus.

Today we will depart from the OLS estimation of the linear regression model.
We will deal with cases where:
- The dependent variable in the model will no longer be continuously
varying on an infinite range, like we have assumed in OLS world.
- Most models we will cover will not be linear in parameters, so
Ordinary Least Squares (OLS) technique will not be applicable for
estimation of these models. Instead we will see how Maximum
Likelihood Estimation (MLE) is used to estimate these models.
- The parameter estimates from these models will no longer represent
the marginal change in the expected outcomes, and we will have to learn
the new ways of interpreting model estimates and finding marginal
effects.

The models can be classified into two main groups: qualitative response models
and models with limited dependent variable.
I. Qualitative response (discrete choice) models:
1. Binary choice models (logit, probit)
Example: being in the labor force (=working or looking for work)
o No (=0)
o Yes (=1)
2. Multiple-choice models: choosing from unranked options
(multinomial or conditional or mixed logit model)
Example: choice of candidates for the Democratic Party (US)

1 Based on lecture notes for Advanced Econometrics course by Karine Torosyan


o Yang (=1)
o Biden (=2)
o Warren (=3)
o Sanders (=4)
3. Ordered choice models: choosing from ranked options (ordered probit
model)
Example: citizens’ trust for government:
o Fully distrust (=1)
o Somewhat distrust (=2)
o Neutral (=3)
o Somewhat trust (=4)
o Fully trust (=5)
4. Count data models
Example: number of cars owned by a family (y=0, 1, 2, 3, 4, …, 9, …).
Once again, this is a discrete outcome, but now we have order and a unit
of measurement (=one car) in the data.

II. Models with limited dependent variable:

In “OLS world” we require that the dependent variable (y) is a continuous


random variable with unlimited range: we assume that the values y can take
range from minus to plus infinity.
But there are clearly cases when this assumption is not right. In some situations,
the outcome variable is continuous in nature, but its observed range is limited
for whatever reasons. In those cases, we can’t model the distribution of y as
having an infinite range (normal). Instead, we have to consider the limitations y
has on its range when modeling its distribution.
1. Truncated data models (truncated regression) – example about traffic
data. If you want to use these data to model how driving speed depends on
control variables, your speed values (= y) will be on the range [80; up):
you will have a truncated sample and not a random sample of cars, and
OLS will not be an appropriate estimation technique in this case.

2. Censored data models (censored or Tobit regression) – Philarmonia.


Suppose, you want to study the demand for concert tickets using data on
ticket purchases for concerts taking place in Tbilisi Philharmonic which
has a seating capacity of 10,000 people. All levels of demand above that
level will be recorded as 10,000 or will be censored from above. To
estimate correctly the demand function for tickets (note: we study
demand, not sales!) you have to take this censoring into account.

3. Sample selection models or incidentally truncated models (Heckman


selection model) – labor force data. Suppose you want to estimate a wage
equation:

Wage = b0 + b1Education +... + bkGender + e

but you have wage values/data only for individuals who are in the labor
force, while you are missing values on wages for those individuals who
are currently out of the labor force. If you only use data for people in the
labor force to estimate the relationship of interest, you will not have a
random sample of the population anymore: people who are in labor force
might be a selective subsample of the population. In this case OLS will no
longer be unbiased, and we have to find a better way for estimating the
model.

Our focus will be on binary choice models (logit & probit) and the optimization
approach you need to take in order to estimate these models.

Can’t we use OLS approach to deal with binary dependent variable and
estimate a linear probability model?
A Linear Probability Model (LPM) is a type of regression model used to
estimate the probability of a binary outcome (e.g., yes/no, 0/1) as a linear
function of explanatory variables. It is a special case of linear regression where
the dependent variable is binary.
Technically, we could use a linear probability model when our dependent
variable is binary and estimate it using Ordinary Least Squares approach, but
there are certain problems associated with this approach.
Some of the undesirable properties of linear probability model estimated via
OLS include:
 Fit of the regression line is not flexible enough (linear vs. logistic)
 Estimated (fitted) values of Y can take values above/below 1 and 0, while
their interpretation is “probability of Y=1”

Linear probability model: Non-linear probability model:


- OLS - Probit
- Logit

Let’s look a bit closer at what we would get if we estimated a model with binary
dependent variable (labor force participation) using a linear model and OLS
technique:
Model #1: Linear Probability Model (LPM) – estimated using OLS
Simple linear regression model where Yi is a binary dependent variable.
reg lfp k5 k618 age wc hc lwg inc

E(Y|X1, X2, …, Xk) = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 +. . +𝛽𝑘 𝑋𝑘𝑖

where: E(Y|X1, X2, …, Xk) for a binary Y variable is simply equal to:
E(Y|X1, X2, …, Xk) = 1*P(Y=1|X1, X2, …, Xk) + 0* P(Y=0|X1, X2, …, Xk) =
= P(Y=1|X1, X2, …, Xk)

So, we get that:


E(Y|X1, X2, …, Xk) = P(Y=1|X1, X2, …, Xk)=𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 +. . +𝛽𝑘 𝑋𝑘𝑖

𝛽𝑖 can be interpreted as the change in probability that Yi=1 associated with unit
change in 𝑋𝑖 , holding other regressors constant. If 𝛽1,𝑖𝑛𝑐𝑜𝑚𝑒 =0.09 that means
that as average income increases by $1, probability of LF participation increases
by 9%.
If intercept (𝛽0 ) coefficient is positive, the probability of participating in LF is
positive even if all variables are 0.

Interpretation of fitted values (y_hat): if fitted value for particular set of X’s is
0.70, this means individuals with given X values have a 70% chance of having a
score of 1 (being in labor force). In other words, we would expect that 70% of
the people who have this particular combination of values on X would fall into
category 1 of the dependent variable, while the other 30% would fall into
category 0. But problem is that fitted values can go beyond 0 and 1 (not
exactly probability)

So, how can we move from this linear probability model to non-linear
probability model and how can we estimate it?
What logit model trues to estimate is the following model equation:

Where coefficients represent change in log odds (left hand side of the equation).
But let’s see where this is coming from:
Non-linear probability models: logit & probit
The binary choice model is derived from an underlying linear model with a
continuous response/dependent variable yi* being determined by observable
factors (x1…xk) and an error:

yi *  x i ' β  ui   0  1 x1i  ...   k xki  ui (1)


But in practice, yi* is unobservable. What we observe is binary variable y
defined as:
𝑦𝑖 = 1 𝑖𝑓 𝑦𝑖∗ ≥ 0
𝑦𝑖 = 0 𝑖𝑓 𝑦𝑖∗ < 0

In model (1) let’s assume that the error term is normal 𝑢𝑖 ~𝑁(0, 𝜎 2 ). In this case
we will have conditional distribution of 𝑦𝑖∗ ~𝑁(𝑥𝑖′ 𝛽, 𝜎 2 ), and we can write:
𝑦𝑖∗ − 𝑥𝑖′ 𝛽 0 − 𝑥𝑖′ 𝛽 𝑥𝑖′ 𝛽
𝑃(𝑦𝑖 = 1) = 𝑃(𝑦𝑖∗ ≥ 0|𝑥𝑖′ 𝛽, 𝜎) = 𝑃( ≥ ) = 𝑃 (𝑧𝑖 ≥ − )
𝜎 𝜎 𝜎
𝑥𝑖′ 𝛽 𝑥𝑖′ 𝛽
= 𝑃 (𝑧𝑖 < ) = 𝛷( ) = CDF (probability or area below that value, which
𝜎 𝜎

depends on individual characteristics (X’s) of a particular person)

Since P(y=0)= 1 - P(y=1), we have:


𝑥𝑖′ 𝛽
𝑃(𝑦𝑖 = 0) = 1 − 𝛷 ( ) = (𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑎𝑏𝑜𝑣𝑒 𝑠𝑜𝑚𝑒 𝑣𝑎𝑙𝑢𝑒)
𝜎
We can put these two probabilities into one expression:
𝑦 1−𝑦𝑖

𝑥𝑖′ 𝛽 𝑖 𝑥𝑖′ 𝛽
𝑃(𝑦𝑖 |𝑥𝑖 𝛽, 𝜎) = [𝛷 ( )] [1 − 𝛷 ( )]
𝜎 𝜎
What happens to this expression when yi is equal to 1 (meaning that person
participates in labor force? What happens to this expression when yi is equal to
0?
We need to develop some optimization process that will use the given dataset
(all n observations!) and will produce the “best” guess of unknown parameters
(betas and sigma). One of the key steps in the process is finding a good way to
put all the data together – go from looking at one observation at a time to
having all n observations aggregated together.
In OLS case we defined the square distance from the true regression line for
each observation/point and then aggregated this measure across all observations.
In case of discrete variable we ask the following question: what should be
optimal regression coefficients and sigma, given the data that we observe. Based
on this question and by pooling together “individual likelihood functions” for all
n individuals, we build the likelihood function for the given sample (entire
sample with n observations):
𝑦𝑖 1−𝑦𝑖
𝑥𝑖′ 𝛽 𝑥𝑖′ 𝛽
𝐿(𝛽, 𝜎|𝑦𝑖 , 𝑥𝑖 ) = ∏𝑛𝑖=1 [𝛷 ( )] [1 − 𝛷 ( )]
𝜎 𝜎

Main question: given observed data y, what distribution (what parameters) is


the most likely to have generated these data? In other words: what θ is the most
compatible with our observed sample y?
Given the likelihood function, the goal is to find θ that maximizes the
likelihood of the sample => hence the name: Maximum Likelihood
Estimation.
MLE is a popular technique because it is flexible and because MLE estimators
have good large sample properties: they are consistent, efficient, and
asymptotically normality distributed, given certain assumptions.
So, which distribution (and correspondingly CDF function) should we assume?

Logit model assumes that error terms follow a logit distribution, while probit
model assumes that the error terms follow a normal distribution

We substitute these specific CDF functions into the likelihood function and we
need to do optimization in order to find the values of betas and sigma:
𝑦𝑖 1−𝑦𝑖
𝑥𝑖′ 𝛽 𝑥𝑖′ 𝛽
𝐿(𝛽, 𝜎|𝑦𝑖 , 𝑥𝑖 ) = ∏𝑛𝑖=1 [𝛷 ( )] [1 − 𝛷 ( )]
𝜎 𝜎

From these likelihood functions we find necessary coefficients!


This is a simple problem of optimization. To solve this problem, we need to:
 Derive the First Order Conditions – F.O.C. and solve those:
o Differentiate L with respect to each component in θ (if there are k
unknown parameters, there will be k derivatives)
o Set each of the k derivatives to 0 to generate the system of k
equations
o Solve this system (if possible) for unknown parameters (θ)
 Check the Second Order Conditions (S.O.C):
o Derive the Hessian matrix for this problem
o Check that the Hessian matrix is negative definite

Since F.O.C. requires taking derivatives of a product of (often complicated)


functions of unknown parameters, this step might be quite challenging. It is
usually simpler to work with the log of the likelihood function (lnL), that allows
substituting optimization of a multiplicative function with optimization of an
additive function, which is much easier.
 ln L 1 L
Anyway, given that  => maximization of L and maximization of
θ L θ
lnL will give the same result.

If we take the log of the likelihood function, we will get the log-likelihood
function:

é n ù n
ln L = ln [ L(q1, q 2 ,..., qk | y)] = ln êÕ li (q1, q2 ,..., q k | y)ú = å lnl(q1, q 2 ,..., q k | yi )
ë i=1 û i=1

Logit model will have the following log-likelihood function:


𝑛 𝑦 1−𝑦𝑖
𝑥𝑖′ 𝛽 𝑖 𝑥𝑖′ 𝛽
𝐿(𝛽, 𝜎|𝑦𝑖 , 𝑥𝑖 ) = ∏ [𝛷 ( )] [1 − 𝛷 ( )]
𝜎 𝜎
𝑖=1

n
ln L(β | data)   yi ln (x i ' β)  (1  yi ) ln1   (x i ' β)
i 1

And by maximizing this function, we can derive (k+1) parameter estimators:


β0, β1,…, βk (note: we are not estimating σ), which results in a the vector of
coefficients.
Predicting probabilities
After estimating probit or logit model, we can use values of parameter estimates
to find predicted probability of success:

𝑃̂(𝑦𝑖 = 1) = 𝐹(𝛽̂0 + 𝛽̂1 𝑎𝑔𝑒𝑖 + 𝛽̂2 𝑒𝑑𝑢𝑐𝑖 + 𝛽̂3 𝑒𝑥𝑝𝑒𝑟𝑖 + ⋯ + 𝛽̂𝑘 𝑘𝑖𝑑𝑠𝑖 )

Interpretation of coefficients in Logit and Probit:


We can’t directly interpret the magnitude of coefficients as representing
marginal effects. We can only interprey their signs. We’ll see how to derive
marginal effects.
Marginal effect in Logit/Probit model is the change in probability of success
when some x-variable changes (holding everything else constant):
𝜕𝐸(𝑦𝑖 |𝑋) 𝜕𝑃(𝑦𝑖 |𝑋)
=
𝜕𝑥𝑖𝑗 𝜕𝑥𝑖𝑗
Shows how p(Y=1) changes if x changes by 1 unit. The marginal effect can be
derived as:
𝜕𝑃(𝑦𝑖 |𝑋) 𝜕𝛷(𝑥𝑖′ 𝛽) 𝜕𝛷(𝑥𝑖′ 𝛽) 𝜕(𝑥𝑖′ 𝛽)
𝑀𝐸𝑗,𝑖 = = = = 𝜙(𝑥𝑖′ 𝛽)𝛽𝑗 = 𝑓𝑖 ∙ 𝛽𝑗
𝜕𝑥𝑖𝑗 𝜕𝑥𝑖𝑗 𝜕(𝑥𝑖′ 𝛽) 𝜕𝑥𝑖𝑗
Where 𝜙(𝑥𝑖′ 𝛽) is a PDF function for a given CDF function 𝛷(𝑥𝑖′ 𝛽).

Since value of PDF is always positive, the sign of marginal effect will always
depend on the value of 𝛽𝑗 obtained from the regression.
Odds ratios:
Often what studies using logit/probit regression will report are odds-ratios,
which allow us to interpret the effect of a particular variable on dependent
variable. Odds ratio allows us to understand the effect of each explanatory
variable X on the odds of the outcome.
Example:

Let’s take a look at an empirical study using logistic regression:


https://www.sciencedirect.com/science/article/pii/S1877050920319505?ref=pdf
_download&fr=RR-2&rr=81901a6199245aa1
Practice example: Suppose you have data on labor force participation: you
observe if a person i is in the workforce (yi=1) or not (yi=0). In this case y is a
binary (discrete) variable. Given a sample of n observations on y: (0,0,1…,0,1),
what is the best estimate of probability of success (Yi=1), assuming that each Yi
follows a Bernoulli distribution:

P(yi=1)=θ and P(yi=0)=1- θ


Where parameter θ is probability of success (success=being in the workforce)
from interval [0;1]. The only unknown parameter about this distribution is θ.
Given the observed sample, can we estimate it?
For MLE we start by putting writing individual probability function and then the
joint one for all observations of n:

P(yi ) = q yi (1- q )(1-yi )


n n
n  yi  (1 yi )
P( y1 ,..., yn |  )   (1   )
yi (1 yi )
 i 1
(1   ) i 1
i 1

If you have 10 observations looking like this (0,1,0,0,1,1,0,0,0,1) then ∑ yi = 4


counts “success” or “=1” outcomes; and ∑(1 − yi ) = 6, so this sum counts the
number of “failures” or “=0” outcomes.
If we look at this as a function of θ, where we know all yi’s -
(0,1,0,0,1,1,0,0,0,1), we will have the likelihood function:
n n

å yi å(1-yi )
L(q | y) = L(q | y1,..., yn ) = q i=1 (1- q ) i=1

( Þ L(q | y) = Õli )
i i
li = l(q | yi ) = q y (1- q )1-y

The log-likelihood function will be:

Based on the log-likelihood function, we can do maximization and derive F.O.C


with respect to our parameter (theta):

Setting the F.O.C equal to zero we can solve to F.O.C:


1 n -1 æ n ö
g(q | y1,..., yn ) = å yi + ç å (1- yi )÷ = 0
q i=1 1- q è i=1 ø
Solve :
n æ n ö
(1- q )å yi = q ç n - å yi ÷
i=1 è i=1 ø
n

n n n ⌢ å yi
å yi - q å yi = nq - q å yi Þ q= i=1
=y
i=1 i=1 i=1 n
Basically, we find that the sample mean will be an MLE estimator for the
unknown parameter θ.
Final comment:
Logit and probit models follow similar procedure of estimating regression
coefficients (betas), however they assume a different probability function (one
that follows either logistic regression or normal distribution) and thus its
likelihood function also looks different. However, the steps used in optimization
and derivation of coefficients in these models is similar to the steps used in this
example.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy