Werkstuk Wittink - tcm235 237206
Werkstuk Wittink - tcm235 237206
Choice modelling
An
overview
of
theory
and
development
in
individual
choice
behaviour
modelling
L.T.
Wittink
BMI
Paper
August
2011
Supervised
by
Alwin
Haensel
Table
of
contents
Introduction
4
Choice
Theory
6
Framework
6
Rational
Behaviour
8
Discrete
and
Probabilistic
Choice
Theory
8
Discrete
Choice
Theory
8
Probabilistic
Choice
Theory
9
Utility
Theory
10
Cardinal
Utility
10
Ordinal
Utility
10
Constant
Utility
10
Random
Utility
11
Expected
Utility
11
Stated
and
Revealed
Preference
12
Stated
Preference
12
Revealed
Preference
13
Exogenous-‐,
Locational-‐
and
Utility-‐Based
Choice
Models
13
Exogenous-‐based
Models
13
Locational-‐based
Models
14
Binary
Choice
Models
14
Logit
and
Probit
18
Linear
Probability
Models
18
Probit
18
Logit
19
Estimation
19
Alternative
estimation
models
20
Multinomial
Logit
20
Multinomial
choice
21
Multinomial
logit
21
Estimation
22
1
Nested
Logit
23
Multidimensional
choice
sets
24
Nested
Logit
24
Estimation
of
Nested
Logit
26
Higher
level
Nested
Logit
and
expansion
on
the
Nested
Logit
Model
27
Cross-‐Nested
Logit
27
Estimation
of
Cross-‐Nested
Logit
28
Mixed
Logit
29
Estimation
30
Repeated
Choice
31
Latent
Class
Logit
31
Estimation
33
Variations:
different
choice
models
36
The
Generalized
Extreme
Value
Model
36
Joint
Logit
36
Multinomial
Probit
37
Mixed
Probit
37
Further
Research
37
Summary
37
Acknowledgement
38
References
38
2
“As far as the laws of math refer to reality, they are not certain; and as far as they are
certain, they do not refer to reality.”
Albert Einstein (1879 – 1955)
3
Introduction
To
some
degree,
all
decisions
or
even
most
of
the
actions
we
take
in
life,
involve
choice.
When
we
go
to
the
supermarket,
we
have
to
decide
on
the
way
to
travel.
When
we
are
at
the
supermarket
we
have
to
choose
from
a
selection
of
vegetables
for
example.
When
we
are
at
home
we
have
to
decide
on
what
to
cook
etcetera.
A
day
in
our
life
is
full
of
sequences
of
choices
we
have
to
make.
But
not
just
us:
our
entire
fellow
society
goes
through
similar
thought
processes.
The
fact
that
a
whole
population
goes
through
such
processes
makes
it
worth
investigating.
Namely
if
it
would
be
possible
to
make
an
indication
on
the
behaviour
of
a
population
on
certain
processes,
these
processes
could
be
adjusted
likewise.
If
it
would
be
possible
to
discover
some
pattern
in
behaviour
or
even
better,
to
discover
a
certain
demand
in
a
process,
it
would
be
able
to
adjust
to
these
discoveries.
This
could
be
of
great
help
of
course.
Thurstone
(1927)
is
often
said
to
be
one
of
the
first
people
to
do
research
in
individual
choice
behaviour
into
food
preferences.
He
is
considered
to
be
the
first
to
describe
this
preference
with
some
sort
of
utility
function.
Nowadays
choice
models
are
used
in
various
areas:
for
example
psychology,
transport,
energy,
housing,
marketing,
voting
and
actually
many
more.
Since
Thurstone’s
research
there
has
been
quite
some
development
considering
choice
models.
As
in
most
fields
of
research,
a
new
topic
often
causes
more
research
and
more
elaborate
research.
Since
the
late
1920s
new
models
have
been
developed,
theories
have
been
adjusted
and
original
assumptions
can
be
avoided.
As
newer
models
were
developed,
not
all
of
these
models
were
applicable
due
to
computational
constraints.
Together
with
technological
advancements
in
society,
choice
models
that
were
unusable
became
suddenly
became
usable.
Currently
models
that
were
not
usable
thirty
years
ago
are
usable
and
computationally
possibilities
only
increase.
In
this
paper
an
attempt
is
made
to
describe
the
theory
behind
choice
models
and
after
that
the
actual
models.
The
theory
behind
the
models
can
be
considered
a
framework,
some
sort
of
foundation
of
the
models.
To
be
able
to
understand
the
models
and
understand
where
they
come
from
and
what
assumptions
are
made,
this
will
be
discussed
first.
In
this
first
section
on
individual
choice
behaviour
this
framework
will
be
discussed
and
explained.
Most
of
the
definitions
needed
for
the
choice
models
will
be
given
here.
After
that
comes
the
section
on
choice
models,
where
the
most
important
and
through
time
most
used
and
referenced
models
will
be
discussed.
Here
no
derivations
will
be
given
–
for
derivations
of
the
models
the
reader
is
referenced
to
more
extensive
literature
as
Ben-‐Akiva
and
Lerman
(1985)
and
Train
(2003).
Finally
the
paper
will
be
concluded
with
a
section
called
Summary,
comments
and
acknowledgement.
Included
in
this
section
is
a
chapter
with
variations
on
the
section
before.
Unfortunately
it
was
not
possible
to
include
all
these
models
in
a
more
elaborate
way,
but
choices
had
to
be
made
on
what
literature
to
discuss.
The
models
that
are
discussed
follow
on
each
other
and
are
instrumental
either
because
they
are
so
often
referenced
and
important
in
the
development
of
new
models
or
because
these
models
are
used
currently.
I
hope
the
reader
finds
this
paper
informative
and
insightful
and
in
the
end
has
a
better
understanding
of
how
choice
models
work
and
how
they
have
developed
over
the
years.
4
Individual Choice Behaviour: Framework
“Go down deep into anything and you will find mathematics.”
Charles Schlichter (unknown)
5
Choice
Theory
Observing
the
choices
of
one
individual
is
interesting,
but
when
statements
can
be
made
about
a
larger
group
of
individuals,
or
even
a
whole
population,
then
really
something
can
be
achieved.
We
could
therefore
state
that
we
are
not
just
interested
in
the
choices
of
a
single
person,
but
rather
of
large
groups
of
individuals.
Think
of
market
demand
for
some
kind
of
service
or
commodity.
Predicting
the
demand
can
be
done
by
modelling
individual
choice
behaviour,
thus
with
the
use
of
choice
models.
This
chapter
will
be
used
mostly
to
describe
principles
of
choice
theories
and
to
give
a
framework,
which
will
be
useful
when
formulating
the
different
discrete
choice
models.
When
examining
the
behaviour
of
individuals,
in
theory
we
look
for
behaviour
that
is
according
to
Ben-‐Akiva
and
Lerman
(1985)
descriptive,
abstract
and
operational.
Descriptive
so
that
the
theory
describes
how
individuals
actually
behave
and
not
how
we
expect
individuals
to
behave.
We
would
also
like
to
formalize
their
behaviour,
independent
of
specific
circumstances,
therefore
abstract.
At
last
we
look
for
operational
behaviour,
meaning
that
it
results
in
actual
models
with
measurable
parameters
and
variables,
or
at
least
parameters
and
variables
that
can
be
estimated.
However
there
is
no
choice
theory
that
satisfies
all
these
requirements.
There
are
choice
theories
that
have
these
requirements
as
ideology,
though
different
models
differ
in
the
level
of
detail
in
which
they
idealize
the
thought
process
behind
observed
behaviour.
There
are
some
common
assumptions
though,
which
are
used
for
the
different
models.
These
assumptions
will
be
described
as
a
framework
for
the
models
that
will
be
described
later
on.
Framework
Ben-‐Akiva
and
Lerman
(1985)
state
that
‘a
choice
can
be
viewed
as
an
outcome
of
a
sequential
decision-‐making
process
that
includes
the
following
steps:’
1. Definition
of
the
choice
problem
2. Generation
of
alternatives
3. Evaluation
of
attributes
of
the
alternatives
4. Choice
5. Implementation
This
means
that
a
choice
is
not
viewed
as
a
single
choice
at
a
specific
time,
but
rather
as
a
process.
An
example
would
be
the
way
someone
would
travel
to
work.
He
could
take
the
bus,
go
by
car,
take
the
bike
or
walk.
Here
the
definition
of
the
problem
would
be:
how
to
get
to
work?
The
alternatives
are
stated
above.
Now
the
choice
is
not
dependent
on
the
alternatives
themselves,
but
rather
on
their
characteristics,
or
attributes:
how
expensive
is
every
alternative?
How
much
time
would
it
take
for
every
alternative?
Is
it
really
feasible
to
walk,
meaning
what
level
of
comfort
does
it
provide?
Eventually
the
decision
maker
applies
some
decision
rule,
which
is
some
sort
of
calculation
to
select
the
best
alternative.
In
order
to
define
the
process
above,
we
need
to
define
the
elements
decision
maker,
alternatives,
attributes
of
alternatives
and
decision
rule.
Note
that
we
consider
actual
decision-‐making
process
here
–
choices
following
from
habit,
intuition
or
imitation
or
any
other
form
of
behaviour
where
there
is
no
rational
process
are
represented
as
a
choice
process
with
only
one
alternative.
Rational
behaviour
will
be
discussed
later.
The
decision
maker
can
be
an
individual,
but
also
a
household,
a
family
or
an
organization.
Because
in
this
case
we
consider
the
organization
as
an
individual
we
6
abstract
the
internal
interactions.
In
this
case
we
are
not
as
much
interested
in
different
individual
choices,
because
every
individual
or
family
has
different
interests
and
backgrounds.
We
are
more
interested
in
predicting
aggregate
demand,
and
we
must
still
treat
the
differences
in
decision-‐making
behaviour
explicitly.
Luce
(1959)
defined
different
choices
in
a
situation
as
alternative
choices,
or
simply
alternatives
and
every
choice
is
made
from
a
set
of
alternatives.
The
environment
of
the
decision
maker
determines
the
universal
set
of
alternatives,
but
single
decision
makers
do
not
consider
all
alternatives.
For
example,
when
one
goes
to
work
the
alternative
to
take
the
car
to
work
could
be
excluded,
because
the
decision
maker
does
not
own
a
car.
Therefore,
each
decision
maker
considers
not
the
universal
set
of
alternatives,
but
a
subset
hereof.
This
subset
is
called
a
choice
set.
This
set
includes
the
alternatives
that
are
feasible
and
known
during
the
decision
process.
Finally
all
alternatives
should
be
mutually
exclusive,
the
choice
set
needs
to
be
exhaustive
–
meaning
that
all
possible
alternatives
are
included
in
the
choice
set
–
and
the
number
of
alternatives
in
the
choice
set
must
be
finite.
As
stated
before,
choices
are
not
made
on
based
on
the
alternatives
apart,
but
rather
on
the
characteristics
or
attributes
of
the
alternatives.
Ben-‐Akiva
and
Lerman
(1985)
state
that
‘the
attractiveness
of
an
alternative
is
evaluated
in
terms
of
a
verctor
of
attribute
values.’
Attributes
can
be
ordinal
or
cardinal.
If
there
are
multiple
alternatives
in
a
choice
set,
the
decision
maker
needs
a
decision
rule
to
choose.
The
decision
rule
describes
the
internal
process
used
by
a
decision
maker
to
process
the
available
information
and
make
a
unique
choice.
Slovic
(1977)
and
Svenson
(1979)
give
us
rules
that
can
be
classified
in
four
categories:
1. Dominance
means
that
an
alternative
is
better
than
another
alternative
when
at
least
one
attribute
is
better
and
all
other
attributes
are
no
worse.
In
most
situations
this
does
not
lead
to
a
unique
choice.
It
is
more
often
used
to
exclude
the
worse
alternatives
from
the
choice
set.
It
can
be
made
more
complex
by
using
a
threshold:
one
attribute
is
only
better
if
the
difference
between
both
alternatives
exceeds
a
certain
threshold.
2. Another
decision
rule
concerns
a
level
of
satisfaction.
This
means
that
every
attribute
of
an
alternative
must
assume
a
level
of
satisfaction.
This
level
is
set
by
the
decision
maker
and
should
be
attainable.
3. The
third
type
of
decision
rule
is
called
lexicographical
rules.
This
means
that
the
attributes
are
ordered
by
importance.
The
decision
maker
chooses
the
attributes
he
values
the
most.
If
attributes
are
qualitative,
all
alternatives
that
do
not
possess
the
desired
quality
will
be
excluded
from
the
choice
set.
In
the
case
the
decision
maker
cannot
make
a
decision
based
on
the
most
important
attribute,
he
will
continue
to
try
and
make
a
decision
based
on
the
second
most
important
attribute.
4. The
last
type
of
decision
rule
assumes
that
a
vector
that
defines
an
objective
function
expressing
the
attractiveness
of
the
attributes
of
an
alternative
expresses
the
attractiveness
of
an
alternative.
This
attractiveness
is
referred
to
as
the
utility.
The
utility
is
a
measure
that
the
decision
maker
tries
to
maximize.
In
the
section
utility
theory
this
will
be
discussed
more
elaborately.
Of
these
four
categories,
the
utility
has
been
used
most
in
recent
models.
We
mostly
refer
to
the
utility
as
a
function
(existing
of
a
vector),
the
utility
function.
7
Rational
Behaviour
The
term
rational
behaviour
is
based
on
the
beliefs
of
an
observer
of
what
the
outcome
of
a
decision
of
a
decision
maker
should
be.
It
only
seems
natural
that
different
observers
have
different
beliefs,
therefore
it
seems
that
there
cannot
be
one
universal
type
of
rational
behaviour.
Thus
rationality
is
not
really
a
useful
concept
when
applied
to
individual
choice
behaviour.
The
concept
described
in
literature
is
one
opposing
impulsiveness,
which
means
that
individuals
do
not
make
decisions
based
on
their
variable
psychological
state
at
the
time
the
decision
has
to
be
made.
It
means
that
it
follows
a
consistent,
calculated
decision
process.
This
does
not
mean
that
the
individual
cannot
follow
his
or
her
own
objectives.
In
1957
Simon
described
the
distinction
between
what
he
called
a
perfect
and
a
bounded
reality.
In
a
perfect
world
an
all-‐knowing
individual
exists,
capable
of
gathering
and
storing
large
quantities
of
data
and
perform
complex
calculations
on
these
data
and
is
able
to
make
consequent
decisions
based
on
the
data.
Bounded
reality
recognizes
the
bounded
capacity
of
humankind
as
problem
solvers
with
limited
information-‐processing
capabilities.
This
means
that
rationality
is
a
quite
ambiguous
concept.
It
is
therefore
necessary
to
introduce
a
specific
set
of
rules,
to
be
able
to
use
the
concept.
Simply
said
this
means
that
we
assume
that
a
decision
maker,
if
alternative
A
is
more
feasible
than
alternative
B
will
choose
A
every
time
he
faces
that
same
decision.
And
if
alternative
A
is
more
feasible
than
B
and
B
is
more
feasible
than
C,
the
decision
maker
will
prefer
A
to
C.
Discrete
and
Probabilistic
Choice
Theory
This
section
can
be
seen
as
an
expansion
of
the
section
about
the
framework
of
choice
theory.
The
concepts
used
in
choice
theory
are
similar
to
the
concepts
in
Economic
Consumer
Theory,
which
will
not
be
treated
in
this
paper.
The
view
on
demand
in
Economic
Consumer
Theory
is
well
applicable
when
the
feasible
choices
have
continuous
variables,
but
this
might
not
always
be
the
case.
In
discrete
choice
theory
types
of
problems
are
better
described
as
discrete
bundles
of
attributes.
Furthermore,
in
probabilistic
choice
theory
the
probability
that
a
decision
maker
chooses
a
certain
alternative
can
be
provided,
which
makes
it
a
powerful
framework
when
working
with
discrete
choice
situations.
Discrete
Choice
Theory
Table
1
Choice
in
travelling
to
work
Attributes
Alternatives
Travel
Time
(t)
Cost
(c
)
Comfort
(o)
Car
t1
c1
o1
Bus
t2
c2
o2
Walk
t3
c3
o3
For
the
section
on
discrete
choice
theory,
consider
a
simple
example
that
is
used
in
literature
(Ben-‐Akiva
&
Lerman
1985,
Train
2003)
more
often.
Consider
a
decision
maker
that
has
to
travel
to
work.
He
has
three
options:
take
the
car,
take
the
bus
or
walk.
The
attributes
of
the
alternatives
are
travel
time,
travel
cost
and
comfort
(see
table
1).
The
choice
will
have
the
utility
function
U
=
U(q1,
q2,
q3)
with
q
is
the
alternative
chosen.
Obviously
the
decision
maker
can
only
choose
one
alternative,
this
qi
=
1
if
mode
i
is
chosen
and
qi
=
0
if
otherwise,
for
all
i
in
the
choice
set
and
q1q2
=
q2q3
=
q1q3
=
0.
Because
the
possibilities
U(1,0,0),
U(0,0,1)
and
U(0,1,0)
are
not
differentiable
we
apply
a
8
maximization
on
the
function
with
the
attributes
as
parameters:
Ui
=
U(ti,
ci,
oi).
Now
we
can
see
that
U1
>
U2
and
U1
>
U3
have
to
be
true
for
alternative
1,
taking
the
car,
to
be
chosen.
As
for
the
form
of
the
utility
function,
in
most
literature
an
additive
utility
function
is
assumed:
Ui
=
-‐
β1ti
–
β2ci
+
β3oi,
for
all
i
in
the
choice
set
and
βi
>
0
for
all
i.
With
this
formula
we
can
try
to
predict
changes
in
U
for
different
numerical
values
for
the
parameters.
This
approach
to
the
utility
function
is
called
revealed
preference
and
will
be
discussed
more
elaborately
in
the
next
chapter.
Lancaster
(1966)
defined
the
utility
function
Uin
=
U(xin),
where
xin
is
the
vector
of
the
attribute
values,
for
every
alternative
i
by
every
decision
maker
n.
Ben-‐Akiva
and
Lerman
expand
this
formula
a
bit,
due
to
variability
in
population:
Uin
=
U(xin,
sn),
where
sn
is
a
vector
of
different
characteristics
of
the
decision
maker,
for
example
income,
age,
education
and
ethnical
background.
Probabilistic
Choice
Theory
In
probabilistic
choice
theory,
it
is
argued
that
we
cannot
approximate
human
behaviour
by
deterministic
parameters.
It
seems
plausible
to
state
that
human
behaviour
has
a
probabilistic
nature.
Furthermore,
it
can
be
argued
that
whilst
the
decision
maker
has
knowledge
of
his
or
her
utility
function,
the
researcher
or
analyst
does
not
know
the
exact
form.
Therefore
Train
(2003)
explains
about
the
term
representative
utility.
In
the
section
about
the
framework
the
utility
function
of
the
form
U
was
introduced.
The
decision
maker
chooses
the
alternative
if
Uin
>
Ujn
∀
j
≠
i,
where
j
are
the
different
choices
from
the
choice
set
(Cn)
and
n
is
the
labelled
decision
maker.
Since
there
are
aspects
of
utility
function
of
a
decision
maker
that
the
researcher
does
not
know,
we
introduce
the
representative
utility
function
Vjn
=
V(xjn,
sn)
with
xnj
∀
j,
again
the
attributes
of
the
alternatives
and
sn
some
attributes
of
the
decision
maker.
Because
V
depends
on
characteristics
the
researcher
cannot
know,
it
makes
sense
that
Vjn
≠
Ujn.
Train
states
that
the
utility
can
be
decomposed
as
Ujn
=
Vjn
+
εjn,
where
εjn
captures
the
factors
that
affect
utility
but
are
not
known
to
the
researcher
and
therefore
are
not
included
in
Vjn.
Simply
said
εjn
is
the
difference
between
Ujn
and
Vjn
and
could
be
considered
an
error
term.
It
seems
logical
that
if
εjn
are
factors
that
affect
the
utility,
but
are
not
known
by
the
researcher,
the
form
of
εjn
is
unknown
as
well.
Therefore
these
terms
are
treated
as
random.
The
joint
density
of
the
vector
of
these
‘errors’
is
denoted
f(εn).
Pin = Pr(! jn ! !in < Vin !Vjn "j # i)
= $ I(! jn ! !in < Vin !Vjn "j # i) f (! n )d! n .
!
The
first
part
is
due
to
the
probability
that
the
decision
maker
chooses
alternative
i.
In
this
part
I(…)
is
the
indicator
function,
which
is
equal
to
1
if
the
statement
between
the
parentheses
is
true
and
0
if
not.
This
is
a
multidimensional
integral
and
only
takes
a
closed
form
for
specific
forms
of
the
density
function
f.
For
example
logit
and
nested
logit
are
models
that
will
be
discussed
later
on,
but
those
have
a
closed
form
for
this
integral.
Probit
and
mixed
logit
are
derived
differently
and
do
not
have
an
open
form
for
this
integral.
They
are
not
calculated
exactly
but
they
are
evaluated
numerically.
An
example
might
help
to
clarify.
Consider
the
example
used
in
the
last
section,
but
this
time
without
the
attribute
comfort.
We
can
define
Vi
=
-‐
β1ti
–
β2ci
with
i
=
{car,
9
bus,
walk}.
Now
suppose
after
analysis
the
researcher
finds
that
Vcar
=
4,
Vbus
=
3
and
Vwalk
=
3.
This
does
not
mean
that
the
decision
maker
will
choose
to
go
to
work
by
car.
It
simply
means
that
by
observed
factors
it
seems
best
to
go
by
car,
but
there
are
still
factors
that
are
unobserved
to
the
researcher.
The
probability
that
the
decision
maker
walks
to
work
instead
of
taking
the
car
is
the
probability
that
the
unobserved
factors
for
walking
are
sufficiently
better
than
those
for
taking
the
car.
As
stated
in
the
formula
above
it
would
be:
10
P(i | C n ) P(i | Cn )
= ,
with
i, j ! Cn " Cn
P( j | C n ) P( j | Cn )
which
simply
said
means
that
removing
irrelevant
alternatives
from
the
choice
set
Cn,
resulting
in
subset
Cn ,
has
no
influence
on
the
choice
probabilities.
Random
utility
Manksi
(1977)
formalized
this
approach,
which
is
more
in
line
with
consumer
theory
than
the
constant
utility
approach.
The
observed
inconsistencies,
or
errors,
are
now
viewed
as
to
be
result
of
observational
inaccuracies
on
the
researcher’s
side.
In
this
approach
we
again
assume
that
a
decision
maker
tries
to
maximize
his
or
hers
utility,
as
is
in
line
with
economic
consumer
theory.
But
–
as
stated
in
the
section
about
probabilistic
choice
theory
–
the
researcher
does
not
know
the
utility
of
a
decision
maker
with
full
certainty
and
therefore
they
are
treated
as
random
variables.
We
can
say
that
the
researcher
defines
the
choice
for
a
specific
alternative
i
in
the
choice
set
as
P(i | Cn ) = Pr(Uin ! U jn , "j # Cn ) ,
as
stated
by
Ben-‐Akiva
and
Lerman
(1985).
Here
we
assume
a
joint
PDF
for
the
set
of
(random)
utilities,
because
a
logical
argument
can
be
made
about
the
underlying
source
for
randomness
in
the
utilities.
Manski
(1973)
identified
four
sources:
1. Unobserved
attributes:
the
vector
of
attributes
that
affects
the
decision
is
incomplete
to
the
researcher.
Therefore
there
is
an
element
included
in
the
utility
function
that
is
observationally
random,
thus
it
follows
that
the
utility
is
random
as
well.
2. Unobserved
taste
variations:
the
researcher
knows
all
attributes,
but
there
is
an
unobserved
argument
that
is
unknown
to
the
researcher.
This
can
be
explained
as
the
researcher
being
unknown
of
the
specific
taste,
of
preference,
of
the
decision
maker.
The
variation
of
this
argument
is
unknown,
making
the
utility
random.
3. Measurement
errors:
the
attributes
of
the
alternatives
are
not
observable.
That
is
why
the
researcher
estimates
the
attributes,
with
a
measurement
error
accounting
for
probable
inaccuracy
in
the
measurement.
This
error
term
is
unknown,
resulting
in
the
utility
becoming
random.
4. Instrumental
variables:
the
true
utility
function
is
known,
but
some
elements
in
the
vector
of
attributes
are
not
observable.
The
researcher
approaches
these
variables
by
a
function
that
is
derived
by
the
relation
of
known
variables.
This
means
the
utility
functions
contains
instrumental
variables,
which
are
in
nature
an
expression
of
an
imperfect
relation
between
estimate
and
actual
attribute.
Again
this
term
contains
a
random
error,
making
the
utility
random.
Expected
utility
Expected
utility
is
also
known,
especially
because
it
is
one
of
the
underlying
assumptions
in
game
theory.
This
approach
deals
with
analysis
of
choices
in
risky
projects.
This
is
one
of
the
oldest
utility
approaches,
as
it
was
formulated
in
1713
by
Nicholas
Bernoulli
and
solved
in
1738
by
Daniel
Bernoulli.
Savage
(1954)
formulated
the
subjective
expected
utility
theory,
which
is
a
more
up-‐to-‐date
work
and
was
reviewed
by
Anand
(1993).
If
an
uncertain
event
has
a
number
of
possible
outcomes
zi
all
have
utilities
U(zi),
and
there
is
a
subjective
probability
P(zi),
then
the
subjective
expected
utility
would
be:
11
! U(z )P(z ) .
i i i
Savage
also
stated
eight
axioms
that
also
suit
well
with
the
other
utility
approaches.
These
axioms
are:
1. Completeness:
if
x
and
y
are
two
alternatives,
either
x
or
y
is
preferred.
Also,
x
and
y
are
equally
desirable.
2. Transitivity:
if
x
is
preferred
to
y
and
y
to
z,
then
x
is
preferred
to
z.
3. Independence:
x
and
y
should
be
independent
of
each
other.
4. Resolution
independence:
Preference
for
an
alternative
x
or
y
only
depends
on
the
attributes
of
the
alternative
ex
ante.
5. Expected
wealth
independence:
The
preference
for
an
alternative
depends
on
the
chance
of
winning
and
not
on
the
size
of
the
stakes:
if
there
is
a
lottery
and
one
can
choose
between
lottery
A,
which
has
a
15%
chance
of
winning
and
a
100
euro
reward
and
lottery
B,
which
has
a
5%
chance
of
winning
and
a
1000
euro
reward,
the
decision
maker
will
choose
to
participate
in
lottery
A.
6. Minimal
strict
preference:
there
is
at
least
one
vector
of
attributes
that
is
strictly
preferred
to
the
other
vectors
of
attributes.
7. Continuity
in
probability:
very
unlikely
events
should
be
regarded
as
having
zero
probability.
8. Partial
resolution
independence:
if
the
attributes
of
x
are
preferable
to
the
attributes
of
y
for
different
states,
then
x
is
preferred
to
y
if
one
of
the
states
is
obtained.
Stated
and
Revealed
Preference
Now
the
main
framework
behind
choice
models
has
been
treated,
we
will
turn
to
When
considering
data
involved
in
choice
models,
it
is
possible
to
divide
these
data
in
two
distinct
types:
stated
preference
(SP)
data
and
revealed
preference
(RP)
data.
In
the
following
two
sections
I
will
give
a
description
of
the
models
and
point
out
strengths
and
weaknesses
of
the
both
data
types.
These
sections
will
not
contain
specific
methods
of
both
approaches,
but
it
will
give
insight
into
these
approaches
and
how
they
differ
from
each
other.
Stated
preference
According
to
Kroes
and
Sheldon
(1988)
SP
methods
refer
to
‘a
family
of
techniques
which
use
statements
of
individual
respondents
about
their
preferences’
in
a
set
of
alternatives
to
estimate
utility
functions.
SP
data
are
collected
through
experimental
situations
or
surveys
where
the
respondents
are
faced
with
hypothetical
choice
problems.
For
example,
the
respondent
is
asked
to
choose
between
five
bikes.
In
this
hypothetical
situation
only
these
five
bikes
exist.
The
response
is
the
stated
choice.
Another
way
to
describe
the
SP
approach
would
be
the
direct
approach,
because
the
data
comes
directly
from
the
respondents,
the
hypothetical
decision
makers.
Due
to
this
approach
the
data
does
not
describe
actual
behaviour,
but
it
describes
how
decision
makers
state
they
would
behave
alike.
A
strong
point
from
this
data
is
that
it
can
give
an
indication
how
respondents
would
behave
in
a
situation
that
currently
not
(yet)
exists.
So
if
it
is
the
researcher’s
objective
to
examine
behaviour
for
example
in
a
product
that
does
not
yet
exists,
the
SP
approach
would
be
suitable.
Also
SP
data
works
very
well
in
data
that
contain
little
or
no
variation,
because
the
questionnaire
can
be
designed
in
a
way
that
will
result
in
the
data
having
the
desired
variation.
12
The
main
disadvantage
of
SP
data
seems
obvious:
the
way
respondents
expect
themselves
to
behave,
moreover
the
way
respondents
say
they
will
behave,
is
not
the
way
they
actually
will
behave.
This
phenomenon
may
arise
because
the
respondent
actually
does
not
know
how
they
would
respond
or
because
he
or
she
feels
it
is
expected
of
them
to
respond
in
a
specific
way.
Revealed
preference
In
contrast
to
SP
data,
RP
data
relate
to
actual
behaviour.
It
is
called
RP
because
decision
makers
reveal
their
preference
through
the
choices
they
make.
In
the
example
used
in
the
SP
section
and
considering
the
RP
approach,
the
respondent
would
be
asked
what
bike
he
or
she
bought
last,
instead
of
choosing
from
a
selected
set
in
a
hypothetical
situation.
Therefore
we
can
state
that
purchasing
or
choosing
habits
reveals
preference.
In
this
approach
utility
functions
are
defined
by
observing
behaviour.
Where
the
SP
approach
is
called
the
direct
approach,
the
RP
approach
is
called
the
indirect
approach.
In
the
RP
approach
actual
behaviour
is
observed,
instead
of
confronting
the
respondent
with
a
hypothetical
situation.
The
largest
advantage
of
RP
data
is
that
the
data
represents
actual
choices.
The
downside
of
RP
data
representing
actual
choices
is
that
it
not
suitable
for
situations
that
currently
not
exist.
Because
we
observe
behaviour,
there
is
too
much
uncertainty
in
stating
behaviour
in
new
situations.
Of
course
approximations
can
be
made,
but
SP
data
is
simply
better
suitable
for
these
situations.
Also
in
situations
with
little
or
no
variation
RP
data
is
not
suitable,
because
relations
between
different
cannot
be
estimated
well
without
variation.
By
using
an
estimation
procedure
that
allows
the
relative
importance
of
the
attributes
to
be
estimated
through
primarily
SP
data
and
at
the
same
time
allows
the
alternative-‐
specific
constants
and
overall
scale
of
the
parameters
to
be
determined
by
RP
data
(Train,
2003),
the
strengths
of
both
approaches
can
be
used.
This
will
not
be
discussed
into
detail
in
this
paper,
but
Hensher
et
al
(1999)
and
Brownstone
et
al
(2000)
describe
this
approach
for
respectively
logit
models
and
mixed
logit
models.
Exogenous,
Locational
and
Utility-‐based
Choice
Models
Until
now
only
choice
models
based
on
utility
and
especially
random
utility
have
been
considered.
The
utility
models
are
models
that
involve
a
set
of
alternatives,
a
decision
maker
and
some
utility
function
that
describes
how
the
decision
maker
chooses
the
most
attractive
alternative
to
them.
In
other
words,
the
decision
maker
bases
his
or
her
decision
on
the
attributes
of
the
alternatives
and
chooses
the
most
preferable
alternative
through
some
decision
process
described
by
the
utility
function.
Now
there
are
different
types
of
models
besides
utility-‐based
models.
These
models
will
not
be
discussed
elaborately,
but
it
will
be
good
for
the
reader
to
be
aware
that
other
types
of
models
exist.
Exogenous-‐based
Models
Paul
Waddell
(1993)
investigates
whether
the
assumption
that
the
choice
of
workplace
is
exogenously
determined
in
models
of
residential
location
is
true.
So
this
research
builds
very
much
on
McFadden’s
(1978)
research.
Exogenous-‐based
models
state
that
choices
are
driven
by
outside
factors
and
are
therefore
very
different
from
utility-‐based
models,
where
the
decision
made
is
dependent
on
characteristics
of
the
decision
maker
13
as
well
as
attributes
of
the
alternatives;
one
could
say
all
endogenous
variables.
There
has
been
debate
whether
or
not
locational-‐based
models
or
somewhat
exogenous,
since
the
paper
by
Waddell.
Before
the
1990s
locational-‐based
models
were
assumed
to
be
exogenous,
as
residential
location
was
assumed
to
driven
by
mostly
workplace.
Even
now
one
could
argue
if
this
makes
a
model
exogenous,
as
workplace
is
not
the
only
deciding
factor
and
other
factors
might
not
be
exogenous.
In
his
paper,
Waddell
reaffirms
many
of
the
influences
assumed
in
urban
economic
theory,
the
same
accounts
for
the
assumption
on
the
relation
between
workplace
and
residential
location.
Locational-‐based
Choice
Models
This
type
of
model
is
not
that
different
from
utility-‐based
models
as
one
might
think.
The
former
models
developed
in
the
1960s
(Alonso
1964,
Muth
1969)
originate
from
a
model
called
the
monocentric
model
or
are
derived
from
gravity
model
(Lowry
1964)
derivatives.
These
models
will
not
be
discussed
here,
but
they
do
have
a
very
important
assumption
in
common:
workplace
choice
is
exogenous
in
determining
the
residential
location
choice
of
households.
So
there
is
a
link
with
exogenous-‐based
models.
Still
very
often
researchers
reference
to
the
work
of
McFadden
(1978)
where
he
described
how
utility-‐maximizing
consumers
are
assumed
when
considering
residential
housing,
thus
locational
choice.
In
this
type
of
choice
model
there
is
also
some
random
part
in
the
utility.
McFadden
points
out
that
the
MNL
and
NL
model,
which
will
be
discussed
later
in
this
paper,
are
very
usable
in
locational-‐based
choice
modelling.
A
difference
to
some
extent
in
locational-‐based
choice
modelling
is
that
the
characteristics
of
alternatives
are
not
just
decided
by
their
own
attributes,
but
also
by
some
external
attributes.
Think
of
climate,
image
of
the
location
or
employment
in
the
area.
It
is
possible
that
alternatives
should
be
placed
in
a
Nested
Logit
model
which
allows
for
overlap
(Cross-‐Nested
Logit)
but
also
allows
for
classes
to
overlap,
as
houses
belong
to
different
classes
but
also
are
defined
by
exogenous-‐driven
factors.
McFadden
(1978)
concludes
that
the
problem
of
modelling
disaggregate
choice
of
housing
location
is
impractically
large.
So
to
a
certain
extent
locational-‐based
choice
models
do
not
differ
as
much
from
utility-‐based
choice
models
as
the
foundation
is
similar.
Head
et
al.
(1995)
discussed
location
choice
through
the
example
of
Japanese
manufacturing
investments
in
the
United
States.
They
state
that
firms
in
the
same
industry
are
drawn
to
similar
locations
because
proximity
causes
positive
externalities.
This
is
very
much
in
line
with
McFadden
who
stated
that
with
location
choice
much
more
exogenous-‐driven
variables
play
a
role.
Head
et
al.
also
state
that
chance
events
can
have
a
lasting
influence
on
the
geographical
pattern
of
manufacturing
in
that
case.
This
is
also
a
big
difference
with
utility-‐based
modelling.
There
the
assumption
is
made
that
in
the
largest
sense
rational
processes
take
place
and
therefore
if
the
same
decision
maker
conducts
the
same
decision
processes,
similar
outcomes
will
take
place.
In
the
case
of
location
choice
this
apparently
is
not
the
case,
as
choices
made
also
have
an
effect
on
the
actual
alternatives.
Binary
Choice
Models
In
this
chapter
a
general
background
of
binary
choice
models
will
be
given,
that
will
be
used
in
the
following
chapters
when
specific
models
will
be
discussed.
As
stated
in
the
chapter
on
choice
theory,
we
have
a
decision
maker
facing
a
set
of
feasible
discrete
choice
alternatives
and
he
or
she
will
select
the
alternative
with
the
14
greatest
utility,
with
the
utility
a
random
variable
(r.v.).
As
in
random
utility
theory,
the
probability
that
a
decision
maker
will
select
a
certain
alternative
is
P(i | Cn ) = Pr(Uin ! U jn , "j # Cn )
.
If
the
choice
set
Cn
exists
of
only
two
alternatives,
i
and
j,
we
have
a
so
called
binary
choice
model.
In
this
case
we
can
state
the
probability
that
decision
maker
n
will
choose
alternative
i
or
j
is:
Pn (i) = Pr(Uin ! U jn ) P ( j) = 1! Pn (i)
and
n .
Ben-‐Akiva
and
Lerman
(1985)
describe
how
theory
described
in
the
last
chapter,
random
utility
theory,
can
be
made
operational:
1. Separating
the
total
utility
into
deterministic
and
random
components
of
the
utility
function
2. Specify
the
deterministic
component
3. Specify
the
random
component
Remember
that
in
the
section
of
probabilistic
choice
theory
we
stated
that
for
the
utility
we
have
an
observed
part
and
an
unobserved
part,
we
also
called
disturbance:
Uin = Vin + !in ,
U jn = Vjn + ! jn
Here
Vin
and
Vjn
are
the
systematic
components.
In
the
chapter
on
choice
theory
V
is
described
as
the
part
of
the
utility
that
can
be
observed
by
the
researcher.
These
components
are
assumed
to
be
deterministic.
V
can
be
thought
of
as
the
means
of
U.
We
can
shift
the
scale
of
measurement
by
transforming
both
Uin
and
Ujn
by
any
strictly
monotonic
increasing
function.
Ben-‐Akiva
and
Lerman
show
that
adding
a
constant
to
both
utilities
has
no
affect
on
the
choice
probabilities.
It
does
change
Vin
and
Vjn,
but
eventually
that
is
no
problem.
The
absolute
levels
of
V
and
ε
do
not
matter,
what
does
matter
is
that
Vin !Vjn < !in ! ! jn .
Though
by
specifying
just
the
differences
instead
of
individual
components
could
develop
binary
choice
models,
usually
each
utility
function
is
specified
separately
for
the
sake
of
continuity.
There
exist
choice
models
with
more
than
two
alternatives;
therefore
the
same
notation
is
used
for
binary
choice
models.
After
separating
the
utility
into
a
deterministic
and
a
random
part,
we
specify
both
parts,
starting
with
the
deterministic
or
systematic
part.
As
V
does
not
just
depend
on
the
underlying
attributes,
but
also
on
attributes
of
the
decision
maker,
we
can
define
V
as
V(zin,
Sn),
as
is
similar
to
a
description
in
the
chapter
on
choice
theory.
Seeing
as
these
two
vectors
z
and
S
are
actually
combined
to
describe
V,
we
define
a
new
vector
xin
=
h(zin,
Sn),
with
h
is
some
vector-‐valued
function.
Now
we
can
write
Vin
=
V(xin)
and
Vjn
=
V(xjn).
Secondly,
a
functional
form
for
V
is
needed.
Because
we
would
like
it
if
the
function
was
to
reflect
theory
about
how
the
elements
in
x
influence
utility
and
we
want
the
function’s
parameters
to
be
estimated
easily,
most
researchers
have
chosen
to
use
functions
that
are
linear
in
the
parameters.
If
we
define
β
=
[ !1, !2 , …, ! K ]
as
the
vector
of
K
unknown
parameters,
we
can
write
Vin = !1 xin1 + !2 xin2 +... + ! K xinK ,
Vjn = !1 x jn1 + !2 x jn2 +... + ! K x jnK .
15
for
both
observed
utilities.
One
important
characteristic
is
that
linearity
in
parameters
does
not
mean
linearity
in
the
attributes
z
and
S.
This
totally
depends
on
the
form
of
h.
Finally,
here
it
is
assumed
that
the
parameters
!1, !2 , …, ! K are
the
same
for
the
whole
population.
It
is
possible
however,
that
market
segmentation
is
present.
Then
!1, !2 , …, ! K
are
treated
as
r.v.’s
distributed
across
the
population.
Finally
we
need
to
specify
the
disturbances
to
obtain
an
operational
binary
choice
model.
Where
in
the
last
paragraph
the
functions
for
V
were
depicted
separately,
now
the
functions
for
the
disturbances
ε
will
be
depicted
in
the
most
convenient
way.
So
the
differences
εjn
–
εin
are
discussed.
As
stated
before,
the
choice
probabilities
are
unaffected
if
we
add
a
constant
to
both
disturbances.
Besides
this
it
also
will
not
make
any
difference
if
the
mean
of
the
disturbance
is
shifted,
as
long
as
the
systematic
component
is
shifted
by
the
same
amount.
From
this
follows
that
the
means
of
the
disturbances
can
be
fully
represented
by
any
constant
without
loss
of
generality.
Usually
it
is
assumed
that
all
disturbances
have
zero
means.
In
addition
to
the
mean,
the
scale
of
the
disturbances
should
be
consistent
with
the
scale
of
the
functions
V.
As
for
the
functional
form
of
the
distribution
of
the
disturbances,
it
does
not
make
sense
to
think
of
the
distribution
of
the
ε’s
to
be
different
from
the
V’s
distribution.
Especially
since
the
disturbances
reflect
the
different
sources
of
observational
error,
different
specifications
of
V
will
lead
to
different,
fitting,
distributions
for
ε.
Because
there
are
many
different
unobserved
factors
that
affect
the
overall
distribution,
it
is
hard
to
make
strong
statements
about
this
distribution.
However,
nowadays
more
and
more
we
gain
insight
about
what
is
included
into
the
disturbances.
16
Choice models
“If people do not believe that mathematics is simple, it is only because they do not realize
how complicated life is.”
John von Neumann (1903-1957)
17
Logit
and
Probit
Now
that
the
framework
of
choice
theory
in
general
and
binary
choice
models
has
been
set,
we
are
now
able
to
talk
about
specific
models.
There
are
three
common
binary
models:
the
linear
probability
model,
the
(binary)
logit
model
and
the
(binary)
probit
model.
These
models
were
both
discussed
by
Thurstone
(1927)
to
some
extent.
The
differences
between
these
models
are
based
on
the
assumption
that
is
made
about
the
distribution
of
the
disturbances,
or
the
difference
between
the
disturbances.
To
obtain
the
eventual
model,
the
choice
probabilities
can
be
derived
under
the
assumption
about
the
disturbances.
Linear
Probability
Model
The
easiest
of
the
three
models
is
the
linear
probability
model.
In
this
model
the
difference
in
the
disturbances
is
uniformly
distributed:
! jn ! !in ~ Unif (!L, L) ,
where
L
>
0.
The
difference
between
the
disturbances
is
defined
as
εjn
–
εin
=
εn,
with
density
function
f(εn).
Here
Pn (i) = Pr(! n ! Vin "Vjn ) .
The
choice
probability
is
given
by
the
cumulative
distribution
function
of
εn.
When
V
is
linear
in
its
parameters,
the
probability
function
is
linear
as
well
between
–L
and
L.
According
to
Cox
(1970)
this
model
has
a
major
drawback:
unless
restrictions
are
placed
on
the
β’s
(which
are
again
used
to
estimate
V),
the
estimated
coefficients
can
imply
probabilities
outside
the
interval
[-‐L,
L].
Therefore
the
logit
and
probit
models
are
used
more
often.
Besides
this
drawback,
it
is
unrealistic
to
assume
the
interval
[-‐L,
L],
and
zero
probabilities
outside
this
interval.
Probit
Another
way
to
view
the
disturbances
is
as
being
the
sum
of
a
large
number
of
unobserved,
independent
constituents.
Due
to
the
large
number
and
the
central
limit
theorem
the
disturbances
tend
to
be
normally
distributed.
Now
we
can
state
that
εin
and
εjn
both
have
a
normal
distribution
with
mean
zero
and
variance
! i2 and
! 2j respectively.
Now
the
difference
between
the
disturbances
also
has
a
normal
distribution
with
mean
zero
and
variance
! i2 + ! 2j ! 2! ij = ! 2 .
When
Vin = ! ' xin
and
Vjn = ! ' x jn
we
can
state
for
the
choice
probabilities:
# ! '(xin " x jn ) &
Pn (i) = ! % ( ,
$ " '
where
Φ
denotes
the
standardized
cumulative
normal
distribution.
The
choice
probability
here
only
depends
on
σ,
not
on
the
variance
of
either
disturbance
or
the
covariance.
Moreover,
the
choice
for
σ
is
arbitrary,
as
rescaling
σ
or
β
by
any
positive
constant
will
not
affect
the
choice
probability.
Usually
σ
=
1
is
chosen.
Of
course
the
assumption
on
normality
is
a
very
convenient
assumption,
as
it
improves
possibilities
considering
calculations
compared
to
the
linear
probability
model,
but
it
can
also
be
a
limitation.
Now
a
normal
distribution
is
required
for
all
unobserved
components
in
the
utility.
Also
the
integral
for
the
choice
probabilities
has
an
open
form
for
probit
models.
This
is
not
a
big
problem,
but
is
not
considered
convenient
analytically.
18
Logit
Logit
models
are
very
much
alike
probit
models,
but
a
big
difference
is
that
the
integral
for
the
choice
probability
has
a
closed
form,
which
makes
these
types
of
models
analytically
more
convenient.
In
the
logit
model
it
is
assumed
that
εjn
–
εin
=
εn
is
logistically
distributed.
The
logistic
distribution
approximates
the
normal
distribution,
but
has
fatter
tails.
Under
this
assumption,
the
choice
probabilities
are:
eµVin
Pn (i) = µV µV
e in + e jn .
If
the
V
functions
are
linear
in
their
parameters,
the
choice
probabilities
can
be
derived
onto
1
Pn (i) = ! µ! '( xin !x jn )
.
1+ e
Here
μ
is
a
positive
scale
parameter.
For
convenience
it
is
assumed,
as
has
been
done
similarly
for
probit,
that
μ
=
1.
But
for
probit
σ
=
1
is
chosen,
which
corresponds
with
var(εjn
–
εin)
=
1.
This
also
implies
that
the
scaled
logit
coefficients
are
!3 times
larger
than
scaled
probit
coefficients.
Train
(2000)
describes
a
couple
of
characteristics
of
the
logit
model,
which
prove
to
be
as
well
the
power
of
the
model,
but
in
some
sense
also
the
limitations.
Firstly
the
logit
model
is
able
to
represent
systematic
taste
variation
very
well,
but
the
flipside
is
that
it
cannot
represent
random
taste
variation.
Secondly,
if
the
unobserved
factors
are
independent
over
time
in
repeated
choice
situations,
the
model
can
capture
the
dynamics
of
repeated
choice.
On
the
other
hand
this
seems
restrictive
as
it
exhibits
substitution
patterns.
There
are
some
limiting
cases
for
all
three
models.
If
µ ! ", ! ! 0 or L ! 0 ,
Pn(i)
will
become
1
if
Vin
–
Vjn
>
0
and
0
otherwise.
If
µ ! 0, ! ! " or L ! " ,
there
is
exactly
0.5
probability
for
both
alternatives.
Estimation
For
both
logit
and
probit
models,
usually
maximum
likelihood
estimators
are
used
to
estimate
the
parameters
!1, !2 , …, ! K from
a
(random)
sample
of
observations
from
the
population.
An
indicator
variable
yin
is
constructed
and
defined
as
1
if
person
n
chose
alternative
i
and
0
if
that
decision
maker
chose
alternative
j.
Also
two
vectors
xin
and
xjn
are
constructed,
both
contain
all
K
values
of
the
relevant
variables.
Now
given
a
sample
of
N
observations,
we
now
have
to
find
estimates
!ˆ1, !ˆ2 , …, !ˆK .
Now
we
consider
the
likelihood,
which
is
eventually
equal
to
the
probability
of
the
observed
outcomes
given
the
parameter
values
!ˆ1, !ˆ2 , …, !ˆK .
Since
the
assumption
is
that
they
are
drawn
at
random
from
the
whole
population,
we
can
state
that
the
likelihood
of
the
sample
in
total
is
the
product
of
the
likelihood
of
all
individual
observations.
Analytically
it
is
more
convenient
to
consider
the
logarithm
of
the
likelihood
function,
denoted
as
log
L.
Now
we
can
write
the
likelihood
as:
N
log L(!1, !2 , …, ! K ) = ![yin log Pn (i) + y jn log Pn ( j)] ,
n=1
where
Pn(i)
is
a
function
of
!1, !2 , …, ! K .
Now
the
log
L
function
is
solved
to
maximize
it
by
differentiation
with
respect
to
the
β’s
and
then
setting
the
partial
derivatives
to
zero,
19
thus
we
solve
max
log
L( !ˆ1, !ˆ2 , …, !ˆK ),
while
we
seek
the
estimates
!ˆ1, !ˆ2 , …, !ˆK that
solve
this
function.
Often
if
a
solution
to
the
first-‐order
conditions
exists,
it
is
a
unique
solution.
However
it
is
quite
possible
that
there
will
be
multiple
solutions
to
the
first-‐order
conditions.
Just
one
of
these
solutions
constitutes
the
maximum
likelihood
estimate.
The
estimates
are
consistent
and
asymptotically
normal.
The
estimates
are
given
in
a
matrix
of
the
second
derivatives
of
the
logarithmic
likelihood
function
with
respect
to
the
parameters,
which
are
evaluated
at
the
true
parameters.
The
estimate
in
the
kth
row
and
!2 L
the
lth
column
is
.
Since
we
do
not
know
the
actual
values
of
the
parameters
!!k!!l
where
we
need
to
evaluate
the
second
derivatives
or
the
distribution
of
the
x
vectors,
usually
an
estimated
variance-‐covariance
matrix
that
is
estimated
at
the
estimated
parameters
!ˆ1, !ˆ2 , …, !ˆK
and
the
sample
distributions
of
the
vectors
x
to
estimate
their
actual
distribution.
Therefore
N " 2
! [yin log Pn (i) + y jn log Pn ( j) %
( $
!!k!!l
'
n=1 # &! =!ˆ
is
used
as
a
consistent
estimator
of
the
actual
value.
As
for
the
computational
aspect
of
this
problem,
it
is
known
that
the
solutions
of
the
first-‐order
conditions
are
typically
non-‐linear
and
the
use
of
a
computer
is
needed
to
solve,
even
for
two-‐variable
problems.
Ben-‐Akiva
and
Lerman
(1985)
describe
how
the
Newton-‐Raphson
algorithm
can
be
used.
Briefly
described
the
algorithm
works
as
follows:
firstly
an
initial
guess
for
the
parameters
is
made.
Then
the
function
of
second
derivatives
around
the
parameters
is
made.
Then
the
linearized
form
received
after
approximating
after
first-‐order
condition
is
solved
and
finally
we
look
at
the
difference
in
steps
between
the
new
approximations.
If
it
is
small
enough
(Ben-‐Akiva
and
Lerman
describe
different
criteria)
then
these
approximations
are
used.
Most
other
procedures
are
similar,
but
with
different
steps
for
the
second
and
third
step.
Alternative
estimation
models
The
maximum
likelihood
estimation
procedure
is
used
mostly
for
the
logit
and
probit
model.
For
the
linear
probability
model
the
least
squares
method,
as
is
more
common
for
regression
models,
or
Berkson’s
procedure
is
used
more
often.
As
probit
and
logit
are
the
main
models
to
be
discussed
here,
least
squares
and
Berkson’s
procedure
will
not
be
discussed
in
this
chapter.
Multinomial
Logit
The
last
two
chapters
the
main
focus
has
revolved
around
binary
choice
models
and
the
estimation
technique
behind
these
models.
In
most
decision
processes
however,
the
number
of
alternatives
in
the
choice
set
is
not
limited
to
two.
This
type
of
model
is
called
a
multinomial
choice
model.
Again,
the
choice
set
is
different
for
every
individual,
as
each
individual
has
their
own
index
of
attributes
and
a
different
subset
of
the
universal
set.
In
this
case
where
more
that
two
alternatives
can
be
chosen
the
derivation
of
choice
models
and
estimation
models
are
more
complex
that
those
for
binary
choice
models.
Instead
of
using
just
the
difference
between
the
disturbances,
we
now
need
to
characterize
the
whole
joint
distribution
of
all
disturbances.
20
Different
types
of
multinomial
choice
models
exist.
It
is
possible
to
expand
the
binary
logit
and
binary
probit
model
to
multinomial
models.
Dow
and
Endersby
(2003)
compare
multinomial
logit
(MNL)
and
multinomial
probit
(MNP)
for
voting
research.
They
state
that
the
MNL
model
is
preferable
to
the
MNP
model.
As
explained
in
the
last
chapter,
the
logit
model
has
a
closed
form
integral
whilst
the
probit
model
has
an
open
form.
Therefore
MNP
is
more
complex
than
MNL
and
it
could
give
some
estimation
problems.
Burda
et
al.
(2008)
present
a
model
that
is
a
mix
between
MNL
and
MNP
where
estimation
is
conducted
by
using
a
Bayesian
Markov
Chain
Monte
Carlo
technique.
However
in
this
chapter
we
focus
specifically
on
the
multinomial
logit
model,
thus
we
will
elaborate
and
expand
on
the
theory
treated
in
the
last
chapter.
First
some
sort
of
background
will
be
painted
concerning
multinomial
choice
theory.
Then
some
definition
of
the
MNL
model
and
its
characteristics,
strengths
and
weaknesses
will
be
given
before
considering
estimation
models.
Multinomial
choice
As
stated
before,
every
decision
maker
has
as
his
or
hers
choice
set
some
subset
of
the
universal
set,
and
every
decision
maker
can
have
a
different
subset.
Manski
(1977)
calls
this
process
of
generating
a
subset
from
the
universal
set
“the
choice
set
generation
process”.
However
for
the
researcher
it
makes
the
model
a
lot
more
complex
if
every
individual
decision
maker
can
choose
a
different
choice
set
than
other
decision
makers.
It
is
defined
that
every
choice
set
Cn
has
J n ! J feasible
choices.
Now
we
state
that
the
probability
that
an
alternative
i
is
chosen
follows
directly
from
the
probability
described
in
the
section
on
random
utility
theory:
Pn (i) = Pr(Uin ! U jn , "j # Cn , j $ i) .
Here
we
can
distinct
a
deterministic
and
a
random
component
in
the
utility:
Pn (i) = (Vin + !in ! Vjn + ! jn , "j # Cn , j $ i) .
Define f (!1n , ... , ! Jnn ) as
the
joint
density
function
of
the
disturbances.
Now
there
are
different
ways
to
express
the
choice
probabilities
described
in
literature.
Ben-‐Akiva
and
Lerman
offer
three
ways
of
deriving
Pn(i).
The
most
insightful
way
is
to
reduce
the
multinomial
problem
to
a
binary
problem,
as
we
discussed
this
earlier.
We
can
state
that
Uin ! U jn, "j # Cn , j $ i
and
from
this
follows
that
if
alternative
i
is
preferred
over
all
other
alternatives
j,
that
also:
Uin ! max U jn .
j"Cn
j#i
21
eµVin
Pn (i) = µV
eµVin + e jn .
For
the
MNL
model
the
choice
probability
is
similar:
eµVin
Pn (i) = .
" eµVjn
j!Cn
These
formulas
are
equal
if
Jn
=
2
and
μ
=
1.
Here
it
is
evident
that
the
MNL
model
in
an
extension,
a
development
on
the
binary
logit
model.
Also
in
the
function
it
can
be
seen
that
it
is
a
proper
probability
mass
function,
as
0 ! Pn (i) ! 1
and
its
sum
over
all
i
is
equal
to
1.
Here
again
Uin = Vin + !in , !i " Cn ,
with
the
disturbances
independently
and
identically
distributed
(iid)
and
Gumbel-‐distributed
with
location
and
scale
parameters
η
and
μ.
As
with
binary
logit,
as
long
as
all
systematic
terms
of
the
utility
include
a
constant,
it
is
not
restrictive
to
take
η
=
0.
To
calculate
the
probability
of
one
of
the
alternatives
in
the
choice
set,
order
the
alternatives
so
that
alternative
i
=
1.
Now
we
still
have
an
unidentifiable
parameter
μ,
but
it
is
common
to
set
this
parameter
to
a
convenient
value,
usually
μ
=
1
is
used.
Obviously
a
big
advantage
of
the
MNL
model
is
that
is
able
to
analyze
a
choice
set
consisting
of
multiple
alternatives,
as
is
possible
for
the
MNP
model.
Though
in
literature
(Bunch
1991,
Alvarez
and
Michael
2001)
use
of
MNP
in
several
fields
is
recommended,
MNP
cannot
work
optimally
with
a
large
number
of
observations.
The
MNL
model
however
is
able
to
work
with
larger
datasets.
Both
models
do
not
work
effectively
with
small
datasets.
Also
the
MNL
model
is
criticized
because
of
the
independence
of
irrelevant
alternatives
(IIA)
property,
which
states
that
for
a
specific
individual
the
ratio
of
the
choice
probabilities
of
any
two
alternatives
is
completely
unaffected
by
systematic
utilities
of
other
alternatives.
This
is
closely
related
to
the
assumption
that
all
disturbances
are
mutually
independent.
Ben-‐Akiva
and
Lerman
state
that
the
problem
does
not
per
se
lie
with
the
IIA
property,
but
rather
every
model
that
has
an
underlying
assumption
that
the
disturbances
are
mutually
independent
state
similar
results.
Dow
and
Endersby
(2003)
state
that
for
most
applications
the
IIA
property
is
not
particularly
restrictive
and
for
most
applications
not
even
relevant.
Besides
these
strengths
and
weaknesses
there
are
two
limiting
cases
for
the
MNL
model,
as
was
the
case
for
the
binary
models.
Firstly
if
µ ! 0
then:
1
Pn (i) = , !i " Cn .
This
means
that
if
µ ! 0 the
variance
of
the
disturbances
Jn
approaches
infinity,
so
the
model
will
not
provide
any
information.
That
means
that
all
alternatives
are
equally
likeable.
The
other
limiting
case
is
when
µ ! " .
Now
the
variance
of
the
disturbances
approaches
zero
and
the
model
becomes
deterministic.
Estimation
For
the
MNL
model
maximum
likelihood
is
commonly
used
for
estimation
of
the
parameters
as
well.
For
the
maximum
likelihood
estimation
procedure
for
the
MNL
model
there
are
no
big
differences
with
binary
logit,
but
their
computational
burden
grows
with
the
number
of
alternatives.
McFadden
(1974)
showed
that
for
the
MNL
model
has
some
special
properties
that
can
simplify
estimation
of
its
parameters
under
certain
circumstances.
22
Again,
most
of
this
theory
is
expansion
on
the
section
on
maximum
likelihood
estimation
for
the
binary
logit
model.
Again
let
yin
be
1
if
decision
maker
or
observation
n
chose
alternative
i
and
0
otherwise.
We
write
the
likelihood
function:
N
L = " " Pn (i) yin ,
n=1 i!Cn
with
for
the
choice
parameters
or
as
Ben-‐Akiva
and
Lerman
(1985)
state,
the
linear-‐in-‐
parameters
logit:
e ! ' xin
Pn (i) = ! ' x jn .
" e
j!Cn
As
in
the
section
on
binary
logit,
we
rewrite
the
likelihood
function
to
a
log
likelihood
function:
N
! ' x jn
log L = # # yin (! ' xin ! ln # e ).
n=1 i"Cn j"Cn
When
setting
the
derivatives
of
the
log
likelihood
function
to
zero,
the
first-‐order
conditions
can
be
obtained:
N
This
shows
that
the
average
value
of
an
attribute
for
the
chosen
alternatives
equals
the
average
value
predicted
by
the
estimated
choice
probabilities.
Moreover,
this
means
that
if
an
alternative-‐specific
constant
is
defined
for
the
alternative
i
at
the
maximum
likelihood
estimates
the
sum
of
the
choice
probabilities
is
equal
to
the
number
in
the
sample
that
chose
i.
All
properties
of
the
maximum
likelihood
estimation
of
binary
logit
extend
to
the
MNL
model.
This
also
applies
for
the
computational
methods
that
are
used
for
solving
the
system
of
K
equations.
Nested
Logit
Hensher
et
al.
(2005)
state
that
the
bulk
of
choice
behaviour
study
applications
do
not
go
farther
than
the
simple
MNL
model,
because
of
the
ease
of
computation
and
because
of
a
wide
availability
of
software
packages.
They
also
state
this
it
does
come
with
a
price
in
the
form
of
the
IID
assumption
on
the
disturbances
and
the
IIA
property,
which
will
are
violated
at
times.
The
nested
logit
(NL)
model
includes
a
partial
relaxation
on
both
assumptions.
As
the
MNL
model,
the
NL
model
as
a
closed
form
solution
in
contrast
to
for
example
the
multinomial
probit
(MNP)
and
mixed
logit
(ML)
model.
The
ML
model
will
be
discussed
in
a
later
chapter.
The
NL
model
is
a
so-‐called
multidimensional
choice
model.
Many
choice
situations
are
not
just
situations
where
a
decision
maker
has
to
choose
from
some
list
of
alternatives,
but
where
the
set
of
alternatives
are
combinations
of
underlying
choice
dimensions,
for
example
as
shown
in
figure
1,
where
a
NL
model
is
depicted
with
two
dimensions.
23
Figure
1.
An
example
of
the
structure
of
a
NL
model.
It
is
not
the
case
that
the
MNL
model
cannot
be
used
as
a
multidimensional
model.
We
can
distinct
two
cases
of
multidimensional
models:
multidimensional
choice
sets
with
shared
observed
attributes
and
multidimensional
choice
sets
with
shared
unobserved
attributes.
The
multidimensional
MNL
model,
called
the
joint
logit
(JL)
model
is
an
example
of
the
former;
the
NL
and
MNP
models
are
examples
of
the
latter.
Before
it
is
able
to
discuss
NL
models
properly,
it
is
necessary
to
have
a
look
at
multidimensional
choice
sets
in
general
first.
Multidimensional
choice
sets
In
multidimensional
choice
theory,
every
decision
process
consists
of
more
than
one
choice
set.
The
most
simple
example
is
probably
the
scenario
with
two
choice
sets:
C1
and
C2.
Both
choice
sets
have
J1
and
J2
elements.
Therefore
the
choice
set
C1 ! C2 (Carthesian product)
will
consist
of
J1 ! J 2
elements.
The
multinomial
choice
set
for
a
decision
maker
n
will
be
Cn = C1 ! C2 " Cn* ,
where
Cn*
is
the
set
of
elements
that
are
not
feasible
for
that
decision
maker.
This
obviously
also
goes
for
choice
sets
with
higher
levels
of
dimensionality.
The
example
Ben-‐Akiva
and
Lerman
give
is
a
choice
situation
where
M
denotes
possible
modes
for
shopping
and
D
denotes
shopping
destinations.
The
choice
set
becomes
M
✕ D.
It
is
possible
however
to
add
extra
dimensions
as
for
example
time
of
day
or
route.
The
difference
with
multinomial
choice
sets
is
that
elements
are
somewhat
ordered,
meaning
that
elements
share
common
elements
along
one
or
more
dimensions.
This
linkage
between
the
elements
makes
analysis
useful,
because
it
implies
that
for
a
linkage
to
exist
either
some
of
the
observed
attributes
of
elements
in
the
choice
set
me
be
equal
across
subsets
of
alternatives
or
this
may
be
true
for
some
of
the
unobserved
attributes.
Consequences
of
the
former
will
result
in
the
JL
model;
the
latter
will
result
in
the
NL
model.
Ben-‐Akiva
and
Lerman
(1985)
state
that
the
results
for
multidimensional
choice
situations
will
not
be
different
from
multinomial
situations,
as
long
as
elements
of
the
choice
set
share
either
observed
or
unobserved
attributes.
Nested
logit
Train
(2003)
states
that
a
nested
logit
model
is
appropriate
when
the
choice
set
can
be
partitioned
into
subsets,
or
nests,
in
such
a
way
that
two
properties
hold.
The
first
of
these
being
that
the
IIA
property
holds
within
each
nest.
The
second
of
these
is
that
the
IIA
property
does
not
hold
in
general
for
alternatives
in
different
nests.
The
NL
model
can
be
derived
from
the
General
Extreme
Value
(GEV)
model,
but
that
will
not
be
covered
in
this
paper.
24
So
when
designing
the
model,
one
should
note
that
the
two
properties
should
hold.
A
way
to
test
this
is
by
removing
one
of
the
alternatives
from
the
choice
set.
If
choice
probabilities
rise
equally
for
certain
alternatives,
these
would
fit
in
one
nest.
Otherwise
they
would
have
to
be
in
two
different
nests,
because
the
IIA
property
does
not
hold.
The
IIA
property
should
hold
within
each
nest
but
not
across
nests.
The
NL
model
is
consistent
with
utility
maximization
(Daly
and
Zachary
(1978),
McFadden
(1978)
and
Williams
(1977)).
Let
the
total
choice
set
be
partitioned
in
K
non-‐
overlapping
subsets
(nests)
B1,
.
.
.
,
Bk.
The
utility
of
alternative
i
in
nest
BK
is
Uin = Vin + !in ,
again
with
Vin
the
observed
part
of
the
utility
and
!in
the
random,
unobserved
part.
The
NL
model
is
obtained
by
assuming
that
the
vector
of
disturbances
has
a
cumulative
distribution
of
a
GEV
type
distribution:
$ K$ ' k'
"
&
exp !#&& # e !!in / "k
)) ) .
& k=1 % j"B ( )(
% k
This
is
a
generalization
of
the
distribution
that
is
underlying
to
the
logit
model.
The
unobserved
utilities
are
correlated
within
nests.
For
any
alternatives
in
different
nests,
the
correlation
is
zero.
In
the
function
above
the
parameter
!k
is
a
measure
of
the
degree
of
independence
in
the
random
part
of
the
utility
among
the
alternatives
in
nest
k.
A
higher
value
causes
greater
independence
and
thus
less
correlation.
When
!k = 1
for
all
k,
the
GEV
distribution
becomes
the
product
of
extreme
values
terms
that
are
independent,
as
!k = 1
represents
independence
among
all
alternatives.
This
means
that
the
NL
model
reduces
to
the
MNL
model.
The
distribution
for
the
unobserved
components
proceed
the
choice
probability
for
alternative
i
in
nest
Bk:
!k #1
Pin =
eVin /!k (" j!Bk
e
V jn / !k
) !!
.
"
K
25
(1977)
describes
a
way
to
calculate
!k
based
on
a
vector
of
characteristics
of
the
decision
maker
and
a
vector
of
parameters
that
need
to
be
estimated.
The
choice
probability
as
given
before
is
still
quite
a
hard
formula
to
grasp.
It
is
possible
to
express
this
choice
probability
in
a
different
fashion
without
loss
of
generality.
The
observed
component
of
the
utility
function
can
be
distinct
in
two
parts:
Uin = Wnk +Yin + !in .
Here
Wnk
is
the
part
that
is
constant
for
all
alternatives
within
a
nest:
this
variable
depends
only
on
variables
that
describe
nest
k,
therefore
they
differ
over
nests
but
not
over
the
alternatives
within
a
nest.
Yin
depends
on
variables
that
describe
alternative
i,
so
they
vary
over
alternatives
within
nest
k.
Finally
!in
is
the
unobserved
part
of
the
utility.
Note
that
Yin
is
simply
defined
as
Vin
–
Wnk.
Now
the
NL
probability
can
be
written
as
the
product
of
two
logit
probabilities.
The
probability
that
an
alternative
is
chosen
can
be
written
as
the
product
of
the
probability
that
a
certain
nest
is
chosen
multiplied
with
the
probability
that
an
alternative
within
that
nest
is
chosen:
Pin = Pin|Bk PnBk ,
where
Pin|Bk is
the
conditional
probability
that
given
an
alternative
in
nest
Bk
is
chosen
an
alternative
i
is
chosen.
PnBk
is
the
marginal
probability
of
choosing
an
alternative
in
nest
Bk.
Any
probability
can
be
written
as
the
product
of
a
marginal
and
conditional
probability,
therefore
it
is
exactly
the
same
as
the
situation
before.
Now
both
can
take
the
form
of
logits:
eWnk +!k Ink
PnBk = K ,
!!=1 eWn! +!!In!
eYin /!k
Pin|Bk = ,
! eYjn /!k
j"Bk
where
I nk = ln " eYin /!k .
i!Bk
These
expressions
are
derived
from
the
choice
probabilities
stated
earlier.
Train
(2003)
gives
the
derivation
by
algebraic
rearrangement.
It
is
customary
to
refer
to
the
marginal
probability
as
the
upper
model
and
to
the
conditional
probability
as
the
lower
model.
The
quantity
Ink
links
the
lower
and
upper
model
by
transferring
information
from
the
lower
model
to
the
upper
model
(Ben-‐Akiva
(1973)).
This
term
is
the
logarithm
of
the
denominator
of
the
lower
model,
which
means
that
!k I nk
is
the
expected
utility
that
the
decision
maker
obtains
from
the
choice
among
the
alternatives
in
nest
Bk.
The
formula
for
the
expected
utility
is
the
same
as
the
utility
for
logit,
as
the
lower
and
upper
model
are
both
logit
models.
Ink
is
often
referred
to
as
the
inclusive
utility
of
nest
Bk.
It
includes
the
term
Wnk,
which
is
the
utility
the
decision
maker
receives
no
matter
what
alternative
in
that
nest
he
chooses.
Added
to
that
is
the
extra
utility
he
or
she
obtains
by
choosing
the
alternative
with
highest
utility
in
that
nest,
!k I nk .
Estimation
of
nested
logit
For
the
NL
model
the
same
applies
as
for
the
MNL:
its
parameters
can
be
estimated
by
standard
maximum
likelihood
techniques:
26
N
yin
L = " " ( Pin|Bk PnBk ) ,
n=1 i!Bk
Train
describes
that
the
NL
model
can
also
be
estimated
in
a
sequential
fashion
in
a
bottom
up
way:
the
lower
models
are
estimated
first.
Then
the
inclusive
utility
is
calculated
for
each
lower
model.
Then
the
upper
model
is
estimated
with
the
inclusive
utility
as
explanatory
variables.
However
Train
(2003)
also
describes
that
two
difficulties
come
with
using
estimation
in
a
sequential
fashion.
Firstly
the
standard
errors
of
the
upper
model
parameters
are
biased
downward
and
some
parameters
appear
in
several
sub
models.
Estimating
the
lower
and
upper
model
separately
causes
separate
estimates
of
whatever
common
parameters
appear
in
the
model.
Maximum
likelihood
estimation
is
conducted
simultaneously
for
both
models;
therefore
the
common
parameters
are
constrained
to
be
the
same
wherever
in
the
model
they
appear.
It
is
stated
that
maximum
likelihood
estimation
is
the
most
efficient
estimation
technique
for
the
NL
model.
27
we
have
parameter
!k
that
indicates
to
what
extent
alternatives
among
a
nest
are
independent
and
a
higher
value
can
be
explained
as
greater
independence
and
less
correlation.
Now
the
probability
that
decision
maker
n
chooses
alternative
i
is:
!k )1
Vin 1/ !k # 1/ !k
&
! (!k ik e ) % ! j"B " jk e
$ k
V jn
( ) (
'
Pin = !!
.
! (! (" V jn 1/ !!
)
K
!=1 j"B! j! e )
Now
if
! ik
is
equal
to
one
for
all
alternatives
in
the
choice
set
and
enters
only
one
nest,
we
get
the
same
choice
probability
as
for
the
two-‐level
NL
model.
If
!k
is
equal
to
one
for
all
nests
next
to
this,
the
model
becomes
a
standard
logit
model.
Also
for
higher-‐level,
overlapping
models
it
is
possible
to
decompose
the
model:
Pin = ! Pin|Bk Pnk ,
k
where
marginal
and
conditional
probabilities
are:
1/ !k
" ( ! jk e
V jn
)
Pnk = j!Bk
!!
,
" (" (" V jn 1/ !!
)
K
!=1 j!B! j! e )
1/ !k
Pin|Bk =
(! ik eVin )
.
V jn 1/ !k
" ("j!Bk jk e )
The
model
was
first
used
by
Vovsha
(1997)
who
used
the
model
for
a
mode
choice
survey
in
Israel.
The
model
is
appealing
because
it
is
able
to
capture
a
wide
variety
of
correlation
structures.
The
CNL
model
has
a
closed
form,
as
it
is
derived
form
the
GEV
model
and
an
expansion
on
the
NL
model.
As
shown
before,
it
is
in
some
ways
a
special
case
of
the
standard
logit
model.
This
makes
the
CNL
model
analytically
interesting.
One
of
the
most
obvious
merits
is
the
ability
to
capture
complex
situations
where
the
NL
model
cannot
handle
correlations,
because
it
does
not
allow
for
overlap.
Also
the
open
form
of
the
PDF
makes
the
model
analytically
interesting.
A
disadvantage
of
the
model,
maybe
because
it
is
quite
a
new
model,
is
that
the
issue
of
identification
still
remains
open.
There
are
different
estimation
techniques
and
a
maximum
likelihood
estimator
can
be
identified.
However
if
the
model
is
over
specified
the
speed
of
the
algorithm
may
decrease
significantly
or
not
even
perform
well
at
all.
Estimation
of
cross-‐nested
logit
The
first
estimation
procedures
for
the
CNL
model
were
proposed
by
Small
(1987)
and
Vovsha
(1997)
and
are
based
on
heuristics.
However,
currently
maximum
likelihood
estimation
techniques
are
used.
Again,
these
techniques
aim
at
identifying
the
set
of
parameters
that
maximize
the
probability
that
a
given
model
perfectly
reproduced
the
observations
(Bierlare
(2001)).
The
objective
function
of
the
maximum
likelihood
estimation
problem
for
the
CNL
model
is
a
nonlinear
analytical
function,
as
the
PDF
has
a
closed
form.
Most
nonlinear
programming
algorithms
are
designed
to
identify
local
optima
of
the
objective
function.
As
the
CNL
model
has
a
closed
form,
the
log
likelihood
does
as
well:
28
ln L = " ln Pin|Cn ,
n!sample
where
this
is
the
probability
that
alternative
i
is
chosen
by
decision
maker
n
and
Cn
is
the
choice
set
for
that
specific
decision
maker.
Ben-‐Akiva
and
Bierlaire
(1999)
give
a
more
elaborate
derivation
of
the
log
likelihood
function.
Whatever
algorithm
is
preferred,
it
is
instrumental
that
different
initial
solutions
are
used.
No
meta-‐heuristics
can
provide
a
global
optimum.
Ben-‐Akiva
and
Bierlaire
also
give
a
number
of
steps
that
need
to
be
taken.
First,
constraints
to
guarantee
the
model
validity
have
to
be
defined.
Then
constraints
imposing
a
correct
intuitive
interpretation
could
be
important.
Finally
normalization
constraints
are
necessary;
otherwise
the
model
would
not
be
estimable.
Mixed
Logit
According
to
McFadden
and
Train
(2000),
mixed
logit
(MXL)
is
a
highly
flexible
that
can
approximate
any
random
utility
model.
The
limitations
stated
for
the
logit
model
are
prevented
because
it
allows
for
random
taste
variation,
unrestricted
patterns
and
correlation
in
unobserved
factors
over
time.
In
contrast
to
the
probit
model,
it
is
not
restricted
to
normal
distributions
for
the
error
terms
and
together
with
the
probit
model
it
has
been
known
for
years
but
has
only
become
applicable
since
simulation
become
accessible.
MXL
models
can
be
derived
under
different
behavioural
specifications
and
each
derivation
provides
a
different
interpretation.
The
MXL
model
is
defined
on
basis
of
the
functional
form
for
its
choice
probabilities.
MXL
probabilities
are
the
integrals
of
logit
probabilities
over
a
density
of
parameters:
Pin = ! Lin (! ) f (! )d ! ,
here
Lin
is
the
logit
probability
evaluated
at
parameters
β,
as
described
in
the
chapter
on
MNL
and
f(β)
is
a
density
function.
If
the
utility
is
linear
in
β,
then
Vin(β)
=
β’xin.
Then
the
MXL
probability
takes
its
usual
form:
" ! ' xin %
e
Pin = ( $ ' f (! )d ! .
$ ! e ! ' x jn '
# j &
This
probability
is
a
weighted
average
of
the
logit
formula,
but
evaluated
at
different
values
of
β,
with
weights
given
by
f(β).
In
literature
a
mixed
function
is
a
weighted
average
of
several
functions
and
the
density
that
provides
the
weights
is
a
mixing
distribution.
MXL
is
a
mix
of
the
logit
function
evaluated
at
different
β’s
with
f(β)
as
the
mixing
distribution.
Standard
logit
is
a
specific
case
where
the
mixing
distribution
is
fixed
for
f(β)
=
1.
Then
the
formula
becomes
the
normal
choice
probability
for
the
MNL
model.
The
mixing
distribution
can
also
be
discrete,
with
β
taking
a
finite
set
of
distinct
values.
This
results
in
the
latent
class
model
that
will
be
discussed
later.
However
in
most
cases
the
mixing
distribution
is
specified
as
continuous.
By
specifying
the
explanatory
variables
and
density
appropriately,
it
is
possible
to
represent
any
utility
maximizing
(and
even
some
forms
of
non-‐utility-‐maximizing
behaviour)
by
a
MXL
model.
There
is
one
issue
though
concerning
notation.
There
are
two
sets
of
parameters
in
a
MXL
model:
the
parameters
β
that
enter
the
logit
formula
and
have
density
f(β)
and
the
second
set
parameters
that
describe
that
density.
Denote
the
parameters
that
describe
the
density
of
β
by
! ,
so
the
density
is
best
denoted
as
f(β| ! ).
The
mixed
logit
choice
probabilities
are
not
dependent
on
the
values
of
β,
but
they
are
functions
of
! .
The
parameters
β
can
be
integrated
out.
In
that
sense
the
β’s
are
similar
to
the
29
disturbances
in
the
sense
that
they
are
both
random
terms
that
are
integrated
out
in
order
to
obtain
the
choice
probability.
One
of
the
positive
points
about
the
MXL
model
is
that
it
exhibits
neither
the
IIA
property
nor
the
restrictive
substitution
patterns
of
the
logit
model.
The
ratio
Pin/Pjn
depends
on
all
data,
including
alternatives
and
attributes
other
than
i
or
j.
The
percentage
change
in
probability
depends
on
the
relation
between
Lin
and
Ljn,
the
logit
probabilities
of
alternatives
i
and
j.
Besides
this,
Greene
and
Hensher
(2002)
state
that
the
MXL
model
is
flexible,
even
though
it
is
fully
parametric.
Therefore
it
can
provide
the
researcher
with
a
large
range
to
specify
the
individual
and
unobserved
heterogeneity.
To
some
extent
this
flexibility
even
offsets
the
specificity
of
the
distributional
assumptions.
They
also
state
that
with
the
MXL
model
it
is
possible
for
the
researcher
to
harvest
a
rich
variety
of
information
about
behaviour
from
a
panel
or
repeated
measures
data
set.
Estimation
Train
(2003)
states
that
MXL
is
well
suited
for
simulation
methods
for
estimation.
Utility
is
again
Uin = !n' xin + "in
and
the
β
coefficients
are
distributed
with
the
mixing
distribution
f(β| ! )
and
!
denotes
the
parameters
of
this
distribution
(mean
and
covariance
of
β).
Now
the
choice
probabilities
are
Pin = ! Lin (! ) f (! | " )d ! ,
e ! ' xin
Lin (! ) = J ! ' x jn
.
! j=1
e
Here
Lin
again
is
the
logit
probability.
The
probabilities
Pin
are
approximated
through
simulation
for
any
given
value
of
! .
According
to
Train
(2003)
this
is
conducted
through
three
steps:
1. Draw
a
value
of
β
from
the
mixing
distribution
f(β| ! )
and
label
it
β’
with
the
superscript
r
=
1.
This
refers
to
the
first
draw.
2. Calculate
the
logit
formula
Lin(βr)
with
this
draw.
3. Repeat
step
1
and
2
often
and
average
the
result.
This
average
is
the
simulated
probability:
! 1 R
Pin = ! Lin (! r ),
R r=1
!
where
R
is
the
number
of
draws.
Here
Pin
is
an
unbiased
estimator
of
Pin.
Its
variance
decreases
as
R
increases
and
it
is
strictly
positive,
twice
differentiable
in
the
parameters
!
and
the
variables
x
that
facilitate
the
numerical
search
for
the
maximum
likelihood
function.
Train
denotes
this
simulated
log
likelihood
(SLL)
function
as
follows:
N J
!
SLL = !! d jn ln Pjn .
n=1 j=1
Here
djn
is
an
indicator
function,
it
is
equal
to
one
if
decision
maker
n
chose
alternative
j
and
zero
otherwise.
The
maximum
simulated
likelihood
estimator
(MSLE)
is
the
value
of
!
that
maximizes
the
SLL
function.
The
exact
properties
of
this
simulated
estimator
will
not
be
discussed
in
this
paper,
but
can
be
found
in
Train
(2003)
and
Greene
(2001).
This
method
of
estimation
is
related
to
accept-‐reject
(AR)
methods
of
simulation.
This
AR
method
will
not
be
discussed
extensively,
as
the
MSLE
method
is
more
often
30
used.
AR
simulation
can
be
applied
generally.
It
is
constructed
as
follows
(Train
(2003),
Greene
(2001)):
1. A
draw
of
the
random
terms
is
taken.
2. The
utility
of
each
alternative
is
calculated
from
this
draw
and
the
alternative
with
the
highest
utility
is
identified.
3. Steps
2
and
3
are
repeated
often.
4. The
simulated
probability
for
an
alternative
is
calculated
as
the
proportion
of
draws
for
which
that
alternative
has
the
highest
utility.
The
AR
simulator
is
also
unbiased
by
construction.
It
is
however
not
strictly
positive.
Also
it
is
not
always
twice
differentiable.
It
seems
to
be
some
sort
of
step
function.
Therefore
the
MSLE
method
is
more
often
used.
Repeated
choice
Each
sampled
decision
maker
easily
generalizes
this
model
for
repeated
choices.
The
simplest
specification
treats
the
coefficients
that
enter
utility
as
varying
over
different
decision
makers
but
constant
over
choice
situations
for
each
decision
maker.
Therefore
utility
from
alternative
i
in
choice
situation
t
by
decision
maker
n
is
Uitn = !n xitn + "itn ,
with
!itn
being
iid
extreme
value
over
alternatives,
time
and
decision
makers.
Now
consider
a
sequence
of
alternative,
one
for
each
time
period:
i
=
{i1,
…
,
iT}.
The
probability
that
the
decision
maker
makes
this
sequence
of
choices
is
the
product
of
the
logit
formulas,
conditional
on
β:
T
" !n' xi tn %
e t '
Lin
(β)
=
( $ .
$ !n' x jt tn '
t=1 $ !
# j &'
The
!itn
terms
are
independent
over
time.
The
unconditional
probability
is
the
integral
of
this
product
over
all
values
of
β:
Pin
=
Lin (! ) f (! )d !.
The
probability
is
simulated
in
a
similar
way
to
the
probability
with
just
one
choice
period:
1. A
draw
of
β
is
taken
from
its
distribution.
2. The
logit
formula
is
calculated
for
each
period
and
the
product
of
the
logits
is
taken.
3. Repeat
steps
1
and
2
often.
4. The
MSLE
is
the
average
of
the
simulated
probabilities.
Latent
Class
Logit
As
stated
in
the
last
chapter,
the
MXL
model
and
the
Latent
Class
Logit
(LCL)
model
are
not
that
different.
Remember
the
usual
form
that
the
mixed
logit
probability
takes:
" ! ' xin %
e
Pin = ( $ ' f (! )d ! .
$ ! e ! ' x jn '
# j &
This
choice
probability
is
the
same
for
the
LCL
model,
but
instead
of
f(β)
being
continuous,
the
mixing
distribution
is
discrete
with
β
taking
a
finite
set
of
distinct
values.
31
Here
β
takes
M
possible
values
labelled
b1,
…
,
bM
with
probability
sm
that
β
=
bm.
This
model
has
been
popular
in
psychology
and
marketing
for
some
time
now
(Kamakura
and
Russell
(1989)
and
Chintagunta
et
al.
(1991)).
Greene
and
Hensher
(2002)
state
that
the
LCL
model
is
similar
to
the
MXL
model
by
McFadden
and
Train
(2001),
but
it
relaxes
the
requirement
that
the
researcher
makes
specific
assumptions
about
the
(continuous)
distributions
of
parameters
across
each
decision
maker.
Sagebiel
(2011)
states
that
the
unconditional
probability
to
choose
a
certain
alternative
i
by
decision
maker
n
is
the
weighted
average
of
the
M
bm
parameters:
M
Pin = ! sm Pin|m ,
m=1
with
Pin|m
being
the
conditional
logit
(CL)
probability
to
choose
alternative
i
that
belongs
to
class
s:
e(b1n xin +...+bkn xkn )
Pin|m = N .
! e(b1n xin +...+bkn xkn )
n=1
The
parameters
hs
are
unknown,
but
can
be
estimated
with
a
MNL
model
with
help
of
a
case-‐specific
variable
vector
that
includes
age,
income
or
any
other
characteristic
of
the
decision
maker.
The
underlying
theory
of
the
LCL
model
presumes
that
individual
choice
behaviour
depends
on
the
observable
attributes
and
on
latent
heterogeneity
that
varies
with
the
unobserved
factors
(Greene
and
Hensher
(2002)).
They
analyse
this
heterogeneity
through
a
model
of
discrete
parameter
variation.
Here
decision
makers,
as
in
the
analysis
by
Sagebiel
(2011),
are
sorted
in
a
set
of
Q
classes,
but
it
is
unknown
to
the
researcher
which
class
contains
which
decision
makers,
even
if
the
decision
makers
themselves
know
it.
There
are
bn
alternatives
by
decision
maker
n
in
Tn
choice
situations.
As
is
clear,
Greene
and
Hensher
directly
explain
the
model
for
repeated
choice.
Here
the
choice
probability
that
alternative
i
is
made
by
decision
maker
n
is
choice
situation
t,
given
that
this
occurs
in
class
q:
'
xitn !q
e
Pitn|q = Jn '
xitn !q
= F(i, t, n | q).
! j=1
e
The
size
of
the
choice
set
and
the
number
of
observations
may
vary
per
decision
maker.
Moreover,
the
choice
set
should
vary
per
choice
situation
as
well.
The
probability
for
the
specific
choice
by
a
decision
maker
can
be
formulated
in
different
manners,
so
for
convenience
can
be
formulated
in
different
manners,
so
for
convenience
ytn
denotes
the
specific
choice
made,
so
the
model
provides
Ptn|q (i) = Pr(ytn = i | class = q).
Given
the
class
assignment,
it
is
assumed
that
Tn
are
independent
events
and
the
contribution
of
decision
maker
n
to
the
likelihood
would
be
the
joint
probability
of
the
sequence
yi
=
[y1n,
…
,
yTn]:
Tn
Pn|q = ! Ptn|q .
t=1
Here
the
class
assignment
is
unknown.
Now
Hnq
is
denoted
as
the
prior
probability
for
class
q
for
decision
maker
n.
For
Hnq
the
form
of
the
MNL
is
used:
z' !
enq
H nq = Q zn' ! q
, q = 1, ... , Q, !Q = 0,
! q=1
e
32
where
zn
is
a
set
of
observable
characteristics,
as
described
in
the
chapter
on
binary
choice
models.
! q
again
denote
the
parameters
for
class
q.
Greene
and
Hensher
(2002)
compared
the
MXL
model
and
the
LCL
model
for
a
transport
situation,
with
the
objective
to
seek
understanding
of
the
relative
merits
of
both
modelling
strategies.
They
state
that
it
is
not
possible
to
state
that
one
of
these
models
is
unambiguously
preferred
to
the
other.
The
LCL
model
has
the
advantage
that
it
is
a
semi
parametric
specification.
This
means
that
the
researcher
is
freed
from
possibly
strong
or
unwarranted
distributional
assumptions,
in
their
case
about
individual
heterogeneity.
For
the
LCL
model
the
same
applies
as
for
the
MXL
model,
it
allows
the
researcher
to
harvest
a
rich
variety
of
information
on
behaviour
from
a
panel
or
repeated
measures
data
set.
Estimation
Now
the
probability,
or
now
the
likelihood
is
the
expectation
over
classes
of
the
class
specific
contributions:
Q
Pn = ! H nq Pn|q .
q=1
This
is
the
likelihood
for
one
decision
maker.
Now
the
log
likelihood
for
the
sample
is
N N )Q # Tn &,
ln L = ! Pn = ! ln +! H nq % " Ptn|q (..
n=1 n=1 +* q=1 $ t=1 '.-
Maximization
of
this
function
with
respect
to
the
structural
parameters
and
the
latent
class
parameter
vectors
is
a
conventional
problem
in
maximum
likelihood
estimation.
Greene
(2001)
discusses
the
mechanics
and
other
aspects
of
estimation,
as
it
is
a
relatively
difficult
optimization
problem
in
comparison
to
other
situations.
The
choice
of
a
good
starting
value
for
Q
is
crucial.
Testing
down
to
the
appropriate
Q
by
comparing
the
log
likelihoods
of
smaller
models
is
not
a
proper
approach.
Roeder
et
al.
(1999)
suggest
that
the
Bayesian
Information
Criterion
(BIC)
model
can
be
used:
(model size)ln N
BIC(model)
=
ln L + .
N
Now
we
now
parameter
estimates
! q ,
the
prior
estimates
of
the
class
probabilities
are
Ĥ nq .
Using
Bayes
theorem
(Bayes
and
Price,
1763)
it
is
possible
to
obtain
a
posterior
estimate
of
the
latent
class
probabilities:
P̂ Ĥ
Ĥ q|n = Qn|q nq .
! P̂n|q Ĥ nq
q=1
Also
this
data
can
be
used
to
estimate
marginal
effects
in
the
logit
model.
33
Summary, comments and acknowledgement
“I abhor averages. I like the individual case. A man may have six meals one day and none the
next, making an average of three meals per day, but that is not a good way to live.”
Louis D. Brandeis (1856 – 1941)
34
Variations:
different
choice
models
Until
now,
the
most
instrumental
models
have
been
discussed.
However,
this
does
not
mean
that
there
are
no
other
choice
models
known.
The
models
discussed
are
mostly
models
that
are
used
very
often
and
are
instrumental
in
the
development
of
new
models.
For
example,
the
standard
logit
models
are
not
used
that
often
anymore,
but
serve
more
as
reference
models.
The
MNL
model
however,
is
still
used
in
some
fields.
Dow
and
Endersby
(2003)
argue
that
in
political
voting
research
the
MNL
model
is
still
usable,
as
it
is
easily
understandable
and
the
differences
in
results
are
very
small
and
computationally
it
can
be
preferred
over
more
complex
models.
In
this
chapter
some
other
models
will
be
discussed
briefly.
The
reader
should
be
aware
that
the
models
discussed
until
now
are
not
the
only
models
and
in
different
areas
different
models
are
used.
Models
as
for
example
the
MNP
model
are
much
used,
but
are
analytically
more
complex
and
did
not
fit
in
the
line
of
this
paper.
These
models
are
worth
mentioning
however.
The
Generalized
Extreme
Value
(GEV)
Model
This
model
has
been
mentioned
before
in
this
paper.
The
GEV
model
is
actually
more
a
class
of
models
than
a
specific
model.
The
MNL
model
and
the
NL
model
are
both
models
that
are
part
of
the
GEV
family
of
models.
McFadden
(1978)
gave
a
generalization
of
the
MNL
model
that
is
in
fact
the
GEV
model.
The
unifying
attribute
of
the
models
of
the
GEV
family
is
that
the
unobserved
components
of
the
utility
for
all
alternatives
are
jointly
distributed
as
a
generalized
extreme
value.
This
allows
for
correlations
over
alternatives.
When
all
these
correlations
become
zero,
the
GEV
model
becomes
the
standard
logit
model.
Currently
the
most
widely
used
member
of
the
family
are
the
NL
and
CNL
models.
These
models
are
applied
in
for
example
energy,
transportation,
housing
and
telecommunication.
Karlstrom
(2001)
showed
that
only
a
small
portion
of
the
possible
models
within
the
GEV
classes
has
ever
been
implemented.
This
means
that
the
full
capability
of
this
class
of
models
has
not
yet
been
fully
exploited
and
new
research
can
be
used
to
further
investigate
the
potential
of
this
group
of
models.
Joint
logit
Regarding
multidimensional
choice
mostly
the
NL
and
CNL
models
have
been
regarded,
models
that
are
based
on
multidimensional
choice
sets
with
shared
unobserved
attributes.
The
Joint
logit
(JL)
model
on
the
other
hand
is
a
model
based
on
multidimensional
choice
sets
with
shared
observed
attributes.
Here
there
are
observed
factors
in
the
choice
set
that
are
shared,
therefore
overlap
in
some
sense.
As
in
the
MXL
model,
marginal
and
conditional
probabilities
are
derived.
The
distribution
of
the
disturbances
will
affect
the
form
of
the
conditional
choice
probabilities.
A
normal
distribution
is
not
always
guaranteed.
Currently
joint
models
are
mostly
used
for
combining
stated
and
revealed
preference
data.
Multinomial
Probit
As
the
MNL
model
is
an
extension
of
the
binary
logit
model,
exactly
that
way
is
the
MNP
model
an
extension
of
the
binary
probit
model.
Now
the
vector
of
disturbances
is
extended
from
two
to
Jn.
This
requires
for
the
model
to
have
a
solution
of
the
Jn
–
1
dimensional
integral
for
evaluating
the
choice
probabilities.
The
concept
of
this
model
appeared
a
very
long
time
ago
in
writings
by
Thurstone
(1927)
in
applications
of
mathematical
psychology.
Because
of
the
open
form
of
the
35
integral
in
evaluating
the
choice
probability,
until
early
1980s
its
computational
difficulty
made
it
unusable.
Since
the
rise
of
simulation
it
has
been
possible
to
use
the
MNP
model
more
extensively.
Nowadays
the
MNP
model
is
used
in
different
areas.
Dow
and
Endersby
(2003)
indicate
that
it
is
often
used
in
political
voting
and
in
psychology
(Train
(2003)).
As
for
estimation,
this
is
a
bit
more
complicated
than
for
the
MNL
model,
as
indicated
before.
Because
the
choice
probabilities
have
an
open
form
integral,
simple
estimation
techniques
are
often
insufficient.
Geweke
(1996)
gives
an
explanation
of
quadrature
methods,
which
approximate
the
integral
by
a
weighted
function
of
specially
chosen
evaluation
points.
However
nowadays
not
many
non-‐simulation
techniques
are
used,
for
example
because
it
is
more
general.
Hajivassiliou
et
al.
(1996)
give
an
overview
of
different
simulation
techniques.
The
most
straightforward
and
most
used
techniques
are
accept-‐reject
(AR),
which
was
briefly
noted
in
the
chapter
on
the
MXL
model,
smoothed
AR
and
GHK
(after
Geweke,
Hajivassiliou
and
Keane)
that
can
be
combined
with
maximum
likelihood
or
sampling.
Train
(2003)
describes
these
models
more
elaborately.
Mixed
Probit
Again
mixed
probit
(MXP)
can
be
seen
as
some
development
on
the
probit
model,
as
was
a
similar
case
for
MXL
and
MNL.
A
constraint
for
the
MNP
is
that
all
random
terms
enter
utility
in
a
linear
way
and
they
are
randomly
distributed
in
a
way
so
that
the
utility
is
normally
distributed.
This
constraint
is
removed
in
MXP
models.
These
models
have,
as
do
the
MXL
models
conditional
probabilities
and
the
integral
also
has
an
open
form.
Therefore
the
model
has
long
run
times
for
the
GHK
simulator.
Train
(2003)
states
that
the
MXP
model
provides
a
way
to
avoid
some
of
the
practical
difficulties
of
the
MXL
model,
as
representing
pure
heteroskedasticity
or
fixed
correlation
patterns
among
alternatives.
In
the
MXP
estimation
is
made
instead
of
specifying
numerous
error
components
in
the
MXL
model.
MXP
is
suitable
some
non-‐
normal
random
terms,
which
is
not
possible
in
the
MNP
model.
However
MXP
is
more
complex
and
computationally
more
burdensome.
Further
research
As
stated
before,
new
models
will
continue
to
arise
now
possibilities
have
opened
up
with
the
advent
of
simulation.
In
2002
Ben-‐Akiva
et
al.
wrote
a
paper
on
hybrid
choice
models.
Burda
et
al.
(2008)
for
example
present
a
Bayesian
mixed
logit-‐probit
model
for
multinomial
choice.
It
is
expected
that
in
the
coming
years
more
of
the
hybrid
models
will
surface
as
computationally
more
will
become
possible.
Summary
In
this
paper
the
reader
is
given
an
overview
of
theory
behind
choice
models.
Choice
models
have
many
applications
in
society:
psychology,
transport,
energy,
housing,
marketing,
voting
and
many
more
areas
make
use
of
these
models.
Since
the
1920s,
when
Thurstone
(1927)
introduced
the
binary
logit
and
probit
model,
these
models
have
developed
significantly.
Firstly
the
theory
behind
the
models,
choice
theory,
probabilistic
theory
and
utility
theory
have
been
discussed.
Most
models
have
the
assumption
of
rational
behaviour
and
utility-‐maximizing
behaviour.
Also
the
framework
for
binary
models
was
given.
36
In
the
second
part
of
the
paper
different
models
have
been
discussed.
Binary
logit
and
probit
models
can
be
considered
predecessors
of
multinomial
logit
and
probit
models.
In
a
similar
way
are
nested
logit
and
mixed
logit
models
derived
from
the
multinomial
logit
model.
It
is
still
discernable
that
the
multinomial
logit
model
is
a
special
case
of
both
the
nested
logit
model
and
the
mixed
logit
model,
as
with
the
right
parameters
the
model
is
equal
to
the
multinomial
model.
For
the
nested
logit
model
it
applies
that
both
the
marginal
and
conditional
probabilities
are
logits.
Also
the
cross-‐
nested
logit
model
can
be
seen
as
an
extension
of
the
nested
logit
model,
as
it
allows
for
overlap
between
nests.
The
latent
class
model
is
derived
from
the
general
extreme
value
model,
as
applies
for
the
mixed
logit
model.
The
difference
between
these
models
is
the
assumptions
underlying
and
mostly
the
difference
in
mixing
distribution.
In
this
part
for
each
model
also
estimation
techniques
were
discussed.
For
two
models
that
were
discussed
last,
the
latent
class
model
and
the
mixed
logit
model,
simulation
techniques
are
necessary
because
of
the
open
form
of
the
integral
in
the
choice
probability.
This
makes
these
models
more
complex
and
computationally
more
burdensome.
These
models
however
do
avoid
some
of
the
assumptions
underlying
at
for
example
the
multinomial
model.
In
the
third
and
last
section
a
short
overview
was
given
of
models
that
were
not
discussed
elaborately
in
the
second
section,
but
were
worth
mentioning
due
to
their
value
and
use
in
research.
The
general
extreme
value
model
is
often
referenced
and
the
multinomial
probit
model
is
still
very
much
used.
Acknowledgement
My
sincere
thanks
goes
out
to
Alwin
Haensel,
who
besides
helping
me
with
my
questions
put
me
on
the
path
to
writing
this
paper
by
making
this
subject
public
on
the
website
of
the
VU.
References
Ai
C.
and
Norton
E.C.
(2003),
Interaction
terms
in
logit
and
probit
models
Alvarez
R.M.
and
Nagler
J.
(1995),
Economics,
Issues
and
the
Perot
Candidacy:
Voter
Choice
in
the
1992
Presidential
Election
Alvarez
R.M.
and
Nagler
J.
(2001),
Correlated
disturbances
in
discrete
choice
models:
a
comparison
of
multinomial
probit
and
logit
models
Adamowicz
W.,
Louviere
J.
and
Williams
M.
(1994),
Combining
Revealed
and
Stated
Preference
Methods
for
Valuing
Environmental
Amenities
–
Journal
of
Environmental
Economics
and
Management
Anand
P.
(1993),
Foundations
of
rational
choice
under
risk
Bayes
T.
and
Price,
M.
(1763),
An
Essay
towards
solving
a
problem
in
the
doctrine
of
chances
–
Philosophical
Transactions
of
the
Royal
Society
of
London
Ben-‐Akiva
M.E.
and
Lerman
S.R.
(1985),
Discrete
Choice
Analysis:
Theory
and
Application
to
Travel
Demand
–
MIT
Press
37
Ben-‐Akiva
M.
E.
and
Bierlaire,
M.
(1999),
Discrete
choice
methods
and
their
applications
to
short-‐term
travel
decisions
–
Handbook
of
Transportation
Science
Ben-‐Akiva,
M.
E.,
McFadden
D.,
Train
K.,
Walker
J.,
Bhat
C.,
Bierlaire
M.,
Bolduc
D.,
Boersch-‐Supan
A.,
Brownstone
D.,
Bunch
D.S.,
Daly
A.,
De
Palma
A.,
Gopinath
D.,
Karlstrom
A.
and
Munizaga
M.
(2002),
Hybrid
Choice
Models:
Progress
and
Challenges
Bierlaire
M.
(2001),
A
theoretical
analysis
of
the
cross-‐nested
logit
model
Collins
L.M.,
and
Lanza
S.T.
(2010),
Latent
class
and
latent
transition
analysis
for
the
social,
behavioural
and
health
sciences
Cox
D.R.
(1970),
Analysis
of
Binary
Data
Dow
J.K.
and
Endersby,
J.W.
(2003),
Multinomial
probit
and
multinomial
logit:
a
comparison
of
choice
models
for
voting
research
Geweke
J.
(1996),
Monte
Carlo
simulation
and
numerical
integration
–
Handbook
of
Computational
Economics,
Elsevier
Science
Glasgow
G.
(2001),
Mixed
logit
models
for
multiparty
elections
Greene
W.
(2001),
Fixed
and
Random
Effects
in
Nonlinear
Models
–
Working
paper
Stern
School
of
Business,
Department
of
Economics
Greene
W.H.
and
Hensher
D.A.
(2003),
A
latent
class
model
for
discrete
choice
analysis:
contrasts
with
mixed
logit
Hajivassiliou
V.,
McFadden
D.
and
Ruud
P.
(1996),
Simulation
of
multivariate
normal
rectangle
probabilities
and
their
derivatives:
Theoretical
and
computational
results
–
Journal
of
Econometrics
Hensher
D.A.,
Rose
J.M.
and
Greene
W.H.
(2005),
Applied
Choice
Analysis:
A
Primer
Karlstrom
A.
(2001),
Developing
generalized
extreme
value
models
using
the
Piekands
representation
theorem
–
Working
paper
Infrastructure
and
Planning
–
Royal
Institute
of
Technology
Stockholm
Kreps
D.M.
(1988),
Notes
on
the
theory
of
choice
Kroes
E.P.
and
Sheldon
R.J.
(1988),
Stated
Preference
Methods
–
Journal
of
Transport
Economics
and
Policy
Luce,
D.
(1959),
Individual
Choice
Behavior:
A
Theoretical
Analysis
McCutcheon,
A.
L.
(1987),
Latent
class
analysis
McFadden,
D.
(1978),
Modelling
the
choice
of
residential
location
–
Spatial
interaction
theory
and
residential
location
McFadden
D.
and
Train
K.
(2000),
Mixed
MNL
models
for
discrete
response
–
Journal
of
Applied
Econometrics
Papola
A.
(2003),
Some
developments
on
the
cross-‐nested
logit
model
-‐
Elsevier
Stevens
T.
H.
(2005),
Can
Stated
Preference
Valuations
Help
Improve
Environmental
Decision
Making?
–
Choices
Magazine,
A
publication
of
the
American
Agricultural
Economics
Association
38
Train
K.
(2003),
Discrete
Choice
Methods
with
Simulation
Thurstone
L.
L.
(1927),
A
law
of
comparative
judgment
–
Psychological
Review
Varian
H.R.
(2005),
Revealed
Preference
Vovsha,
P.
(1997),
Cross-‐nested
logit
model:
an
application
to
mode
choice
in
the
Tel-‐Aviv
metropolitan
area
Wassenaar
H.J.
and
Chen
W.
(2003),
An
Approach
to
decision-‐based
design
with
discrete
choice
analysis
for
demand
modeling
Wardman
M.
(1988)
A
comparison
of
revealed
preference
and
stated
preference
models
of
travel
behavior
–
Journal
of
transport
economics
and
policy
Wooldridge
J.M.
(2002),
Econometric
Analysis
of
Cross
Section
and
Panel
Data,
MIT
Press
39