0% found this document useful (0 votes)
43 views28 pages

Conditional Random Fields: Probabilistic Models For Segmenting and Labeling Sequence Data

This document summarizes the key points of the ICML 2001 paper "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data". It introduces conditional random fields (CRFs), a discriminative framework for building probabilistic models to label and segment sequence data. CRFs address label bias issues that can occur in maximum entropy Markov models (MEMMs) by assigning probabilities to entire sequences rather than per-state. Experimental results show CRFs outperform HMMs and MEMMs on tasks such as part-of-speech tagging due to their ability to incorporate diverse and overlapping features while avoiding label bias problems.

Uploaded by

clliu168
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views28 pages

Conditional Random Fields: Probabilistic Models For Segmenting and Labeling Sequence Data

This document summarizes the key points of the ICML 2001 paper "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data". It introduces conditional random fields (CRFs), a discriminative framework for building probabilistic models to label and segment sequence data. CRFs address label bias issues that can occur in maximum entropy Markov models (MEMMs) by assigning probabilities to entire sequences rather than per-state. Experimental results show CRFs outperform HMMs and MEMMs on tasks such as part-of-speech tagging due to their ability to incorporate diverse and overlapping features while avoiding label bias problems.

Uploaded by

clliu168
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

ICML 2001

Conditional Random Fields:


Probabilistic Models for Segmenting and
Labeling Sequence Data
John Lafferty, Andrew McCallum, Fernando Pereira


Presentation by Rongkun Shen
Nov. 20, 2003
Sequence Segmenting and Labeling
Goal: mark up sequences with content tags

Application in computational biology
DNA and protein sequence alignment
Sequence homolog searching in databases
Protein secondary structure prediction
RNA secondary structure analysis

Application in computational linguistics & computer science
Text and speech processing, including topic segmentation, part-of-speech
(POS) tagging
Information extraction
Syntactic disambiguation
Example: Protein secondary structure prediction
Conf: 977621015677468999723631357600330223342057899861488356412238
Pred: CCCCCCCCCCCCCEEEEEEECCCCCCCCCCCCCHHHHHHHHHHHHHHHCCCCEEEEHHCC
AA: EKKSINECDLKGKKVLIRVDFNVPVKNGKITNDYRIRSALPTLKKVLTEGGSCVLMSHLG
10 20 30 40 50 60


Conf: 855764222454123478985100010478999999874033445740023666631258
Pred: CCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHCCCCCCCCCCCCHHHHHHCCC
AA: RPKGIPMAQAGKIRSTGGVPGFQQKATLKPVAKRLSELLLRPVTFAPDCLNAADVVSKMS
70 80 90 100 110 120


Conf: 874688611002343044310017899999875053355212244334552001322452
Pred: CCCEEEECCCHHHHHHCCCCCHHHHHHHHHHHHHCCEEEECCCCCCCCCCCCCCCCHHHH
AA: PGDVVLLENVRFYKEEGSKKAKDREAMAKILASYGDVYISDAFGTAHRDSATMTGIPKIL
130 140 150 160 170 180
Generative Models
Hidden Markov models (HMMs) and stochastic grammars
Assign a joint probability to paired observation and label sequences
The parameters typically trained to maximize the joint likelihood of train
examples
Generative Models (contd)
Difficulties and disadvantages
Need to enumerate all possible observation sequences
Not practical to represent multiple interacting features or long-range
dependencies of the observations
Very strict independence assumptions on the observations
Conditional Models
Conditional probability P(label sequence y | observation sequence x) rather
than joint probability P(y, x)
Specify the probability of possible label sequences given an observation
sequence

Allow arbitrary, non-independent features on the observation sequence X

The probability of a transition between labels may depend on past and
future observations
Relax strong independence assumptions in generative models
Discriminative Models
Maximum Entropy Markov Models (MEMMs)
Exponential model
Given training set X with label sequence Y:
Train a model that maximizes P(Y|X, )
For a new data sequence x, the predicted label y maximizes P(y|x, )
Notice the per-state normalization
MEMMs (contd)
MEMMs have all the advantages of Conditional Models

Per-state normalization: all the mass that arrives at a state must be
distributed among the possible successor states (conservation of score
mass)

Subject to Label Bias Problem

Bias toward states with fewer outgoing transitions
Label Bias Problem
P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r)
P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)

Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)
In the training data, label value 2 is the only label value observed after label value 1
Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x

However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).

Per-state normalization does not allow the required expectation
Consider this MEMM:
Solve the Label Bias Problem
Change the state-transition structure of the model






Not always practical to change the set of states

Start with a fully-connected model and let the training
procedure figure out a good structure
Prelude the use of prior, which is very valuable (e.g. in information
extraction)
Random Field
Conditional Random Fields (CRFs)
CRFs have all the advantages of MEMMs without
label bias problem
MEMM uses per-state exponential model for the conditional probabilities of
next states given the current state
CRF has a single exponential model for the joint probability of the entire
sequence of labels given the observation sequence
Undirected acyclic graph
Allow some transitions vote more strongly than others
depending on the corresponding observations
Definition of CRFs
X is a random variable over data sequences to be labeled
Y is a random variable over corresponding label sequences
Example of CRFs
Graphical comparison among
HMMs, MEMMs and CRFs
HMM MEMM CRF
Conditional Distribution
1 2 1 2
( , , , ; , , , ); and
n n k k
u =
x is a data sequence
y is a label sequence
v is a vertex from vertex set V = set of label random variables
e is an edge from edge set E over V
f
k
and g
k
are given and fixed. g
k
is a Boolean vertex feature; f
k
is a
Boolean edge feature
k is the number of features
are parameters to be estimated
y|
e
is the set of components of y defined by edge e
y|
v
is the set of components of y defined by vertex v
If the graph G = (V, E) of Y is a tree, the conditional distribution over the
label sequence Y = y, given X = x, by fundamental theorem of random
fields is:
(y | x) exp ( , y | , x) ( , y | , x)
u

e e
| |
+
|
\ .
k k e k k v
e E,k v V ,k
p f e g v
Conditional Distribution (contd)
CRFs use the observation-dependent normalization Z(x) for the
conditional distributions:
Z(x) is a normalization over the data sequence x
(y | x) exp ( , y | , x) ( , y |
1
(x)
, x)
u

e e
| |
= +
|
\ .
k k e k k v
e E,k v V ,k
p f e g v
Z
Parameter Estimation for CRFs
The paper provided iterative scaling algorithms

It turns out to be very inefficient

Prof. Dietterichs group applied Gradient Descendent
Algorithm, which is quite efficient
Training of CRFs (From Prof. Dietterich)
log ( | )
( , y | , x) ( , y | , x) log (x)
u

u u
e e
| | c c
= +
|
c c
\ .
k k e k k v
e E,k v V ,k
p y x
f e g v Z
log ( | ) ( , y| , x) ( , y| , x) log (x)
k k e k k v
e E,k v V ,k
p y x f e g v Z
u

e e
= +

First, we take the log of the equation
Then, take the derivative of the above equation
For training, the first 2 items are easy to get.
For example, for each
k
, f
k
is a sequence of Boolean numbers, such
as 00101110100111.
is just the total number of 1s in the sequence.
( , y | , x)
k k e
f e
The hardest thing is how to calculate Z(x)
Training of CRFs (From Prof. Dietterich) (contd)
Maximal cliques
y
1
y
2
y
3
y
4
c
1
c
2
c
3
c
1
c
2
c
3
1 2 3 4
1 2 3 4
1 1 2 2 2 3 3 3 4
y ,y ,y ,y
1 1 2 2 2 3 3 3 4
y y y y
(x) (y ,y ,x) (y ,y ,x) (y ,y ,x)
(y ,y ,x) (y ,y ,x) (y ,y ,x)
Z c c c
c c c
=
=


3 4 3 4 3 3 4
: exp( (y ,x) (y ,y ,x)) (y ,y ,x) c c + =
1 1 2 1 2 1 1 2
: exp( (y ,x) (y ,x) (y ,y ,x)) (y ,y ,x) c c + + =
2 3 2 3 2 2 3
: exp( (y ,x) (y ,y ,x)) (y ,y ,x) c c + =
Modeling the label bias problem
In a simple HMM, each state generates its designated symbol with probability
29/32 and the other symbols with probability 1/32

Train MEMM and CRF with the same topologies

A run consists of 2,000 training examples and 500 test examples, trained to
convergence using Iterative Scaling algorithm

CRF error is 4.6%, and MEMM error is 42%

MEMM fails to discriminate between the two branches

CRF solves label bias problem
MEMM vs. HMM
The HMM outperforms the MEMM
MEMM vs. CRF
CRF usually outperforms the MEMM
CRF vs. HMM
Each open square represents a data set with < 1/2, and a solid circle indicates
a data set with 1/2; When the data is mostly second order ( 1/2), the
discriminatively trained CRF usually outperforms the HMM
POS tagging Experiments
POS tagging Experiments (contd)
Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging
Each word in a given input sentence must be labeled with one of 45 syntactic tags
Add a small set of orthographic features: whether a spelling begins with a number
or upper case letter, whether it contains a hyphen, and if it contains one of the
following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
oov = out-of-vocabulary (not observed in the training set)
Summary
Discriminative models are prone to the label bias problem

CRFs provide the benefits of discriminative models

CRFs solve the label bias problem well, and demonstrate good
performance
Thanks for your attention!

Special thanks to
Prof. Dietterich & Tadepalli!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy