0% found this document useful (0 votes)

8 views

Anlp 02 Wordrep Textclass

Uploaded by

anhnnl.ba12-016

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Anlp 02 Wordrep Textclass

Uploaded by

anhnnl.ba12-016

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

CS11-711 Advanced NNLP

Word Representation
and Text Classi ers
Graham Neubig

https://phontron.com/class/anlp-fall2024/
fi
Reminder:
Bag of Words (BOW)
I hate this movie

lookup lookup lookup lookup weights score

( + + +
)⋅ =

Features f are based on word identity, weights w learned

Which problems mentioned before would this solve?
What’s Missing in BOW?
• Handling of conjugated or compound words Subword
• I love this move -> I loved this movie Models
• Handling of word similarity Word
• I love this move -> I adore this movie Embeddings
• Handling of combination features

• I love this movie -> I don’t love this movie

Neural
Networks
• I hate this movie -> I don’t hate this movie

• Handling of sentence structure

Sequence
• It has an interesting story, but is boring overall Models
Subword Models
Basic Idea
• Split less common words into multiple subword tokens
the companies are expanding

the compan _ies are expand _ing

• Bene ts:
• Share parameters between word variants,
compound words
• Reduce parameter size, save compute+memory
fi
Byte Pair Encoding
(Sennrich+ 2015)
• Incrementally combine together the most frequent token pairs
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}

pairs = get_stats(vocab)

[(('e', 's'), 9), (('s', 't'), 9), (('t', '</w>'), 9), (('w', 'e'), 8), (('l', 'o'), 7), …]

vocab = merge_vocab(pairs[0], vocab)

{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}

pairs = get_stats(vocab)
[(('es', 't'), 9), (('t', '</w>'), 9), (('l', 'o'), 7), (('o', 'w'), 7), (('n', 'e'), 6)]

vocab = merge_vocab(pairs[0], vocab)

{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}

Example code:
https://github.com/neubig/anlp-code/tree/main/02-subwords
Unigram Models
(Kudo 2018)
• Use a unigram LM that generates all words in the
sequence independently (more next lecture)
• Pick a vocabulary that maximizes the log likelihood
of the corpus given a xed vocabulary size
• Optimization performed using the EM algorithm
(details not important for most people)
• Find the segmentation of the input that maximizes
unigram probability
fi
SentencePiece
• A highly optimized library that makes it possible to
train and use BPE and Unigram models

% spm_train --input=<input> \
--model_prefix=<model_name>
--vocab_size=8000 --character_coverage=1.0
--model_type=<type>

% spm_encode --model=<model_file>
—output_format=piece < input > output

• Python bindings also available

https://github.com/google/sentencepiece
Subword Considerations
• Multilinguality: Subword models are hard to use
multilingually because they will over-segment less
common languages naively (Ács 2019)
• Work-around: Upsample less represented
languages
• Arbitrariness: Do we do “es t” or “e st”?
• Work-around: “Subword regularization" samples
different segmentations at training time to make
models robust (Kudo 2018)
Continuous Word
Embeddings
Basic Idea
• Previously we represented words with a sparse vector
with a single “1” — a one-hot vector
• Continuous word embeddings look up a dense vector

One-hot Representations Dense Representations

I hate this movie I hate this movie
lookup lookup lookup lookup lookup lookup lookup lookup
Continuous Bag of Words
(CBOW)
I hate this movie

lookup lookup lookup lookup

+ + +
=

W + =

bias scores
What do Our Vectors Represent?
• No guarantees, but we hope that:
• Words that are similar (syntactically, semantically,
same language, etc.) are close in vector space
• Each vector element is a features (e.g. is this an
animate object? is this a positive word, etc.)

great
excellent
angel
sun
nice
Shown in 2D, but
cat basket in reality we use
dog 512, 1024, etc.
bad disease
monster
A Note: “Lookup”
• Lookup can be viewed as “grabbing” a single
vector from a big matrix of word embeddings
num. words
vector
size
lookup(2)
• Similarly, can be viewed as multiplying by a “one-
hot” vector
num. words 00
vector 1
*0
size 0
…
• Former tends to be faster
Training a More Complex
Model
Reminder: Simple Training of BOW Models

• Use an algorithm called “structured perceptron”

feature_weights = {}
for x, y in data:
# Make a prediction
features = extract_features(x)
predicted_y = run_classifier(features)
# Update the weights if the prediction is wrong
if predicted_y != y:
for feature in features:
feature_weights[feature] = (
feature_weights.get(feature, 0) +
y * features[feature]
)

Full Example:
https://github.com/neubig/anlp-code/tree/main/01-simpleclassi er
fi
How do we Train More
Complex Models?
• We use gradient descent

• Write down a loss function

• Calculate derivatives of the loss function wrt the

parameters

• Move in the parameters in the direction that

reduces the loss function
Loss Function
• A value that gets lower as the model gets better
• Examples from binary classi cation using score s(x)
Hinge Loss Sigmoid + Negative Log Likelihood
1
ℓ = max(−y ∗ s) σ(y ∗ s) = ℓ = − log σ(y ∗ s)
1 + e−(y∗s)

y=1

y=-1

• Example from BOW model + hinge loss

!|V|
∂ max(0, −y ∗ i wi freq(vi , x))
=
∂wi
! "|V|
−y · freq(vi , x) if − y · i wi freq(vi , x) > 0
0 otherwise
Optimizing Parameters
• Standard stochastic gradient descent does

<latexit sha1_base64="6hDez93AjMrSnNoeZfXZjS/6HaI=">AAACFnicbVDLSsNAFJ34rPUVdelmsAi6sCQiqAuh6MZlBatCE8JketsOTiZh5kYoIX/hxl9x40LFrbjzb5w+Fr4OXO7hnHuZuSfOpDDoeZ/O1PTM7Nx8ZaG6uLS8suqurV+ZNNccWjyVqb6JmQEpFLRQoISbTANLYgnX8e3Z0L++A21Eqi5xkEGYsJ4SXcEZWily670I6QkNFIsli4oA+4C2455fljQAKXe+S7uRW/Pq3gj0L/EnpEYmaEbuR9BJeZ6AQi6ZMW3fyzAsmEbBJZTVIDeQMX7LetC2VLEETFiM7irptlU6tJtqWwrpSP2+UbDEmEES28mEYd/89obif147x+5RWAiV5QiKjx/q5pJiSoch0Y7QwFEOLGFcC/tXyvtMM442yqoNwf998l/S2q8f1/2Lg1rjdJJGhWySLbJDfHJIGuScNEmLcHJPHskzeXEenCfn1Xkbj045k50N8gPO+xdFOJ7z</latexit>
gt = r✓t 1 `(✓t 1)
Gradient of Loss

✓t = ✓t
<latexit sha1_base64="WmTb7CmseFEFrcsdvuCoFOH6zU0=">AAACCnicbZBPS8MwGMbT+W/Of1WPXoJD8LLRiqAehKEXjxOsG6ylpFm2haVpSd4Ko+zuxa/ixYOKVz+BN7+N2daDbj4Q+OV535fkfaJUcA2O822VlpZXVtfK65WNza3tHXt3714nmaLMo4lIVDsimgkumQccBGunipE4EqwVDa8n9dYDU5on8g5GKQti0pe8xykBY4X2oQ8DBiQEfIkLzKHmjnEN++aC+yGEdtWpO1PhRXALqKJCzdD+8rsJzWImgQqidcd1UghyooBTwcYVP9MsJXRI+qxjUJKY6SCf7jLGR8bp4l6izJGAp+7viZzEWo/iyHTGBAZ6vjYx/6t1MuidBzmXaQZM0tlDvUxgSPAkGNzlilEQIwOEKm7+iumAKELBxFcxIbjzKy+Cd1K/qLu3p9XGVZFGGR2gQ3SMXHSGGugGNZGHKHpEz+gVvVlP1ov1bn3MWktWMbOP/sj6/AFhNJmN</latexit>
1 ⌘gt
Learning Rate

• There are many other optimization options! (see

Ruder 2016 in references)
What is this Algorithm?
feature_weights = {}
for x, y in data:
# Make a prediction
features = extract_features(x)
predicted_y = run_classifier(features)
# Update the weights if the prediction is wrong
if predicted_y != y:
for feature in features:
feature_weights[feature] = (
feature_weights.get(feature, 0) +
y * features[feature]
)

• Loss function: Hinge Loss

• Optimizer: SGD w/ learning rate 1

Combination Features
Combination Features
very good
good
I don’t love this movie neutral
bad
very bad

very good
good
There’s nothing I don’t neutral
love about this movie bad
very bad
Basic Idea of Neural Networks
(for NLP Prediction Tasks)
I hate this movie

lookup lookup lookup lookup

scores

some complicated
function to extract
combination features probs
(neural net) softmax
Deep CBOW
I hate this movie

+ + +
=
tanh( tanh(
W1*h + b1) W2*h + b2)

W + =

bias scores
What do Our Vectors
Represent?

• Now things are more interesting!

• We can learn feature combinations (a node in the

second layer might be “feature 1 AND feature 5 are
active”)

• e.g. capture things such as “not” AND “hate”

What is a Neural Net?:
Computation Graphs
“Neural” Nets
Original Motivation: Neurons in the Brain

Current Conception: Computation Graphs

X
f (x1 , x2 , x3 ) = xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c

Image credit: Wikipedia

expression:
>
y = x Ax + b · x + c

graph:

A node is a {tensor, matrix, vector, scalar} value

x
An edge represents a function argument
expression:
(and also
>
an data dependency). They are just
y = x to
pointers +b·x+c
Axnodes.

A node with an incoming edge is a function of

graph:
that edge’s tail node.
A node knows how to compute its value and the
value of its derivative w.r.t each argument (edge)
@F
times a derivative of an arbitrary input @f (u) .

> ✓ ◆>
f (u) = u @f (u) @F @F
=
@u @f (u) @f (u)

x
expression:
>
y = x Ax + b · x + c

graph:

Functions can be nullary, unary,

binary, … n-ary. Often they are unary or binary.
f (U, V) = UV

f (u) = u>
A
x
expression:
>
y = x Ax + b · x + c

graph:

f (M, v) = Mv

f (U, V) = UV

f (u) = u>
A
x
Computation graphs are directed and acyclic (in DyNet)
expression:
>
y = x Ax + b · x + c

graph:

f (M, v) = Mv f (x, A) = x> Ax

f (U, V) = UV

>
x A
f (u) = u
A @f (x, A)
= (A> + A)x
@x
@f (x, A)
x = xx>
@A
expression:
>
y = x Ax + b · x + c

graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
expression:
>
y = x Ax + b · x + c

graph: f (x1 , x2 , x3 ) =
X
xi
i
y
f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c

variable names are just labelings of nodes.

Algorithms (1)

• Graph construction

• Forward propagation

• In topological order, compute the value of the

node given its inputs
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

x> A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv
x> Ax
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
> i
x Ax + b · x + c
f (M, v) = Mv
x> Ax
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Algorithms (2)
• Back-propagation:
• Process examples in reverse topological order
• Calculate the derivatives of the parameters with
respect to the nal value
(This is usually a “loss function”, a value we want
to minimize)
• Parameter update:
• Move the parameters in the direction of this
derivative
W -= α * dl/dW
fi
Back Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Concrete Implementation
Examples
Neural Network Frameworks

Developed by FAIR/Meta Developed by Google

Most widely used in NLP Used in some NLP projects
Favors dynamic execution Favors de nition+compilation
More exibility Conceptually simple parallelization
Most vibrant ecosystem
fl
fi
Basic Process in Neural
Network Frameworks
• Create a model

• For each example

• create a graph that represents the computation

you want

• calculate the result of that computation

• if training, perform back propagation and

update
Bag of Words (BOW)
I hate this movie

lookup lookup lookup lookup bias scores

+ + + + =
probs
softmax

https://github.com/neubig/anlp-code/tree/main/02-textclass
Continuous Bag of Words
(CBOW)
I hate this movie

lookup lookup lookup lookup

+ + +
=

W + =

bias scores
https://github.com/neubig/anlp-code/tree/main/02-textclass
Deep CBOW
I hate this movie

+ + +
=
tanh( tanh(
W1*h + b1) W2*h + b2)

W + =

bias scores
https://github.com/neubig/anlp-code/tree/main/02-textclass
A Few More Important
Concepts
A Better Optimizer: Adam
• Most standard optimization option in NLP and beyond
• Considers rolling average of gradient, and momentum
mt = 1 mt 1 + (1 1 )gt Momentum

<latexit sha1_base64="7LaGMij2fvu9TSQ4l1XLThYMwZo=">AAACSHicbZBLSwMxFIUz9VXrq+rSTbAoilhmiqAuhKIblxWsLXTKmEnTGppMhuROoQz9e27cufM/uHGh4s60nYVaLwQ+zjmXJCeMBTfgui9Obm5+YXEpv1xYWV1b3yhubt0ZlWjK6lQJpZshMUzwiNWBg2DNWDMiQ8EaYf9q7DcGTBuuolsYxqwtSS/iXU4JWCko3ssA8D6+wH7IgAQelkEKx94IH+ED7zgTD3HPpny/MPgVruDBbLiShVVHwZiCYsktu5PBs+BlUELZ1ILis99RNJEsAiqIMS3PjaGdEg2cCjYq+IlhMaF90mMtixGRzLTTSRMjvGeVDu4qbU8EeKL+3EiJNGYoQ5uUBB7MX28s/ue1EuietVMexQmwiE4v6iYCg8LjWnGHa0ZBDC0Qqrl9K6YPRBMKtvyCLcH7++VZqFfK52Xv5qRUvczayKMdtIsOkIdOURVdoxqqI4oe0St6Rx/Ok/PmfDpf02jOyXa20a/J5b4BCp+sPA==</latexit>
vt = 2 vt 1 + (1 2 )gt gt Rolling Average of Gradient

• Correction of bias early in training

mt vt
m̂t = v̂ t =
1 ( 1 )t 1 ( 2 )t
<latexit sha1_base64="tX+KmExHPpLLt2r1vquyYOWnxPw=">AAACDXicbVA9SwNBEN2LXzF+nVraLIZALAx3NmohBG0sIxgTyMVjb7OXLNm9O3bnhHDcL7Dxr9hYqNja2/lv3HwUmvhg4PHeDDPzgkRwDY7zbRWWlldW14rrpY3Nre0de3fvTsepoqxJYxGrdkA0EzxiTeAgWDtRjMhAsFYwvBr7rQemNI+jWxglrCtJP+IhpwSM5NsVb0Agk7kP+AJ7oSI0kz7kmXtc9QIGxHeP7iH37bJTcybAi8SdkTKaoeHbX14vpqlkEVBBtO64TgLdjCjgVLC85KWaJYQOSZ91DI2IZLqbTd7JccUoPRzGylQEeKL+nsiI1HokA9MpCQz0vDcW//M6KYRn3YxHSQosotNFYSowxHicDe5xxSiIkSGEKm5uxXRATCZgEiyZENz5lxdJ86R2XnNvnHL9cpZGER2gQ1RFLjpFdXSNGqiJKHpEz+gVvVlP1ov1bn1MWwvWbGYf/YH1+QOt25ts</latexit>
sha1_base64="uU/7wIbYkpNLdU8wOtlDEbAhmgY=">AAACDXicbVA9SwNBEJ3z2/gVtbRZlEAsDHc2aiEEbSwVjAq5eOxt9pIlu3fH7pwQjvsFNv4VGwsVsbO38x/Y+RfcJBZ+PRh4vDfDzLwwlcKg6745Y+MTk1PTM7OlufmFxaXy8sqZSTLNeIMlMtEXITVcipg3UKDkF6nmVIWSn4e9w4F/fsW1EUl8iv2UtxTtxCISjKKVgnLF71LMVREg2Sd+pCnLVYBF7m1V/ZAjDbzNSyyC8oZbc4cgf4n3RTbq9Y/3ZwA4DsqvfjthmeIxMkmNaXpuiq2cahRM8qLkZ4anlPVohzctjanippUP3ylIxSptEiXaVoxkqH6fyKkypq9C26kods1vbyD+5zUzjHZbuYjTDHnMRouiTBJMyCAb0haaM5R9SyjTwt5KWJfaTNAmWLIheL9f/ksa27W9mndiwziAEWZgDdahCh7sQB2O4BgawOAabuEeHpwb5855dJ5GrWPO18wq/IDz8gnWbp6D</latexit>
sha1_base64="ebyssrkgbrz8OmYYTlFMkBtIdy4=">AAACDXicbVC7SgNBFJ31bXxFBRubwRiIhWHXRi2EoI1lBGMC2bjMTmbN4MzuMnNXCMt+gY0f4E/YWGiwtbfzD+z8BSePQhMPXDiccy/33uPHgmuw7U9ranpmdm5+YTG3tLyyupZf37jSUaIoq9FIRKrhE80ED1kNOAjWiBUj0hes7t+e9f36HVOaR+EldGPWkuQm5AGnBIzk5Ytuh0AqMw/wCXYDRWgqPchSZ7/k+gyI5+xdQ+blC3bZHgBPEmdECpXK91dv63G36uU/3HZEE8lCoIJo3XTsGFopUcCpYFnOTTSLCb0lN6xpaEgk06108E6Gi0Zp4yBSpkLAA/X3REqk1l3pm05JoKPHvb74n9dMIDhqpTyME2AhHS4KEoEhwv1scJsrRkF0DSFUcXMrph1iMgGTYM6E4Iy/PElqB+XjsnNhwjhFQyygbbSDSshBh6iCzlEV1RBF9+gJvaBX68F6tnrW27B1yhrNbKI/sN5/AO4hn1U=</latexit> <latexit sha1_base64="4PRu3kPKoHVfeG6bK0+b+RrguA0=">AAACDXicbVC7SgNBFJ2NrxhfUUubwRCIhWE3jVoIQRvLCK4JZNdldjKbDJl9MHM3EJb9Aht/xcZCxdbezr9x8ig08cCFwzn3cu89fiK4AtP8Ngorq2vrG8XN0tb2zu5eef/gXsWppMymsYhlxyeKCR4xGzgI1kkkI6EvWNsfXk/89ohJxePoDsYJc0PSj3jAKQEteeWqMyCQjXIP8CV2AkloNvIgz6zTmuMzIF7j5AFyr1wx6+YUeJlYc1JBc7S88pfTi2kasgioIEp1LTMBNyMSOBUsLzmpYgmhQ9JnXU0jEjLlZtN3clzVSg8HsdQVAZ6qvycyEio1Dn3dGRIYqEVvIv7ndVMIzt2MR0kKLKKzRUEqMMR4kg3ucckoiLEmhEqub8V0QHQmoBMs6RCsxZeXid2oX9StW7PSvJqnUURH6BjVkIXOUBPdoBayEUWP6Bm9ojfjyXgx3o2PWWvBmM8coj8wPn8AzFGbfw==</latexit>
sha1_base64="3k3UytFXUZZXN9xAvbSC1KbMZCk=">AAACDXicbVC7SgNBFL3r2/iKWtoMSiAWht00aiEEbSwVjBGycZmdzCZDZh/M3A2EZb/Axl+xsVARO3s7/8DOX3DyKHwduHA4517uvcdPpNBo2+/W1PTM7Nz8wmJhaXllda24vnGp41QxXmexjNWVTzWXIuJ1FCj5VaI4DX3JG37vZOg3+lxpEUcXOEh4K6SdSASCUTSSVyy5XYpZP/eQHBE3UJRlfQ/zzNkruz5H6lV3rzH3ijt2xR6B/CXOhOzUap8fLwBw5hXf3HbM0pBHyCTVuunYCbYyqlAwyfOCm2qeUNajHd40NKIh161s9E5OSkZpkyBWpiIkI/X7REZDrQehbzpDil392xuK/3nNFIODViaiJEUesfGiIJUEYzLMhrSF4gzlwBDKlDC3EtalJhM0CRZMCM7vl/+SerVyWHHOTRjHMMYCbME2lMGBfajBKZxBHRjcwB08wKN1a91bT9bzuHXKmsxswg9Yr1/05J6W</latexit>
sha1_base64="inXyXnDDlmalxS+brMkBon+upyE=">AAACDXicbVC7SgNBFJ31bXytCjY2gzEQC8OujVoIQRtLBWMC2XWZncyaIbMPZu4GwrJfYOMH+BM2Fiq29nb+gZ2/4ORRaOKBC4dz7uXee/xEcAWW9WlMTc/Mzs0vLBaWlldW18z1jWsVp5KyGo1FLBs+UUzwiNWAg2CNRDIS+oLV/c5Z3693mVQ8jq6glzA3JLcRDzgloCXPLDltAlk39wCfYCeQhGZdD/LM3i87PgPiHezdQO6ZRatiDYAniT0ixWr1++t162H3wjM/nFZM05BFQAVRqmlbCbgZkcCpYHnBSRVLCO2QW9bUNCIhU242eCfHJa20cBBLXRHggfp7IiOhUr3Q150hgbYa9/rif14zheDIzXiUpMAiOlwUpAJDjPvZ4BaXjILoaUKo5PpWTNtEZwI6wYIOwR5/eZLUDirHFftSh3GKhlhA22gHlZGNDlEVnaMLVEMU3aFH9IxejHvjyXg13oatU8ZoZhP9gfH+Awymn2g=</latexit>

• Final update
⌘
✓t = ✓t 1 p m̂t
<latexit sha1_base64="vFWgf8Z25xhNy/8deLRj2WaSIkw=">AAACM3icbZDLSgMxFIYz3q23qks3wSIIYpkRQV0IohvBjYJVoVOGTHrGBjOZMTkjlDAP5cYHcSOCCxW3voNpLeLth8Cf75xDcv44l8Kg7z96Q8Mjo2PjE5OVqemZ2bnq/MKZyQrNocEzmemLmBmQQkEDBUq4yDWwNJZwHl8d9OrnN6CNyNQpdnNopexSiURwhg5F1aMQO4AssljSXfp1WQ9Kuk7DRDNuQ4dKG5prjTbsMLQ3ZYTlWgi5ETJTZZ+ljkXVml/3+6J/TTAwNTLQcVS9D9sZL1JQyCUzphn4ObYs0yi4hLISFgZyxq/YJTSdVSwF07L9pUu64kibJpl2RyHt0+8TlqXGdNPYdaYMO+Z3rQf/qzULTLZbVqi8QFD886GkkBQz2kuQtoUGjrLrDONauL9S3mEuKXQ5V1wIwe+V/5rGRn2nHpxs1vb2B2lMkCWyTFZJQLbIHjkkx6RBOLklD+SZvHh33pP36r19tg55g5lF8kPe+wel+6z3</latexit>
v̂t + ✏
Visualization of Embeddings
• Reduce high-dimensional embeddings into 2/3D
for visualization (e.g. Mikolov et al. 2013)
Non-linear Projection
• Non-linear projections group things that are close in high-
dimensional space
• e.g. SNE/t-SNE (van der Maaten and Hinton 2008) group things
that give each other a high probability according to a Gaussian
PCA t-SNE

(Image credit: Derksen 2016)

t-SNE Visualization can be
Misleading! (Wattenberg et al. 2016)
• Settings matter

• Linear correlations cannot be interpreted

Any Questions?
(sequence models in next class)

Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
Glove
100% (1)
Glove
10 pages
Beyond Effective Go: Part 1 - Achieving High-Performance Code
From Everand
Beyond Effective Go: Part 1 - Achieving High-Performance Code
Corey S Scott
No ratings yet
UI/ux
No ratings yet
UI/ux
18 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
Skip Gram
100% (1)
Skip Gram
37 pages
anlp-02-wordrep-textclass
No ratings yet
anlp-02-wordrep-textclass
59 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
L4_CSE256_FA24_WE
No ratings yet
L4_CSE256_FA24_WE
68 pages
07_word_embeddings_notes
No ratings yet
07_word_embeddings_notes
23 pages
XCS224N_Module4_Slides
No ratings yet
XCS224N_Module4_Slides
91 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
تمثيل النص كموترات - تدريب _ مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب _ مايكروسوفت ليرن
14 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Lrfscds
No ratings yet
Lrfscds
6 pages
cs224n 2017 Lecture4 PDF
No ratings yet
cs224n 2017 Lecture4 PDF
61 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Unit iv
No ratings yet
Unit iv
58 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
CS388N Practice Questions Answers
No ratings yet
CS388N Practice Questions Answers
48 pages
Unit iv
No ratings yet
Unit iv
57 pages
NLP Short
No ratings yet
NLP Short
5 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
How Exactly Does Word2vec Work?: David Meyer
No ratings yet
How Exactly Does Word2vec Work?: David Meyer
18 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
110 pages
BA-LLMS-W3-S2-2024-2025 - Copy
No ratings yet
BA-LLMS-W3-S2-2024-2025 - Copy
64 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Expt_5_Expt_6_
No ratings yet
Expt_5_Expt_6_
10 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
4 Classification 2
No ratings yet
4 Classification 2
55 pages
Recurrent Neural Networks cheatsheet
No ratings yet
Recurrent Neural Networks cheatsheet
44 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
intro_slides
No ratings yet
intro_slides
31 pages
Lecture13 - ML Linear & Log-Linear Models
No ratings yet
Lecture13 - ML Linear & Log-Linear Models
34 pages
W02 MLOptDL
No ratings yet
W02 MLOptDL
23 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
Homework 3: SVM and Sentiment Analysis: Minted Listings
No ratings yet
Homework 3: SVM and Sentiment Analysis: Minted Listings
7 pages
GEN AI LAB PROGRAMS
No ratings yet
GEN AI LAB PROGRAMS
15 pages
DM Chapter 9 - word embedding
No ratings yet
DM Chapter 9 - word embedding
7 pages
UNIT-II
No ratings yet
UNIT-II
20 pages
Word2vec Parameter Learning Explained: Xin Rong Ronxin@umich - Edu
No ratings yet
Word2vec Parameter Learning Explained: Xin Rong Ronxin@umich - Edu
21 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
W03 NLP
No ratings yet
W03 NLP
88 pages
W 2 Vexp
No ratings yet
W 2 Vexp
22 pages
Learning C++ by Creating Games with UE4
From Everand
Learning C++ by Creating Games with UE4
William Sherif
3/5 (7)
Unleashing the Power of CSS
From Everand
Unleashing the Power of CSS
Stephanie Eckles
No ratings yet
Design Patterns in Swift: A Different Approach to Coding with Swift
From Everand
Design Patterns in Swift: A Different Approach to Coding with Swift
Vamshi Krishna
No ratings yet
Object-Oriented JavaScript: Create scalable, reusable high-quality JavaScript applications, and libraries
From Everand
Object-Oriented JavaScript: Create scalable, reusable high-quality JavaScript applications, and libraries
Stoyan Stefanov
3.5/5 (3)
Formative Test 1 Asking For Attention & Checking For Understanding
No ratings yet
Formative Test 1 Asking For Attention & Checking For Understanding
2 pages
26-sample-epe-writing
No ratings yet
26-sample-epe-writing
2 pages
Chapter 1
No ratings yet
Chapter 1
26 pages
Oet Listening PDF
0% (1)
Oet Listening PDF
5 pages
Structure and Properties of Matter Introduction To Matter
No ratings yet
Structure and Properties of Matter Introduction To Matter
3 pages
A Sample LA From 127 Hours For Class10
No ratings yet
A Sample LA From 127 Hours For Class10
1 page
What Is The Difference Between Second Language and Foreign Language Learning
No ratings yet
What Is The Difference Between Second Language and Foreign Language Learning
4 pages
Computer Assignment 3
No ratings yet
Computer Assignment 3
5 pages
Bat File Commands
No ratings yet
Bat File Commands
80 pages
303 Teaching English For Kindergarten - Wisma Yunita N Gita Mutiara Hati
No ratings yet
303 Teaching English For Kindergarten - Wisma Yunita N Gita Mutiara Hati
8 pages
Advantages of Digital Communication
100% (1)
Advantages of Digital Communication
1 page
Service of Holy Baptism (English)
No ratings yet
Service of Holy Baptism (English)
30 pages
Makna Dan Fungsi Compang
No ratings yet
Makna Dan Fungsi Compang
12 pages
Form 1 Lesson 24 Listening
No ratings yet
Form 1 Lesson 24 Listening
2 pages
Elec97312023a1
No ratings yet
Elec97312023a1
2 pages
Asymptotic Notations
No ratings yet
Asymptotic Notations
39 pages
Anh 10 - GS - Compound Sentence
No ratings yet
Anh 10 - GS - Compound Sentence
2 pages
About Jharkhand and Its Art and Craft
0% (1)
About Jharkhand and Its Art and Craft
15 pages
Fixture Design PDF
No ratings yet
Fixture Design PDF
9 pages
I English Language
No ratings yet
I English Language
124 pages
Java Awt DND Dragsource
No ratings yet
Java Awt DND Dragsource
1 page
7th Grade Wordly Wise Lesson 4
0% (1)
7th Grade Wordly Wise Lesson 4
28 pages
Class Log Discrete Maths
No ratings yet
Class Log Discrete Maths
16 pages
Amazing Patent Searching Techniques and Tools
No ratings yet
Amazing Patent Searching Techniques and Tools
3 pages
Linux Notes
No ratings yet
Linux Notes
57 pages
Quiz Contest A Quiz Application A PROJEC
No ratings yet
Quiz Contest A Quiz Application A PROJEC
45 pages
Parative Stylistics of French and English: A Methodology For Translation. Amsterdam/Phila
No ratings yet
Parative Stylistics of French and English: A Methodology For Translation. Amsterdam/Phila
2 pages
The Gongsun Longzi And Other Neglected Texts Aligning Philosophical And Philological Perspectives Rafael Suter Editor Lisa Indraccolo Editor Wolfgang Behr Editor pdf download
100% (1)
The Gongsun Longzi And Other Neglected Texts Aligning Philosophical And Philological Perspectives Rafael Suter Editor Lisa Indraccolo Editor Wolfgang Behr Editor pdf download
76 pages
Richa Mathew
No ratings yet
Richa Mathew
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Anlp 02 Wordrep Textclass

Uploaded by

Anlp 02 Wordrep Textclass

Uploaded by

CS11-711 Advanced NNLP

lookup lookup lookup lookup weights score

Features f are based on word identity, weights w learned

• I love this movie -> I don’t love this movie

• Handling of sentence structure

the compan _ies are expand _ing

vocab = merge_vocab(pairs[0], vocab)

{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}

vocab = merge_vocab(pairs[0], vocab)

• Python bindings also available

One-hot Representations Dense Representations

lookup lookup lookup lookup

• Use an algorithm called “structured perceptron”

• Write down a loss function

• Calculate derivatives of the loss function wrt the

• Move in the parameters in the direction that

more closely linked to acc probabilistic interpretation, gradients everywhere

• Example from BOW model + hinge loss

• There are many other optimization options! (see

• Loss function: Hinge Loss

• Optimizer: SGD w/ learning rate 1

lookup lookup lookup lookup

• Now things are more interesting!

• We can learn feature combinations (a node in the

• e.g. capture things such as “not” AND “hate”

Current Conception: Computation Graphs

f (u) = u> f (u, v) = u · v

Image credit: Wikipedia

A node is a {tensor, matrix, vector, scalar} value

A node with an incoming edge is a function of

Functions can be nullary, unary,

f (M, v) = Mv f (x, A) = x> Ax

f (u) = u> f (u, v) = u · v

f (u) = u> f (u, v) = u · v

variable names are just labelings of nodes.

• In topological order, compute the value of the

f (u) = u> f (u, v) = u · v

f (u) = u> f (u, v) = u · v

f (u) = u> f (u, v) = u · v

f (u) = u> f (u, v) = u · v

f (u) = u> f (u, v) = u · v

Developed by FAIR/Meta Developed by Google

• For each example

• create a graph that represents the computation

• calculate the result of that computation

• if training, perform back propagation and

lookup lookup lookup lookup bias scores

lookup lookup lookup lookup

• Correction of bias early in training

(Image credit: Derksen 2016)

• Linear correlations cannot be interpreted

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.