0% found this document useful (0 votes)
8 views

Anlp 02 Wordrep Textclass

Uploaded by

anhnnl.ba12-016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Anlp 02 Wordrep Textclass

Uploaded by

anhnnl.ba12-016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

CS11-711 Advanced NNLP

Word Representation
and Text Classi ers
Graham Neubig

https://phontron.com/class/anlp-fall2024/
fi
Reminder:
Bag of Words (BOW)
I hate this movie

lookup lookup lookup lookup weights score

( + + +
)⋅ =

Features f are based on word identity, weights w learned


Which problems mentioned before would this solve?
What’s Missing in BOW?
• Handling of conjugated or compound words Subword
• I love this move -> I loved this movie Models
• Handling of word similarity Word
• I love this move -> I adore this movie Embeddings
• Handling of combination features

• I love this movie -> I don’t love this movie


Neural
Networks
• I hate this movie -> I don’t hate this movie

• Handling of sentence structure


Sequence
• It has an interesting story, but is boring overall Models
Subword Models
Basic Idea
• Split less common words into multiple subword tokens
the companies are expanding

the compan _ies are expand _ing

• Bene ts:
• Share parameters between word variants,
compound words
• Reduce parameter size, save compute+memory
fi
Byte Pair Encoding
(Sennrich+ 2015)
• Incrementally combine together the most frequent token pairs
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}

pairs = get_stats(vocab)

[(('e', 's'), 9), (('s', 't'), 9), (('t', '</w>'), 9), (('w', 'e'), 8), (('l', 'o'), 7), …]

vocab = merge_vocab(pairs[0], vocab)

{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}

pairs = get_stats(vocab)
[(('es', 't'), 9), (('t', '</w>'), 9), (('l', 'o'), 7), (('o', 'w'), 7), (('n', 'e'), 6)]

vocab = merge_vocab(pairs[0], vocab)


{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}

Example code:
https://github.com/neubig/anlp-code/tree/main/02-subwords
Unigram Models
(Kudo 2018)
• Use a unigram LM that generates all words in the
sequence independently (more next lecture)
• Pick a vocabulary that maximizes the log likelihood
of the corpus given a xed vocabulary size
• Optimization performed using the EM algorithm
(details not important for most people)
• Find the segmentation of the input that maximizes
unigram probability
fi
SentencePiece
• A highly optimized library that makes it possible to
train and use BPE and Unigram models

% spm_train --input=<input> \
--model_prefix=<model_name>
--vocab_size=8000 --character_coverage=1.0
--model_type=<type>

% spm_encode --model=<model_file>
—output_format=piece < input > output

• Python bindings also available

https://github.com/google/sentencepiece
Subword Considerations
• Multilinguality: Subword models are hard to use
multilingually because they will over-segment less
common languages naively (Ács 2019)
• Work-around: Upsample less represented
languages
• Arbitrariness: Do we do “es t” or “e st”?
• Work-around: “Subword regularization" samples
different segmentations at training time to make
models robust (Kudo 2018)
Continuous Word
Embeddings
Basic Idea
• Previously we represented words with a sparse vector
with a single “1” — a one-hot vector
• Continuous word embeddings look up a dense vector

One-hot Representations Dense Representations


I hate this movie I hate this movie
lookup lookup lookup lookup lookup lookup lookup lookup
Continuous Bag of Words
(CBOW)
I hate this movie

lookup lookup lookup lookup

+ + +
=

W + =

bias scores
What do Our Vectors Represent?
• No guarantees, but we hope that:
• Words that are similar (syntactically, semantically,
same language, etc.) are close in vector space
• Each vector element is a features (e.g. is this an
animate object? is this a positive word, etc.)

great
excellent
angel
sun
nice
Shown in 2D, but
cat basket in reality we use
dog 512, 1024, etc.
bad disease
monster
A Note: “Lookup”
• Lookup can be viewed as “grabbing” a single
vector from a big matrix of word embeddings
num. words
vector
size
lookup(2)
• Similarly, can be viewed as multiplying by a “one-
hot” vector
num. words 00
vector 1
*0
size 0

• Former tends to be faster
Training a More Complex
Model
Reminder: Simple Training of BOW Models

• Use an algorithm called “structured perceptron”

feature_weights = {}
for x, y in data:
# Make a prediction
features = extract_features(x)
predicted_y = run_classifier(features)
# Update the weights if the prediction is wrong
if predicted_y != y:
for feature in features:
feature_weights[feature] = (
feature_weights.get(feature, 0) +
y * features[feature]
)

Full Example:
https://github.com/neubig/anlp-code/tree/main/01-simpleclassi er
fi
How do we Train More
Complex Models?
• We use gradient descent

• Write down a loss function

• Calculate derivatives of the loss function wrt the


parameters

• Move in the parameters in the direction that


reduces the loss function
Loss Function
• A value that gets lower as the model gets better
• Examples from binary classi cation using score s(x)
Hinge Loss Sigmoid + Negative Log Likelihood
1
ℓ = max(−y ∗ s) σ(y ∗ s) = ℓ = − log σ(y ∗ s)
1 + e−(y∗s)

y=1

y=-1

more closely linked to acc probabilistic interpretation, gradients everywhere


fi
Calculating Derivatives
• Calculate the derivative of the parameter given the loss function

• Example from BOW model + hinge loss


!|V|
∂ max(0, −y ∗ i wi freq(vi , x))
=
∂wi
! "|V|
−y · freq(vi , x) if − y · i wi freq(vi , x) > 0
0 otherwise
Optimizing Parameters
• Standard stochastic gradient descent does

<latexit sha1_base64="6hDez93AjMrSnNoeZfXZjS/6HaI=">AAACFnicbVDLSsNAFJ34rPUVdelmsAi6sCQiqAuh6MZlBatCE8JketsOTiZh5kYoIX/hxl9x40LFrbjzb5w+Fr4OXO7hnHuZuSfOpDDoeZ/O1PTM7Nx8ZaG6uLS8suqurV+ZNNccWjyVqb6JmQEpFLRQoISbTANLYgnX8e3Z0L++A21Eqi5xkEGYsJ4SXcEZWily670I6QkNFIsli4oA+4C2455fljQAKXe+S7uRW/Pq3gj0L/EnpEYmaEbuR9BJeZ6AQi6ZMW3fyzAsmEbBJZTVIDeQMX7LetC2VLEETFiM7irptlU6tJtqWwrpSP2+UbDEmEES28mEYd/89obif147x+5RWAiV5QiKjx/q5pJiSoch0Y7QwFEOLGFcC/tXyvtMM442yqoNwf998l/S2q8f1/2Lg1rjdJJGhWySLbJDfHJIGuScNEmLcHJPHskzeXEenCfn1Xkbj045k50N8gPO+xdFOJ7z</latexit>
gt = r✓t 1 `(✓t 1)
Gradient of Loss

✓t = ✓t
<latexit sha1_base64="WmTb7CmseFEFrcsdvuCoFOH6zU0=">AAACCnicbZBPS8MwGMbT+W/Of1WPXoJD8LLRiqAehKEXjxOsG6ylpFm2haVpSd4Ko+zuxa/ixYOKVz+BN7+N2daDbj4Q+OV535fkfaJUcA2O822VlpZXVtfK65WNza3tHXt3714nmaLMo4lIVDsimgkumQccBGunipE4EqwVDa8n9dYDU5on8g5GKQti0pe8xykBY4X2oQ8DBiQEfIkLzKHmjnEN++aC+yGEdtWpO1PhRXALqKJCzdD+8rsJzWImgQqidcd1UghyooBTwcYVP9MsJXRI+qxjUJKY6SCf7jLGR8bp4l6izJGAp+7viZzEWo/iyHTGBAZ6vjYx/6t1MuidBzmXaQZM0tlDvUxgSPAkGNzlilEQIwOEKm7+iumAKELBxFcxIbjzKy+Cd1K/qLu3p9XGVZFGGR2gQ3SMXHSGGugGNZGHKHpEz+gVvVlP1ov1bn3MWktWMbOP/sj6/AFhNJmN</latexit>
1 ⌘gt
Learning Rate

• There are many other optimization options! (see


Ruder 2016 in references)
What is this Algorithm?
feature_weights = {}
for x, y in data:
# Make a prediction
features = extract_features(x)
predicted_y = run_classifier(features)
# Update the weights if the prediction is wrong
if predicted_y != y:
for feature in features:
feature_weights[feature] = (
feature_weights.get(feature, 0) +
y * features[feature]
)

• Loss function: Hinge Loss

• Optimizer: SGD w/ learning rate 1


Combination Features
Combination Features
very good
good
I don’t love this movie neutral
bad
very bad

very good
good
There’s nothing I don’t neutral
love about this movie bad
very bad
Basic Idea of Neural Networks
(for NLP Prediction Tasks)
I hate this movie

lookup lookup lookup lookup

scores

some complicated
function to extract
combination features probs
(neural net) softmax
Deep CBOW
I hate this movie

+ + +
=
tanh( tanh(
W1*h + b1) W2*h + b2)

W + =

bias scores
What do Our Vectors
Represent?

• Now things are more interesting!

• We can learn feature combinations (a node in the


second layer might be “feature 1 AND feature 5 are
active”)

• e.g. capture things such as “not” AND “hate”


What is a Neural Net?:
Computation Graphs
“Neural” Nets
Original Motivation: Neurons in the Brain

Current Conception: Computation Graphs


X
f (x1 , x2 , x3 ) = xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c

Image credit: Wikipedia


expression:
>
y = x Ax + b · x + c

graph:

A node is a {tensor, matrix, vector, scalar} value


x
An edge represents a function argument
expression:
(and also
>
an data dependency). They are just
y = x to
pointers +b·x+c
Axnodes.

A node with an incoming edge is a function of


graph:
that edge’s tail node.
A node knows how to compute its value and the
value of its derivative w.r.t each argument (edge)
@F
times a derivative of an arbitrary input @f (u) .

> ✓ ◆>
f (u) = u @f (u) @F @F
=
@u @f (u) @f (u)

x
expression:
>
y = x Ax + b · x + c

graph:

Functions can be nullary, unary,


binary, … n-ary. Often they are unary or binary.
f (U, V) = UV

f (u) = u>
A
x
expression:
>
y = x Ax + b · x + c

graph:

f (M, v) = Mv

f (U, V) = UV

f (u) = u>
A
x
Computation graphs are directed and acyclic (in DyNet)
expression:
>
y = x Ax + b · x + c

graph:

f (M, v) = Mv f (x, A) = x> Ax

f (U, V) = UV

>
x A
f (u) = u
A @f (x, A)
= (A> + A)x
@x
@f (x, A)
x = xx>
@A
expression:
>
y = x Ax + b · x + c

graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
expression:
>
y = x Ax + b · x + c

graph: f (x1 , x2 , x3 ) =
X
xi
i
y
f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c

variable names are just labelings of nodes.


Algorithms (1)

• Graph construction

• Forward propagation

• In topological order, compute the value of the


node given its inputs
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

x> A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv
x> Ax
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
> i
x Ax + b · x + c
f (M, v) = Mv
x> Ax
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Algorithms (2)
• Back-propagation:
• Process examples in reverse topological order
• Calculate the derivatives of the parameters with
respect to the nal value
(This is usually a “loss function”, a value we want
to minimize)
• Parameter update:
• Move the parameters in the direction of this
derivative
W -= α * dl/dW
fi
Back Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Concrete Implementation
Examples
Neural Network Frameworks

Developed by FAIR/Meta Developed by Google


Most widely used in NLP Used in some NLP projects
Favors dynamic execution Favors de nition+compilation
More exibility Conceptually simple parallelization
Most vibrant ecosystem
fl
fi
Basic Process in Neural
Network Frameworks
• Create a model

• For each example

• create a graph that represents the computation


you want

• calculate the result of that computation

• if training, perform back propagation and


update
Bag of Words (BOW)
I hate this movie

lookup lookup lookup lookup bias scores

+ + + + =
probs
softmax

https://github.com/neubig/anlp-code/tree/main/02-textclass
Continuous Bag of Words
(CBOW)
I hate this movie

lookup lookup lookup lookup

+ + +
=

W + =

bias scores
https://github.com/neubig/anlp-code/tree/main/02-textclass
Deep CBOW
I hate this movie

+ + +
=
tanh( tanh(
W1*h + b1) W2*h + b2)

W + =

bias scores
https://github.com/neubig/anlp-code/tree/main/02-textclass
A Few More Important
Concepts
A Better Optimizer: Adam
• Most standard optimization option in NLP and beyond
• Considers rolling average of gradient, and momentum
mt = 1 mt 1 + (1 1 )gt Momentum

<latexit sha1_base64="7LaGMij2fvu9TSQ4l1XLThYMwZo=">AAACSHicbZBLSwMxFIUz9VXrq+rSTbAoilhmiqAuhKIblxWsLXTKmEnTGppMhuROoQz9e27cufM/uHGh4s60nYVaLwQ+zjmXJCeMBTfgui9Obm5+YXEpv1xYWV1b3yhubt0ZlWjK6lQJpZshMUzwiNWBg2DNWDMiQ8EaYf9q7DcGTBuuolsYxqwtSS/iXU4JWCko3ssA8D6+wH7IgAQelkEKx94IH+ED7zgTD3HPpny/MPgVruDBbLiShVVHwZiCYsktu5PBs+BlUELZ1ILis99RNJEsAiqIMS3PjaGdEg2cCjYq+IlhMaF90mMtixGRzLTTSRMjvGeVDu4qbU8EeKL+3EiJNGYoQ5uUBB7MX28s/ue1EuietVMexQmwiE4v6iYCg8LjWnGHa0ZBDC0Qqrl9K6YPRBMKtvyCLcH7++VZqFfK52Xv5qRUvczayKMdtIsOkIdOURVdoxqqI4oe0St6Rx/Ok/PmfDpf02jOyXa20a/J5b4BCp+sPA==</latexit>
vt = 2 vt 1 + (1 2 )gt gt Rolling Average of Gradient

• Correction of bias early in training


mt vt
m̂t = v̂ t =
1 ( 1 )t 1 ( 2 )t
<latexit sha1_base64="tX+KmExHPpLLt2r1vquyYOWnxPw=">AAACDXicbVA9SwNBEN2LXzF+nVraLIZALAx3NmohBG0sIxgTyMVjb7OXLNm9O3bnhHDcL7Dxr9hYqNja2/lv3HwUmvhg4PHeDDPzgkRwDY7zbRWWlldW14rrpY3Nre0de3fvTsepoqxJYxGrdkA0EzxiTeAgWDtRjMhAsFYwvBr7rQemNI+jWxglrCtJP+IhpwSM5NsVb0Agk7kP+AJ7oSI0kz7kmXtc9QIGxHeP7iH37bJTcybAi8SdkTKaoeHbX14vpqlkEVBBtO64TgLdjCjgVLC85KWaJYQOSZ91DI2IZLqbTd7JccUoPRzGylQEeKL+nsiI1HokA9MpCQz0vDcW//M6KYRn3YxHSQosotNFYSowxHicDe5xxSiIkSGEKm5uxXRATCZgEiyZENz5lxdJ86R2XnNvnHL9cpZGER2gQ1RFLjpFdXSNGqiJKHpEz+gVvVlP1ov1bn1MWwvWbGYf/YH1+QOt25ts</latexit>
sha1_base64="uU/7wIbYkpNLdU8wOtlDEbAhmgY=">AAACDXicbVA9SwNBEJ3z2/gVtbRZlEAsDHc2aiEEbSwVjAq5eOxt9pIlu3fH7pwQjvsFNv4VGwsVsbO38x/Y+RfcJBZ+PRh4vDfDzLwwlcKg6745Y+MTk1PTM7OlufmFxaXy8sqZSTLNeIMlMtEXITVcipg3UKDkF6nmVIWSn4e9w4F/fsW1EUl8iv2UtxTtxCISjKKVgnLF71LMVREg2Sd+pCnLVYBF7m1V/ZAjDbzNSyyC8oZbc4cgf4n3RTbq9Y/3ZwA4DsqvfjthmeIxMkmNaXpuiq2cahRM8qLkZ4anlPVohzctjanippUP3ylIxSptEiXaVoxkqH6fyKkypq9C26kods1vbyD+5zUzjHZbuYjTDHnMRouiTBJMyCAb0haaM5R9SyjTwt5KWJfaTNAmWLIheL9f/ksa27W9mndiwziAEWZgDdahCh7sQB2O4BgawOAabuEeHpwb5855dJ5GrWPO18wq/IDz8gnWbp6D</latexit>
sha1_base64="ebyssrkgbrz8OmYYTlFMkBtIdy4=">AAACDXicbVC7SgNBFJ31bXxFBRubwRiIhWHXRi2EoI1lBGMC2bjMTmbN4MzuMnNXCMt+gY0f4E/YWGiwtbfzD+z8BSePQhMPXDiccy/33uPHgmuw7U9ranpmdm5+YTG3tLyyupZf37jSUaIoq9FIRKrhE80ED1kNOAjWiBUj0hes7t+e9f36HVOaR+EldGPWkuQm5AGnBIzk5Ytuh0AqMw/wCXYDRWgqPchSZ7/k+gyI5+xdQ+blC3bZHgBPEmdECpXK91dv63G36uU/3HZEE8lCoIJo3XTsGFopUcCpYFnOTTSLCb0lN6xpaEgk06108E6Gi0Zp4yBSpkLAA/X3REqk1l3pm05JoKPHvb74n9dMIDhqpTyME2AhHS4KEoEhwv1scJsrRkF0DSFUcXMrph1iMgGTYM6E4Iy/PElqB+XjsnNhwjhFQyygbbSDSshBh6iCzlEV1RBF9+gJvaBX68F6tnrW27B1yhrNbKI/sN5/AO4hn1U=</latexit> <latexit sha1_base64="4PRu3kPKoHVfeG6bK0+b+RrguA0=">AAACDXicbVC7SgNBFJ2NrxhfUUubwRCIhWE3jVoIQRvLCK4JZNdldjKbDJl9MHM3EJb9Aht/xcZCxdbezr9x8ig08cCFwzn3cu89fiK4AtP8Ngorq2vrG8XN0tb2zu5eef/gXsWppMymsYhlxyeKCR4xGzgI1kkkI6EvWNsfXk/89ohJxePoDsYJc0PSj3jAKQEteeWqMyCQjXIP8CV2AkloNvIgz6zTmuMzIF7j5AFyr1wx6+YUeJlYc1JBc7S88pfTi2kasgioIEp1LTMBNyMSOBUsLzmpYgmhQ9JnXU0jEjLlZtN3clzVSg8HsdQVAZ6qvycyEio1Dn3dGRIYqEVvIv7ndVMIzt2MR0kKLKKzRUEqMMR4kg3ucckoiLEmhEqub8V0QHQmoBMs6RCsxZeXid2oX9StW7PSvJqnUURH6BjVkIXOUBPdoBayEUWP6Bm9ojfjyXgx3o2PWWvBmM8coj8wPn8AzFGbfw==</latexit>
sha1_base64="3k3UytFXUZZXN9xAvbSC1KbMZCk=">AAACDXicbVC7SgNBFL3r2/iKWtoMSiAWht00aiEEbSwVjBGycZmdzCZDZh/M3A2EZb/Axl+xsVARO3s7/8DOX3DyKHwduHA4517uvcdPpNBo2+/W1PTM7Nz8wmJhaXllda24vnGp41QxXmexjNWVTzWXIuJ1FCj5VaI4DX3JG37vZOg3+lxpEUcXOEh4K6SdSASCUTSSVyy5XYpZP/eQHBE3UJRlfQ/zzNkruz5H6lV3rzH3ijt2xR6B/CXOhOzUap8fLwBw5hXf3HbM0pBHyCTVuunYCbYyqlAwyfOCm2qeUNajHd40NKIh161s9E5OSkZpkyBWpiIkI/X7REZDrQehbzpDil392xuK/3nNFIODViaiJEUesfGiIJUEYzLMhrSF4gzlwBDKlDC3EtalJhM0CRZMCM7vl/+SerVyWHHOTRjHMMYCbME2lMGBfajBKZxBHRjcwB08wKN1a91bT9bzuHXKmsxswg9Yr1/05J6W</latexit>
sha1_base64="inXyXnDDlmalxS+brMkBon+upyE=">AAACDXicbVC7SgNBFJ31bXytCjY2gzEQC8OujVoIQRtLBWMC2XWZncyaIbMPZu4GwrJfYOMH+BM2Fiq29nb+gZ2/4ORRaOKBC4dz7uXee/xEcAWW9WlMTc/Mzs0vLBaWlldW18z1jWsVp5KyGo1FLBs+UUzwiNWAg2CNRDIS+oLV/c5Z3693mVQ8jq6glzA3JLcRDzgloCXPLDltAlk39wCfYCeQhGZdD/LM3i87PgPiHezdQO6ZRatiDYAniT0ixWr1++t162H3wjM/nFZM05BFQAVRqmlbCbgZkcCpYHnBSRVLCO2QW9bUNCIhU242eCfHJa20cBBLXRHggfp7IiOhUr3Q150hgbYa9/rif14zheDIzXiUpMAiOlwUpAJDjPvZ4BaXjILoaUKo5PpWTNtEZwI6wYIOwR5/eZLUDirHFftSh3GKhlhA22gHlZGNDlEVnaMLVEMU3aFH9IxejHvjyXg13oatU8ZoZhP9gfH+Awymn2g=</latexit>

• Final update

✓t = ✓t 1 p m̂t
<latexit sha1_base64="vFWgf8Z25xhNy/8deLRj2WaSIkw=">AAACM3icbZDLSgMxFIYz3q23qks3wSIIYpkRQV0IohvBjYJVoVOGTHrGBjOZMTkjlDAP5cYHcSOCCxW3voNpLeLth8Cf75xDcv44l8Kg7z96Q8Mjo2PjE5OVqemZ2bnq/MKZyQrNocEzmemLmBmQQkEDBUq4yDWwNJZwHl8d9OrnN6CNyNQpdnNopexSiURwhg5F1aMQO4AssljSXfp1WQ9Kuk7DRDNuQ4dKG5prjTbsMLQ3ZYTlWgi5ETJTZZ+ljkXVml/3+6J/TTAwNTLQcVS9D9sZL1JQyCUzphn4ObYs0yi4hLISFgZyxq/YJTSdVSwF07L9pUu64kibJpl2RyHt0+8TlqXGdNPYdaYMO+Z3rQf/qzULTLZbVqi8QFD886GkkBQz2kuQtoUGjrLrDONauL9S3mEuKXQ5V1wIwe+V/5rGRn2nHpxs1vb2B2lMkCWyTFZJQLbIHjkkx6RBOLklD+SZvHh33pP36r19tg55g5lF8kPe+wel+6z3</latexit>
v̂t + ✏
Visualization of Embeddings
• Reduce high-dimensional embeddings into 2/3D
for visualization (e.g. Mikolov et al. 2013)
Non-linear Projection
• Non-linear projections group things that are close in high-
dimensional space
• e.g. SNE/t-SNE (van der Maaten and Hinton 2008) group things
that give each other a high probability according to a Gaussian
PCA t-SNE

(Image credit: Derksen 2016)


t-SNE Visualization can be
Misleading! (Wattenberg et al. 2016)
• Settings matter

• Linear correlations cannot be interpreted


Any Questions?
(sequence models in next class)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy