0% found this document useful (0 votes)
37 views31 pages

14-Word Embeddings II

Uploaded by

alt.gw-eommgz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views31 pages

14-Word Embeddings II

Uploaded by

alt.gw-eommgz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Word Embeddings II

CSC484 IR

Some of slides are adopted from Dr. Yates et al. (Amsterdam, Waterloo) and Dr. Mitra et al. (Microsoft, UCL) and Dr. Bamman (Berkeley)
1 2 3 4 … 50

the 0.418 0.24968 -0.41242 0.1217 … -0.17862

, 0.013441 0.23682 -0.16899 0.40951 … -0.55641

. 0.15164 0.30177 -0.16763 0.17684 … -0.31086

of 0.70853 0.57088 -0.4716 0.18048 … -0.52393

to 0.68047 -0.039263 0.30186 -0.17792 … 0.13228

… … … … … … …

chanty 0.23204 0.025672 -0.70699 -0.04547 … 0.34108

kronik -0.60921 -0.67218 0.23521 -0.11195 … 0.85632

rolonda -0.51181 0.058706 1.0913 -0.55163 … 0.079711

zsombor -0.75898 -0.47426 0.4737 0.7725 … 0.84014

sandberger 0.072617 -0.51393 0.4728 -0.52202 … 0.23096

https://nlp.stanford.edu/projects/glove/
y
dog
cat puppy

wrench

screwdriver
Word embedding importance
“Word embedding” in NLP papers

0.7

0.525

0.35

0.175

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

Data from ACL papers in the ACL Anthology (https://www.aclweb.org/anthology/)


emoji2vec

Eisner et al. (2016), “emoji2vec: Learning Emoji Representations from their Description”
Word2vec in IR
• In the previous lecture we learned about word2vec.
• How can we use w2v in IR?
• The possibilities are limitless!
Retrieval using vector representations

Generate vector
representation of query

Generate vector
representation of document

Estimate relevance from q-d


vectors
Popular approaches to incorporating embeddings for
matching/search/ranking
estimate relevance estimate relevance

Compare query and document Use embeddings to generate


directly in the embedding space suitable query expansions
Re-rank using word embeddings

e.g., BM25
Document embeddings
• Wait, we only have word vectors?
• How do we get document embeddings?
• There are many techniques to generate document embeddings.
• One simple way to combine the word vectors contained in the
document.
y

1.9 -0.2 -1.1 -0.2 -0.7

sum

2.7 3.1 -1.4 -2.3 0.7 -0.7 -0.8 -1.3 -0.2 -0.9 2.3 1.5 1.1 1.4 1.3 -0.9 -1.5 -0.7 0.9 0.2 -0.1 -0.7 -1.6 0.2 0.6

I loved the movie !

3

y

1.9 -0.2 -1.1 -0.2 -0.7

avg

2.7 3.1 -1.4 -2.3 0.7 -0.7 -0.8 -1.3 -0.2 -0.9 2.3 1.5 1.1 1.4 1.3 -0.9 -1.5 -0.7 0.9 0.2 -0.1 -0.7 -1.6 0.2 0.6

I loved the movie !

Iyyer et al. (2015), “Deep Unordered Composition Rivals Syntactic Methods for Text Classification” (ACL)
4

y

1.9 -0.2 -1.1 -0.2 -0.7

weighted sum

2.7 3.1 -1.4 -2.3 0.7 -0.7 -0.8 -1.3 -0.2 -0.9 2.3 1.5 1.1 1.4 1.3 -0.9 -1.5 -0.7 0.9 0.2 -0.1 -0.7 -1.6 0.2 0.6

I loved the movie !

6

OOV in word embeddings
• How do we tackle OOV (out-of-vocabulary) words in word2vec?
• What is the embedding/vector of a word we never seen before?
• Unfortunately, there’s no easy way to tackle this problem in w2v L
• Pre-trained word embeddings great for words that appear frequently in data
• Unseen words are treated as UNKs (unknown) and assigned zero or random
vectors; everything unseen is assigned the same representation.
Shared structure friend

friended

Even in languages like English that aren’t friendless

highly inflected, words share important


friendly
structure.
friendship
Even if we never see the word “unfriendly” in
our data, we should be able to reason about it
unfriend
as: un + friend + ly
unfriendly
In Arabic, this is even more important.
Subword embedding models: FastText
• FastText: an word embedding technique from Facebook (2017)
• Aims to fix the OOV (out-of-vocabulary) problem with word
embeddings by utilizing subwords/character n-grams
• Subword models need less data to get comparable performance.
• Can produce vectors for any word even if we never seen it before!
• http://www.fasttext.cc/
e(<wh)

FastText + e(whe)
+ e(her) 3-grams
+ e(ere)
+ e(re>)

+ e(<whe)
+ e(wher) 4-grams
+ e(here)
+ e(ere>)
e(where) =
+ e(<wher)
+ e(where) 5-grams
+ e(here>)

+ e(<where) 6-grams
+ e(where>)

e(*) = embedding for * + e(<where>) word

12

Both w2v and fastText don’t capure context
• Models for learning static embeddings (w2v, fastText, etc.) learn a
single representation for a word type.
Types and tokens

• Type: bears

“bears”
• Tokens:
• The bears ate the honey 3.1 1.4 -2.7 0.3

• We spotted the bears from the highway 3.1 1.4 -2.7 0.3
• Yosemite has brown bears 3.1 1.4 -2.7 0.3
• The chicago bears didn’t make the playoffs
3.1 1.4 -2.7 0.3
y
elk
bears moose

x
football

49ers
packers
‫‪Types and tokens‬‬

‫•‬ ‫اﻟﮭﻼل ‪Type:‬‬

‫”اﻟﮭﻼل“‬
‫•‬ ‫‪Tokens:‬‬
‫•‬ ‫ﻓﺎز اﻟﮭﻼل ﻋﻠﻰ اﻟﻨﺼﺮ‬ ‫‪3.1‬‬ ‫‪1.4‬‬ ‫‪-2.7‬‬ ‫‪0.3‬‬

‫•‬ ‫رأﯾﺖ اﻟﮭﻼل اﻟﻤﻀﻲء‬ ‫‪3.1‬‬ ‫‪1.4‬‬ ‫‪-2.7‬‬ ‫‪0.3‬‬


‫•‬ ‫ﻣﺤﻤﺪ اﻟﮭﻼل أﺣﺪ طﻼب ﻣﺎدة ﻋﺎل‪١١١‬‬ ‫‪3.1‬‬ ‫‪1.4‬‬ ‫‪-2.7‬‬ ‫‪0.3‬‬
‫•‬ ‫اﻟﮭﻼل ﯾﺬﻛﺮﻧﻲ ﺑﺮﻣﻀﺎن‬ ‫‪3.1‬‬ ‫‪1.4‬‬ ‫‪-2.7‬‬ ‫‪0.3‬‬
‫‪y‬‬
‫اﻟﻘﻣﺮ‬
‫ا ﻟﮫﻼل‬ ‫اﻟﺑدر‬

‫‪x‬‬
‫ﻛ ﺮ ة ا ﻟﻘ د م‬

‫اﻟﻧﺻﺮ‬
‫اﻻ ﺗﺣﺎد‬
Contextualized embeddings

• Big idea: transform the representation of a token in a sentence (e.g., from


a static word embedding) to be sensitive to its local context in a sentence.
BERT
• BERT: Bidirectional Encoder Representations from Transformers
• Bidirectional: reads the text from both directions.
• Transformer: a deep learning model/architecture
• Most state-of-the-art AI models are Transformer-based:
• BERT, GPT, GPT-2, GPT-3, ChatGPT, LaMDA

• BERT is a Transformer-based model that can give contextual word


embeddings.
Yosemite has
brown bears

We saw a moose in
Alaska

Da bears lost
again!

Go pack go!
‫رأﯾﺖ اﻟﮭﻼل اﻟﻤﻀﻲء‬

‫ﺣﯿﺮ اﻟﻘﻤﺮ اﻟﺸﻌﺮاء‬

‫ﻓﺎز اﻟﮭﻼل ﻋﻠﻰ اﻟﻨﺼﺮ‬

‫وﻗﻊ روﻧﺎﻟﺪو ﻣﻊ اﻟﻨﺼﺮ‬


Adoption by Commercial Search Engines
Google Search
MS Bing

We’re making a significant improvement to how we Starting from April of this year (2019), we used
understand queries, representing the biggest leap large transformer models to deliver the largest
forward in the past five years, and one of the biggest quality improvements to our Bing customers in
leaps forward in the history of Search. the past year.
source source
One problem with using embeddings for IR
• Given a query q
• We need to compare it with all estimate relevance

documents, so we can rank them.


• If we have 100 million document, that
means 100 million comparisons!
Vector databases/libraries
• A vector database indexes and stores vector embeddings for fast
retrieval and similarity search [pinecone].
• Vector databases excel at Nearest Neighbor Search (NNS).
• Nearest neighbor search (NNS): is the problem of finding the point in
a given set that is closest (or most similar) to a given point.
[wikipedia]
• kNN (k-Nearest Neighbor): is the problem of finding the top-k points
in a given set that is closest (or most similar) to a given point.
Vector databases/libraries
• Recently, many libraries were developed to make vector search fast
and scalable.
• ScaNN (Scalable Nearest Neighbors) from Google [2020]
• Faiss from Facebook/Meta [2019]
• SPANN from Microsoft [2021]
• Annoy (Approximate Nearest Neighbors) from Spotify
Conclusion
• Word embeddings are essential to modern NLP pipelines.
• Subword embeddings allow you to create embeddings for words not
present in training data; require much less data to train.
• Transformers can transform word embeddings to be sensitive to their
use in context.
• Static word embeddings (word2vec, fastText) provide representations
of word types; contextualized word representations (BERT) provide
representations of tokens in context.
• Vector libraries allow us to perform efficient vector similarity.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy