14-Word Embeddings II
14-Word Embeddings II
CSC484 IR
Some of slides are adopted from Dr. Yates et al. (Amsterdam, Waterloo) and Dr. Mitra et al. (Microsoft, UCL) and Dr. Bamman (Berkeley)
1 2 3 4 … 50
… … … … … … …
https://nlp.stanford.edu/projects/glove/
y
dog
cat puppy
wrench
screwdriver
Word embedding importance
“Word embedding” in NLP papers
0.7
0.525
0.35
0.175
0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Eisner et al. (2016), “emoji2vec: Learning Emoji Representations from their Description”
Word2vec in IR
• In the previous lecture we learned about word2vec.
• How can we use w2v in IR?
• The possibilities are limitless!
Retrieval using vector representations
Generate vector
representation of query
Generate vector
representation of document
e.g., BM25
Document embeddings
• Wait, we only have word vectors?
• How do we get document embeddings?
• There are many techniques to generate document embeddings.
• One simple way to combine the word vectors contained in the
document.
y
sum
2.7 3.1 -1.4 -2.3 0.7 -0.7 -0.8 -1.3 -0.2 -0.9 2.3 1.5 1.1 1.4 1.3 -0.9 -1.5 -0.7 0.9 0.2 -0.1 -0.7 -1.6 0.2 0.6
3
≈
y
avg
2.7 3.1 -1.4 -2.3 0.7 -0.7 -0.8 -1.3 -0.2 -0.9 2.3 1.5 1.1 1.4 1.3 -0.9 -1.5 -0.7 0.9 0.2 -0.1 -0.7 -1.6 0.2 0.6
Iyyer et al. (2015), “Deep Unordered Composition Rivals Syntactic Methods for Text Classification” (ACL)
4
≈
y
weighted sum
2.7 3.1 -1.4 -2.3 0.7 -0.7 -0.8 -1.3 -0.2 -0.9 2.3 1.5 1.1 1.4 1.3 -0.9 -1.5 -0.7 0.9 0.2 -0.1 -0.7 -1.6 0.2 0.6
6
≈
OOV in word embeddings
• How do we tackle OOV (out-of-vocabulary) words in word2vec?
• What is the embedding/vector of a word we never seen before?
• Unfortunately, there’s no easy way to tackle this problem in w2v L
• Pre-trained word embeddings great for words that appear frequently in data
• Unseen words are treated as UNKs (unknown) and assigned zero or random
vectors; everything unseen is assigned the same representation.
Shared structure friend
friended
FastText + e(whe)
+ e(her) 3-grams
+ e(ere)
+ e(re>)
+ e(<whe)
+ e(wher) 4-grams
+ e(here)
+ e(ere>)
e(where) =
+ e(<wher)
+ e(where) 5-grams
+ e(here>)
+ e(<where) 6-grams
+ e(where>)
12
≈
Both w2v and fastText don’t capure context
• Models for learning static embeddings (w2v, fastText, etc.) learn a
single representation for a word type.
Types and tokens
• Type: bears
“bears”
• Tokens:
• The bears ate the honey 3.1 1.4 -2.7 0.3
• We spotted the bears from the highway 3.1 1.4 -2.7 0.3
• Yosemite has brown bears 3.1 1.4 -2.7 0.3
• The chicago bears didn’t make the playoffs
3.1 1.4 -2.7 0.3
y
elk
bears moose
x
football
49ers
packers
Types and tokens
”اﻟﮭﻼل“
• Tokens:
• ﻓﺎز اﻟﮭﻼل ﻋﻠﻰ اﻟﻨﺼﺮ 3.1 1.4 -2.7 0.3
x
ﻛ ﺮ ة ا ﻟﻘ د م
اﻟﻧﺻﺮ
اﻻ ﺗﺣﺎد
Contextualized embeddings
We saw a moose in
Alaska
Da bears lost
again!
Go pack go!
رأﯾﺖ اﻟﮭﻼل اﻟﻤﻀﻲء
We’re making a significant improvement to how we Starting from April of this year (2019), we used
understand queries, representing the biggest leap large transformer models to deliver the largest
forward in the past five years, and one of the biggest quality improvements to our Bing customers in
leaps forward in the history of Search. the past year.
source source
One problem with using embeddings for IR
• Given a query q
• We need to compare it with all estimate relevance