0% found this document useful (0 votes)
27 views64 pages

NLP Week9 Fine Tuning - and - IR

This document discusses the concepts of fine-tuning language models (LMs) and their application in information retrieval (IR). It covers contextual embeddings, the advantages of fine-tuning pre-trained models, and various methods for classification and sequence labeling using models like BERT. Additionally, it explores practical considerations for fine-tuning, evaluation metrics for IR systems, and the integration of LMs in ad-hoc retrieval tasks.

Uploaded by

frfgvhr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views64 pages

NLP Week9 Fine Tuning - and - IR

This document discusses the concepts of fine-tuning language models (LMs) and their application in information retrieval (IR). It covers contextual embeddings, the advantages of fine-tuning pre-trained models, and various methods for classification and sequence labeling using models like BERT. Additionally, it explores practical considerations for fine-tuning, evaluation metrics for IR systems, and the integration of LMs in ad-hoc retrieval tasks.

Uploaded by

frfgvhr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Fine-tuning and information

retrieval with LMs


NLP Week 9

Thanks to Dan Jurafsky for most of the slides this week!


Plan for today
1. Contextualized embeddings

2. Fine-tuning for classi cation

3. Fine-tuning for sequence labeling

4. Practical considerations for ne-tuning

5. Information retrieval powered by language models

6. Group exercises
fi
fi
Problem with static embeddings (word2vec)
They are static! The embedding for a word doesn't reflect how
its meaning changes in context.

The chicken didn't cross the road because it was too


tired

What is the meaning represented in the static embedding for


"it"?
Contextual Embeddings
• Intuition: a representation of meaning of a word should be
different in different contexts!
• Contextual Embedding: each word has a different vector
that expresses different meanings depending on the
surrounding words
• How to compute contextual embeddings?
• Deep neural networks
• Bi-LSTMs
• Self-attention
Contextual Embeddings

The chicken didn't cross the road because it ______

What should be the properties of "it"?

The chicken didn't cross the road because it was too tired
The chicken didn't cross the road because it was too wide

At this point in the sentence, it's probably referring to either the


chicken or the street

Levesque et al. 2012 - The Winograd Schema Challenge


Word sense
Words are ambiguous
A word sense is a discrete representation of one aspect of meaning

Contextual embeddings offer a continuous high-dimensional


model of meaning that is more fine grained than discrete senses.
Static vs Contextual Embeddings
Static embeddings represent word types (dictionary entries)
Contextual embeddings represent word instances (one for each
time the word occurs in any context/sentence)
Contextual Embeddings from BERT
Word sense disambiguation (WSD)
The task of selecting the correct sense for a word.
1-nearest neighbor algorithm for WSD
Melamud et al (2016), Peters et al (2018)

At training time, take a sense-labeled corpus like SEMCOR


Run corpus through BERT to get contextual embedding for each
token
• E.g., pooling representations from last 4 BERT transformer layer
Then for each sense s of word w for n tokens of that sense, pool
n
embeddings: 1

vs = vi ∀vi ∈ tokens(s)
n i=0
At test time, given a token of a target word t, compute contextual
embedding t and choose its nearest neighbor sense from training
set
1-nearest neighbor algorithm for WSD
Similarity and contextual embeddings
• We generally use cosine as for static embeddings
• But some issues:
• Contextual embeddings tend to be anisotropic: all point in roughly
the same direction so have high inherent cosines (Ethayarajh 2019)
• Cosine measure are dominated by a small number of "rogue"
dimensions with very high values (Timkey and van Schijndel 2021)
• Cosine tends to underestimate human judgments on similarity of
word meaning for very frequent words (Zhou et al., 2022)
Fine-tuning pre-trained models

• Recap: Basic ingredients for training ML models


• A model pθ(y ∣ x)
• A dataset = {(xi, yi)}ni=1
• An objective function, e.g., cross-entropy loss
• A learning algorithm: gradient descent
𝒟
Fine-tuning pre-trained models
• So far, models where initialized randomly, i.e., θ ∼ (μ, σ)
• Fine-tuning → Initialize weights from a pre-trained model
• Why should this work? → Transfer learning
• Pre-training encodes useful information about language
• This is helpful for various NLP downstream tasks

Howard & Ruder (2018) -


Universal Language Model Fine-
tuning for Text Classification

Ruder (2019) - The State of


Transfer Learning in NLP
𝒩
Fine-tuning pre-trained models
• Nowadays, fine-tuning means many things
• Adaptation to a single task, e.g., classification
• Continual pre-training
• Language adaptation
• Instruction fine-tuning
• Safety fine-tuning, RLHF
• Post-training
• Mid-training
• Reinforcement fine-tuning
• etc.
Fine-tuning pre-trained models
• Nowadays, fine-tuning means many things
• Adaptation to a single task, e.g., classification
• Continual pre-training
• Language adaptation
• Instruction fine-tuning
• Safety fine-tuning, RLHF
• Post-training
• Mid-training
• Reinforcement fine-tuning
• etc.
Using BERT for classification

Devlin et al. 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Sentence classification

Assign a label to a single sentences:


• Sentiment analysis (does a given sentence
have positive or negative sentiment?)
• Acceptability judgements (is a given
sentence grammatical?)
• Hate speech detection (e.g, does a given
sentence express hate or encourages
violence?)
Using BERT for sentence pair classification

Devlin et al. 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Sequence-Pair classification

Assign a label to pairs of sentences:


• Paraphrase detection (are the two sentences
paraphrases of each other?)
• Logical entailment (does sentence A logically
entail sentence B?)
• Discourse coherence (how coherent is
sentence B as a follow-on to sentence A?)
Example: Natural Language Inference

Pairs of sentences are given one of 3 labels

How? —> pass the premise/hypothesis pairs through BERT and


use the output vector for the [CLS] token as the input to the
classification head .
Using BERT for sequence labeling

Devlin et al. 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Using BERT for sequence labeling

Assign a label from a small fixed set of labels to each


token in the sequence.
• Named entity recognition
• Part of speech tagging
Named Entity Recognition
A named entity is anything that can be referred to with a
proper name: a person, a location, an organization
Named entity recognition (NER): find spans of text that
constitute proper names and tag the type of the entity
Named Entity Recognition
BIO Tagging
Ramshaw and Marcus (1995)

A method that lets us turn a segmentation task (finding


boundaries of entities) into a classification task
Sequence labeling
Evaluating pre-trained language models
• How good are fine-tuning language models?
• Glue, SuperGLUE, SQuAD, MasakhaNER, etc.

Kiela et al. (2021) -


Dynabench: Rethinking
Benchmarking in NLP

Performance (relative to human performance) saturation over time


Practical considerations: Out-of-domain generalization
McCoy et al. (2019) - Right for the Wrong Reasons: Diagnosing

• Do models truly learn the task Syntactic Heuristics in Natural Language Inference

they are supposed to learn?


• What happens if we evaluate
models outside the training
distribution?
• Models can learn shortcuts
present in the training data
Practical considerations: Fine-tuning (in)stability

• Repeat fine-tuning many times and change only the data


order + weights of the classification layer
• Substantial variations in task performance

Mosbach et al. (2021) - On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
Practical considerations: Fine-tuning (in)stability

• Fine-tuning can fail due to optimization problems


• You have to choose your step-size (learning rate) carefully

Mosbach et al. (2021) - On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
Practical considerations: Parameter-efficient fine-tuning

• As models grow in size, fine-tuning can be expensive both in


memory and time
• But do we really need to fine-tune all weights of a pre-trained
model?
• There exist many method for fine-tuning only a subset of the
weights and these models often achieve very similar
performance compare to fine-tuning all weights
LoRA
• Low-rank adaptation (LoRA) is a widely used parameter-
efficient fine-tuning method (PEFT)
• Fine-tuning introduces changes to the models weights
W′ = W + ΔW
• Let’s assume ΔW is low rank
• Parameterize ΔW = AB
• Now learn only A and B

Information retrieval
• Retrieve media based on a user’s information need
• We consider a special case of IR: ad-hoc retrieval
• A user provides a question to a system, which then returns
one or more documents from a collection of documents
Information retrieval
• Document: the unit of text that the IR system indexes
• Collection: set of documents
• Term: a word (or paragraph) that occurs in the collection
• Query: information need expressed as a set of terms
Embedding
Information retrieval
Recall from Week 5
types
• TODO: re-use slides from Chapter 6 here

• There are three main types of embeddings used in NLP and they are
created using three different methods:

1. Sparse embeddings and term-document frequency

2. Dense embeddings and Word2Vec algorithms

3. Contextual word embeddings and language modeling


The tf-idf weighting (the ‘-’ here is a hyphen, not a minus sign) is the produ
f two terms, each term capturing one of these two intuitions:
The Sparse
first
Recall is the
from Week embeddings
5 term frequency (Luhn, 1957): the frequency of the word t in th
ocument d. We can just use the raw count as the term frequency:
Term frequency
• Term-document (tf) in the tf-idf algorithm
matrix
tft, d = count(t, d) (6.1
• Each document is represented by a vector of terms
We could imagine using raw count:
More commonly we squash the raw frequency a bit, by using the log10 of the fr
uency instead. The intuition is that a word appearing 100 times in a docume
tft,d = count(t,d)
oesn’t make that word 100 times more likely to be relevant to the meaning of th
ocument. We also
But instead need to do something special with counts
of using raw count, we usually squash a bit:of 0, since we can
ake the log of 0. 2

(
1 + log10 count(t, d) if count(t, d) > 0
tft, d = (6.1
0 otherwise
RecallInformation retrieval
number of documents
from Week 5
in many collections, this measure
with aInverse log function. The resulting
document definition
frequency (idf) for inverse
• TODO: re-use slides from Chapter 6 here
) is thus
✓ ◆
N
idft = log10 (6.13)
dft

for some words in the Shakespeare corpus, ranging from


N is the total number of documents
ords which occur in only one play like Romeo, to those that
in the collection
or Falstaff, to those which are very common like fool or so
tely non-discriminative since they occur in all 37 plays like
tf-idf scoring
• Weighted value for each term

wt,d = t,d × idff


• How do we score queries and documents?

q⋅d
score(q, d) = cos(q, d) =
|q||d|
tf
q⋅d
tf-idf scoring score(q, d) = cos(q, d) =
|q||d|

• Rewrite the dot-product as a sum of products


tf-idf scoring
tf-idf scoring
BM25
• Introduce parameters k and b
Information retrieval
• Efficiently find documents that contain terms of interest
• Inverted index
• Dictionary: list of terms (+ document frequency)
• Postings list: list of document ids that contain the term
(+ term frequency in each of the documents)
Evaluation of information retrieval systems
Precision and recall in IR
• Each document is either relevant or not relevant
• Precision: Fraction of retrieved documents that are relevant
• Recall: Fraction of all relevant documents that are retrieved
• Let T be the set of returned documents, R are the relevant
documents in T, N are the irrelevant documents in T, U are all
relevant documents in the collection
Evaluation of information retrieval systems

• We care about the rank of the target document(s)


• Precision@k: fraction of relevant documents seen at a
particular rank k
• Recall@k: fraction of relevant documents found at a
particular rank k
Evaluation of information retrieval systems

• Mean average precision (MAP)


• Iterate over the ranked list top to bottom
• Note the precision only at positions where a relevant item has
been encountered
• Average these precisions over the return set
• Formally: Let Rr be the set of relevant documents at or above
rank r
• Average precision:
• Precisionr(d) precision measured at rank where d was found
Evaluation of information retrieval systems

• Mean average precision (MAP)


• Average precision:
• Precisionr(d) precision measured at rank where d was
found
• Given a set of queries Q:
Evaluation of information retrieval systems
• Mean average precision (MAP)

• Relevant docs at: 1, 3, 5, 6, 8


• Precisions: 1.0, 0.66, 0.60, 0.66, 0.63
• AP = (1.0 + 0.66 + 0.60 + 0.66 + 0.63) / 5 = 3.55 / 5 = 0.71
Information retrieval with pre-trained LMs

• Instead of using word-count vectors, use dense vectors from


pre-trained language models, e.g., BERT
Information retrieval with pre-trained LMs
• Two potential ways to index queries and documents
• Jointly encode query and each document
• Bi-encoder: encode query and documents independently
Information retrieval with pre-trained LMs
• Jointly encode query and every document

z = BERT(q; [SEP]; d)[CLS]

score(z) = so max(Uz)
ft
Information retrieval with pre-trained LMs
• This approach is expensive z = BERT(q; [SEP]; d)[CLS]
• Need to re-encode every
document for each query score(z) = so max(Uz)
• This quickly becomes
impractical when dealing with
large document collections
ft
Information retrieval with pre-trained LMs
• Bi-encoder: encode queries and documents independently

zq = BERTquery(q)[CLS]
zd = BERTdoc(d)[CLS]
score(q, d) = zq ⋅ zd
Information retrieval with pre-trained LMs
zq = BERTquery(q)[CLS]
• Documents are encoded in
advance zd = BERTdoc(d)[CLS]
• When a new query comes
score(q, d) = zq ⋅ zd
in, we only have to encode
that query
• Much more efficient but
tends to give worse results
• Question: How to combine
the two approaches?
Information retrieval with pre-trained LMs
zq = BERTquery(q)[CLS]
• Documents are encoded in
advance zd = BERTdoc(d)[CLS]
• When a new query comes
score(q, d) = zq ⋅ zd
in, we only have to encode
that query
• Much more efficient but
tends to give worse results
• Question: How to combine
the two approaches?
• Re-ranking
Information retrieval with pre-trained LMs
• Question Answering (QA) as a retrieval problem
• Find supporting documents for a query
• Extract or generate an answer based on the top-k docs
Information retrieval with pre-trained LMs
• Consider the document on the
right
• There exist many queries for which
this document might be relevant
• When was HEC founded?
• Can I do a PhD at HEC?
• What’s Viger Square?
• Etc.
• Ideally, all of this needs to be
encoded in the document
representation
How to improve dense retrieval?
+ − n
• Use a collection {(qi i , di )}i=1 of queries paired with
, d
positive and negative documents
• Learn an implicit relevance definition based on this collection
via contrastive learning
• Intuition: move positive documents close to query, push
negative documents away from query

exp(q ⋅ d+i )
ℒ(qi, di+, {di,j
− k
}j=1) = k
+
exp(qi ⋅ di ) + ∑j=1 exp(qi ⋅ d−i,j)
Karpukhin et al. (2020) - Dense Passage Retrieval for Open-Domain Question Answering
Dense Passage Retrieval for Open-Domain QA

• How does it compare to BM25?

Karpukhin et al. (2020) - Dense Passage Retrieval for Open-Domain Question Answering
Dense Passage Retrieval for Open-Domain QA

• What happens outside of the training distribution?


• Question: what could be the reason?

Thakur et al. (2021) - BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Practical considerations
• How to efficiently search over dense representations?
• Approximate nearest neighbour search
• Fast embedding search (FAISS) on GPUs
[15 minute break]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy