0% found this document useful (0 votes)

27 views64 pages

NLP Week9 Fine Tuning - and - IR

This document discusses the concepts of fine-tuning language models (LMs) and their application in information retrieval (IR). It covers contextual embeddings, the advantages of fine-tuning pre-trained models, and various methods for classification and sequence labeling using models like BERT. Additionally, it explores practical considerations for fine-tuning, evaluation metrics for IR systems, and the integration of LMs in ad-hoc retrieval tasks.

Uploaded by

frfgvhr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views64 pages

NLP Week9 Fine Tuning - and - IR

Uploaded by

frfgvhr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Fine-tuning and information

retrieval with LMs

NLP Week 9

Thanks to Dan Jurafsky for most of the slides this week!

Plan for today
1. Contextualized embeddings

2. Fine-tuning for classi cation

3. Fine-tuning for sequence labeling

4. Practical considerations for ne-tuning

5. Information retrieval powered by language models

6. Group exercises
fi
fi
Problem with static embeddings (word2vec)
They are static! The embedding for a word doesn't reflect how
its meaning changes in context.

The chicken didn't cross the road because it was too

tired

What is the meaning represented in the static embedding for

"it"?
Contextual Embeddings
• Intuition: a representation of meaning of a word should be
different in different contexts!
• Contextual Embedding: each word has a different vector
that expresses different meanings depending on the
surrounding words
• How to compute contextual embeddings?
• Deep neural networks
• Bi-LSTMs
• Self-attention
Contextual Embeddings

The chicken didn't cross the road because it ______

What should be the properties of "it"?

The chicken didn't cross the road because it was too tired
The chicken didn't cross the road because it was too wide

At this point in the sentence, it's probably referring to either the

chicken or the street

Levesque et al. 2012 - The Winograd Schema Challenge

Word sense
Words are ambiguous
A word sense is a discrete representation of one aspect of meaning

Contextual embeddings offer a continuous high-dimensional

model of meaning that is more fine grained than discrete senses.
Static vs Contextual Embeddings
Static embeddings represent word types (dictionary entries)
Contextual embeddings represent word instances (one for each
time the word occurs in any context/sentence)
Contextual Embeddings from BERT
Word sense disambiguation (WSD)
The task of selecting the correct sense for a word.
1-nearest neighbor algorithm for WSD
Melamud et al (2016), Peters et al (2018)

At training time, take a sense-labeled corpus like SEMCOR

Run corpus through BERT to get contextual embedding for each
token
• E.g., pooling representations from last 4 BERT transformer layer
Then for each sense s of word w for n tokens of that sense, pool
n
embeddings: 1
∑
vs = vi ∀vi ∈ tokens(s)
n i=0
At test time, given a token of a target word t, compute contextual
embedding t and choose its nearest neighbor sense from training
set
1-nearest neighbor algorithm for WSD
Similarity and contextual embeddings
• We generally use cosine as for static embeddings
• But some issues:
• Contextual embeddings tend to be anisotropic: all point in roughly
the same direction so have high inherent cosines (Ethayarajh 2019)
• Cosine measure are dominated by a small number of "rogue"
dimensions with very high values (Timkey and van Schijndel 2021)
• Cosine tends to underestimate human judgments on similarity of
word meaning for very frequent words (Zhou et al., 2022)
Fine-tuning pre-trained models

• Recap: Basic ingredients for training ML models

• A model pθ(y ∣ x)
• A dataset = {(xi, yi)}ni=1
• An objective function, e.g., cross-entropy loss
• A learning algorithm: gradient descent
𝒟
Fine-tuning pre-trained models
• So far, models where initialized randomly, i.e., θ ∼ (μ, σ)
• Fine-tuning → Initialize weights from a pre-trained model
• Why should this work? → Transfer learning
• Pre-training encodes useful information about language
• This is helpful for various NLP downstream tasks

Howard & Ruder (2018) -

Universal Language Model Fine-
tuning for Text Classification

Ruder (2019) - The State of

Transfer Learning in NLP
𝒩
Fine-tuning pre-trained models
• Nowadays, fine-tuning means many things
• Adaptation to a single task, e.g., classification
• Continual pre-training
• Language adaptation
• Instruction fine-tuning
• Safety fine-tuning, RLHF
• Post-training
• Mid-training
• Reinforcement fine-tuning
• etc.
Fine-tuning pre-trained models
• Nowadays, fine-tuning means many things
• Adaptation to a single task, e.g., classification
• Continual pre-training
• Language adaptation
• Instruction fine-tuning
• Safety fine-tuning, RLHF
• Post-training
• Mid-training
• Reinforcement fine-tuning
• etc.
Using BERT for classification

Devlin et al. 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Sentence classification

Assign a label to a single sentences:

• Sentiment analysis (does a given sentence
have positive or negative sentiment?)
• Acceptability judgements (is a given
sentence grammatical?)
• Hate speech detection (e.g, does a given
sentence express hate or encourages
violence?)
Using BERT for sentence pair classification

Devlin et al. 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Sequence-Pair classification

Assign a label to pairs of sentences:

• Paraphrase detection (are the two sentences
paraphrases of each other?)
• Logical entailment (does sentence A logically
entail sentence B?)
• Discourse coherence (how coherent is
sentence B as a follow-on to sentence A?)
Example: Natural Language Inference

Pairs of sentences are given one of 3 labels

How? —> pass the premise/hypothesis pairs through BERT and

use the output vector for the [CLS] token as the input to the
classification head .
Using BERT for sequence labeling

Devlin et al. 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Using BERT for sequence labeling

Assign a label from a small fixed set of labels to each

token in the sequence.
• Named entity recognition
• Part of speech tagging
Named Entity Recognition
A named entity is anything that can be referred to with a
proper name: a person, a location, an organization
Named entity recognition (NER): find spans of text that
constitute proper names and tag the type of the entity
Named Entity Recognition
BIO Tagging
Ramshaw and Marcus (1995)

A method that lets us turn a segmentation task (finding

boundaries of entities) into a classification task
Sequence labeling
Evaluating pre-trained language models
• How good are fine-tuning language models?
• Glue, SuperGLUE, SQuAD, MasakhaNER, etc.

Kiela et al. (2021) -

Dynabench: Rethinking
Benchmarking in NLP

Performance (relative to human performance) saturation over time

Practical considerations: Out-of-domain generalization
McCoy et al. (2019) - Right for the Wrong Reasons: Diagnosing

• Do models truly learn the task Syntactic Heuristics in Natural Language Inference

they are supposed to learn?

• What happens if we evaluate
models outside the training
distribution?
• Models can learn shortcuts
present in the training data
Practical considerations: Fine-tuning (in)stability

• Repeat fine-tuning many times and change only the data

order + weights of the classification layer
• Substantial variations in task performance

Mosbach et al. (2021) - On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
Practical considerations: Fine-tuning (in)stability

• Fine-tuning can fail due to optimization problems

• You have to choose your step-size (learning rate) carefully

Mosbach et al. (2021) - On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
Practical considerations: Parameter-efficient fine-tuning

• As models grow in size, fine-tuning can be expensive both in

memory and time
• But do we really need to fine-tune all weights of a pre-trained
model?
• There exist many method for fine-tuning only a subset of the
weights and these models often achieve very similar
performance compare to fine-tuning all weights
LoRA
• Low-rank adaptation (LoRA) is a widely used parameter-
efficient fine-tuning method (PEFT)
• Fine-tuning introduces changes to the models weights
W′ = W + ΔW
• Let’s assume ΔW is low rank
• Parameterize ΔW = AB
• Now learn only A and B

Information retrieval
• Retrieve media based on a user’s information need
• We consider a special case of IR: ad-hoc retrieval
• A user provides a question to a system, which then returns
one or more documents from a collection of documents
Information retrieval
• Document: the unit of text that the IR system indexes
• Collection: set of documents
• Term: a word (or paragraph) that occurs in the collection
• Query: information need expressed as a set of terms
Embedding
Information retrieval
Recall from Week 5
types
• TODO: re-use slides from Chapter 6 here

• There are three main types of embeddings used in NLP and they are
created using three diﬀerent methods:

1. Sparse embeddings and term-document frequency

2. Dense embeddings and Word2Vec algorithms

3. Contextual word embeddings and language modeling

The tf-idf weighting (the ‘-’ here is a hyphen, not a minus sign) is the produ
f two terms, each term capturing one of these two intuitions:
The Sparse
first
Recall is the
from Week embeddings
5 term frequency (Luhn, 1957): the frequency of the word t in th
ocument d. We can just use the raw count as the term frequency:
Term frequency
• Term-document (tf) in the tf-idf algorithm
matrix
tft, d = count(t, d) (6.1
• Each document is represented by a vector of terms
We could imagine using raw count:
More commonly we squash the raw frequency a bit, by using the log10 of the fr
uency instead. The intuition is that a word appearing 100 times in a docume
tft,d = count(t,d)
oesn’t make that word 100 times more likely to be relevant to the meaning of th
ocument. We also
But instead need to do something special with counts
of using raw count, we usually squash a bit:of 0, since we can
ake the log of 0. 2

(
1 + log10 count(t, d) if count(t, d) > 0
tft, d = (6.1
0 otherwise
RecallInformation retrieval
number of documents
from Week 5
in many collections, this measure
with aInverse log function. The resulting
document definition
frequency (idf) for inverse
• TODO: re-use slides from Chapter 6 here
) is thus
✓ ◆
N
idft = log10 (6.13)
dft

for some words in the Shakespeare corpus, ranging from

N is the total number of documents
ords which occur in only one play like Romeo, to those that
in the collection
or Falstaff, to those which are very common like fool or so
tely non-discriminative since they occur in all 37 plays like
tf-idf scoring
• Weighted value for each term

wt,d = t,d × idff

• How do we score queries and documents?

q⋅d
score(q, d) = cos(q, d) =
|q||d|
tf
q⋅d
tf-idf scoring score(q, d) = cos(q, d) =
|q||d|

• Rewrite the dot-product as a sum of products

tf-idf scoring
tf-idf scoring
BM25
• Introduce parameters k and b
Information retrieval
• Efficiently find documents that contain terms of interest
• Inverted index
• Dictionary: list of terms (+ document frequency)
• Postings list: list of document ids that contain the term
(+ term frequency in each of the documents)
Evaluation of information retrieval systems
Precision and recall in IR
• Each document is either relevant or not relevant
• Precision: Fraction of retrieved documents that are relevant
• Recall: Fraction of all relevant documents that are retrieved
• Let T be the set of returned documents, R are the relevant
documents in T, N are the irrelevant documents in T, U are all
relevant documents in the collection
Evaluation of information retrieval systems

• We care about the rank of the target document(s)

• Precision@k: fraction of relevant documents seen at a
particular rank k
• Recall@k: fraction of relevant documents found at a
particular rank k
Evaluation of information retrieval systems

• Mean average precision (MAP)

• Iterate over the ranked list top to bottom
• Note the precision only at positions where a relevant item has
been encountered
• Average these precisions over the return set
• Formally: Let Rr be the set of relevant documents at or above
rank r
• Average precision:
• Precisionr(d) precision measured at rank where d was found
Evaluation of information retrieval systems

• Mean average precision (MAP)

• Average precision:
• Precisionr(d) precision measured at rank where d was
found
• Given a set of queries Q:
Evaluation of information retrieval systems
• Mean average precision (MAP)

• Relevant docs at: 1, 3, 5, 6, 8

• Precisions: 1.0, 0.66, 0.60, 0.66, 0.63
• AP = (1.0 + 0.66 + 0.60 + 0.66 + 0.63) / 5 = 3.55 / 5 = 0.71
Information retrieval with pre-trained LMs

• Instead of using word-count vectors, use dense vectors from

pre-trained language models, e.g., BERT
Information retrieval with pre-trained LMs
• Two potential ways to index queries and documents
• Jointly encode query and each document
• Bi-encoder: encode query and documents independently
Information retrieval with pre-trained LMs
• Jointly encode query and every document

z = BERT(q; [SEP]; d)[CLS]

score(z) = so max(Uz)
ft
Information retrieval with pre-trained LMs
• This approach is expensive z = BERT(q; [SEP]; d)[CLS]
• Need to re-encode every
document for each query score(z) = so max(Uz)
• This quickly becomes
impractical when dealing with
large document collections
ft
Information retrieval with pre-trained LMs
• Bi-encoder: encode queries and documents independently

zq = BERTquery(q)[CLS]
zd = BERTdoc(d)[CLS]
score(q, d) = zq ⋅ zd
Information retrieval with pre-trained LMs
zq = BERTquery(q)[CLS]
• Documents are encoded in
advance zd = BERTdoc(d)[CLS]
• When a new query comes
score(q, d) = zq ⋅ zd
in, we only have to encode
that query
• Much more efficient but
tends to give worse results
• Question: How to combine
the two approaches?
Information retrieval with pre-trained LMs
zq = BERTquery(q)[CLS]
• Documents are encoded in
advance zd = BERTdoc(d)[CLS]
• When a new query comes
score(q, d) = zq ⋅ zd
in, we only have to encode
that query
• Much more efficient but
tends to give worse results
• Question: How to combine
the two approaches?
• Re-ranking
Information retrieval with pre-trained LMs
• Question Answering (QA) as a retrieval problem
• Find supporting documents for a query
• Extract or generate an answer based on the top-k docs
Information retrieval with pre-trained LMs
• Consider the document on the
right
• There exist many queries for which
this document might be relevant
• When was HEC founded?
• Can I do a PhD at HEC?
• What’s Viger Square?
• Etc.
• Ideally, all of this needs to be
encoded in the document
representation
How to improve dense retrieval?
+ − n
• Use a collection {(qi i , di )}i=1 of queries paired with
, d
positive and negative documents
• Learn an implicit relevance definition based on this collection
via contrastive learning
• Intuition: move positive documents close to query, push
negative documents away from query

exp(q ⋅ d+i )
ℒ(qi, di+, {di,j
− k
}j=1) = k
+
exp(qi ⋅ di ) + ∑j=1 exp(qi ⋅ d−i,j)
Karpukhin et al. (2020) - Dense Passage Retrieval for Open-Domain Question Answering
Dense Passage Retrieval for Open-Domain QA

• How does it compare to BM25?

Karpukhin et al. (2020) - Dense Passage Retrieval for Open-Domain Question Answering
Dense Passage Retrieval for Open-Domain QA

• What happens outside of the training distribution?

• Question: what could be the reason?

Thakur et al. (2021) - BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Practical considerations
• How to efficiently search over dense representations?
• Approximate nearest neighbour search
• Fast embedding search (FAISS) on GPUs
[15 minute break]

Mat1501 2021 SG 2
No ratings yet
Mat1501 2021 SG 2
270 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
Application of Eurocode 7 For Earth Retaining Structures
100% (1)
Application of Eurocode 7 For Earth Retaining Structures
57 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
No ratings yet
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
16 pages
Data Mining Report
No ratings yet
Data Mining Report
17 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
BERT
No ratings yet
BERT
98 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
14-Word Embeddings II
No ratings yet
14-Word Embeddings II
31 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Bert Ayman
No ratings yet
Bert Ayman
5 pages
Tut4 - WordEmb NLP
No ratings yet
Tut4 - WordEmb NLP
30 pages
Bert
No ratings yet
Bert
20 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
2019 Wiedemannetal Konvens Bert 2
No ratings yet
2019 Wiedemannetal Konvens Bert 2
2 pages
Bert
No ratings yet
Bert
10 pages
BLEURT: Learning Robust Metrics For Text Generation
No ratings yet
BLEURT: Learning Robust Metrics For Text Generation
12 pages
11 Bert
No ratings yet
11 Bert
66 pages
A Primer in BERTology
No ratings yet
A Primer in BERTology
15 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
No ratings yet
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
17 pages
Basics of NLP
No ratings yet
Basics of NLP
9 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
LSTM To BERT
No ratings yet
LSTM To BERT
30 pages
Agarwal, Resume Shortlisting and Ranking With Transformers
No ratings yet
Agarwal, Resume Shortlisting and Ranking With Transformers
12 pages
What Happens To BERT Embeddings During Fine-Tuning
No ratings yet
What Happens To BERT Embeddings During Fine-Tuning
13 pages
Pre-Training & LLM 2
No ratings yet
Pre-Training & LLM 2
46 pages
Does BERT Make Any Sense? Interpretable Word Sense Disambiguation With Contextualized Embeddings
No ratings yet
Does BERT Make Any Sense? Interpretable Word Sense Disambiguation With Contextualized Embeddings
10 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
Lecture 12 Pretraining
No ratings yet
Lecture 12 Pretraining
46 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
24 pages
Whitening Sentence Representations For Better Semantics and Faster Retrieval
No ratings yet
Whitening Sentence Representations For Better Semantics and Faster Retrieval
9 pages
Generative Context-Aware Fine-Tuning of Self-Supervised Speech Models
No ratings yet
Generative Context-Aware Fine-Tuning of Self-Supervised Speech Models
5 pages
Deep Network Notes
No ratings yet
Deep Network Notes
54 pages
Chapter 12
No ratings yet
Chapter 12
16 pages
Bert 1 42
No ratings yet
Bert 1 42
42 pages
NLP Detailed QA
No ratings yet
NLP Detailed QA
3 pages
Cs229 Lecture Selfsupervision Final
No ratings yet
Cs229 Lecture Selfsupervision Final
65 pages
All About Encoder-Decoder Models
No ratings yet
All About Encoder-Decoder Models
50 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Dis8 Sol
No ratings yet
Dis8 Sol
6 pages
Trend
No ratings yet
Trend
47 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
Pars BERT
No ratings yet
Pars BERT
10 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
From Everand
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
Adam Jones
No ratings yet
? Chapter 7 - Loops in Python
No ratings yet
? Chapter 7 - Loops in Python
9 pages
BER Analysis Power Point Presentation
No ratings yet
BER Analysis Power Point Presentation
39 pages
Introduction To Data Science With R Programming
No ratings yet
Introduction To Data Science With R Programming
12 pages
PVP Valor Spreadsheet 2 (Make A Copy To Edit)
No ratings yet
PVP Valor Spreadsheet 2 (Make A Copy To Edit)
7 pages
Basic Geostatistics: Austin Troy
No ratings yet
Basic Geostatistics: Austin Troy
36 pages
Computer Fundamentals and Programming Using Dev C++
No ratings yet
Computer Fundamentals and Programming Using Dev C++
16 pages
TDDFT
No ratings yet
TDDFT
15 pages
CFD Course Notes v14
No ratings yet
CFD Course Notes v14
20 pages
Fluid Mechanics HW2
No ratings yet
Fluid Mechanics HW2
3 pages
(Maths) Functions Ques Bank
No ratings yet
(Maths) Functions Ques Bank
22 pages
Ansys Workbench 13: Theory - Applications - Case Studies
No ratings yet
Ansys Workbench 13: Theory - Applications - Case Studies
4 pages
Mahamaya Technical University,: Noida
No ratings yet
Mahamaya Technical University,: Noida
47 pages
Applications of Trigonometry
No ratings yet
Applications of Trigonometry
7 pages
CW1 Balancing of Rotating Masses
No ratings yet
CW1 Balancing of Rotating Masses
5 pages
Topic 4.3 - Wave Characteristics
No ratings yet
Topic 4.3 - Wave Characteristics
55 pages
Notes Potential Flow Around Cylinder
No ratings yet
Notes Potential Flow Around Cylinder
4 pages
Det KSP
No ratings yet
Det KSP
4 pages
Forenoon 10 A.M. To 1 P.M. Session: Semester No. 01
No ratings yet
Forenoon 10 A.M. To 1 P.M. Session: Semester No. 01
35 pages
Reams Black June 271977
100% (4)
Reams Black June 271977
195 pages
CO2 Ged102 pg.193
No ratings yet
CO2 Ged102 pg.193
3 pages
eBook Tặng 1 - Em Tự Tin Vào Lớp 1 Với Mighty Math Singapore - Full - 224 Trang
No ratings yet
eBook Tặng 1 - Em Tự Tin Vào Lớp 1 Với Mighty Math Singapore - Full - 224 Trang
226 pages
Basics of Sigma-Delta Modulation
No ratings yet
Basics of Sigma-Delta Modulation
25 pages
Extra LPR281 Questions
No ratings yet
Extra LPR281 Questions
4 pages
Techniques of Integration
No ratings yet
Techniques of Integration
17 pages
A New Discrete Element Model For Simulating A Flexible Ring Net Barrier Under Rockfall Impact Comparing With Large-Scale Physical Model Test Data
No ratings yet
A New Discrete Element Model For Simulating A Flexible Ring Net Barrier Under Rockfall Impact Comparing With Large-Scale Physical Model Test Data
12 pages
Okun'S Law in Malaysia: An Autoregressive Distributed Lag (Ardl) Approach With Hodrick-Prescott (HP) Filter
No ratings yet
Okun'S Law in Malaysia: An Autoregressive Distributed Lag (Ardl) Approach With Hodrick-Prescott (HP) Filter
9 pages
Ricco Serial Verb Constructions in Three-Participant Event
No ratings yet
Ricco Serial Verb Constructions in Three-Participant Event
50 pages
AS Physics Mechanics Newtons Laws Answers OCR AQA Edexcel Ms
No ratings yet
AS Physics Mechanics Newtons Laws Answers OCR AQA Edexcel Ms
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

NLP Week9 Fine Tuning - and - IR

Uploaded by

NLP Week9 Fine Tuning - and - IR

Uploaded by

Fine-tuning and information

retrieval with LMs

Thanks to Dan Jurafsky for most of the slides this week!

2. Fine-tuning for classi cation

3. Fine-tuning for sequence labeling

4. Practical considerations for ne-tuning

5. Information retrieval powered by language models

The chicken didn't cross the road because it was too

What is the meaning represented in the static embedding for

The chicken didn't cross the road because it ______

What should be the properties of "it"?

At this point in the sentence, it's probably referring to either the

Levesque et al. 2012 - The Winograd Schema Challenge

Contextual embeddings offer a continuous high-dimensional

At training time, take a sense-labeled corpus like SEMCOR

• Recap: Basic ingredients for training ML models

Howard & Ruder (2018) -

Ruder (2019) - The State of

Assign a label to a single sentences:

Assign a label to pairs of sentences:

Pairs of sentences are given one of 3 labels

How? —> pass the premise/hypothesis pairs through BERT and

Assign a label from a small fixed set of labels to each

A method that lets us turn a segmentation task (finding

Kiela et al. (2021) -

Performance (relative to human performance) saturation over time

they are supposed to learn?

• Repeat fine-tuning many times and change only the data

• Fine-tuning can fail due to optimization problems

• As models grow in size, fine-tuning can be expensive both in

1. Sparse embeddings and term-document frequency

2. Dense embeddings and Word2Vec algorithms

3. Contextual word embeddings and language modeling

for some words in the Shakespeare corpus, ranging from

wt,d = t,d × idff

• Rewrite the dot-product as a sum of products

• We care about the rank of the target document(s)

• Mean average precision (MAP)

• Mean average precision (MAP)

• Relevant docs at: 1, 3, 5, 6, 8

• Instead of using word-count vectors, use dense vectors from

z = BERT(q; [SEP]; d)[CLS]

• How does it compare to BM25?

• What happens outside of the training distribution?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.