Bert 1 42
Bert 1 42
Pawan Goyal
Why NSP?
Masking focuses on predicting words from surrounding contexts so as to
produce effective word-level representations.
Many applications require relationship between two sentences, e.g.,
▶ paraphrase detection (detecting if two sentences have similar meanings),
▶ entailment (detecting if the meanings of two sentences entail or contradict
each other)
▶ discourse coherence (deciding if two neighboring sentences form a
coherent discourse)
Finite vocabulary assumptions make even less sense in many languages with
complex morphology
Finite vocabulary assumptions make even less sense in many languages with
complex morphology
When do we stop
When the desired vocabulary size is met
With RNNs, we used the hidden layer associated with the final input
element to stand for the entire sequence. In BERT, the [CLS] token plays
the role of sentence embedding.
This unique token is added to the vocabulary and is prepended to the
start of all input sequences, both during pretraining and encoding.
The output vector in the final layer of the model for the [CLS] input serves
as the input to classifier head a classifier head.
With RNNs, we used the hidden layer associated with the final input
element to stand for the entire sequence. In BERT, the [CLS] token plays
the role of sentence embedding.
This unique token is added to the vocabulary and is prepended to the
start of all input sequences, both during pretraining and encoding.
The output vector C ∈ Rdh in the final layer of the model for the [CLS]
input serves as the input to classifier head a classifier head.
The only new parameters introduced during fine-tuning are classification
layer weights WC ∈ RK×dh , where K is the number of labels.
y = softmax(WC C)
Example: multiNLI
Pairs of sentences are given one of 3 labels: entails, contradicts and
neutral.
These labels describe a relationship between the meaning of the first
sentence (the premise) and the second sentence (the hypothesis).
Here, the final output vector corresponding to each input token is passed
to a classifier that produces a softmax distribution over the possible set of
tags.
The set of weights to be learned for this additional layer is WK ∈ Rk×dh ,
where k is the number of possible tags for the task.
Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 33 / 55
POS Tagging
Supervised training data for tasks like named entity recognition (NER) is
typically in the form of BIO tags associated with text segmented at the word
level. For example: