0% found this document useful (0 votes)
26 views48 pages

BERT and Transformer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views48 pages

BERT and Transformer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Introduction to BERT and Transformer:

pre-trained self-attention models to


leverage unlabeled corpus data
PremiLab @ XJTLU, 4 April 2019
presented by Hang Dong
Presentation of the two papers:

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for
language understanding. (NAACL 2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need.
(NIPS 2017)
Acknowledgement to background
Acknowledgement to all used slides, figures, tables, equations, texts from image from http://ruder.io/nlp-
the papers, blogs and codes! imagenet/
Pre-training general language representations

• Feature-based approaches
• Non-neural word representations
• Neural embedding
• Word embedding: Word2Vec, Glove, …
• Sentence embedding, paragraph embedding, …

• Deep contextualised word representation (ELMo, Embeddings from Language Models)


(Peters et al., 2018)

• Fine-tuning approaches
• OpenAI GPT (Generative Pre-trained Transformer) (Radford et al., 2018a)
• BERT (Bi-directional Encoder Representations from Transformers) (Devlin et al., 2018)
Content
• ELMo (Peters et al., 2018)
• OpenAI GPT (Radford et al., 2018a)
• Transformer (especially self-attention) (Vaswani et al., 2017)
• BERT (Devlin et al., 2018)
• Analyses & Future Studies
ELMo: deep contextualised word representation
(Peters et al., 2018)
• “Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before
assigning each word in it an embedding.”

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/


Acknowledgement to slides from
https://www.slideshare.net/shunta
roy/a-review-of-deep-
contextualized-word-
representations-peters-2018
Acknowledgement to slides from
https://www.slideshare.net/shunta
roy/a-review-of-deep-

ELMo contextualized-word-
representations-peters-2018
Acknowledgement to slides from
https://www.slideshare.net/shunta
roy/a-review-of-deep-

ELMo contextualized-word-
representations-peters-2018
OpenAI GPT (Generative Pre-trained
Transformer) – (1) pre-training
• Unsupervised pre-training, maximising the log-likelihood,

• where is an unsupervised corpus of tokens, 𝑘 is the size of context


window, 𝑃 is modelled as a neural network with parameters Θ.

• where 𝑈 is one-hot representation of tokens in the window, 𝑛 is the total number of


transformer layers, transformer_block() denotes the decoder of the Transformer
model (multi-headed self-attention and position-wise feedfoward layers).

Equations in (Radford et al., 2018)


GPT: (2) Fine-tuning
Given labelled data 𝐶 , including each input as a
sequence of tokens 𝑥 1 , 𝑥 2 , … , 𝑥 𝑚 , each label as .

Then maximise the final objective function:

𝜆 is set as 0.5 in the experiment.

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/ Equations in (Radford et al., 2018)


Transformer: a seq2seq model N=6
𝑑model = 512

Residual connection
& Layer normalisation

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/ Figure in (Vaswani et al., 2017)


Self-attention (1)
”The animal didn't cross the street because it was too tired”

”The animal didn't cross the street because it was too wide”

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/ Equation and Figure in (Vaswani et al., 2017)
Self-attention (2)

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/


Self-attention (3)

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/


Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/
Multi-head attention

𝑑𝑚𝑜𝑑𝑒𝑙
ℎ = 8, 𝑑𝑘 = 𝑑𝑣 = = 64

(Vaswani et al., 2017)


Multi-head attention

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/


Acknowledgement to Figure from
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture14-
transformers.pdf

Modelling the dependencies between


(keitakurita, 2019)
(1) the input and output tokens
(2) the input tokens themselves
(3) the output tokens themselves. Acknowledgement to Figure from (keitakurita, 2019)
http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/
Three Multi-Head attention blocks
• Encoder Multi-Head Attention (left)
• Keys, values and queries are the output of the previous layer
in the encoder.
• Multiple word-word alignments.

• Decoder Masked Multi-Head Attention (lower right)


• Set the word-word attention weights for the connections to
illegal “future” words to −∞.

• Encoder-Decoder Multi-Head Attention (upper right)


• Keys and values from the output of the encoder, queries
from the previous decoder layer.

Figure in (Vaswani et al., 2017)


Acknowledgement to Figure from
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture14-
transformers.pdf
Why self-attention? - Efficiency and Path

Table in (Vaswani et al., 2017)


Maximum Path Length in RNN and Self-attention

Acknowledge to Figure from


http://mlexplained.com/2017/12/29/attentio
n-is-all-you-need-explained/
Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/
Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/
Positional Embedding
• In order to add position information (order of the sequence)

• Each dimension of the positional encoding corresponds to a sinusoid.

• For any fixed offset 𝑘, 𝑃𝐸𝑝𝑜𝑠+𝑘 can be represented as a linear transformation of 𝑃𝐸𝑝𝑜𝑠 . This
would allow the model to easily learn to attend by relative positions.

Equations in (Vaswani et al., 2017)


Acknowledgement to the slide adapted from http://web.stanford.edu/class/cs224n/slides/cs224n-
2019-lecture14-transformers.pdf

Adopted by GPT
Evaluation for Transformer

Table in (Vaswani et al., 2017)


Table in (Vaswani et al., 2017)
Evaluation for Transformer – parameter tuning
What is BERT (Bidirectional Encoder Representations from Transformers)?

Figure in (Devlin et al., 2018)


Input Representation
Hidden state corresponding to [CLS] will be
used as the sentence representation

• Token Embeddings: WordPiece embedding (Wu et al., 2016)


• Segment Embeddings: randomly initialized and learned; single sentence input only adds EA
• Position embeddings: randomly initialized and learned
Figure in (Devlin et al., 2018)
Training tasks (1) - Masked Language Model
• Masked Language Model:
Cloze Task

• Masking(input_seq):
For every input_seq :
• Randomly select 15% of tokens
(not more than 20 per seq)
• For 80% of the time:
• Replace the word
with the [MASK]
token.
• For 10% of the time:
• Replace the word
with a random word
• For 10% of the time
• Keep the word
unchanged..

• For related code see def


create_masked_lm_predictions(…) in
https://github.com/google-
research/bert/blob/master/create_pret
raining_data.py

Acknowledgement to the Figure from http://jalammar.github.io/illustrated-bert/


Training tasks (2) – Next Sentence Prediction
• Next sentence prediction –
Binary classification
• For every input document
as a sentence-token 2D list:
• Randomly select a split over
sentences:
• Store the segment A
• For 50% of the time:
• Sample random
sentence split from
another document
as segment B.
• For 50% of the time:
• Use the actual
sentences as
segment B.
• Masking (Truncate([segment A,
segment B]))
• For related code see def flight ##less
create_instances_from_document (…)
in https://github.com/google-
research/bert/blob/master/create_pret
raining_data.py segment A segment B
Acknowledgement to the Figure adapted from http://jalammar.github.io/illustrated-bert/
Pre-Training datasets and details
• Training loss L is the sum of the mean masked LM likelihood and mean next sentence
prediction likelihood.

• Dataset: Long contiguous word sequences.


• BooksCorpus (800M words), about 7,000 unique unpublished books from a variety of
genres including Adventure,
Fantasy, and Romance.
• English Wikipedia (2,500M words), excluding lists, tables, headers.

• Sequence length 512; Batch size 256; trained for 1M steps (approximately 40 epochs);
learning rate 1e-4; Adam optimiser, 𝛽1 as 0.9, 𝛽2 as 0.999; dropout as 0.1 on all layers; GELU
activation; L2 weight decay of 0.01; learning rate warmup over the first 10,000 steps, linear
decay of learning rate …
• BERTBASE : N = 6, 𝑑model = 512, ℎ = 12, Total Parameters=110M
• 4 cloud TPUs in Pod configuration (16 TPU chips total)

• BERTLARGE : N = 24, 𝑑model = 1024, ℎ = 16, Total Parameters=340M


• 16 Cloud TPUs (64 TPU chips total)

• Each pretraining took 4 days to complete.


Fine-tuning with BERT

• Context vector 𝐶: Take the final


hidden state corresponding to
the first token in the input:
[CLS].
• Transform to a probability
distribution of the class labels:

Figure in (Devlin et al., 2018)


Evaluation for BERT: GLUE
• General Language Understanding Evaluation (GLUE) benchmark: Standard split of data to
train, validation, test, where labels for the test set is only held in the server.

• Sentence pair tasks


• MNLI, Multi-Genre Natural Language Inference
• QQP, Quora Question Pairs
• QNLI, Question Natural Language Inference
• STS-B The Semantic Textual Similarity Benchmark
• MRPC Microsoft Research Paraphrase Corpus
• RTE Recognizing Textual Entailment
• WNLI Winograd NLI is a small natural language inference dataset
• Single sentence classification
• SST-2 The Stanford Sentiment Treebank
• CoLA The Corpus of Linguistic Acceptability
Evaluation for BERT: GLUE

Table in (Devlin et al., 2018)


Evaluation on SQUAD
• The Standford Question Answering
Dataset (SQuAD) is a collection of 100k
crowdsourced question/answer pairs.

Table in (Devlin et al., 2018)


Evaluation on Named Entity Recognition
• The CoNLL 2003 Named Entity
Recognition (NER) dataset. This dataset
consists of 200k training words which
have been annotated as Person,
Organization, Location, Miscellaneous,
or Other (non-named entity).

Table in (Devlin et al., 2018)


Ablation Study (1) – on pre-train tasks

Table in (Devlin et al., 2018)


Ablation Study (2) – on model sizes

Table in (Devlin et al., 2018)


Ablation Study (3) – on pre-training steps

Figure in (Devlin et al., 2018)


Ablation Study (4) – using BERT as feature extractor
(without fine-tuning)

Table in (Devlin et al., 2018)


Why BERT works?
• Leveraging huge unlabeled and high quality data: 7000 books +
Wikipedia (together 3300M words)

• Multi-head self-attention blocks in Transformer:


• modelling the intra- and extra- word-word relations
• parallelable within instance and thus efficient

• Task similarity: masked language modelling + next sentence


prediction
How to improve BERT?
• Pre-training
• Better tasks for pre-training for more complex usage
• Better (larger, high-quality) data
• Cross-lingual BERT for unsupervised learning (Lample & Conneau, 2019)
• Even larger model, GPT-2: zero shot to outperform the SOTA (Radford et al.,
2018b)

• Fine-tuning
• Better loss in fine-tuning
• Introduce new tasks in fine-tuning
An architecture for multi-label classification
(Dong, 2019)
𝑣𝑤𝑡
Title (Title-guided)
Sentence-level attentions
𝑠𝑑 𝑦𝑑
𝑐𝑡

Bi-GRU 𝑐𝑑

𝑐s 𝑐𝑡𝑎
Word-level attentions Sigmoid 𝐿𝐶𝐸

Bi-GRU Bi-GRU

𝑐𝑎
Sentence Semantic-based loss regularisation
(in Content)
𝑣𝑤𝑎 𝑣𝑠𝑎 𝜆1 𝐿𝑠𝑖𝑚 +𝜆2 𝐿𝑠𝑢𝑏
In H. Dong, W. Wang, H. Kaizhu, F. Coenen. Joint Multi-Label Attention Networks for Social Text Annotation, in Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Volume 2 (Short Papers), Minneapolis, USA, 2-7 June, 2019.
Is it possible? Any further thought?
Title
𝑠𝑑 𝑦𝑑

𝑐𝑑
FFNN+Sigmoid 𝐿𝐶𝐸
BERT

Semantic-based loss regularisation


𝜆1 𝐿𝑠𝑖𝑚 +𝜆2 𝐿𝑠𝑢𝑏

Sentence
(in Content)
Recommended Learning Resources
• Jay Alammar. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning). Dec
2018. http://jalammar.github.io/illustrated-bert/
• Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/.
June 2018.
• Ashish Vaswani and Anna Huang. Transformers and Self-Attention For Generative Models.
Feb 2019. CS224n. Stanford University. http://web.stanford.edu/class/cs224n/slides/cs224n-
2019-lecture14-transformers.pdf
• Kevin Clark. Future of NLP + Deep Learning. Mar 2019. CS224n. Stanford University.
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture20-future.pdf
• keitakurita. Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding” Explained http://mlexplained.com/2019/01/07/paper-dissected-
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/
• keitakurita. Paper Dissected: “Attention is All You Need” Explained
http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/
References
• Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers)
• Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep
Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers) (Vol. 1, pp. 2227-2237).
• Lample, G., & Conneau, A. (2019). Cross-lingual Language Model Pretraining. arXiv preprint
arXiv:1901.07291.
• Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018a). Improving Language Understanding by
Generative Pre-Training.
• Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2018b). Language models are
unsupervised multitask learners. Technical report, OpenAI.
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention
is all you need. In Advances in Neural Information Processing Systems(pp. 5998-6008).
• Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Klingner, J. (2016). Google's neural
machine translation system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy