0% found this document useful (0 votes)

26 views48 pages

BERT and Transformer

Uploaded by

krishna chaitanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views48 pages

BERT and Transformer

Uploaded by

krishna chaitanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Introduction to BERT and Transformer:

pre-trained self-attention models to

leverage unlabeled corpus data
PremiLab @ XJTLU, 4 April 2019
presented by Hang Dong
Presentation of the two papers:

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for
language understanding. (NAACL 2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need.
(NIPS 2017)
Acknowledgement to background
Acknowledgement to all used slides, figures, tables, equations, texts from image from http://ruder.io/nlp-
the papers, blogs and codes! imagenet/
Pre-training general language representations

• Feature-based approaches
• Non-neural word representations
• Neural embedding
• Word embedding: Word2Vec, Glove, …
• Sentence embedding, paragraph embedding, …

• Deep contextualised word representation (ELMo, Embeddings from Language Models)

(Peters et al., 2018)

• Fine-tuning approaches
• OpenAI GPT (Generative Pre-trained Transformer) (Radford et al., 2018a)
• BERT (Bi-directional Encoder Representations from Transformers) (Devlin et al., 2018)
Content
• ELMo (Peters et al., 2018)
• OpenAI GPT (Radford et al., 2018a)
• Transformer (especially self-attention) (Vaswani et al., 2017)
• BERT (Devlin et al., 2018)
• Analyses & Future Studies
ELMo: deep contextualised word representation
(Peters et al., 2018)
• “Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before
assigning each word in it an embedding.”

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/

Acknowledgement to slides from
https://www.slideshare.net/shunta
roy/a-review-of-deep-
contextualized-word-
representations-peters-2018
Acknowledgement to slides from
https://www.slideshare.net/shunta
roy/a-review-of-deep-

ELMo contextualized-word-
representations-peters-2018
Acknowledgement to slides from
https://www.slideshare.net/shunta
roy/a-review-of-deep-

ELMo contextualized-word-
representations-peters-2018
OpenAI GPT (Generative Pre-trained
Transformer) – (1) pre-training
• Unsupervised pre-training, maximising the log-likelihood,

• where is an unsupervised corpus of tokens, 𝑘 is the size of context

window, 𝑃 is modelled as a neural network with parameters Θ.

• where 𝑈 is one-hot representation of tokens in the window, 𝑛 is the total number of

transformer layers, transformer_block() denotes the decoder of the Transformer
model (multi-headed self-attention and position-wise feedfoward layers).

Equations in (Radford et al., 2018)

GPT: (2) Fine-tuning
Given labelled data 𝐶 , including each input as a
sequence of tokens 𝑥 1 , 𝑥 2 , … , 𝑥 𝑚 , each label as .

Then maximise the final objective function:

𝜆 is set as 0.5 in the experiment.

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/ Equations in (Radford et al., 2018)

Transformer: a seq2seq model N=6
𝑑model = 512

Residual connection
& Layer normalisation

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/ Figure in (Vaswani et al., 2017)

Self-attention (1)
”The animal didn't cross the street because it was too tired”

”The animal didn't cross the street because it was too wide”

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/ Equation and Figure in (Vaswani et al., 2017)
Self-attention (2)

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/

Self-attention (3)

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/
Multi-head attention

𝑑𝑚𝑜𝑑𝑒𝑙
ℎ = 8, 𝑑𝑘 = 𝑑𝑣 = = 64
ℎ

(Vaswani et al., 2017)

Multi-head attention

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/

Acknowledgement to Figure from
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture14-
transformers.pdf

Modelling the dependencies between

(keitakurita, 2019)
(1) the input and output tokens
(2) the input tokens themselves
(3) the output tokens themselves. Acknowledgement to Figure from (keitakurita, 2019)
http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/
Three Multi-Head attention blocks
• Encoder Multi-Head Attention (left)
• Keys, values and queries are the output of the previous layer
in the encoder.
• Multiple word-word alignments.

• Decoder Masked Multi-Head Attention (lower right)

• Set the word-word attention weights for the connections to
illegal “future” words to −∞.

• Encoder-Decoder Multi-Head Attention (upper right）

• Keys and values from the output of the encoder, queries
from the previous decoder layer.

Figure in (Vaswani et al., 2017)

Acknowledgement to Figure from
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture14-
transformers.pdf
Why self-attention? - Efficiency and Path

Table in (Vaswani et al., 2017)

Maximum Path Length in RNN and Self-attention

Acknowledge to Figure from

http://mlexplained.com/2017/12/29/attentio
n-is-all-you-need-explained/
Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/
Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/
Positional Embedding
• In order to add position information (order of the sequence)

• Each dimension of the positional encoding corresponds to a sinusoid.

• For any fixed offset 𝑘, 𝑃𝐸𝑝𝑜𝑠+𝑘 can be represented as a linear transformation of 𝑃𝐸𝑝𝑜𝑠 . This
would allow the model to easily learn to attend by relative positions.

Equations in (Vaswani et al., 2017)

Acknowledgement to the slide adapted from http://web.stanford.edu/class/cs224n/slides/cs224n-
2019-lecture14-transformers.pdf

Adopted by GPT
Evaluation for Transformer

Table in (Vaswani et al., 2017)

Table in (Vaswani et al., 2017)
Evaluation for Transformer – parameter tuning
What is BERT (Bidirectional Encoder Representations from Transformers)?

Figure in (Devlin et al., 2018)

Input Representation
Hidden state corresponding to [CLS] will be
used as the sentence representation

• Token Embeddings: WordPiece embedding (Wu et al., 2016)

• Segment Embeddings: randomly initialized and learned; single sentence input only adds EA
• Position embeddings: randomly initialized and learned
Figure in (Devlin et al., 2018)
Training tasks (1) - Masked Language Model
• Masked Language Model:
Cloze Task

• Masking(input_seq):
For every input_seq :
• Randomly select 15% of tokens
(not more than 20 per seq)
• For 80% of the time:
• Replace the word
with the [MASK]
token.
• For 10% of the time:
• Replace the word
with a random word
• For 10% of the time
• Keep the word
unchanged..

• For related code see def

create_masked_lm_predictions(…) in
https://github.com/google-
research/bert/blob/master/create_pret
raining_data.py

Acknowledgement to the Figure from http://jalammar.github.io/illustrated-bert/

Training tasks (2) – Next Sentence Prediction
• Next sentence prediction –
Binary classification
• For every input document
as a sentence-token 2D list:
• Randomly select a split over
sentences:
• Store the segment A
• For 50% of the time:
• Sample random
sentence split from
another document
as segment B.
• For 50% of the time:
• Use the actual
sentences as
segment B.
• Masking (Truncate([segment A,
segment B]))
• For related code see def flight ##less
create_instances_from_document (…)
in https://github.com/google-
research/bert/blob/master/create_pret
raining_data.py segment A segment B
Acknowledgement to the Figure adapted from http://jalammar.github.io/illustrated-bert/
Pre-Training datasets and details
• Training loss L is the sum of the mean masked LM likelihood and mean next sentence
prediction likelihood.

• Dataset: Long contiguous word sequences.

• BooksCorpus (800M words), about 7,000 unique unpublished books from a variety of
genres including Adventure,
Fantasy, and Romance.
• English Wikipedia (2,500M words), excluding lists, tables, headers.

• Sequence length 512; Batch size 256; trained for 1M steps (approximately 40 epochs);
learning rate 1e-4; Adam optimiser, 𝛽1 as 0.9, 𝛽2 as 0.999; dropout as 0.1 on all layers; GELU
activation; L2 weight decay of 0.01; learning rate warmup over the first 10,000 steps, linear
decay of learning rate …
• BERTBASE : N = 6, 𝑑model = 512, ℎ = 12, Total Parameters=110M
• 4 cloud TPUs in Pod configuration (16 TPU chips total)

• BERTLARGE : N = 24, 𝑑model = 1024, ℎ = 16, Total Parameters=340M

• 16 Cloud TPUs (64 TPU chips total)

• Each pretraining took 4 days to complete.

Fine-tuning with BERT

• Context vector 𝐶: Take the final

hidden state corresponding to
the first token in the input:
[CLS].
• Transform to a probability
distribution of the class labels:

Figure in (Devlin et al., 2018)

Evaluation for BERT: GLUE
• General Language Understanding Evaluation (GLUE) benchmark: Standard split of data to
train, validation, test, where labels for the test set is only held in the server.

• Sentence pair tasks

• MNLI, Multi-Genre Natural Language Inference
• QQP, Quora Question Pairs
• QNLI, Question Natural Language Inference
• STS-B The Semantic Textual Similarity Benchmark
• MRPC Microsoft Research Paraphrase Corpus
• RTE Recognizing Textual Entailment
• WNLI Winograd NLI is a small natural language inference dataset
• Single sentence classification
• SST-2 The Stanford Sentiment Treebank
• CoLA The Corpus of Linguistic Acceptability
Evaluation for BERT: GLUE

Table in (Devlin et al., 2018)

Evaluation on SQUAD
• The Standford Question Answering
Dataset (SQuAD) is a collection of 100k
crowdsourced question/answer pairs.

Table in (Devlin et al., 2018)

Evaluation on Named Entity Recognition
• The CoNLL 2003 Named Entity
Recognition (NER) dataset. This dataset
consists of 200k training words which
have been annotated as Person,
Organization, Location, Miscellaneous,
or Other (non-named entity).

Table in (Devlin et al., 2018)

Ablation Study (1) – on pre-train tasks

Table in (Devlin et al., 2018)

Ablation Study (2) – on model sizes

Table in (Devlin et al., 2018)

Ablation Study (3) – on pre-training steps

Figure in (Devlin et al., 2018)

Ablation Study (4) – using BERT as feature extractor
(without fine-tuning)

Table in (Devlin et al., 2018)

Why BERT works?
• Leveraging huge unlabeled and high quality data: 7000 books +
Wikipedia (together 3300M words)

• Multi-head self-attention blocks in Transformer:

• modelling the intra- and extra- word-word relations
• parallelable within instance and thus efficient

• Task similarity: masked language modelling + next sentence

prediction
How to improve BERT?
• Pre-training
• Better tasks for pre-training for more complex usage
• Better (larger, high-quality) data
• Cross-lingual BERT for unsupervised learning (Lample & Conneau, 2019)
• Even larger model, GPT-2: zero shot to outperform the SOTA (Radford et al.,
2018b)

• Fine-tuning
• Better loss in fine-tuning
• Introduce new tasks in fine-tuning
An architecture for multi-label classification
(Dong, 2019)
𝑣𝑤𝑡
Title (Title-guided)
Sentence-level attentions
𝑠𝑑 𝑦𝑑
𝑐𝑡

Bi-GRU 𝑐𝑑

𝑐s 𝑐𝑡𝑎
Word-level attentions Sigmoid 𝐿𝐶𝐸

Bi-GRU Bi-GRU

𝑐𝑎
Sentence Semantic-based loss regularisation
(in Content)
𝑣𝑤𝑎 𝑣𝑠𝑎 𝜆1 𝐿𝑠𝑖𝑚 +𝜆2 𝐿𝑠𝑢𝑏
In H. Dong, W. Wang, H. Kaizhu, F. Coenen. Joint Multi-Label Attention Networks for Social Text Annotation, in Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Volume 2 (Short Papers), Minneapolis, USA, 2-7 June, 2019.
Is it possible? Any further thought?
Title
𝑠𝑑 𝑦𝑑

𝑐𝑑
FFNN+Sigmoid 𝐿𝐶𝐸
BERT

Semantic-based loss regularisation

𝜆1 𝐿𝑠𝑖𝑚 +𝜆2 𝐿𝑠𝑢𝑏

Sentence
(in Content)
Recommended Learning Resources
• Jay Alammar. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning). Dec
2018. http://jalammar.github.io/illustrated-bert/
• Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/.
June 2018.
• Ashish Vaswani and Anna Huang. Transformers and Self-Attention For Generative Models.
Feb 2019. CS224n. Stanford University. http://web.stanford.edu/class/cs224n/slides/cs224n-
2019-lecture14-transformers.pdf
• Kevin Clark. Future of NLP + Deep Learning. Mar 2019. CS224n. Stanford University.
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture20-future.pdf
• keitakurita. Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding” Explained http://mlexplained.com/2019/01/07/paper-dissected-
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/
• keitakurita. Paper Dissected: “Attention is All You Need” Explained
http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/
References
• Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers)
• Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep
Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers) (Vol. 1, pp. 2227-2237).
• Lample, G., & Conneau, A. (2019). Cross-lingual Language Model Pretraining. arXiv preprint
arXiv:1901.07291.
• Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018a). Improving Language Understanding by
Generative Pre-Training.
• Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2018b). Language models are
unsupervised multitask learners. Technical report, OpenAI.
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention
is all you need. In Advances in Neural Information Processing Systems(pp. 5998-6008).
• Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Klingner, J. (2016). Google's neural
machine translation system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144.

Bert Explained
No ratings yet
Bert Explained
8 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
10 Attention N Bert
No ratings yet
10 Attention N Bert
55 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Week 3: Deeplearning - Ai
No ratings yet
Week 3: Deeplearning - Ai
98 pages
BERT
No ratings yet
BERT
98 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
Lec 02
No ratings yet
Lec 02
33 pages
Bert
No ratings yet
Bert
20 pages
BERT
No ratings yet
BERT
4 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
All About Encoder-Decoder Models
No ratings yet
All About Encoder-Decoder Models
50 pages
LSTM To BERT
No ratings yet
LSTM To BERT
30 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
Lecture 12 Pretraining
No ratings yet
Lecture 12 Pretraining
46 pages
Bert
No ratings yet
Bert
60 pages
Bert
No ratings yet
Bert
10 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
Data Mining Report
No ratings yet
Data Mining Report
17 pages
Bert
No ratings yet
Bert
36 pages
RADL TTho
No ratings yet
RADL TTho
64 pages
BERT GPT CoT
No ratings yet
BERT GPT CoT
83 pages
11 Bert
No ratings yet
11 Bert
66 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
CS283 Lecture6 2024
No ratings yet
CS283 Lecture6 2024
115 pages
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
No ratings yet
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
68 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
Bert Ayman
No ratings yet
Bert Ayman
5 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
Stanford Dataset 2.0
No ratings yet
Stanford Dataset 2.0
9 pages
BERT Architecture
No ratings yet
BERT Architecture
8 pages
Bert 1 42
No ratings yet
Bert 1 42
42 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
20 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
BERT Interview Questions and Cross Questions-1
No ratings yet
BERT Interview Questions and Cross Questions-1
9 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
2024 Semeval-1 72
No ratings yet
2024 Semeval-1 72
6 pages
Paper Review
No ratings yet
Paper Review
6 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
No ratings yet
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
186 pages
IGNOU PGDCA All in One Previous Years Unsolved Papers
From Everand
IGNOU PGDCA All in One Previous Years Unsolved Papers
Manish Soni
No ratings yet
L11 ClassIntricacies
No ratings yet
L11 ClassIntricacies
9 pages
L14 OptimizationSingleVariable
No ratings yet
L14 OptimizationSingleVariable
33 pages
L12 FileInputOutput
No ratings yet
L12 FileInputOutput
18 pages
Shaver - 2006 - Attachment Theory Individual Psychodynamics and Relationship Functioning
No ratings yet
Shaver - 2006 - Attachment Theory Individual Psychodynamics and Relationship Functioning
38 pages
L6 Tuple Container
No ratings yet
L6 Tuple Container
18 pages
L8 Dictionaries
No ratings yet
L8 Dictionaries
17 pages
L7 Set Container
No ratings yet
L7 Set Container
16 pages
Topics Tested Mathematics PP1 P2 2017-2023 Analysis
No ratings yet
Topics Tested Mathematics PP1 P2 2017-2023 Analysis
3 pages
Modality Effects in Delayed Free Recall and Recognition Visual Is Better Than Auditory
No ratings yet
Modality Effects in Delayed Free Recall and Recognition Visual Is Better Than Auditory
16 pages
Maxinejiji - Poser Story
100% (1)
Maxinejiji - Poser Story
217 pages
A Cow in The House
No ratings yet
A Cow in The House
2 pages
Venkatesh Resume
No ratings yet
Venkatesh Resume
2 pages
वार्षिक - परीक्षा - पेपर 6 अंग्रेजी shalasugam.com 2023
No ratings yet
वार्षिक - परीक्षा - पेपर 6 अंग्रेजी shalasugam.com 2023
2 pages
Relation Between Sociology and Social Work
100% (1)
Relation Between Sociology and Social Work
7 pages
Survey Questionnaire: Statement Always Sometimes Often Never
No ratings yet
Survey Questionnaire: Statement Always Sometimes Often Never
4 pages
Chapter 1 Boyle
No ratings yet
Chapter 1 Boyle
9 pages
All Projects S24
No ratings yet
All Projects S24
154 pages
Lesson 3
No ratings yet
Lesson 3
6 pages
Microsoft Windows Server 2016 Licensing
No ratings yet
Microsoft Windows Server 2016 Licensing
2 pages
Narrative Report in Elln
100% (7)
Narrative Report in Elln
2 pages
GenAI Curriculum (DataSpoof)
No ratings yet
GenAI Curriculum (DataSpoof)
4 pages
The Problem and Its Background: Thesis Title: Learning Virtues Through Literary Selections in English
No ratings yet
The Problem and Its Background: Thesis Title: Learning Virtues Through Literary Selections in English
12 pages
ELEC 1411 Syllabus 03-04
No ratings yet
ELEC 1411 Syllabus 03-04
2 pages
RMCT Assignment
100% (1)
RMCT Assignment
10 pages
Nadavant-ul-Ulma, Ali Garh and Deoband
No ratings yet
Nadavant-ul-Ulma, Ali Garh and Deoband
5 pages
Running Head: Fusing Creativity in Multicultural Teams: Dubai School of Government
No ratings yet
Running Head: Fusing Creativity in Multicultural Teams: Dubai School of Government
44 pages
Guiding Questions Fo EE Reflective Sessions
No ratings yet
Guiding Questions Fo EE Reflective Sessions
4 pages
Senior Java Developer
No ratings yet
Senior Java Developer
3 pages
1 Lesson Plan in Mapeh 7
No ratings yet
1 Lesson Plan in Mapeh 7
7 pages
KPMG
No ratings yet
KPMG
8 pages
CV Kishore
No ratings yet
CV Kishore
3 pages
The Case Study Approach Ivan
No ratings yet
The Case Study Approach Ivan
20 pages
Action Plan in English
100% (4)
Action Plan in English
4 pages
Communicative Competence Strategies in Various Speech Situations
No ratings yet
Communicative Competence Strategies in Various Speech Situations
4 pages
Downloader
No ratings yet
Downloader
3 pages
Conceptual Metaphors in Mylo Xyloto Album by Coldplay: Selvia Neilil Kamaliah
No ratings yet
Conceptual Metaphors in Mylo Xyloto Album by Coldplay: Selvia Neilil Kamaliah
10 pages
Qualifications2023 24
No ratings yet
Qualifications2023 24
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BERT and Transformer

Uploaded by

BERT and Transformer

Uploaded by

Introduction to BERT and Transformer:

pre-trained self-attention models to

• Deep contextualised word representation (ELMo, Embeddings from Language Models)

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/

• where is an unsupervised corpus of tokens, 𝑘 is the size of context

• where 𝑈 is one-hot representation of tokens in the window, 𝑛 is the total number of

Equations in (Radford et al., 2018)

Then maximise the final objective function:

𝜆 is set as 0.5 in the experiment.

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/ Equations in (Radford et al., 2018)

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/ Figure in (Vaswani et al., 2017)

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/

(Vaswani et al., 2017)

Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/

Modelling the dependencies between

• Decoder Masked Multi-Head Attention (lower right)

• Encoder-Decoder Multi-Head Attention (upper right）

Figure in (Vaswani et al., 2017)

Table in (Vaswani et al., 2017)

Acknowledge to Figure from

• Each dimension of the positional encoding corresponds to a sinusoid.

Equations in (Vaswani et al., 2017)

Table in (Vaswani et al., 2017)

Figure in (Devlin et al., 2018)

• Token Embeddings: WordPiece embedding (Wu et al., 2016)

• For related code see def

Acknowledgement to the Figure from http://jalammar.github.io/illustrated-bert/

• Dataset: Long contiguous word sequences.

• BERTLARGE : N = 24, 𝑑model = 1024, ℎ = 16, Total Parameters=340M

• Each pretraining took 4 days to complete.

• Context vector 𝐶: Take the final

Figure in (Devlin et al., 2018)

• Sentence pair tasks

Table in (Devlin et al., 2018)

Table in (Devlin et al., 2018)

Table in (Devlin et al., 2018)

Table in (Devlin et al., 2018)

Table in (Devlin et al., 2018)

Figure in (Devlin et al., 2018)

Table in (Devlin et al., 2018)

• Multi-head self-attention blocks in Transformer:

• Task similarity: masked language modelling + next sentence

Semantic-based loss regularisation

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.