BERT and Transformer
BERT and Transformer
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for
language understanding. (NAACL 2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need.
(NIPS 2017)
Acknowledgement to background
Acknowledgement to all used slides, figures, tables, equations, texts from image from http://ruder.io/nlp-
the papers, blogs and codes! imagenet/
Pre-training general language representations
• Feature-based approaches
• Non-neural word representations
• Neural embedding
• Word embedding: Word2Vec, Glove, …
• Sentence embedding, paragraph embedding, …
• Fine-tuning approaches
• OpenAI GPT (Generative Pre-trained Transformer) (Radford et al., 2018a)
• BERT (Bi-directional Encoder Representations from Transformers) (Devlin et al., 2018)
Content
• ELMo (Peters et al., 2018)
• OpenAI GPT (Radford et al., 2018a)
• Transformer (especially self-attention) (Vaswani et al., 2017)
• BERT (Devlin et al., 2018)
• Analyses & Future Studies
ELMo: deep contextualised word representation
(Peters et al., 2018)
• “Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before
assigning each word in it an embedding.”
ELMo contextualized-word-
representations-peters-2018
Acknowledgement to slides from
https://www.slideshare.net/shunta
roy/a-review-of-deep-
ELMo contextualized-word-
representations-peters-2018
OpenAI GPT (Generative Pre-trained
Transformer) – (1) pre-training
• Unsupervised pre-training, maximising the log-likelihood,
Residual connection
& Layer normalisation
”The animal didn't cross the street because it was too wide”
Acknowledgement to Figure from http://jalammar.github.io/illustrated-bert/ Equation and Figure in (Vaswani et al., 2017)
Self-attention (2)
𝑑𝑚𝑜𝑑𝑒𝑙
ℎ = 8, 𝑑𝑘 = 𝑑𝑣 = = 64
ℎ
• For any fixed offset 𝑘, 𝑃𝐸𝑝𝑜𝑠+𝑘 can be represented as a linear transformation of 𝑃𝐸𝑝𝑜𝑠 . This
would allow the model to easily learn to attend by relative positions.
Adopted by GPT
Evaluation for Transformer
• Masking(input_seq):
For every input_seq :
• Randomly select 15% of tokens
(not more than 20 per seq)
• For 80% of the time:
• Replace the word
with the [MASK]
token.
• For 10% of the time:
• Replace the word
with a random word
• For 10% of the time
• Keep the word
unchanged..
• Sequence length 512; Batch size 256; trained for 1M steps (approximately 40 epochs);
learning rate 1e-4; Adam optimiser, 𝛽1 as 0.9, 𝛽2 as 0.999; dropout as 0.1 on all layers; GELU
activation; L2 weight decay of 0.01; learning rate warmup over the first 10,000 steps, linear
decay of learning rate …
• BERTBASE : N = 6, 𝑑model = 512, ℎ = 12, Total Parameters=110M
• 4 cloud TPUs in Pod configuration (16 TPU chips total)
• Fine-tuning
• Better loss in fine-tuning
• Introduce new tasks in fine-tuning
An architecture for multi-label classification
(Dong, 2019)
𝑣𝑤𝑡
Title (Title-guided)
Sentence-level attentions
𝑠𝑑 𝑦𝑑
𝑐𝑡
Bi-GRU 𝑐𝑑
𝑐s 𝑐𝑡𝑎
Word-level attentions Sigmoid 𝐿𝐶𝐸
Bi-GRU Bi-GRU
𝑐𝑎
Sentence Semantic-based loss regularisation
(in Content)
𝑣𝑤𝑎 𝑣𝑠𝑎 𝜆1 𝐿𝑠𝑖𝑚 +𝜆2 𝐿𝑠𝑢𝑏
In H. Dong, W. Wang, H. Kaizhu, F. Coenen. Joint Multi-Label Attention Networks for Social Text Annotation, in Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Volume 2 (Short Papers), Minneapolis, USA, 2-7 June, 2019.
Is it possible? Any further thought?
Title
𝑠𝑑 𝑦𝑑
𝑐𝑑
FFNN+Sigmoid 𝐿𝐶𝐸
BERT
Sentence
(in Content)
Recommended Learning Resources
• Jay Alammar. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning). Dec
2018. http://jalammar.github.io/illustrated-bert/
• Jay Alammar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/.
June 2018.
• Ashish Vaswani and Anna Huang. Transformers and Self-Attention For Generative Models.
Feb 2019. CS224n. Stanford University. http://web.stanford.edu/class/cs224n/slides/cs224n-
2019-lecture14-transformers.pdf
• Kevin Clark. Future of NLP + Deep Learning. Mar 2019. CS224n. Stanford University.
http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture20-future.pdf
• keitakurita. Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding” Explained http://mlexplained.com/2019/01/07/paper-dissected-
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/
• keitakurita. Paper Dissected: “Attention is All You Need” Explained
http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/
References
• Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers)
• Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep
Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers) (Vol. 1, pp. 2227-2237).
• Lample, G., & Conneau, A. (2019). Cross-lingual Language Model Pretraining. arXiv preprint
arXiv:1901.07291.
• Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018a). Improving Language Understanding by
Generative Pre-Training.
• Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2018b). Language models are
unsupervised multitask learners. Technical report, OpenAI.
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention
is all you need. In Advances in Neural Information Processing Systems(pp. 5998-6008).
• Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Klingner, J. (2016). Google's neural
machine translation system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144.