RADL TTho
RADL TTho
Quan Thanh Tho, “Modern Approaches in Natural Language Processing”, VNU Journal of Science:
Computer Science and Communication Engineering, 2022 3
Agenda
• Sequence data and sequence models
• Seq2Seq and attention
• Transformer model
• BERT and other variants
• Applications in NLP
4
Sequence data and sequence models
5
Sequence data
A series of data points whose points reliant on each other
• Length can be varied
• Positions matter
6
Problem of Standard Networks
7
RNN comes as a rescue
10
Intuition
11
Seq2Seq architecture
NLP researchers also employ
that idea into designing a
structure dubbed as
Sequence-to-Sequence
(Seq2Seq), which extends
AutoEncoder architecture
12
Seq2Seq: The bottle neck problem
13
Seq2Seq with attention
14
Seq2Seq with attention
15
Seq2Seq with attention
16
Seq2Seq with attention
17
Seq2Seq with attention
18
Seq2Seq with another bottleneck
19
20
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you
need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural
information processing systems. Curran Associates, Inc 21
22
Inside an Encoder Block
23
Scaled Dot Product Attention
24
Scaled Dot Product Attention
25
Scaled Dot Product Attention
26
Scaled Dot Product Attention
RNN-based Seq2Seq:
- Keys and Values are the same
- Queries are provided from
encoder
27
Self-Attention in Transformer
• Attention maps a query and a set of key-value
pairs to an output
• query, keys, and output are all vectors
28
Self-Attention
Image source: 29
https://jalammar.github.io/illustrated-transformer/
30
31
32
34
35
36
37
38
BERT and other variants
40
Transformer-based Language
Models
41
BERT
• Bidirectional Encoder Representations from
Transformers.
• Use the Transformer Encoder architecture.
• Introduced in 2018 by Google AI.
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: pretraining of
deep bidirectional transformers for language understanding (J. Burstein, C.
Doran, & T. Solorio, Eds.).
Architecture
Pretraining
• Two unsupervised tasks:
1. Masked Language Model
2. Next Sentence Prediction
Text Classification
GPT
• Generative Pre-trained Transformer
• Use the Transformer Decoder architecture.
• Introduced in 2018 by OpenAI.
54
NLP typical pipeline
55
NLP DL-based pipeline
56
Pre-trained Neural Language Model
58
NLP LM-based pipeline
59
From BERT to BART
• BERT is not a fully Seq2Seq model (i.e. not a generative model)
• BART is introduced as an extended/complement
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and
Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation,
and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages
7871–7880, Online. Association for Computational Linguistics.
60
From PhoBERT to BARTPho
61
BARTPho for Vietnamese
translation applications
• Pretrained with Vietnamese
• Implicitly processing “aligning” task
• More powerful if the target language has similar
language to Vietnamese (Chinese, Bahnaric, etc. )
62
A Demo to be concluded
• https://www.ura.hcmut.edu.vn/bahnar/nmt
63
Thank you
64