0% found this document useful (0 votes)
17 views64 pages

RADL TTho

hust

Uploaded by

Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views64 pages

RADL TTho

hust

Uploaded by

Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Transformer and its variants for NLP

Quan Thanh Tho


qttho@hcmut.edu.vn
Assoc. Prof. Quan Thanh Tho
Vice Dean
Faculty of Computer Science and Engineering
Ho Chi Minh City University of Technology (HCMUT)
Vietnam National University - Ho Chi Minh City
qttho@hcmut.edu.vn
http://www.cse.hcmut.edu.vn/qttho/

• BEng, HCMUT, Vietnam, 1998


• PhD, NTU, Singapore, 2006
• Research Interests: Artificial Intelligence,
Natural Language Processing, intelligent
systems, formal methods
2
NLP Milestones

Quan Thanh Tho, “Modern Approaches in Natural Language Processing”, VNU Journal of Science:
Computer Science and Communication Engineering, 2022 3
Agenda
• Sequence data and sequence models
• Seq2Seq and attention
• Transformer model
• BERT and other variants
• Applications in NLP

4
Sequence data and sequence models

5
Sequence data
A series of data points whose points reliant on each other
• Length can be varied
• Positions matter

6
Problem of Standard Networks

• Inputs, outputs can be different lengths in different


examples.
• Relations between positions are not well reflected

7
RNN comes as a rescue

RNN: an architecture tailored for sequence data:


1) Doesn't depend on data length
2) Take advantage of past information
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation.
In D. E. Rumelhart & J. L. Mcclelland (Eds.), Parallel distributed processing: Explorations in the microstructure of
cognition, Volume 1: Foundations (pp. 318–362). MIT Press 8
Seq2seq and Attention

10
Intuition

11
Seq2Seq architecture
NLP researchers also employ
that idea into designing a
structure dubbed as
Sequence-to-Sequence
(Seq2Seq), which extends
AutoEncoder architecture

Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel,


R., Urtasun, R., Torralba, A., & Fidler, S.
(2015). Skip-thought vectors. In C. Cortes, N.
Lawrence, D. Lee, M. Sugiyama, & R. Garnett
(Eds.), Advances in neural information
processing systems. Curran Associates, Inc

Sutskever, I., Vinyals, O., & Le, Q. V. (2014).


Sequence to sequence learning with neural
networks. Advances in neural information
processing systems, 27

12
Seq2Seq: The bottle neck problem

13
Seq2Seq with attention

14
Seq2Seq with attention

15
Seq2Seq with attention

16
Seq2Seq with attention

17
Seq2Seq with attention

18
Seq2Seq with another bottleneck

19
20
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you
need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural
information processing systems. Curran Associates, Inc 21
22
Inside an Encoder Block

23
Scaled Dot Product Attention

24
Scaled Dot Product Attention

25
Scaled Dot Product Attention

26
Scaled Dot Product Attention

RNN-based Seq2Seq:
- Keys and Values are the same
- Queries are provided from
encoder

27
Self-Attention in Transformer
• Attention maps a query and a set of key-value
pairs to an output
• query, keys, and output are all vectors

28
Self-Attention

Image source: 29
https://jalammar.github.io/illustrated-transformer/
30
31
32
34
35
36
37
38
BERT and other variants

40
Transformer-based Language
Models

41
BERT
• Bidirectional Encoder Representations from
Transformers.
• Use the Transformer Encoder architecture.
• Introduced in 2018 by Google AI.

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: pretraining of
deep bidirectional transformers for language understanding (J. Burstein, C.
Doran, & T. Solorio, Eds.).
Architecture
Pretraining
• Two unsupervised tasks:
1. Masked Language Model
2. Next Sentence Prediction
Text Classification
GPT
• Generative Pre-trained Transformer
• Use the Transformer Decoder architecture.
• Introduced in 2018 by OpenAI.

Openai [Accessed: 2023-03-01]. (2023). https://openai.com/


How it works?
XLNet
• Autoencoding (BERT):
• [MASK] tokens do not appear during finetuning ⇒
pretrain-finetuning discrepancy.
• Assume the predicted tokens are independent of each
other given the unmasked tokens. Example: “New York is
a city” ⇒ “[MASK] [MASK] is a city”
• Autoregressive (GPT):
• Only trained to encode a unidirectional context (forward
or backward).
Yang, Z. et al. “XLNet: Generalized Autoregressive Pretraining for
Language Understanding.” NeurIPS (2019)
XLNet
• XLNet combines pros from both while avoiding
their cons.
• Techniques:
• Permutation Language Modeling
• Two-Stream Self-Attention for Target-Aware
Representations
• Incorporating Ideas from Transformer-XL
• Modeling Multiple Segments
Applications in NLP

54
NLP typical pipeline

55
NLP DL-based pipeline

56
Pre-trained Neural Language Model

ULMFit (Howard and Rudder, 2018) 57


NLP LM-based pipeline

58
NLP LM-based pipeline

59
From BERT to BART
• BERT is not a fully Seq2Seq model (i.e. not a generative model)
• BART is introduced as an extended/complement

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and
Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation,
and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages
7871–7880, Online. Association for Computational Linguistics.
60
From PhoBERT to BARTPho

61
BARTPho for Vietnamese
translation applications
• Pretrained with Vietnamese
• Implicitly processing “aligning” task
• More powerful if the target language has similar
language to Vietnamese (Chinese, Bahnaric, etc. )

62
A Demo to be concluded

• https://www.ura.hcmut.edu.vn/bahnar/nmt
63
Thank you

64

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy