0% found this document useful (0 votes)

13 views22 pages

14.Chapter10_AdvancedDeepLearningForText

Uploaded by

Minh Mai Ngọc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views22 pages

14.Chapter10_AdvancedDeepLearningForText

Uploaded by

Minh Mai Ngọc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Natural Language Processing

AC3110E

1
Chapter 10: Advanced Deep Learning
Techniques for Text

Lecturer: PhD. DO Thi Ngoc Diep

SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Outline

• Transformers net
• Transformers as Language Models
• Bidirectional Transformer Encoders
• Transfer Learning through Fine-Tuning

3
10.1. The transformer blocks

• The most common architecture for language modeling (dated 2022)

• non-recurrent networks
• handle distant information
• more efficient to implement at scale
• Make from stacks of transformer blocks
• Each block: a multilayer network made by combining simple linear layers,
feedforward networks, and self-attention layers

4
Single self-attention layer

• Extract and use information from arbitrarily large contexts without the need
to pass through intermediate recurrent connections
• A single causal self-attention layer:
• Input sequence: (x1,...,xn)
• Output sequence: (y1,...,yn)
• Self-attention: The output y is the
result of a straightforward
computation over the inputs
• The computations at each time step are
independent of all the other steps and
therefore can be performed in parallel.

• Simple dot-product based self-attention :

• α
• α = softmax(score( , ) j i)
• score( , )=

5
Single self-attention layer

• A single causal self-attention layer:

• Transformers consider 3 different roles for each input embedding
• Query : As the current focus of attention when being
compared to all of the other preceding inputs
(weight matrix 𝐖 𝐐 ∈ ℝ × )
• Key: as a preceding input being compared
(weight matrix 𝐖 𝐊 ∈ ℝ × )
• Value: as a value used to compute the output for the
current focus of value attention
(weight matrix 𝐖 𝐕 ∈ ℝ × )

• Transformer self-attention :
• 𝐐

• 𝐊

• 𝐕

• α
• α = softmax(score( , ) j i)
• Scaled dot-product :
𝐪 ⋅𝒌
score( , )=

6
Single self-attention layer

• A single causal self-attention layer:

• Transformer self-attention for input matrix X of N input tokens:
• 𝐗∈ ℝ ×
• 𝐐 = 𝐗 𝐖𝐐
• 𝐊 = 𝐗 𝐖𝐊
• 𝐕 = 𝐗 𝐖𝐕
× 𝐐𝐊
• Y∈ ℝ = SelfAttention(Q,K,V)=softmax 𝐕
• “Masked Attention”
• As for language models, we don’t look at the future when predicting a sequence
=> mask out attention to future words
• The matrix shows the qi · kj values => Mask the upper-triangle portion of
the matrix to −∞ (the softmax will turn to zero)

7
Multihead self-attention layer

• To capture all of the different kinds of

parallel relations among its inputs.
• sets of self-attention layers, called heads
• each head learns different aspects of the
relationships that exist among inputs
at the same level of abstraction
• Each head i:
𝐊 × 𝐐 𝐕
• 𝐢 , 𝐢 ∈ ℝ ×
, 𝐢 ∈ ℝ ×
𝐐 𝐊 𝐕
• 𝐢 𝐢 𝐢
• 𝐎 ×
𝐎
•

8
Layer Normalization (layer norm)

• To improve training performance

• Hidden values are normalized to zero mean and a standard deviation of one within
each layer.
• by keeping the values of a hidden layer in a range  facilitates gradient-based
training
• Input:
• a vector to normalize x, with dimensionality dh
• Calculate:
• 𝜇= ∑ 𝑥 ;σ= ∑ (𝑥 −𝜇)

• Output:
(𝐱 )
• LayerNorm(x) = 𝛾𝐱 + 𝛽 = 𝛾 +𝛽

9
Positional embedding

• To model the position of each token in the input sequence (the word order)
• Absolute position (index) representation
• Sinusoidal position representation
• Learned absolute position representations
• Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position
• etc.

10
10.1.1. Transformers as Language Models

• Given a training corpus of plain text, train the model autoregressively to

predict the next token in a sequence yt, using cross-entropy loss
• Each training item can be processed in parallel since the output for each element in
the sequence is computed separately.

11
10.1.2. Bidirectional Transformer Encoders

• Bidirectional encoders: allow the self-attention mechanism to range over the

entire input
=> BERT (Bidirectional Encoder Representations from Transformers)
• In processing each element of the sequence, the model attends to all inputs, both
before and after the current one
𝐐
•
• 𝐊

• 𝐕

• α
• α = softmax(score( , ) j n)
• Scaled dot-product :
𝐪 ⋅𝒌
score( , )=
• The matrix matrix showing the complete set of qi·kj comparisons, no more
masking => bidirectional context

12
Bidirectional Transformer Encoders training

• BERT (Bidirectional Encoder Representations from Transformers)

• Masked Language Modeling (MLM)
approach
• Instead of trying to predict the
next word, the model learns to predict
the missing element
• A random sample of tokens from each
training sequence is selected for
learning (15% of the input tokens)
• Once chosen, a token is used in one
of three ways:
• It is replaced with the unique vocabulary token [MASK]. (80%)
• It is replaced with another token from the vocabulary, randomly sampled based on token unigram
probabilities. (10%)
• It is left unchanged. (10%)
• Objective is to predict the original inputs for each of the masked tokens
•  Generate a probability distribution over the vocabulary for each of the missing items

• SpanBERT: masking spans of words

• Next Sentence Prediction (NSP)
• RoBERTa: for longer and remove NSP

(Devlin et al., 2019), (Joshi et al., 2020) 13

10.2. Transfer Learning through Fine-Tuning

• Transfer learning: Acquiring knowledge from one task or domain, and then applying it
(transferring it) to solve a new task
• Fine-tuning :
• Pretrained language models contain a rich representations of word meaning => can
be leveraged in other downstream applications through fine-tuning.
• Fine-tuning process:
• Add application-specific parameters on top of pre-trained models
• Use labeled data from the application to train these additional application-specific parameters
• Can freeze or make only minimal adjustments to the pretrained language model parameters

14
10.2. Transfer Learning to Downstream Tasks

• Neural architecture influences the type of pretraining

• Encoders architecture
• Encoder-Decoders architecture
• Decoders architecture

https://jalammar.github.io/illustrated-bert/ 15
Encoders architecture: BERT Fine-Tuning

• Sentiment classification
• Finetuning a set of weights WC # ×
uses supervised training data
• Can update over the limited final few layers of the transformer

+ Add a new token as the start of all input sequences

16
Encoders architecture: BERT Fine-Tuning

• Part-of-speech tagging, BIO-based named entity recognition

• The final output vector corresponding to each input token is passed to a classifier that
produces a softmax distribution over the possible set of tags

17
Encoders architecture: BERT Fine-Tuning

• Span-oriented approach
• Named entity recognition, question answering, syntactic parsing, semantic role
labeling and co-reference resolution.

18
Encoders architecture: BERT Fine-Tuning

• Finetuning BERT also led to new state-of-the-art results on a broad range of

tasks:
• QQP: Quora Question Pairs (detect paraphrase questions)
• QNLI: natural language inference over question answering data
• SST-2: sentiment analysis
• CoLA: Sentence acceptability judgment (detect whether sentences are grammatical.)
• STS-B: semantic textual similarity
• MRPC: Paraphrasing/sentence similarity
• RTE: a small natural language inference corpus

19
Encoder-Decoder architecture: pretrained model T5

• Google model T5
• Span corruption as objective function

• Lots of downstream tasks:

20
Decoder architecture: GPT model

• Generative Pretrained Transformer (GPT)

• Type of large language model (LLM)
• Based on the transformer architecture, pre-trained on large data sets of un-labelled
text, and able to generate novel human-like content
• Transformer decoder with 12 layers, 117M parameters.
• 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
• Byte-pair encoding with 40,000 merges
• Trained on BooksCorpus: over 7000 unique books.
• Contains long spans of contiguous text, for learning long-distance dependencies.
• GPT-2: a larger version (1.5B) of GPT trained on more data
• GPT-3: in-context learning
• 175 billion parameters
• Trained on 300B tokens of text
• GPT-4 (March 2023)
• basis for more task-specific GPT systems, including models fine-tuned for instruction
following (ChatGPT chatbot service)

OpenAI (Radford et al., 2018) 21

• end of Chapter 10

AE556_2024_Topic7_Transformer
No ratings yet
AE556_2024_Topic7_Transformer
49 pages
Generative AI With LArge Language Models
No ratings yet
Generative AI With LArge Language Models
36 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
unit6
No ratings yet
unit6
26 pages
RADL TTho
No ratings yet
RADL TTho
64 pages
Machine Translation Wise 2016/2017
No ratings yet
Machine Translation Wise 2016/2017
58 pages
Week 12
100% (1)
Week 12
64 pages
Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
No ratings yet
Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
67 pages
ML for NLP-LO3
No ratings yet
ML for NLP-LO3
61 pages
BLM5135_10_ResidualNetworks_Transformer
No ratings yet
BLM5135_10_ResidualNetworks_Transformer
60 pages
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
No ratings yet
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
17 pages
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
No ratings yet
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
68 pages
Presentation 11 (1)
No ratings yet
Presentation 11 (1)
20 pages
1 s2.0 S2214212623002740 Main
No ratings yet
1 s2.0 S2214212623002740 Main
12 pages
Transformer
No ratings yet
Transformer
55 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
Summaries of The Chapters
No ratings yet
Summaries of The Chapters
29 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
Generative AI
No ratings yet
Generative AI
54 pages
DeepLearning 4 and 5
No ratings yet
DeepLearning 4 and 5
60 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
14 04 Transformers
No ratings yet
14 04 Transformers
11 pages
GenAIWorkshop GEOMAR With Footnotes Final
No ratings yet
GenAIWorkshop GEOMAR With Footnotes Final
41 pages
Session 8
No ratings yet
Session 8
24 pages
Transformer networks
No ratings yet
Transformer networks
53 pages
Congress On Intelligent Systems Proceedings Of Cis 2021 Volume 2 1st Ed 2022 Mukesh Saraswat instant download
No ratings yet
Congress On Intelligent Systems Proceedings Of Cis 2021 Volume 2 1st Ed 2022 Mukesh Saraswat instant download
91 pages
L.7
No ratings yet
L.7
54 pages
BTech Advanced AI Unit03
No ratings yet
BTech Advanced AI Unit03
109 pages
12-13.Chapter9_DeepLearningInNLP
No ratings yet
12-13.Chapter9_DeepLearningInNLP
45 pages
transformer slides
No ratings yet
transformer slides
21 pages
Transformer
No ratings yet
Transformer
10 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Contextual+Word+Embeddings
No ratings yet
Contextual+Word+Embeddings
8 pages
Building Generative AI Powered Apps A Hands on Guide for Developers 1st Edition Kansal pdf download
100% (1)
Building Generative AI Powered Apps A Hands on Guide for Developers 1st Edition Kansal pdf download
56 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
LLM
No ratings yet
LLM
41 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
Future of AI in Biomedicine and Biotechnology - (Chapter 12 Shaping The Future of Healthcare With BERT in Clinical Text... )
No ratings yet
Future of AI in Biomedicine and Biotechnology - (Chapter 12 Shaping The Future of Healthcare With BERT in Clinical Text... )
20 pages
aM3RdIpjnYdPsGKF
No ratings yet
aM3RdIpjnYdPsGKF
20 pages
ViewDiff - 3D - Consisitent Image Generation With Text To Image Models
No ratings yet
ViewDiff - 3D - Consisitent Image Generation With Text To Image Models
22 pages
How Transformers Work_ A Detailed Exploration of Transformer Architecture _ DataCamp
No ratings yet
How Transformers Work_ A Detailed Exploration of Transformer Architecture _ DataCamp
19 pages
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
No ratings yet
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
22 pages
F24 10423 Homework 4
No ratings yet
F24 10423 Homework 4
19 pages
Vertopal.com Coding Attention Mechanisms
No ratings yet
Vertopal.com Coding Attention Mechanisms
24 pages
PSLT
No ratings yet
PSLT
16 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
Download ebooks file ChatGPT MASTERY 12 Books in 1: Unlocking the Potential of AI, Everything you Need to know to Make Money Mastering AI Irvin all chapters
80% (5)
Download ebooks file ChatGPT MASTERY 12 Books in 1: Unlocking the Potential of AI, Everything you Need to know to Make Money Mastering AI Irvin all chapters
50 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Threat Behavior Textual Search by Attention Graph Isomorphism
No ratings yet
Threat Behavior Textual Search by Attention Graph Isomorphism
15 pages
Transformers
No ratings yet
Transformers
27 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
imp_ml
No ratings yet
imp_ml
8 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Transformers
No ratings yet
Transformers
2 pages
Transformers - Intuitively and Exhaustively Explained - by Daniel Warfield - Towards Data Science
No ratings yet
Transformers - Intuitively and Exhaustively Explained - by Daniel Warfield - Towards Data Science
38 pages
15.Chapter11_NLPApplications
No ratings yet
15.Chapter11_NLPApplications
25 pages
Complete NLP Guide_ From Fundamentals to Deep Learning with TensorFlow
No ratings yet
Complete NLP Guide_ From Fundamentals to Deep Learning with TensorFlow
13 pages
1.Chapter1 Introduction Chapter2 LanguageCharacteristics
No ratings yet
1.Chapter1 Introduction Chapter2 LanguageCharacteristics
35 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Advanced Techniques in Training and Applying Large Language Models
No ratings yet
Advanced Techniques in Training and Applying Large Language Models
6 pages
huang2021 (1)
No ratings yet
huang2021 (1)
14 pages
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
No ratings yet
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
34 pages
TMECH24 Transformer Deformable Object Manipulation
No ratings yet
TMECH24 Transformer Deformable Object Manipulation
14 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
xiao.202.Multi-Information Spatial–Temporal LSTM
No ratings yet
xiao.202.Multi-Information Spatial–Temporal LSTM
11 pages
Transformer
No ratings yet
Transformer
5 pages
Fair Federated Learning for Digital Healthcare
No ratings yet
Fair Federated Learning for Digital Healthcare
15 pages
FGeo_HyperGNet
No ratings yet
FGeo_HyperGNet
13 pages
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
No ratings yet
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
15 pages
GenAI_Syllabus
No ratings yet
GenAI_Syllabus
17 pages
CS541 HW4
No ratings yet
CS541 HW4
11 pages
Large Language Models
No ratings yet
Large Language Models
10 pages
Transformer Architecture explained in LLMs
No ratings yet
Transformer Architecture explained in LLMs
2 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Transformers
No ratings yet
Transformers
21 pages
Self-Attention Vision Transformer With Transfer Learning For Efficient Crops and Weeds Classification
No ratings yet
Self-Attention Vision Transformer With Transfer Learning For Efficient Crops and Weeds Classification
6 pages
[AFM] Attentional Factorization Machines - Learning the Weight of Feature Interactions via Attention Networks (ZJU 2017)
No ratings yet
[AFM] Attentional Factorization Machines - Learning the Weight of Feature Interactions via Attention Networks (ZJU 2017)
7 pages
32-2024-A Novel Dual-Pipeline based Attention Mechanism for
No ratings yet
32-2024-A Novel Dual-Pipeline based Attention Mechanism for
7 pages
AI Assistant for Visually Impaired 3
No ratings yet
AI Assistant for Visually Impaired 3
6 pages
11.Chapter8_WordEmbedding
No ratings yet
11.Chapter8_WordEmbedding
17 pages
Conformer
No ratings yet
Conformer
5 pages
Unit 4 LLM
No ratings yet
Unit 4 LLM
11 pages
Building Generative AI Powered Apps A Hands on Guide for Developers 1st Edition Kansal - The ebook is now available, just one click to start reading
No ratings yet
Building Generative AI Powered Apps A Hands on Guide for Developers 1st Edition Kansal - The ebook is now available, just one click to start reading
60 pages
Deep Learning - IIT Ropar - - Unit 15 - Week 12
No ratings yet
Deep Learning - IIT Ropar - - Unit 15 - Week 12
3 pages
Transformers
No ratings yet
Transformers
10 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

14.Chapter10_AdvancedDeepLearningForText

Uploaded by

14.Chapter10_AdvancedDeepLearningForText

Uploaded by

Natural Language Processing

Lecturer: PhD. DO Thi Ngoc Diep

• The most common architecture for language modeling (dated 2022)

• Simple dot-product based self-attention :

• A single causal self-attention layer:

• A single causal self-attention layer:

• To capture all of the different kinds of

• To improve training performance

• Given a training corpus of plain text, train the model autoregressively to

• Bidirectional encoders: allow the self-attention mechanism to range over the

• BERT (Bidirectional Encoder Representations from Transformers)

• SpanBERT: masking spans of words

(Devlin et al., 2019), (Joshi et al., 2020) 13

• Neural architecture influences the type of pretraining

+ Add a new token as the start of all input sequences

• Part-of-speech tagging, BIO-based named entity recognition

• Finetuning BERT also led to new state-of-the-art results on a broad range of

• Lots of downstream tasks:

• Generative Pretrained Transformer (GPT)

OpenAI (Radford et al., 2018) 21

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.