0% found this document useful (0 votes)
13 views22 pages

14.Chapter10_AdvancedDeepLearningForText

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views22 pages

14.Chapter10_AdvancedDeepLearningForText

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Natural Language Processing

AC3110E

1
Chapter 10: Advanced Deep Learning
Techniques for Text

Lecturer: PhD. DO Thi Ngoc Diep


SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Outline

• Transformers net
• Transformers as Language Models
• Bidirectional Transformer Encoders
• Transfer Learning through Fine-Tuning

3
10.1. The transformer blocks

• The most common architecture for language modeling (dated 2022)


• non-recurrent networks
• handle distant information
• more efficient to implement at scale
• Make from stacks of transformer blocks
• Each block: a multilayer network made by combining simple linear layers,
feedforward networks, and self-attention layers

4
Single self-attention layer

• Extract and use information from arbitrarily large contexts without the need
to pass through intermediate recurrent connections
• A single causal self-attention layer:
• Input sequence: (x1,...,xn)
• Output sequence: (y1,...,yn)
• Self-attention: The output y is the
result of a straightforward
computation over the inputs
• The computations at each time step are
independent of all the other steps and
therefore can be performed in parallel.

• Simple dot-product based self-attention :


• α
• α = softmax(score( , ) j i)
• score( , )=

5
Single self-attention layer

• A single causal self-attention layer:


• Transformers consider 3 different roles for each input embedding
• Query : As the current focus of attention when being
compared to all of the other preceding inputs
(weight matrix 𝐖 𝐐 ∈ ℝ × )
• Key: as a preceding input being compared
(weight matrix 𝐖 𝐊 ∈ ℝ × )
• Value: as a value used to compute the output for the
current focus of value attention
(weight matrix 𝐖 𝐕 ∈ ℝ × )

• Transformer self-attention :
• 𝐐

• 𝐊

• 𝐕

• α
• α = softmax(score( , ) j i)
• Scaled dot-product :
𝐪 ⋅𝒌
score( , )=

6
Single self-attention layer

• A single causal self-attention layer:


• Transformer self-attention for input matrix X of N input tokens:
• 𝐗∈ ℝ ×
• 𝐐 = 𝐗 𝐖𝐐
• 𝐊 = 𝐗 𝐖𝐊
• 𝐕 = 𝐗 𝐖𝐕
× 𝐐𝐊
• Y∈ ℝ = SelfAttention(Q,K,V)=softmax 𝐕
• “Masked Attention”
• As for language models, we don’t look at the future when predicting a sequence
=> mask out attention to future words
• The matrix shows the qi · kj values => Mask the upper-triangle portion of
the matrix to −∞ (the softmax will turn to zero)

7
Multihead self-attention layer

• To capture all of the different kinds of


parallel relations among its inputs.
• sets of self-attention layers, called heads
• each head learns different aspects of the
relationships that exist among inputs
at the same level of abstraction
• Each head i:
𝐊 × 𝐐 𝐕
• 𝐢 , 𝐢 ∈ ℝ ×
, 𝐢 ∈ ℝ ×
𝐐 𝐊 𝐕
• 𝐢 𝐢 𝐢
• 𝐎 ×
𝐎

8
Layer Normalization (layer norm)

• To improve training performance


• Hidden values are normalized to zero mean and a standard deviation of one within
each layer.
• by keeping the values of a hidden layer in a range  facilitates gradient-based
training
• Input:
• a vector to normalize x, with dimensionality dh
• Calculate:
• 𝜇= ∑ 𝑥 ;σ= ∑ (𝑥 −𝜇)

• Output:
(𝐱 )
• LayerNorm(x) = 𝛾𝐱 + 𝛽 = 𝛾 +𝛽

9
Positional embedding

• To model the position of each token in the input sequence (the word order)
• Absolute position (index) representation
• Sinusoidal position representation
• Learned absolute position representations
• Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position
• etc.

10
10.1.1. Transformers as Language Models

• Given a training corpus of plain text, train the model autoregressively to


predict the next token in a sequence yt, using cross-entropy loss
• Each training item can be processed in parallel since the output for each element in
the sequence is computed separately.

11
10.1.2. Bidirectional Transformer Encoders

• Bidirectional encoders: allow the self-attention mechanism to range over the


entire input
=> BERT (Bidirectional Encoder Representations from Transformers)
• In processing each element of the sequence, the model attends to all inputs, both
before and after the current one
𝐐

• 𝐊

• 𝐕

• α
• α = softmax(score( , ) j n)
• Scaled dot-product :
𝐪 ⋅𝒌
score( , )=
• The matrix matrix showing the complete set of qi·kj comparisons, no more
masking => bidirectional context

12
Bidirectional Transformer Encoders training

• BERT (Bidirectional Encoder Representations from Transformers)


• Masked Language Modeling (MLM)
approach
• Instead of trying to predict the
next word, the model learns to predict
the missing element
• A random sample of tokens from each
training sequence is selected for
learning (15% of the input tokens)
• Once chosen, a token is used in one
of three ways:
• It is replaced with the unique vocabulary token [MASK]. (80%)
• It is replaced with another token from the vocabulary, randomly sampled based on token unigram
probabilities. (10%)
• It is left unchanged. (10%)
• Objective is to predict the original inputs for each of the masked tokens
•  Generate a probability distribution over the vocabulary for each of the missing items

• SpanBERT: masking spans of words


• Next Sentence Prediction (NSP)
• RoBERTa: for longer and remove NSP

(Devlin et al., 2019), (Joshi et al., 2020) 13


10.2. Transfer Learning through Fine-Tuning

• Transfer learning: Acquiring knowledge from one task or domain, and then applying it
(transferring it) to solve a new task
• Fine-tuning :
• Pretrained language models contain a rich representations of word meaning => can
be leveraged in other downstream applications through fine-tuning.
• Fine-tuning process:
• Add application-specific parameters on top of pre-trained models
• Use labeled data from the application to train these additional application-specific parameters
• Can freeze or make only minimal adjustments to the pretrained language model parameters

14
10.2. Transfer Learning to Downstream Tasks

• Neural architecture influences the type of pretraining


• Encoders architecture
• Encoder-Decoders architecture
• Decoders architecture

https://jalammar.github.io/illustrated-bert/ 15
Encoders architecture: BERT Fine-Tuning

• Sentiment classification
• Finetuning a set of weights WC # ×
uses supervised training data
• Can update over the limited final few layers of the transformer

+ Add a new token as the start of all input sequences


16
Encoders architecture: BERT Fine-Tuning

• Part-of-speech tagging, BIO-based named entity recognition


• The final output vector corresponding to each input token is passed to a classifier that
produces a softmax distribution over the possible set of tags

17
Encoders architecture: BERT Fine-Tuning

• Span-oriented approach
• Named entity recognition, question answering, syntactic parsing, semantic role
labeling and co-reference resolution.

18
Encoders architecture: BERT Fine-Tuning

• Finetuning BERT also led to new state-of-the-art results on a broad range of


tasks:
• QQP: Quora Question Pairs (detect paraphrase questions)
• QNLI: natural language inference over question answering data
• SST-2: sentiment analysis
• CoLA: Sentence acceptability judgment (detect whether sentences are grammatical.)
• STS-B: semantic textual similarity
• MRPC: Paraphrasing/sentence similarity
• RTE: a small natural language inference corpus

19
Encoder-Decoder architecture: pretrained model T5

• Google model T5
• Span corruption as objective function

• Lots of downstream tasks:

20
Decoder architecture: GPT model

• Generative Pretrained Transformer (GPT)


• Type of large language model (LLM)
• Based on the transformer architecture, pre-trained on large data sets of un-labelled
text, and able to generate novel human-like content
• Transformer decoder with 12 layers, 117M parameters.
• 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
• Byte-pair encoding with 40,000 merges
• Trained on BooksCorpus: over 7000 unique books.
• Contains long spans of contiguous text, for learning long-distance dependencies.
• GPT-2: a larger version (1.5B) of GPT trained on more data
• GPT-3: in-context learning
• 175 billion parameters
• Trained on 300B tokens of text
• GPT-4 (March 2023)
• basis for more task-specific GPT systems, including models fine-tuned for instruction
following (ChatGPT chatbot service)

OpenAI (Radford et al., 2018) 21


• end of Chapter 10

22

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy