0% found this document useful (0 votes)
95 views24 pages

HKBK College of Engineering Department of Computer Science and Engineering

The document summarizes a technical seminar presentation on BERT (Bidirectional Encoder Representations from Transformers). The seminar introduces BERT, describes its key technical innovation of applying bidirectional training of Transformer models to language modeling. It outlines the related work that influenced BERT, the methodology including the BERT architecture and training strategies of masked language modeling and next sentence prediction. Applications of BERT include pre-trained domain-specific models and uses in text classification, search engines, and more. Advantages of BERT are its bidirectional nature, consideration of context, and use of a minimal vocabulary.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views24 pages

HKBK College of Engineering Department of Computer Science and Engineering

The document summarizes a technical seminar presentation on BERT (Bidirectional Encoder Representations from Transformers). The seminar introduces BERT, describes its key technical innovation of applying bidirectional training of Transformer models to language modeling. It outlines the related work that influenced BERT, the methodology including the BERT architecture and training strategies of masked language modeling and next sentence prediction. Applications of BERT include pre-trained domain-specific models and uses in text classification, search engines, and more. Advantages of BERT are its bidirectional nature, consideration of context, and use of a minimal vocabulary.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

HKBK COLLEGE of ENGINEERING

Department of Computer Science and Engineering

A Technical Seminar On
“BIDIRECTIONAL ENCODER REPRESENTATIONS FROM TRANSFORMER”

By
Sheikh Junaid Nazir (1HK16CS143)

Guided By
J.Suneetha

Associate Professor

 
Contents
Introduction //Introduction to topic
Related Work //Literature Survey/Existing System
Methodology/Architecture // Proposed System,
design process, implementation details
Applications/Usage // Uses
Advantages/Disadvantages
Conclusion and future scope //Summary
 References
Introduction
BERT (Bidirectional Encoder Representations from Transformers) is a
recent paper published by researchers at Google AI Language.
It has caused a stir in the Machine Learning community by presenting
state-of-the-art results in a wide variety of NLP tasks, including
Question Answering (SQuAD v1.1), Natural Language Inference, and
others.
BERT’s key technical innovation is applying the bidirectional training
of Transformer, a popular attention model, to language modelling.
This is in contrast to previous efforts which looked at a text sequence
either from left to right or combined left-to-right and right-to-left
training.
The paper’s results show that a language model which is bi-
directionally trained can have a deeper sense of language context and
flow than single-direction language models
Dept. of CSE@ HKBKCE 3 2020-21
Related work
BERT has its origins from pre-training contextual
representations including Semi-supervised Sequence
Learning, Generative Pre-Training, ELMo, and ULMFit.
Unlike previous models, BERT is a deeply bidirectional,
unsupervised language representation, pre-trained using
only a plain text corpus. Context-free models such as 
word2vec or GloVe generate a single word embedding
representation for each word in the vocabulary, where
BERT is deeply bidirectional.

Dept. of CSE@ HKBKCE


Methodology
Architecture:
The BERT architecture builds on top of Transformer. We
currently have two variants available:
BERT Base: 12 layers (transformer blocks), 12 attention
heads, and 110 million parameters
BERT Large: 24 layers (transformer blocks), 16 attention
heads and, 340 million parameters.
The BERT Base architecture has the same model size as
OpenAI’s GPT for comparison purposes. All of these
Transformer layers are Encoder-only blocks.
Dept. of CSE@ HKBKCE
Dept. of CSE@ HKBKCE
Proposed System:
On October 25, 2019 Google announced that they had
started applying BERT models to their search algorithms
 within the US. On December 9, 2019 BERT was applied
to over 70 languages.

Dept. of CSE@ HKBKCE


Design Process and Implementation:
BERT makes use of Transformer, an attention mechanism
that learns contextual relations between words (or sub-
words) in a text.
In its vanilla form, Transformer includes two separate
mechanisms — an encoder that reads the text input and a
decoder that produces a prediction for the task.
 Since BERT’s goal is to generate a language model, only
the encoder mechanism is necessary. 
The chart below is a high-level description of the
Transformer encoder.

Dept. of CSE@ HKBKCE


The input is a sequence of tokens, which are first
embedded into vectors and then processed in the neural
network. 
The output is a sequence of vectors of size H, in which
each vector corresponds to an input token with the same
index.
When training language models, there is a challenge of
defining a prediction goal. Many models predict the next
word in a sequence (e.g. “The child came home from
___”), a directional approach which inherently limits
context learning.
To overcome this challenge, BERT uses two training
strategies:
Dept. of CSE@ HKBKCE
Masked LM (MLM)
Before feeding word sequences into BERT, 15% of the
words in each sequence are replaced with a [MASK]
token.
The model then attempts to predict the original value of
the masked words, based on the context provided by the
other, non-masked, words in the sequence.
If the i-th token is chosen, BERT replaces the i-th
token with:
1) the [MASK] token 80% of the time

Dept. of CSE@ HKBKCE


2) a random token 10% of the time
3) the unchanged i-th token 10% of the time

Dept. of CSE@ HKBKCE


The BERT loss function takes into consideration only the
prediction of the masked values and ignores the prediction
of the non-masked words.
 As a consequence, the model converges slower than
directional models, a characteristic which is offset by its
increased context awareness.

Next Sentence Prediction (NSP)


In the BERT training process, the model receives pairs of
sentences as input and learns to predict if the second
sentence in the pair is the subsequent sentence in the
original document.

Dept. of CSE@ HKBKCE


During training, 50% of the inputs are a pair in which the
second sentence is the subsequent sentence in the original
document, while in the other 50% a random sentence from
the corpus is chosen as the second sentence.
The assumption is that the random sentence will be
disconnected from the first sentence.
To help the model distinguish between the two sentences
in training, the input is processed in the following way
before entering the model:
1) A [CLS] token is inserted at the beginning of the first
sentence and a [SEP] token is inserted at the end of each
sentence.

Dept. of CSE@ HKBKCE


2) A sentence embedding indicating Sentence A or
Sentence B is added to each token. Sentence embedding
are similar in concept to token embedding with a
vocabulary of 2.
3) A positional embedding is added to each token to
indicate its position in the sequence. The concept and
implementation of positional embedding are presented
in the Transformer paper.

Dept. of CSE@ HKBKCE


To predict if the second sentence is indeed connected to the
first, the following steps are performed:
1) The entire input sequence goes through the
Transformer model.

Dept. of CSE@ HKBKCE


2) The output of the [CLS] token is transformed into a 2×1
shaped vector, using a simple classification layer (learned
matrices of weights and biases).
3) Calculating the probability of IsNextSequence with
softmax.

How to use BERT (Fine-tuning)


Using BERT for a specific task is relatively
straightforward:
BERT can be used for a wide variety of language tasks,
while only adding a small layer to the core model:

Dept. of CSE@ HKBKCE


Classification tasks such as sentiment analysis are done
similarly to Next Sentence classification, by adding a
classification layer on top of the Transformer output for
the [CLS] token.
In Question Answering tasks (e.g. SQuAD v1.1), the
software receives a question regarding a text sequence and
is required to mark the answer in the sequence. Using
BERT, a Q&A model can be trained by learning two extra
vectors that mark the beginning and the end of the answer.

Dept. of CSE@ HKBKCE


Application/Usage
 Pre-trained models: Models pretrained on
domain/application specific corpus
BioBERT(biomedicaltext), SciBERT(scientificpublication
s),and clinicalBERT(clinical notes). Training on domain
specific corpus has shown to yield better performance
when fine-tuning them on downstream NLP tasks like
NER etc. for those domains, in comparison to fine
tuning BERT(which was trained on BooksCorpus and
Wikipedia).

Dept. of CSE@ HKBKCE


Model pretrained on monolingual corpora (M-BERT) from 104
languages for zero-shot cross lingual model transfer (task specific
annotations in one language is used to fine tune model for evaluation
in another language)
  Model that jointly learns video and language representation learning
(videoBERT) by representing representing video frames as special
descriptor tokens along with text for pretraining. This is used for
video captioning.
 A model that combines the power of Graph neural networks and
BERT (G-BERT) for medical/diagnostic code representation and
recommendation. Medical codes with hierarchical representations are
encoded using GNNs and fed as input during pretraining with EHR
data. This is then fine-tuned for making medical recommendations.
  Text classifications
  Search engines

Dept. of CSE@ HKBKCE


Advantages/Disadvantages
Advantages:
BERT is a bidirectional model that is based on
the transformer architecture, it replaces the sequential
nature of Recurring Neural Networks with a much
faster Attention-based approach.
BERT is bidirectional, and this is quite interesting in
certain NLP cases, it means he can "read", in a certain
way, in two "random directions". As opposed to
directional models, which read the text input sequentially
(left-to-right or right-to-left), the Transformer encoder
reads the entire sequence of words at once.

Dept. of CSE@ HKBKCE


BERT is considering the "context", not just the single
word (the "window word" and n-grams) as FastText. In a
few words, BERT considers the sentence and the
sentences around that sentence. It allows the model to
understand the particular word within a sentence, and the
sentence itself within a period.
BERT uses minimal vocabulary. BERT uses the idea of
representing words as sub words or n-grams. On average
vocabulary of 8k to 30k n-grams can represent any word
in a large corpus (!). This has a significant advantage from
a memory perspective.

Dept. of CSE@ HKBKCE


Disadvantages
Will need to add non-pre-trained bidirectional model on
top
Right-to-left SQuAD model doesn’t see question
Need to train two models
Off-by-one: LTR predicts next word, RTL predicts
previous word
Not trivial to add arbitrary pre-training tasks.

Dept. of CSE@ HKBKCE


Conclusion/Future Scope
To evaluate performance, we compared BERT to other state-
of-the-art NLP systems.
Importantly, BERT achieved all of its results with almost no
task-specific changes to the neural network architecture. On 
SQuAD v1.1, BERT achieves 93.2% F1 score (a measure of
accuracy), surpassing the previous state-of-the-art score of
91.6% and human-level score of 91.2%.
BERT is a groundbreaking technology, not only for NLP in
general but also for Google Search.
However, it's important to understand that BERT doesn't
change the way in which websites get ranked but simply
improves Google's understanding of natural language.
Dept. of CSE@ HKBKCE
References
 R. H. Tu et al., "Berkeley reliability tools-BERT," in IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 12, no. 10, pp. 1524-1534, Oct. 1993.
 M. Sabbaghian and D. Falconer, "BERT Chart Analysis of Turbo Frequency Domain Equalization with
Imperfect Channel State Information," 2008 IEEE Wireless Communications and Networking Conference,
Las Vegas, NV, 2008, pp. 436-441.
 M. Munikar, S. Shakya and A. Shrestha, "Fine-grained Sentiment Classification using BERT," 2019
Artificial Intelligence for Transforming Business and Society (AITB), Kathmandu, Nepal, 2019, pp. 1-5.
 Z. Gao, A. Feng, X. Song and X. Wu, "Target-Dependent Sentiment Classification With BERT," in IEEE
Access, vol. 7, pp. 154290-154299, 2019.
 S. Yu, J. Su and D. Luo, "Improving BERT-Based Text Classification With Auxiliary Sentence and Domain
Knowledge," in IEEE Access, vol. 7, pp. 176600-176612, 2019.
 H. Huang, X. Jing, F. Wu, Y. Yao, X. Zhang and X. Dong, "DCNN-BiGRU Text Classification Model Based
on BERT Embedding," 2019 IEEE International Conferences on Ubiquitous Computing & Communications
(IUCC) and Data Science and Computational Intelligence (DSCI) and Smart Computing, Networking and
Services (SmartCNS), Shenyang, China, 2019, pp. 632-637.
 W. Li, S. Gao, H. Zhou, Z. Huang, K. Zhang and W. Li, "The Automatic Text Classification Method Based on
BERT and Feature Union," 2019 IEEE 25th International Conference on Parallel and Distributed Systems
(ICPADS), Tianjin, China, 2019, pp. 774-777.
 S. Liu, H. Tao and S. Feng, "Text Classification Research Based on Bert Model and Bayesian Network,"
2019 Chinese Automation Congress (CAC), Hangzhou, China, 2019, pp. 5842-5846.

Dept. of CSE@ HKBKCE

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy