0% found this document useful (0 votes)

95 views24 pages

HKBK College of Engineering Department of Computer Science and Engineering

The document summarizes a technical seminar presentation on BERT (Bidirectional Encoder Representations from Transformers). The seminar introduces BERT, describes its key technical innovation of applying bidirectional training of Transformer models to language modeling. It outlines the related work that influenced BERT, the methodology including the BERT architecture and training strategies of masked language modeling and next sentence prediction. Applications of BERT include pre-trained domain-specific models and uses in text classification, search engines, and more. Advantages of BERT are its bidirectional nature, consideration of context, and use of a minimal vocabulary.

Uploaded by

1HK16CS104 Muntazir Hussain Bhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views24 pages

HKBK College of Engineering Department of Computer Science and Engineering

Uploaded by

1HK16CS104 Muntazir Hussain Bhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

HKBK COLLEGE of ENGINEERING

Department of Computer Science and Engineering

A Technical Seminar On
“BIDIRECTIONAL ENCODER REPRESENTATIONS FROM TRANSFORMER”

By
Sheikh Junaid Nazir (1HK16CS143)

Guided By
J.Suneetha

Associate Professor

Contents
Introduction //Introduction to topic
Related Work //Literature Survey/Existing System
Methodology/Architecture // Proposed System,
design process, implementation details
Applications/Usage // Uses
Advantages/Disadvantages
Conclusion and future scope //Summary
 References
Introduction
BERT (Bidirectional Encoder Representations from Transformers) is a
recent paper published by researchers at Google AI Language.
It has caused a stir in the Machine Learning community by presenting
state-of-the-art results in a wide variety of NLP tasks, including
Question Answering (SQuAD v1.1), Natural Language Inference, and
others.
BERT’s key technical innovation is applying the bidirectional training
of Transformer, a popular attention model, to language modelling.
This is in contrast to previous efforts which looked at a text sequence
either from left to right or combined left-to-right and right-to-left
training.
The paper’s results show that a language model which is bi-
directionally trained can have a deeper sense of language context and
flow than single-direction language models
Dept. of CSE@ HKBKCE 3 2020-21
Related work
BERT has its origins from pre-training contextual
representations including Semi-supervised Sequence
Learning, Generative Pre-Training, ELMo, and ULMFit.
Unlike previous models, BERT is a deeply bidirectional,
unsupervised language representation, pre-trained using
only a plain text corpus. Context-free models such as
word2vec or GloVe generate a single word embedding
representation for each word in the vocabulary, where
BERT is deeply bidirectional.

Dept. of CSE@ HKBKCE

Methodology
Architecture:
The BERT architecture builds on top of Transformer. We
currently have two variants available:
BERT Base: 12 layers (transformer blocks), 12 attention
heads, and 110 million parameters
BERT Large: 24 layers (transformer blocks), 16 attention
heads and, 340 million parameters.
The BERT Base architecture has the same model size as
OpenAI’s GPT for comparison purposes. All of these
Transformer layers are Encoder-only blocks.
Dept. of CSE@ HKBKCE
Dept. of CSE@ HKBKCE
Proposed System:
On October 25, 2019 Google announced that they had
started applying BERT models to their search algorithms
within the US. On December 9, 2019 BERT was applied
to over 70 languages.

Dept. of CSE@ HKBKCE

Design Process and Implementation:
BERT makes use of Transformer, an attention mechanism
that learns contextual relations between words (or sub-
words) in a text.
In its vanilla form, Transformer includes two separate
mechanisms — an encoder that reads the text input and a
decoder that produces a prediction for the task.
 Since BERT’s goal is to generate a language model, only
the encoder mechanism is necessary.
The chart below is a high-level description of the
Transformer encoder.

Dept. of CSE@ HKBKCE

The input is a sequence of tokens, which are first
embedded into vectors and then processed in the neural
network.
The output is a sequence of vectors of size H, in which
each vector corresponds to an input token with the same
index.
When training language models, there is a challenge of
defining a prediction goal. Many models predict the next
word in a sequence (e.g. “The child came home from
___”), a directional approach which inherently limits
context learning.
To overcome this challenge, BERT uses two training
strategies:
Dept. of CSE@ HKBKCE
Masked LM (MLM)
Before feeding word sequences into BERT, 15% of the
words in each sequence are replaced with a [MASK]
token.
The model then attempts to predict the original value of
the masked words, based on the context provided by the
other, non-masked, words in the sequence.
If the i-th token is chosen, BERT replaces the i-th
token with:
1) the [MASK] token 80% of the time

Dept. of CSE@ HKBKCE

2) a random token 10% of the time
3) the unchanged i-th token 10% of the time

Dept. of CSE@ HKBKCE

The BERT loss function takes into consideration only the
prediction of the masked values and ignores the prediction
of the non-masked words.
 As a consequence, the model converges slower than
directional models, a characteristic which is offset by its
increased context awareness.

Next Sentence Prediction (NSP)

In the BERT training process, the model receives pairs of
sentences as input and learns to predict if the second
sentence in the pair is the subsequent sentence in the
original document.

Dept. of CSE@ HKBKCE

During training, 50% of the inputs are a pair in which the
second sentence is the subsequent sentence in the original
document, while in the other 50% a random sentence from
the corpus is chosen as the second sentence.
The assumption is that the random sentence will be
disconnected from the first sentence.
To help the model distinguish between the two sentences
in training, the input is processed in the following way
before entering the model:
1) A [CLS] token is inserted at the beginning of the first
sentence and a [SEP] token is inserted at the end of each
sentence.

Dept. of CSE@ HKBKCE

2) A sentence embedding indicating Sentence A or
Sentence B is added to each token. Sentence embedding
are similar in concept to token embedding with a
vocabulary of 2.
3) A positional embedding is added to each token to
indicate its position in the sequence. The concept and
implementation of positional embedding are presented
in the Transformer paper.

Dept. of CSE@ HKBKCE

To predict if the second sentence is indeed connected to the
first, the following steps are performed:
1) The entire input sequence goes through the
Transformer model.

Dept. of CSE@ HKBKCE

2) The output of the [CLS] token is transformed into a 2×1
shaped vector, using a simple classification layer (learned
matrices of weights and biases).
3) Calculating the probability of IsNextSequence with
softmax.

How to use BERT (Fine-tuning)

Using BERT for a specific task is relatively
straightforward:
BERT can be used for a wide variety of language tasks,
while only adding a small layer to the core model:

Dept. of CSE@ HKBKCE

Classification tasks such as sentiment analysis are done
similarly to Next Sentence classification, by adding a
classification layer on top of the Transformer output for
the [CLS] token.
In Question Answering tasks (e.g. SQuAD v1.1), the
software receives a question regarding a text sequence and
is required to mark the answer in the sequence. Using
BERT, a Q&A model can be trained by learning two extra
vectors that mark the beginning and the end of the answer.

Dept. of CSE@ HKBKCE

Application/Usage
 Pre-trained models: Models pretrained on
domain/application specific corpus
BioBERT(biomedicaltext), SciBERT(scientificpublication
s),and clinicalBERT(clinical notes). Training on domain
specific corpus has shown to yield better performance
when fine-tuning them on downstream NLP tasks like
NER etc. for those domains, in comparison to fine
tuning BERT(which was trained on BooksCorpus and
Wikipedia).

Dept. of CSE@ HKBKCE

Model pretrained on monolingual corpora (M-BERT) from 104
languages for zero-shot cross lingual model transfer (task specific
annotations in one language is used to fine tune model for evaluation
in another language)
 Model that jointly learns video and language representation learning
(videoBERT) by representing representing video frames as special
descriptor tokens along with text for pretraining. This is used for
video captioning.
 A model that combines the power of Graph neural networks and
BERT (G-BERT) for medical/diagnostic code representation and
recommendation. Medical codes with hierarchical representations are
encoded using GNNs and fed as input during pretraining with EHR
data. This is then fine-tuned for making medical recommendations.
 Text classifications
 Search engines

Dept. of CSE@ HKBKCE

Advantages/Disadvantages
Advantages:
BERT is a bidirectional model that is based on
the transformer architecture, it replaces the sequential
nature of Recurring Neural Networks with a much
faster Attention-based approach.
BERT is bidirectional, and this is quite interesting in
certain NLP cases, it means he can "read", in a certain
way, in two "random directions". As opposed to
directional models, which read the text input sequentially
(left-to-right or right-to-left), the Transformer encoder
reads the entire sequence of words at once.

Dept. of CSE@ HKBKCE

BERT is considering the "context", not just the single
word (the "window word" and n-grams) as FastText. In a
few words, BERT considers the sentence and the
sentences around that sentence. It allows the model to
understand the particular word within a sentence, and the
sentence itself within a period.
BERT uses minimal vocabulary. BERT uses the idea of
representing words as sub words or n-grams. On average
vocabulary of 8k to 30k n-grams can represent any word
in a large corpus (!). This has a significant advantage from
a memory perspective.

Dept. of CSE@ HKBKCE

Disadvantages
Will need to add non-pre-trained bidirectional model on
top
Right-to-left SQuAD model doesn’t see question
Need to train two models
Off-by-one: LTR predicts next word, RTL predicts
previous word
Not trivial to add arbitrary pre-training tasks.

Dept. of CSE@ HKBKCE

Conclusion/Future Scope
To evaluate performance, we compared BERT to other state-
of-the-art NLP systems.
Importantly, BERT achieved all of its results with almost no
task-specific changes to the neural network architecture. On
SQuAD v1.1, BERT achieves 93.2% F1 score (a measure of
accuracy), surpassing the previous state-of-the-art score of
91.6% and human-level score of 91.2%.
BERT is a groundbreaking technology, not only for NLP in
general but also for Google Search.
However, it's important to understand that BERT doesn't
change the way in which websites get ranked but simply
improves Google's understanding of natural language.
Dept. of CSE@ HKBKCE
References
 R. H. Tu et al., "Berkeley reliability tools-BERT," in IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 12, no. 10, pp. 1524-1534, Oct. 1993.
 M. Sabbaghian and D. Falconer, "BERT Chart Analysis of Turbo Frequency Domain Equalization with
Imperfect Channel State Information," 2008 IEEE Wireless Communications and Networking Conference,
Las Vegas, NV, 2008, pp. 436-441.
 M. Munikar, S. Shakya and A. Shrestha, "Fine-grained Sentiment Classification using BERT," 2019
Artificial Intelligence for Transforming Business and Society (AITB), Kathmandu, Nepal, 2019, pp. 1-5.
 Z. Gao, A. Feng, X. Song and X. Wu, "Target-Dependent Sentiment Classification With BERT," in IEEE
Access, vol. 7, pp. 154290-154299, 2019.
 S. Yu, J. Su and D. Luo, "Improving BERT-Based Text Classification With Auxiliary Sentence and Domain
Knowledge," in IEEE Access, vol. 7, pp. 176600-176612, 2019.
 H. Huang, X. Jing, F. Wu, Y. Yao, X. Zhang and X. Dong, "DCNN-BiGRU Text Classification Model Based
on BERT Embedding," 2019 IEEE International Conferences on Ubiquitous Computing & Communications
(IUCC) and Data Science and Computational Intelligence (DSCI) and Smart Computing, Networking and
Services (SmartCNS), Shenyang, China, 2019, pp. 632-637.
 W. Li, S. Gao, H. Zhou, Z. Huang, K. Zhang and W. Li, "The Automatic Text Classification Method Based on
BERT and Feature Union," 2019 IEEE 25th International Conference on Parallel and Distributed Systems
(ICPADS), Tianjin, China, 2019, pp. 774-777.
 S. Liu, H. Tao and S. Feng, "Text Classification Research Based on Bert Model and Bayesian Network,"
2019 Chinese Automation Congress (CAC), Hangzhou, China, 2019, pp. 5842-5846.

Dept. of CSE@ HKBKCE

LSTM to BERT
No ratings yet
LSTM to BERT
30 pages
BERT Slides
No ratings yet
BERT Slides
41 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
BERT
No ratings yet
BERT
1 page
11 Bert
No ratings yet
11 Bert
66 pages
TC PLC Lib Controller Toolbox
No ratings yet
TC PLC Lib Controller Toolbox
180 pages
BERT
No ratings yet
BERT
98 pages
A Primer in BERTology
No ratings yet
A Primer in BERTology
15 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
Learning To Answer by Learning To Ask - Getting The Best of GPT-2 and BERT Worlds PDF
No ratings yet
Learning To Answer by Learning To Ask - Getting The Best of GPT-2 and BERT Worlds PDF
10 pages
BERT_GPT_CoT
No ratings yet
BERT_GPT_CoT
83 pages
C4_W3
No ratings yet
C4_W3
98 pages
BERT-1-42
No ratings yet
BERT-1-42
42 pages
2412.13663v2
No ratings yet
2412.13663v2
20 pages
Clinic Albert
No ratings yet
Clinic Albert
25 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
LLM Learning
No ratings yet
LLM Learning
56 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
BERT
No ratings yet
BERT
4 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
Module 4
No ratings yet
Module 4
66 pages
Lec 02
No ratings yet
Lec 02
33 pages
Stanford Dataset 2.0
No ratings yet
Stanford Dataset 2.0
9 pages
Bert
No ratings yet
Bert
10 pages
data_mining_report
No ratings yet
data_mining_report
17 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
No ratings yet
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
17 pages
BERT Interview Questions and Cross Questions-1
No ratings yet
BERT Interview Questions and Cross Questions-1
9 pages
Difference Between BART and BERT
No ratings yet
Difference Between BART and BERT
2 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
BERT
No ratings yet
BERT
1 page
Khaire, Prajakta Unpaired Image To Image Translation Using Vision Transformer
No ratings yet
Khaire, Prajakta Unpaired Image To Image Translation Using Vision Transformer
48 pages
11. Pre-training & LLM 2
No ratings yet
11. Pre-training & LLM 2
46 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
BERT Architecture
No ratings yet
BERT Architecture
8 pages
Thesis On Sequential Quadratic Programming As A Method of Optimization
No ratings yet
Thesis On Sequential Quadratic Programming As A Method of Optimization
24 pages
NLP-LLM
No ratings yet
NLP-LLM
47 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
6-Bert T5 GPT
No ratings yet
6-Bert T5 GPT
31 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
Module 5
No ratings yet
Module 5
40 pages
BERT Language Model
No ratings yet
BERT Language Model
7 pages
Fast Synthesis of Threshold Logic Networks With Optimization
No ratings yet
Fast Synthesis of Threshold Logic Networks With Optimization
30 pages
Vehicle State Estimation Using Robust Adaptive Extended Kalman Filter
No ratings yet
Vehicle State Estimation Using Robust Adaptive Extended Kalman Filter
28 pages
2024.semeval-1.72
No ratings yet
2024.semeval-1.72
6 pages
Bert
No ratings yet
Bert
20 pages
Bert ayman
No ratings yet
Bert ayman
5 pages
Lecture-5 Investors Utility and CAL
No ratings yet
Lecture-5 Investors Utility and CAL
26 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
DSA - Lecture 19 - Prims and Kruskals Algorithm (1)
No ratings yet
DSA - Lecture 19 - Prims and Kruskals Algorithm (1)
31 pages
Pretraining-Based Natural Language Generation For Text Summarization
No ratings yet
Pretraining-Based Natural Language Generation For Text Summarization
7 pages
A-Level RDMS & Keys
No ratings yet
A-Level RDMS & Keys
18 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
FeedCon (Unit 5) PDF
No ratings yet
FeedCon (Unit 5) PDF
43 pages
Bert
No ratings yet
Bert
36 pages
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
No ratings yet
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
22 pages
MA8451-PROBABILITY AND RANDOMPROCESSES-639406958-QUE BANK(MA8451)PRP-ECE
No ratings yet
MA8451-PROBABILITY AND RANDOMPROCESSES-639406958-QUE BANK(MA8451)PRP-ECE
23 pages
Unit 5 - QB FML
No ratings yet
Unit 5 - QB FML
12 pages
Affiliated To VTU, Belgaum and Approved by AICTE
No ratings yet
Affiliated To VTU, Belgaum and Approved by AICTE
27 pages
A Comprehensive Survey of Deep Learning in The Field of Medical Imaging and Medical Natural Language Processing Challenges and Research Direct
No ratings yet
A Comprehensive Survey of Deep Learning in The Field of Medical Imaging and Medical Natural Language Processing Challenges and Research Direct
17 pages
Sheehan Saad 2013 Higher Order Orthogonal Iteration of Tensors (Hooi) and Its Relation To Pca and Glram
No ratings yet
Sheehan Saad 2013 Higher Order Orthogonal Iteration of Tensors (Hooi) and Its Relation To Pca and Glram
11 pages
Does BERT Make Any Sense? Interpretable Word Sense Disambiguation With Contextualized Embeddings
No ratings yet
Does BERT Make Any Sense? Interpretable Word Sense Disambiguation With Contextualized Embeddings
10 pages
Module-3 ACA
No ratings yet
Module-3 ACA
50 pages
Data Analysis Course: Time Series Analysis & Forecasting (Version-1)
No ratings yet
Data Analysis Course: Time Series Analysis & Forecasting (Version-1)
43 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
Recurrence Relations
No ratings yet
Recurrence Relations
7 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
A Sec Seminar Presentation Schedule
No ratings yet
A Sec Seminar Presentation Schedule
1 page
College Notes Gallery
100% (4)
College Notes Gallery
34 pages
CV Lab 3 Muhammad Umer Siddiq
No ratings yet
CV Lab 3 Muhammad Umer Siddiq
7 pages
Artificial Intelligence (Ai) : Prima Nur Pratama Fadhil Arif Fathoni Anas Rachmadi
No ratings yet
Artificial Intelligence (Ai) : Prima Nur Pratama Fadhil Arif Fathoni Anas Rachmadi
13 pages
Deeplob: Deep Convolutional Neural Networks For Limit Order Books
No ratings yet
Deeplob: Deep Convolutional Neural Networks For Limit Order Books
12 pages
Lampiran Uji Regresi Linear Berganda
No ratings yet
Lampiran Uji Regresi Linear Berganda
3 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
BERT
No ratings yet
BERT
21 pages
Mesh Warping
No ratings yet
Mesh Warping
6 pages
Chapter-10 Parallel Programming Models, Languages and Compilers
No ratings yet
Chapter-10 Parallel Programming Models, Languages and Compilers
30 pages
Graphs
No ratings yet
Graphs
3 pages
Demo Mid
No ratings yet
Demo Mid
8 pages
Solved Assignment Problems Algorithms and Flowcharts
No ratings yet
Solved Assignment Problems Algorithms and Flowcharts
8 pages
OR Explain Inclusion, Coherence and Locality Properties (8 Marks)
No ratings yet
OR Explain Inclusion, Coherence and Locality Properties (8 Marks)
2 pages
Statustics P2 Exercise MS
No ratings yet
Statustics P2 Exercise MS
5 pages
Solutions Model-3
No ratings yet
Solutions Model-3
4 pages
Becoming A Data Scientist StudyPlan
No ratings yet
Becoming A Data Scientist StudyPlan
10 pages
EE580 Final Exam 2 PDF
No ratings yet
EE580 Final Exam 2 PDF
2 pages
Harshal PAL: Education
No ratings yet
Harshal PAL: Education
1 page
Staad Pro Basics Irfan 204
No ratings yet
Staad Pro Basics Irfan 204
12 pages
Project Poster (IT Group 22)
No ratings yet
Project Poster (IT Group 22)
1 page
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

HKBK College of Engineering Department of Computer Science and Engineering

Uploaded by

HKBK College of Engineering Department of Computer Science and Engineering

Uploaded by

HKBK COLLEGE of ENGINEERING

Department of Computer Science and Engineering

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Next Sentence Prediction (NSP)

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

How to use BERT (Fine-tuning)

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

Dept. of CSE@ HKBKCE

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.