0% found this document useful (0 votes)
28 views16 pages

Report Group-8

The document presents the models and methods used in a project on summarization and translation. It discusses extractive summarization models like BERT and abstractive models like GPT-2, T5, BART, and PEGASUS. It used data from 214 pages of a G20 Indian government website and focused on one article to test these models. The models generate summaries through different techniques, like BERT using sentence clustering while GPT-2 uses layers and positional encoding. BART and PEGASUS were pre-trained on large datasets for summarization tasks. The project aims to test these models' performance on the dataset.

Uploaded by

ravi2587ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

Report Group-8

The document presents the models and methods used in a project on summarization and translation. It discusses extractive summarization models like BERT and abstractive models like GPT-2, T5, BART, and PEGASUS. It used data from 214 pages of a G20 Indian government website and focused on one article to test these models. The models generate summaries through different techniques, like BERT using sentence clustering while GPT-2 uses layers and positional encoding. BART and PEGASUS were pre-trained on large datasets for summarization tasks. The project aims to test these models' performance on the dataset.

Uploaded by

ravi2587ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Project Report on Summarization and Translation

21BDS057 - Ravi Ranjan


21BDS036 - Manish Kumar
21BDS028 - Kola Kiriti Kumar
21BDS042 - Dilli Babu N

Under the guidance of


Dr. Sunil Saumya
Asst. Prof., Dept. of Data Science And Intelligent Systems

DEPARTMENT OF DATA SCIENCE AND ARTIFICIAL INTELLIGENCE


INDIAN INSTITUTE OF INFORMATION TECHNOLOGY DHARWAD

December 12, 2023

1
Contents
1 Introduction 3

2 Related Work 3
2.1 Thematic Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Rhetorical Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Data and Methods 3


3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3.1 Sequence Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3.2 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Computation Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4.1 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.5 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Results and Discussions 8


4.1 ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Pre-trained Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Conclusion 12

6 References 13

2
1 Introduction
Summarization is a well-defined problem in Natural Language Processing (NLP) wherein lengthy
documents belonging to a certain domain (e.g., Health, Legal case, Lecture documents) are
reduced to digestible shorter paragraphs. Automatic summarization generally confronts issues
with understanding the context, clustering, and reconstructing the statements into a summary.

Summarization tasks generally have two methodologies:


• Abstractive Summarization: In this approach, the model attempts to generate novel
statements with the knowledge it has on the domain.
• Extractive Summarization: In this approach, the model ranks and selects essential
statements from the source file to compose a summary. This report proposes an imple-
mentation of an Extractive Summarization model.

2 Related Work
Majority of the previous work on domain adaptation of summary generation in Context, fail to
identify optimal dataset for the operations of fine-tuning the Language model. They also fail to
modify the structure of the summary in comparison to the document to capture much important
parts of the things into the summary.

2.1 Thematic Structure


Document-Bert is an attempt in materializing this approach, the study focuses on building three
versions domain specific BERT model; (i) Use BERT out of the box and fine tune on the task
data, (ii) Use BERT and adapt the language model with domain corpora and further fine tune
over the task data, (iii) Pre-train the BERT model over domain corpora instead of Generic
corpora.

2.2 Rhetorical Roles


In continuation to the discussion of Extractive Summarization model types, Unsupervised Do-
main Dependent Models : In addition to taking into the account rhetorical structure and se-
mantic roles, the creation and assignment of scoring system is also to be considered. In the
ideal scenarios the summary should be consisting of 10% Introduction, 25% Context, 60% Case
Analysis and 5% Conclusion with Ruling of summary . Another alternative is to use traditional
TF-IDF technique to rank the sentences and Saravanan et.al. uses the K - Mixture Model as
the deciding factor for inclusion of sentences in the summary .

3 Data and Methods


3.1 Data
We have procured the data from the G20 Indian government website.It was in a pdf form and
the total pages were 214.So to perform such summarization and translation was computationally
difficult.So we took one article from document and perform our action.

3
3.2 Models
This project involves using multiple models to test and review their scores using this dataset.
For extractive summarization we have model called BERT-EXTRACTIVE SUMMARIZATION
and for abstractive we have model like GPT-2,T5,PEGASUS and BART. For translation we
have model called MBART-large-50-many-to-many-mmt.

BERT works by first embedding the sentences, then running a clustering algorithm, finding
the sentences that are closest to the cluster’s centroids.

BERT MODEL

GPT-2 works in following ways


Layered Structure: GPT consists of multiple layers of the transformer model. Each layer in-
cludes a combination of self-attention and feedforward neural network sub-layers. Positional
Encoding: Since, transformers don’t inherently understand the order of tokens in a sequence,
positional encoding is added to provide information about the positions of tokens in the input
sequence. Attention Mechanism: GPT uses a self-attention mechanism that allows the model to
assign different weights to different parts of the input sequence, enabling it to focus on relevant
information. Fine-Tuning: After pre-training, GPT models can be fine-tuned on specific tasks
with smaller datasets to adapt to particular applications, such as language translation, summa-
rization, or question-answering. Autoregressive Generation: GPT is autoregressive, meaning it
generates sequences one token at a time. During inference, the model predicts the next token
based on the preceding context. Parameter Size: GPT models typically have a large number of
parameters, contributing to their ability to learn complex patterns and generate coherent and
contextually relevant text.

4
GPT-2 MODEL

The BART-Large-CNN model is the BART model which is pre-trained on the English lan-
guage, and fine-tuned on CNN Daily Mail . It was introduced in the paper BART: Denoising
Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Compre-
hension and first released in the repository .

BART, which stands short for Bidirectional Auto-Regressive Transformers, is an encoder-to-


encoder model which is synonymous to a sequence-to-sequence model. It employs a bidirec-
tional approach, similar to how BERT works, in which the input is read from both ends. This
helps the model to have better context from both ends of the input text at the same time. The
encoder is bidirectional, while the decoder is auto-regressive, which means that the model pre-
dicts future data based on past data(similar to GPT). The pre-training for this model involves
using a random noising function with the intent of corrupting the input text and then learning
a model to reconstruct the original text. The BART model is notably effective when it is fine-
tuned for the task of summarization. BART-Large-CNN is a fine-tuned version of BART, which
is used for text generation tasks like translation and summarization .

BART MODEL

5
PEGASUS , which stands short for Pre-training with Extracted Gap-sentences for Abstrac-
tive Summarization Using BERT as Encoder, is a transformer encoder-decoder model to improve
fine-tuning performance on abstractive summarization. In this model, during pre-training, sev-
eral whole sentences are removed from the documents and the it is tasked with recovering them.
The authors of the paper found out that if you choose the ’most important’ sentences from the
input document to mask then you would end up with results closest to a summary. The paper
concluded using some training results that large datasets for training samples is not necessary
for supervised learning anymore which is why it opened up many low-cost use cases.

PEGASUS MODEL

T5 woks on the principle of transfer text-to-text transformer (T5) is a state-of-the-art pre-


trained language model based on the transformer architecture. It adopts a unified text-to-text
framework that can handle any natural language processing (NLP) task by converting both the
input and output into natural language texts.

6
T5 Architecture

MBART MODEL MBART is a sequence-to-sequence denoising auto-encoder pretrained on


large-scale monolingual corpora in many languages using the BART objective. mBART is one of
the first methods for pretraining a complete sequence-to-sequence model by denoising full texts
in multiple languages, while previous approaches have focused only on the encoder, decoder, or
reconstructing parts of the text.
MBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for
translation task. As the model is multilingual it expects the sequences in a different format. A
special language id token is added in both the source and target text.

3.3 Problems
3.3.1 Sequence Length
In our dataset, as previously mentioned in the Data and Methods section, we can see that our
decisions are extremely long and range from 1500-2000words. Most transformer-based models
have input sequence limits of 512 tokens or 1024 tokens. This proves to be a significant problem
since our decision is nearly 5-10 times the token limit.

7
3.3.2 Solutions
1. Split the decision into multiple chunks and generate summaries separately and then com-
bine the summaries to one summary. The problem with this method is that transformers
use global context for the task of summarization, and splitting into chunks would contra-
dict the purpose of using transformers-based models, and subsequently context between
the chunks would be lost. Since our summarizer is abstractive, it would generate similar
summaries on consecutive chunks. We can avoid the latter problem by simply using cosine
similarity to select the sentences and drop sentences with extremely high similarity with
other ones.

3.4 Computation Limit


Our implementation involves fine-tuning a pre-trained model to summarize the large document
text. Fine-tuning a model for large sequence sizes is computationally expensive for even small
training and testing sizes. Even though we use a small dataset of 250 document-summary pairs,
we exceed computation limits of online resources.

3.4.1 Solutions
1. Train sample for fewer epochs decrease the size of the output summary which would allow
some online sources to train the model. This proves to be extremely difficult and the
summary generated using the fine-tuned models prove to be illegible and grammatically
incorrect.
2. Use other sources to train and test the models metrics, e.g., AutoTrain model (Hugging-
Face) is an online-open source website to use and train models on your dataset.

3.5 Methodologies

Methods for Extractive Summarization and translation of


G20 Text
We propose several methods for generating extractive and abstractice summaries of text then
translation:

1. Directly use a pre-trained abstractive model like BERT extractive summarizer.


2. Fine-tune an extractive summarizer for the Indian G20 document.
3. Fine-tune an abstractive summarizer like GPT-2,T5,PEGASUS,BART for the dataset then
comparing their similarity with human reference summary but in our case we are using
extractive summary as reference

4 Results and Discussions


4.1 ROUGE
In order to measure the degree of similarity between any two texts, we use the ROUGE score.
ROUGE, which stands for Recall Oriented Understudy for Gisting Evaluation, is widely con-
sidered to be the standard evaluation metric for NLP tasks. Below are the types of ROUGE
used:

1. ROUGE-N: Takes into account n-grams matching between the generated summary and
the gold standard summary. ROUGE-1 measures unigrams or single words that match be-
tween the generated summary and gold standard. Similarly, ROUGE-2 looks at matching
bigrams.

8
2. ROUGE-L: Counts the longest common sub-sequence matching between the generated
summary and the gold standard. The core idea behind this is that a higher ROUGE-L
score would mean that a long enough sub-sequence is common, and hence the summary is
closer to the gold standard.

ROUGE FORMULAE

ROUGE allows us to calculate three parameters, which are Recall, Precision, and F-1 score.
Recall is equal to the number of matching n-grams divided by the total number of n-grams in
the reference text. Precision is equal to the number of matching n-grams divided by the total
number of n-grams in our generated summary. The F-1 score is the simple harmonic mean of
recall and precision.

4.2 Pre-trained Model Results


Rouge score of individual models

Rouge score of combined model

9
Bar chat of rouge scores

10
11
Now looking at chart we can see that GPT-2 is best summary model. So,we will take that
model summary and then trasnlate in our desired languge using mBART model.

5 Conclusion
In the current scenario of the Indian government System overloaded with documents and peo-
ple don’t have time to all document. So,for easeness our model is best for summarization and
translation which will affect the lives of people on large scale.

Looking at the future scope for the project we can see that by training the model for just a
small part of the dataset, we produced decent scores which can be improved by increasing the
computational power.

12
6 References
code link
https://www.kaggle.com/code/raviranjan7284/summarization-and-translation

Citations
1.Narayan, S., Cohen, S.B. and Lapata, M., 2018. Don’t give me the details, just the sum-
mary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint
arXiv:1808.08745

2.Erkan, G. and Radev, D.R., 2004. Lexrank: Graph-based lexical centrality as salience in
text summarization. Journal of artificial intelligence research, 22, pp.457-479.

3.Narayan, S., Cohen, S.B. and Lapata, M., 2018. Ranking sentences for extractive summa-
rization with reinforcement learning. arXiv preprint arXiv:1802.08636.

4.Nallapati, R., Zhai, F. and Zhou, B., 2017, February. Summarunner: A recurrent neural
network based sequence model for extractive summarization of documents. In Proceedings of
the AAAI conference on artificial intelligence (Vol. 31, No. 1).

13
Acknowledgement
We extend our heartfelt appreciation to all individuals who contributed to the successful
culmination of this project.
Our deepest gratitude is extended to Dr. Sunil Saumya Sir, our dedicated supervisor, whose
unwavering guidance, support, and invaluable insights were instrumental throughout the project.
His expertise and encouragement significantly influenced the trajectory of our undertaking.
We would also like to express our sincere thanks to our fellow teammates who, at various
stages of the project, provided invaluable support, shared insightful ideas, and offered assistance.
Their collaboration enriched the project experience and contributed to its overall success.
This project would not have been possible without the collective effort and encouragement
received from our academic community and beyond.
Thank you to everyone who played a role, directly or indirectly, in the realization of this
project.

14
Declaration
We declare that this written submission represents my ideas in my own words and where others’
ideas or words have been included, we have adequately cited and referenced the original sources.
We also declare that we have adhered to all principles of academic honesty and integrity and have
not misrepresented or fabricated or falsified any idea/data/fact/source in our submission. We
understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or
from whom proper permission has not been taken when needed.

(Signature with date)


Ravi Ranjan
Roll No= 21BDS057

(Signature with date)


Manish Kumar
(Roll No=21BDS036)

(Signature with date)


Kiriti Kumar Kola
Roll No=21BDS028

(Signature with date)


Dilli Babu N
Roll No=21BDS042

15
Approval Sheet
This project report entitled (Title) by (Author Name 1), (Author Name 2), and (Author
Name 3) is approved for the degree of Bachelor of Technology in Computer Science and Engi-
neering.

Supervisors

Head of Department

(Head of Department)

Examiners

Date:
Place:

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy