Report Group-8
Report Group-8
1
Contents
1 Introduction 3
2 Related Work 3
2.1 Thematic Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Rhetorical Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5 Conclusion 12
6 References 13
2
1 Introduction
Summarization is a well-defined problem in Natural Language Processing (NLP) wherein lengthy
documents belonging to a certain domain (e.g., Health, Legal case, Lecture documents) are
reduced to digestible shorter paragraphs. Automatic summarization generally confronts issues
with understanding the context, clustering, and reconstructing the statements into a summary.
2 Related Work
Majority of the previous work on domain adaptation of summary generation in Context, fail to
identify optimal dataset for the operations of fine-tuning the Language model. They also fail to
modify the structure of the summary in comparison to the document to capture much important
parts of the things into the summary.
3
3.2 Models
This project involves using multiple models to test and review their scores using this dataset.
For extractive summarization we have model called BERT-EXTRACTIVE SUMMARIZATION
and for abstractive we have model like GPT-2,T5,PEGASUS and BART. For translation we
have model called MBART-large-50-many-to-many-mmt.
BERT works by first embedding the sentences, then running a clustering algorithm, finding
the sentences that are closest to the cluster’s centroids.
BERT MODEL
4
GPT-2 MODEL
The BART-Large-CNN model is the BART model which is pre-trained on the English lan-
guage, and fine-tuned on CNN Daily Mail . It was introduced in the paper BART: Denoising
Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Compre-
hension and first released in the repository .
BART MODEL
5
PEGASUS , which stands short for Pre-training with Extracted Gap-sentences for Abstrac-
tive Summarization Using BERT as Encoder, is a transformer encoder-decoder model to improve
fine-tuning performance on abstractive summarization. In this model, during pre-training, sev-
eral whole sentences are removed from the documents and the it is tasked with recovering them.
The authors of the paper found out that if you choose the ’most important’ sentences from the
input document to mask then you would end up with results closest to a summary. The paper
concluded using some training results that large datasets for training samples is not necessary
for supervised learning anymore which is why it opened up many low-cost use cases.
PEGASUS MODEL
6
T5 Architecture
3.3 Problems
3.3.1 Sequence Length
In our dataset, as previously mentioned in the Data and Methods section, we can see that our
decisions are extremely long and range from 1500-2000words. Most transformer-based models
have input sequence limits of 512 tokens or 1024 tokens. This proves to be a significant problem
since our decision is nearly 5-10 times the token limit.
7
3.3.2 Solutions
1. Split the decision into multiple chunks and generate summaries separately and then com-
bine the summaries to one summary. The problem with this method is that transformers
use global context for the task of summarization, and splitting into chunks would contra-
dict the purpose of using transformers-based models, and subsequently context between
the chunks would be lost. Since our summarizer is abstractive, it would generate similar
summaries on consecutive chunks. We can avoid the latter problem by simply using cosine
similarity to select the sentences and drop sentences with extremely high similarity with
other ones.
3.4.1 Solutions
1. Train sample for fewer epochs decrease the size of the output summary which would allow
some online sources to train the model. This proves to be extremely difficult and the
summary generated using the fine-tuned models prove to be illegible and grammatically
incorrect.
2. Use other sources to train and test the models metrics, e.g., AutoTrain model (Hugging-
Face) is an online-open source website to use and train models on your dataset.
3.5 Methodologies
1. ROUGE-N: Takes into account n-grams matching between the generated summary and
the gold standard summary. ROUGE-1 measures unigrams or single words that match be-
tween the generated summary and gold standard. Similarly, ROUGE-2 looks at matching
bigrams.
8
2. ROUGE-L: Counts the longest common sub-sequence matching between the generated
summary and the gold standard. The core idea behind this is that a higher ROUGE-L
score would mean that a long enough sub-sequence is common, and hence the summary is
closer to the gold standard.
ROUGE FORMULAE
ROUGE allows us to calculate three parameters, which are Recall, Precision, and F-1 score.
Recall is equal to the number of matching n-grams divided by the total number of n-grams in
the reference text. Precision is equal to the number of matching n-grams divided by the total
number of n-grams in our generated summary. The F-1 score is the simple harmonic mean of
recall and precision.
9
Bar chat of rouge scores
10
11
Now looking at chart we can see that GPT-2 is best summary model. So,we will take that
model summary and then trasnlate in our desired languge using mBART model.
5 Conclusion
In the current scenario of the Indian government System overloaded with documents and peo-
ple don’t have time to all document. So,for easeness our model is best for summarization and
translation which will affect the lives of people on large scale.
Looking at the future scope for the project we can see that by training the model for just a
small part of the dataset, we produced decent scores which can be improved by increasing the
computational power.
12
6 References
code link
https://www.kaggle.com/code/raviranjan7284/summarization-and-translation
Citations
1.Narayan, S., Cohen, S.B. and Lapata, M., 2018. Don’t give me the details, just the sum-
mary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint
arXiv:1808.08745
2.Erkan, G. and Radev, D.R., 2004. Lexrank: Graph-based lexical centrality as salience in
text summarization. Journal of artificial intelligence research, 22, pp.457-479.
3.Narayan, S., Cohen, S.B. and Lapata, M., 2018. Ranking sentences for extractive summa-
rization with reinforcement learning. arXiv preprint arXiv:1802.08636.
4.Nallapati, R., Zhai, F. and Zhou, B., 2017, February. Summarunner: A recurrent neural
network based sequence model for extractive summarization of documents. In Proceedings of
the AAAI conference on artificial intelligence (Vol. 31, No. 1).
13
Acknowledgement
We extend our heartfelt appreciation to all individuals who contributed to the successful
culmination of this project.
Our deepest gratitude is extended to Dr. Sunil Saumya Sir, our dedicated supervisor, whose
unwavering guidance, support, and invaluable insights were instrumental throughout the project.
His expertise and encouragement significantly influenced the trajectory of our undertaking.
We would also like to express our sincere thanks to our fellow teammates who, at various
stages of the project, provided invaluable support, shared insightful ideas, and offered assistance.
Their collaboration enriched the project experience and contributed to its overall success.
This project would not have been possible without the collective effort and encouragement
received from our academic community and beyond.
Thank you to everyone who played a role, directly or indirectly, in the realization of this
project.
14
Declaration
We declare that this written submission represents my ideas in my own words and where others’
ideas or words have been included, we have adequately cited and referenced the original sources.
We also declare that we have adhered to all principles of academic honesty and integrity and have
not misrepresented or fabricated or falsified any idea/data/fact/source in our submission. We
understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or
from whom proper permission has not been taken when needed.
15
Approval Sheet
This project report entitled (Title) by (Author Name 1), (Author Name 2), and (Author
Name 3) is approved for the degree of Bachelor of Technology in Computer Science and Engi-
neering.
Supervisors
Head of Department
(Head of Department)
Examiners
Date:
Place:
16