An Overview On Extractive Text Summariza
An Overview On Extractive Text Summariza
4258-4270
Abstract
W
ith the increasing of online information and recourse texts, text summarization has become
an essential and more favorite domain to preserve and show the main purpose of textual
information. It is very difficult for human beings to summarize manually large documents
of text. Text summarization is the process of automatically creating and condensing form of a given
document and preserving its information content source into a shorter version with overall meaning.
Nowadays text summarization is one of the most favorite research areas in natural language processing
and could attracted more attention of NLP researchers. There are also much more close relationships
between text mining and text summarization. According to difference requirements summary with
respect to input text, established summarization systems should be created and classified based on the
type of input text. In this study, at first, the topic of text mining and its relationship with text
summarization are considered. Then a review has been done on some of the summarization approaches
and their important parameters for extracting predominant sentences, identified the main stages of the
summarizing process, and the most significant extraction criteria are presented. Finally, the most
fundamental proposed evaluation methods are considered.
1. Introduction
With an increasing amount of data available on the web, rising of news websites, publication of
various electronic books, and a significant growth in the number of published articles in different
fields of study, one of the main challenges for researchers of 21st century has been that of accessing
accurate and reliable data. The widespread volume of information available on one hand and time
limitation on the other has directed the researchers to the interesting area of summarizing texts and
powerful system to summarize documents. Given that much researches paper has been done on the
summarization subject and many articles have been published about it. But there is still a lot of
weakness in summarization field and there is a large gap to achieve an efficie nt system that can acts
like a human agent. Problems outlined in the Persian language are far more than other languages.
The complexity of language and lack of precision tools are current problems facing the Persian
language processing. Therefore, the review of operations and procedures performed on other
4258
Article History:
Received Date: Dec. 12, 2018
Accepted Date: Apr. 12, 2019
Available Online: Jul. 01, 2019
M. Abdolahi et al. / Vol. 9(33), Jul. 2019, PP. 4258-4270
languages, considering the semantic field and use the semantic relationship, using tools such as
graph theory; statistical methods, fuzzy logic and data mining techniques can make a significant
contribution to Persian language processing.
A text made up of components such as words, phrases and sentences connected completely and
meaningfully together. One of the main areas of natural language processing is text mining, which
means discover and extract new information from the documents. Text mining is analysis the
documents to extract valuable hidden patterns from the text. It is also involves the detection of the
connection between words and sentences, classify and summarize texts. The main propose in this
research is an overview on text summarization .
More than half a century passed since the first research on the automatic text summarization [ 8].
Studying on text summarization systems emerged in 1950‟s which focused on the some basic
features of the text such as the position of the sentences in the text due to lack of powerful
computers and other problems in natural language processing. Since then, a lot of methods with
powerful tools were presented to simulate text processing, such as human brain.
4259
because eliminating words may be lead to some text noises such as grammatical error. Apart from
reducing the size of the text, processing consists of some other steps as follows:
Case Folding: At this stage all words with big and small letters came in a uniform. Also, always try
to converted uppercase letter to lowercase.
Stemming: This stage it root extraction of words without considering different scenarios such as
singular or plural, the time and all prefix and suffix. This requires knowledge of a particular language
and many algorithms have been proposed for each language. Examples of these types of words in the
English language can be compressed and compression that convert to compress.
Stop Words: Some frequent words without concept. Examples of these types of words in the
English language such as do, does, will, and so on.
N-grams: N-grams are a sub-set of N words come together. They should be protected, but it is
better to come in the form of uniform.
Tokenization: Tokenization is a dividing the text to smaller units, which are often words. But in
some cases it is difficult to define the word and its scope. A word as a unit of text has the following
properties:
Each set of continuous characters is always a word. These characters can even figure as well.
Range of a word can be a whitespace character, symbols or end of the line.
Sometimes whitespace characters and punctuation is not range word or token.
In languages such as English and Persian that words separated with whitespace, defining the
boundary words is not always an easy task. For example acronyms that has point, words in hypertext
links, and the words that are separated with symbolic characters and words with space character like
New York [4].
One of the important functions of the text mining is text clustering. The main aim of text
clustering is to place the similar sentences in same clusters. The number of clusters can be
determined by the user or automatically by the program. The first step of text clustering is dividing
the text to its component and separated the sentences.
4260
There are three major advantages of automatic generation of summary by the machines. The
advantages are: summary size is controllable, its content is predictable and it can be determined that
any part of the summary related to which part of the original text.
An abstractive summarization attempts to extract the main concept of the text in clear natural
language without necessity to use text phrases. Each abstractive summarization consists of
comprehension part to interpret the text and find the new concepts and production part to generate
new shorter text with most important information from the original document. In this method,
sentences could be omitted or changed or even new sentences could be generated. It should be
noted that this method is very complicated and even more complicated than machine translation.
But in Query-Based summary, it is assumed that the reader has a general knowledge about the
topic and just looking for specific information in the text. In this case based on the user's question, a
related summary is created. Most of these summarization systems are extractive.
4261
Independent summarization is something like generic summary, accepted every text of each field
and generate a general summary regardless of the text scope or type.
Domain dependent summarization accepts texts with specific field of literature and type. There
are many specific text patterns such as News, science text, fiction, sports. Ge nerated summary of
domain dependent systems are according to the input text type. Genre specific summarization is
trying to summary much more specialized field. This group can be abstract specific literature such as
sports news, texts related to the field of geography, political News and etc.
4262
scoring them, and noun phrases extraction, clustering and ranking them. Word morphological
analysis plays an important role in natural language processing and helps to resolve the ambiguity in
the words. The mentioned criteria study of root words, prefixes and suffixes attache d to a word.
(1)
√
In Eq. (1), the numerator is the number of similar sentence and title words and the denominator is
the square root of the product of the title length and sentence length.
) ))) )) )))
(2)
PSi is effective value of the relative position of the ith sentence that calculated based on entropy.
The C is a constant value between zero and one and Xi is a sentence location value.
(3)
4.6. Cue-Phrase
Cue-Phrases are some phrases like “finally”, “as result”, “in this paper”. Statements containing
these words certainly including important topic and must be preserved. There are some words that
lead to increase the importance of a noun and its sentence. For example the words like Dr., Mr., and
Miss.
There are some other words placed at the beginning of a sentence. These words lead to decrease
the importance of the sentence. The mentioned sentences are come to complete the meani ng of
previous sentence and can be ignore it. Some example of them are “because”, “for example”,
4263
“therefore” and “thus”. The weight of statements containing these words in both positive and
negative form calculated as Eq. (4).
| | )
(4)
In the Eq. (4), Si is the number of words in sentences i, Spi is the number of positive Cue -Phrases
and Sni is the number of negative Cue-Phrases.
4.8. Sentences containing words with different fonts and font effects
Some of the words that are important or specific issues charged with capital letters started or
bold, italic or underlined. They have a higher chance of exposure sentences in the output.
(5)
Sip is the number of specific terms in the sentence, Si length of sentence and Xi scoring rate of the
sentence based on mentioned criteria.
4264
4.13. Pronouns
Sentences with pronouns have less importance than the one with nouns. Pronouns refer to nouns
that described in the previous sentences. The position of the pronoun in the sentence is also
important. For example, a sentence with pronoun at the beginning is less important sentence. But if
the pronoun is placed at the middle and end of a sentence, it is a little more important sentence. The
score of sentence including pronoun calculated as Eq. (7):
)
(7)
)
In Eq. (7), the | Spi | is the number of pronouns in sentences, Si is the sentence length and BPi is a
constant parameter. If the pronoun location is in the first three word in the sentence, BPi = 1,
otherwise BPi = 0.
4265
Summarization algorithm based on LSA method consists of three steps: create the input matrix,
applied the SVD method on the created matrix and sentence extraction. LSA also has some
limitations. The most important of them are as follows:
The algorithm does not use the information about the words arrangement in sentences,
grammar and morphology relationship. However, this information can be useful to better
understand words and sentences.
The algorithm does not use any word knowledge and word database.
By increasing the number of different words and heterogeneous data, the performance of
the algorithm greatly reduced. Performance reduction is due to the time and memory
complexity of the SVD method.
)
S (Va) = (1 – d) + d * ∑ ) (8)
)
In the Eq. (8), V is nodes, E is edges, in(Va) is the number of Va input edges, out(Va) is the number
of Va output edges and d is an input parameter between zero and one. S(Va) is Va score and S(Vb) is
Vb score. The algorithm can also be applied on an undirected graph, but the output summary is more
different with complexity of time.
An effective graph based summary proposed algorithm is Text Rank algorithm [18]. It uses an
unsupervised method to extract key words and sentences with scoring to nodes based on previous
mentioned similarity measures [17]. The algorithm is started with an optional node values and
recursively repeated until coverage to predefined threshold. The optional nodes value has no effect
on the final scores.
4266
∑
) (9)
One of the most important issues in k-means clustering is determining the optimal number of
clusters. Given that in this method largest cluster is considered as main topic, input text size has a
direct impact on determining the number of clusters. For example a big K lead to small clusters and
as result small and disperse summary with very low correlation and small K lead to big and dense
clusters and as result low compression text.
N.I.Meghana & M.S.Bewoor proposed another technique to create a query based summarization
system using clustering methods [18]. They used Expectation Maximization Clustering (EM) algorithm
and implementation methodology is divided into two sections. The EM algorithm has been
implemented in the first part and query-based summarization is done in second part. The proposed
method has been used the Word net.
4267
ignored. The problem is more common in multi document summarization [20]. Similar concept
sentences are also created the ambiguity in question answering systems.
There are some proposed methods to combine similar sentences to extract much more important
sentences from the main text. One of the most important sentence fusions is dependency trees.
Katja Filippova is proposed a directed acyclic graph (DAG) to sentence fusion [12]. Clustering
approaches are also can be used for combining similar sentences. In these approaches, small clusters
are considered to place very similar sentences and then each cluster can be combined t o one
sentence.
(10)
In the Eq. (10) sum_ref determines summary extracted by experts and sum_cand determines
summary extracted by system.
Another evaluation criterion mentioned in ROUGE-N is calculated by the Eq.(11). The measure
compares number of N-grams in machine summary and human summary.
∑ ∑ ))
(11)
∑ ∑ ))
In the Eq. (11) N is the number of N-grams, count_match (N-gram) is maximum number of N-
grams which are in machine summary and human summary simultaneously. count(N -gram) is also
the number of N-grams in human summary.
4268
CONCLUSION
Text summarization is one of the most exciting research areas in natural language processing. It is
an open research areas and a lot of researches are having been done about it. Text summarization
can be classified into different groups and approaches and the most notably of them studied in this
paper. In this study, at first, the topic of text mining and its relationship with text summarization are
considered. Important designing criteria in text summarization systems are presented. Different
approaches for summarization and important parameters needed to rate important sentences are
introduced. Finally important evaluation approaches are introduced.
References
[1] A. Agrawal, U. Gupta, “Extraction based approach for text summarization using k -means
clustering”, International Journal of Scientific and Research Publications, Vol 4, Issue 11, 2014.
[2] E. Hovy, C. Lin, ”Automated Text Summarization and the SUMMARIST System, In Advances in
Automatic Text Summarization”, MIT Press, pp. 81-94, 1999.
[4] G.O Makbule, I. Cicekli, F. Nur Alpaslan, “Text Summarization of Turkish Texts using Latent
Semantic Analysis”, 23rd International Conference on Computational Linguistics, pp. 869–876,
2010.
[5] G.O. Makbule, “Text Summarization using Latent Semantic Analysis”, M.S thesis, Middle East
Technical University, 2011.
[6] G. Salton, “Automatic Text Processing: The Transformation, Analysis, and Retrieval of
Information by Computer”, Addison-Wesley Publishing Company, 1989.
[7] H. Edmundson, “New Methods in Automatic Extracting”, Journal of the Association for
Computing Machinery, Vol16 (2), PP. 264-285, 1969.
[8] H. Luhn, “The automatic creation of literature abstraction”, IBM Journal of research and
development, Vol 2, pp.159-165, 1958.
4269
[13] K. Youngkoong, S. Jungyun, “An Effective Sentence Extraction Technique Using Contextual
Information and Statistical Approaches for Text Summarization”, Pattern Recognition Letters,
2008.
[14] L. Suanmali, M. Salem, B. Salim, N. Salim, “Sentence Features Fusion for Text summarization
using Fuzzy Logic, IEEE, pp.142-145, 2009.
[15] M. Pourvali, A. Abadeh Mohammad, “Automated text summarization base on lexical chain and
graph using of word net and Wikipedia knowledge base”, IJCSI International Journal of
Computer Science Issues, No. 3, vol. 9, 2012.
[16] M. Wasson, “Using leading text for news summaries: Evaluation results and implications for
commercial summarization applications”, in Proc. 17th International Conference on
Computational Linguistics and 36th Annual Meeting of the ACL, pp.1364-1368, 1998.
[17] N. Alami, M. Meknassi, N. Rais, “Automatic Texts Summarization: Current State of the Art”,
Journal of Asian Scientific Research, Vol 5(1), pp 1-15, 2015.
[18] N.I. Meghana, M.S. Bewoor, M.S, S.H. Patil, “Text Summarization using Expectation
Maximization Clustering Algorithm”, International Journal of Engineering Research and
Applications (IJERA), Vol. 2, Issue 4, pp.168-171, 2012.
[19] R. Barzilay, M. Elhadad, “Using Lexical Chains for Text Summarization”, the MIT Press, pp. 111-
121, 1999.
[21] R. Mihalcea, “Graph-based ranking algorithms for sentence extraction, applied to text
summarization”, in Proceedings of the ACL 2004 on Interactive poster and demonstration
sessions, 2004.
[23] T. Hirao, Y. Sasaki, H. Isozaki, “An extrinsic evaluation for question-biased text summarization
on qa tasks”. In Proceedings of NAACL workshop on Automatic summarization, 2001.
[25] V. Gupta, G.S Lehal, “A Survey of Text Mining Techniques and Applications”, Journal of
Emerging Technologies in Web Intelligence, Vol. 1, no. 1, 2009.
[26] W. al-sanie, “Towards an infrastructure for Arabic text summarization using rhetorical
structure theory”, M.S Thesis”, Department of computer science. King Saud university, Riyadh,
Kingdom of Saudi Arabia, 2005.
4270