18
18
ScienceDirect
Procedia Computer Science 00 (2018) 000–000
Available online at www.sciencedirect.com www.elsevier.com/locate/procedia
Procedia Computer Science 00 (2018) 000–000
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 142 (2018) 270–277
Abstract
How can we know what is going on in the world with a click of a button? With the increase of digital data everywhere, it is
becoming difficult to categorize and retrieve information from such huge data. Topic detection is considered a powerful way to
How can we
mine data andknow
relatewhat is documents
similar going on intogether.
the world with a click
Although of a button?
the Arabic content With
on thethe increase
web of digital
is increasing data
every everywhere,
day, it is
the application
becoming
of difficultontoArabic
topic detection categorize
text isand
notretrieve
up to thisinformation from
increase. In this such
paperhuge data.
we are Topic detection
investigating famousis topic
considered a powerful
detection wayand
techniques, to
mine data
latest and relate
significant similararticles
scholarly documents
relatedtogether.
to topic Although
detection the Arabicand
in general content
in theonArabic
the web is increasing
domain every
in specific. day,
This the application
survey paper will
of topic
help detectioninterested
researchers on Arabicintexttheisdomain
not up to ofthis increase.
topic In this
detection paper
to be we are
familiar investigating
with commonly famous topic detection
used techniques techniques,
and updated with and
the
latest significant
technologiesscholarly articles related to topic detection in general and in the Arabic domain in specific. This survey paper will
in this area.
help researchers interested in the domain of topic detection to be familiar with commonly used techniques and updated with the
latest
© 2018technologies
The Authors.in this area. by Elsevier B.V.
Published
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
© 2018
© 2018 The
The Authors.
Authors. Published
Published by by Elsevier
Elsevier B.V.
B.V.
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Linguistics.
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational
Linguistics.
Keywords: Topic modeling;LDA;Clustering; Data Mining
1. Over
Introduction
the past few years the world has witnessed a huge increase in digital data across the internet. Nowadays people
use various types of social media, news sites, blogs, etc. as a source of information and to express themselves in
Over the
different past Along
ways. few years
withthethis
world has witnessed
massive amount of a huge
data,increase
the needin for
digital data across
classifying andthe internet.
sorting Nowadays
these peoplea
data became
crucial need. Sophisticated approaches have been applied to be able to classify these data in an organized manner. in
use various types of social media, news sites, blogs, etc. as a source of information and to express themselves
different ways. Along with this massive amount of data, the need for classifying and sorting these data became a
crucial need. Sophisticated approaches have been applied to be able to classify these data in an organized manner.
*
E-mail address: arafea@aucegypt.edu and bnadaaym@aucegypt.edu
*
E-mail address:
1877-0509 a
© 2018 rafea@aucegypt.edu and bby
The Authors. Published nadaaym@aucegypt.edu
Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review©under
1877-0509 2018responsibility
The Authors. of the scientific
Published B.V.of the 4th International Conference on Arabic Computational Linguistics.
committee
by Elsevier
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
From the massive amount of data streams coming from different social media, detecting new events and tracking
current ones has been an area of interest to many researchers over the past years. The original idea of topic detection
and tracking was originated back in 1996 in the US Government Defence Advanced Research Projects Agency
(DARPA) within its broadcast news[1]. The idea has been developed over the years using different techniques. The
techniques used can be summarized under three main categories: document-pivot approaches, feature-pivot
approaches and probabilistic approaches [2].
The remainder of the paper is structured as follows. Section 2 reviews the document-pivot approach. Section 3
provides an overview on the feature-pivot approach. Section 4 presents the probabilistic topic modelling approaches.
Section 5 shows the application of topic detection methods on Arabic texts. Finally, in section 6 we conclude the paper
and provide our remarks.
2. Document-pivot approaches
The document-pivot approach is based on clustering documents around certain topic together based on document
similarity. Different clustering methods were used, in [3] hierarchical agglomerative clustering with time decay was
used to identify events in news. The time decay feature helps clustering posts about the same event and detecting a
new event when happens. It is also applied in [4] to detect topics within financial news. Incremental clustering is
widely used specially in detecting topics from social media. The idea of the incremental clustering method is that if a
received item exceeds the similarity threshold to existing clusters, it’s added to the most similar one, otherwise a new
cluster is created with this item [5]. Incremental clustering is used in online topic detection as number of clusters do
not need to be known ahead.
With the rapid growth of social media, Twitter needed different handling due to the nature of its short posts. An
enhancement to the incremental clustering for online event detection from Twitter is presented in [6]. The
enhancement is based on representing the centroid of the cluster as a feature vector, so when a new tweet comes, a
pairwise similarity between the features of the tweet and centroid is performed. The weights of the features are updated
in the tweet vector and the centroid vector of active cluster. They called this method Incremental Clustering Vector
Expansion (ICVE) method. Their experiment showed a significant improvement in terms of precision, recall, and F-
measure compared to the traditional incremental clustering method.
Tackling the area of online topic detection from social media, ‘TwitterNews+’ is presented in [7]. Although the
incremental clustering method has lower computation complexity than other methods, when applying online this
computation cost can be high. This system decreases the computation cost by discarding old clusters after a threshold
of time to allow more space for the new ones. A time stamp is added to each cluster based on the time stamps of the
tweets in it. This system could achieve a recall of 0.96 and precision of 0.89 against the state of the art techniques
including, an older version of the system ‘TwitterNews’ [8], and first story detection method in [9].
3. Feature-pivot method
This approach relies on statistical methods to extract set of terms representing the topics. It uses similarity and co-
occurrence of terms together to detect a topic, this method was adopted by many researches on topic detection from
Twitter, as it suits the nature of limited numbers of words in the tweet. In [10] emerging topics was detected by
considering the posting time of the tweet and its growth/decay in a certain time interval. It also considered the author
of the tweet as a feature to better cluster tweets related to the same topic together. ‘TwitterMonitor’ presented in [11]
is detecting bursty terms that suddenly appear with high frequency indicating a new topic. It clusters those terms based
on their probabilistic co-occurrence to identify the topics. They also used more information from tweets in a post
processing phase, like: geo-location, and news sources to better visualization of the results.
Four approaches based on the feature-pivot approach is presented in [12] and compared to document-pivot
as baseline. The first approach is called Graphic feature-pivot, which uses the structural clustering algorithm for
networks (SCAN) [13] in grouping terms together. The algorithm groups nodes that share similarities together in what
is called a community. The node that’s connected to more than a community is called a hub. The nodes of the graph
are the terms and the communities represent the topics. by detecting the hubs, related topics can be grouped based on
number of hubs connecting them. The second algorithm presented is frequent pattern mining (FPM) which relies on
272 Ahmed Rafea et al. / Procedia Computer Science 142 (2018) 270–277
Author name / Procedia Computer Science 00 (2018) 000–000 3
pairwise co-occurrence between unigrams. The third algorithm is Soft FPM (SFPM) which extends the algorithm to
group a set of co-occurring unigrams not just pairs. The fourth one considers n-grams co-occurrences and called
BNgram. All the algorithms use a modified version of Term frequency-inverse document frequency (TF-IDF) which
considers the changing frequency over time. All algorithms were applied on three datasets collected from Twitter
during three major events about sports, politics and a social event in the USA in 2012, and evaluated using topical
recall, keyword precision and keyword recall measures. The result showed that BNgram and document-pivot method
could achieve the highest topical recall score of 0.7692 followed by SFPM achieving a score of 0.615. Regarding the
keyword precision, FPM could achieve a precision score of 1 in one of the datasets and 0.75 in another while a zero
in the third dataset. For the keyword recall, SFPM could achieve a recall score of 0.8982. It is worth noting that the
superiority of one algorithm is not consistent over the results of the three datasets. This is related to the nature of the
targeted events as the structure and coherent of topics differ from one domain to another.
A comparison between feature-pivot approach and document-pivot approaches was presented in [14]. The feature-
pivot approach is based on grouping co-occurring unigrams according to their proportional frequency in the set of
tweets they occur in, and their frequency in the whole dataset. While the document-pivot approach used bisecting k-
means clustering technique. Both approaches were applied on an Egyptian Twitter dataset, showing that the feature-
pivot approach could achieve F-score of 0.923 compared to 0.8 achieved by document-pivot. And by validating the
results on different sizes of datasets from three different domains, sports, entertainment, and politics, the average F-
score of the feature-pivot approach was 0.83 compared to 0.56 achieved by the document-pivot approach.
For online event detection feature pivot doesn’t work very well in new event detection, in [7] ‘TwitterNews+’
which is based on document-pivot approach could achieve a recall score of 0.96 and precision score of 0.89, while
two feature-pivot based approaches presented in [15] and [16] achieved recall scores of 0.58 and 0.71 respectively,
and precision scores of 0.55 and 0.64 respectively.
Topic modeling is an area of research focuses on classifying the data into groups [17], it is prominent for demonstrating
discrete data and giving a productive approach to find hidden structures in huge data [18]. The way topic modeling
works can be simply described as assuming each document as a group of topics with different probability.
Mathematical formulas are applied to find the probability of each topic in each document.
The idea of topic modeling could be originated back to the 90’s when latent semantic Analysis (LSA), formerly
known as latent semantic indexing (LSI), was presented as a novel approach for retrieval of documents not just by
occurrence of query word in them but based on the conceptual content these words imply [19]. Since then the idea of
clustering textual data based on their content is a growing area of research improving each day. Afterwards in 1999
the probabilistic latent semantic analysis (PLSA) was presented in [20] and was a great contribution towards the
enhancement of topic modeling approaches. In 2003 the famous Latent Dirichlet Allocation (LDA) approach was
presented in [21]. Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The main idea is
that documents are represented as random mixtures over latent topics, where each topic is characterized by a
distribution over words. The major advantage of LDA over LSA and PLSA is its ability in dimensionality reduction
and the ability to be applied in more complex methods.
Many variants of topic modeling techniques have appeared since then yielding lots of approaches can be applied
in various domains guaranteeing better understanding of large data and more coherent resulting topics. In this section
we will present the famous probabilistic topic modelling techniques, Probabilistic Latent Semantic Analysis (PLSA),
and Latent Dirichlet Allocation (LDA). Then significant recent works between 2017 and 2018 investigating variations
of LDA is presented.
Before presenting PLSA we will give a brief background of LSA to get more insight why PLSA was introduced.
LSA is a topic modeling technique in natural language processing. Its main objective is to create a vector-based
representation for texts that help grouping the related words [22]. A detailed review of LSA in [23] explains the
technical aspects of the approach. The mathematical foundation of Latent Semantic Analysis is the Vector Space
Ahmed Rafea et al. / Procedia Computer Science 142 (2018) 270–277 273
4 Author name / Procedia Computer Science 00 (2018) 000–000
Model (VSM), an algebraic model for representing documents as vectors in a space where dictionary terms are used
as dimensions. Using matrix notation, VSM represents a collection of ‘d’ documents (a corpus) in a space of ‘t’
dictionary terms as the t × d matrix X. Two main term reduction techniques are applied to matrix X: stop words
removal and stemming. TF-IDF is one of the most common techniques used in representing the entries of the matrix.
The similarity between terms in documents are calculated using different similarity metrics. One of the most common
one is the cosine similarity. The main contribution of LSA is that it considers not only the similarity of terms in
documents but also takes into consideration the related terms to generate a more insight to the topic. The LSA approach
was compared in [19] against a straightforward term matching method, and the precision of detecting relevant
documents by LSA outperformed the other approach by 13%.
LSA is used in many applications like online customer support, spam filtering, text summarization and lots more.
From the drawbacks LSA experiences is the high computational and memory usage, also determining the optimal
number of dimensions to use for singular value decomposition (SVD). Application of LSA is investigated in [24] on
the 2016 presidential debates in USA, it could capture the policy adopted by each candidate and the change in topics
based on people’s reactions.
PLSA was first introduced in 1999 [20], this approach has important theoretical advantages over standard LSA,
since it is based on the likelihood principle, defines a generative data model, and directly minimizes word perplexity.
It can also take advantage of statistical standard methods for model fitting, overfitting control, and model combination.
The core of PLSA is a statistical model called the aspect model. It is a latent variable model for general co-occurrence
data which associates an unobserved class variable 𝑧𝑧𝑧𝑧𝑧𝑧 = {𝑧𝑧1 , … . . 𝑧𝑧𝑘𝑘 }. With each occurrence of a word 𝑤𝑤𝑤𝑤𝑤𝑤 =
{𝑤𝑤1 , … . . 𝑤𝑤𝑀𝑀 } in a document 𝑑𝑑𝑑𝑑𝑑𝑑 = {𝑑𝑑1 , … . . 𝑑𝑑𝑁𝑁 }.
The Expectation Maximization (EM) algorithm is used commonly for maximum likelihood estimation in latent
variable models. It has two main steps:
• The expectation step (E) where posterior probabilities are calculated for the latent variable 𝑧𝑧 based on the
current estimates of the parameters.
• The maximization step (M) where parameters are updated for given posterior probabilities calculated in the
(E) step.
Regarding solving the perplexity problem of words, both approaches were applied to more than a dataset and their
performance was compared to a unigrams baseline. PLSA could reduce perplexity by more than a factor of 3, while
LSA achieved less than a factor of two.
PLSA suffers from drawbacks related to number of parameters and discovering new topics when the data increases
significantly. A recent novel approach is presented in [25] addressing the problem of discovering new topics from
large textual data. They presented Weighted Incremental PLSA (WPLSA) algorithm. They compared their approach
to the standard PLSA, Maximum posterior PLSA (MAP-PLSA), and Quasi-Bayes PLSA (QB-PLSA). Using
perplexity measure, WPLSA could achieve lower values compared to the other approaches on different number of
topics.
Continuing from the idea of PLSA and trying to solve the disadvantages of the model regarding the number
parameters that grows linearly with the size of the corpus, and the difficulty of assigning a probability to a new
document outside the trained corpus. The idea of latent Dirichlet allocation (LDA) came up from the fundamental
probabilistic feature both LSA and PLSA was built upon, which is the “bag of words” assumption. In the bag of words
model, the order of words is not taken into consideration, applying the exchangeability property of words. In [21] the
LDA approach was presented for the first time. It is defined as a generative probabilistic model of a corpus. The basic
idea that documents are represented as random mixtures over latent topics, where each topic is characterized by a
distribution over words.
The model assumes the following generative process for each 𝑤𝑤 in corpus 𝐷𝐷:
1. Choose 𝑁𝑁 ~𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝(𝜉𝜉)
2. Choose 𝜃𝜃~𝐷𝐷𝐷𝐷𝐷𝐷(𝛼𝛼)
3. For each of the 𝑁𝑁 words 𝑤𝑤𝑛𝑛 :
− Choose a topic 𝑧𝑧𝑛𝑛 ~ 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 (𝜃𝜃)
274 Ahmed Rafea et al. / Procedia Computer Science 142 (2018) 270–277
Author name / Procedia Computer Science 00 (2018) 000–000 5
− Choose a word 𝑤𝑤𝑛𝑛 from 𝑝𝑝(𝑤𝑤𝑛𝑛 |𝑧𝑧𝑛𝑛 , 𝛽𝛽) a multinomial property conditioned on the topic 𝑧𝑧𝑛𝑛
Comparing LDA to PLSA and mixture of unigrams model in [21], the perplexity at 100 topics of LDA was lower
than that of PLSA by 16% and lower than the mixture of unigrams model by 45% on one dataset. It also showed that
the other approaches didn’t perform well when new documents different than those of the training set is presented.
Regarding the application of LDA on social media specially Twitter, feature-pivot based approaches in [12] could
achieve better topical recall, keyword recall and precision better than LDA over three datasets. On one of the datasets
LDA resulted in zero scores in the three evaluation measures. In [7] LDA achieved a recall score of 0.45 and precision
score of 0.49 while feature-pivot approaches and ‘TwitterNews’ a document-pivot based approach achieved
significant better results for online event detection from Twitter.
With the variations of digital content, it became more intuitive to look beyond just the words in the documents.
Researchers started to think to augment some auxiliary data associated with the digital documents for better quality
of topics. A supervised approach tackling the idea of different context holds different sentiment for the same word
which implies different topic features, discriminatively objective-subjective LDA (dosLDA) is proposed in [26]. The
basic idea they built their approach on is Bag of Discriminative Words (BoDW). They elaborated with an example
that the same word as “bug” can appear in a scientific research which means an insect and has a neutral or may be
positive sentiment, can appear also in a software document meaning a “problem” which holds a negative sentiment.
If this is taken into consideration it will help the resulting topic to be of better quality. They applied their model on
different datasets, including Twitter, Flickr, and a multidomain sentiment data set. They defined the subjective part as
the sentiment (positive, neutral, and negative), and the objective part as the categories of documents, they showed that
the Bag of discriminative Words (BoDW) representation is more predictive than the Bag of Topics (BoT)[27]
representation that is used for discriminative tasks.
Mostly the probabilistic topic modeling is based on three-layer hierarchical Bayesian structure [21], where each
document is modelled as a probability distribution over topics, and each topic is a probability distribution over words.
A new approach suggesting adding a latent concept layer between topic layer and word layer, so that each topic is a
probability distribution over concepts and each concept is a probability distribution over words. They presented the
unsupervised model Conceptual LDA (CLDA) and the supervised model Conceptual Labelled LDA (CLLDA) in
[28].A concept knowledge base tool called Probase [29] is used to get the probability distribution of each concept over
words, and Gibs Sampling was used in the estimation of parameters. The models were applied on different datasets
and compared to the traditional Labeled LDA (LLDA)[30] model and showed better results.
Although topic detection area of research has been actively evolving over the past years, we can hardly find solid
application of topic modelling approaches on the Arabic text. The number of researches on Arabic is falling behind
the increase of Arabic content on the web [31]. Mainly due to different dialects used over the internet and the decrease
of using standard Arabic in social media and other blogs. Also, the lack of annotated corpora used for categorization.
In this section we are trying to present the most significant works related to the area of topic modelling and topic
detection in Arabic language since 2011 till 2018.
In 2011 LDA was applied on real word Arabic datasets collected by the authors in [31] based on ‘Echorouk’,
‘Reuters’, and ‘Xinhua’ web articles between the years 2008 and 2009. The research focused on investigating the
effect of different stemmers on the results of topic modelling. The experiments showed the proper usage of stemming
yields to better resulting topics. Another application of LDA is presented in [32] to discover the thematic structure of
the Quran. Each chapter of the Quran is considered as a document and LDA was applied to extract topics in each
chapter. The algorithm was able to identify major topics in each chapter of the Quran, characterized by the distinct
themes of Makki and Madani chapters.
A comparative study between LDA and K-means clustering technique is presented in [33]. The objective of the
research was to compare the influence of morpho-syntactic characteristics of the Arabic language in the pre-processing
phase on the performance of topic detection using both approaches. They used an Arabic corpus called “OSAC” (open
Ahmed Rafea et al. / Procedia Computer Science 142 (2018) 270–277 275
6 Author name / Procedia Computer Science 00 (2018) 000–000
source Arabic corpus) collected from multiple sites consists of 11 topics. Their results showed that LDA performed
better when applied on raw documents. After applying stop words removal, both approaches yielded in better results.
However, when stemmer was applied some ambiguity in the words increases resulting in performance degradation.
The authors related this degradation to the parameters used in the pre-processing phase, they suggested including other
parameters such as lemmatization to enhance the performance.
Recently, tackling Arabic dialects and its challenges, a topic and sentiment model was applied to the Colloquial
(Maghrebi) Arabic in [34]. The corpus used was collected from Facebook pages, a supervised approach was applied
to extract the sentiment, then LDA was applied to extract the topics. They proposed a new semi-supervised approach
combining the topic and the sentiment to join each topic to a specific sentiment. A sentiment layer between the
document and the topic layer was added, where sentiment labels are associated with documents, topics are associated
with sentiment labels and words are associated with both sentiment labels and topics. The results were promising, yet
some challenges need to be addressed. The challenge of the lack of a lexicon for the dialect so the authors had to build
their own lexicon which was not large enough. The stemmer didn’t perform well with the dialect, so more
improvement needs to be done.
In 2018, a combined approach between the k-means and the topic modeling approaches is presented in [35]. They
used a dataset of Modern Standard Arabic of news document composed of 2700 documents of 9 topics. The k-means
algorithm is applied to cluster documents. To cluster topics, LDA is applied before the clustering algorithm. The mean
normalized vectors of data act as input to the LDA algorithm. After that the output probability of topics in documents
is mean-normalized and k-means clustering is applied. In that manner the dimension of features is reduced because of
using LDA. The model was evaluated by applying it on different datasets, “Aljazeera”, “Alkhaleej”, “Alwatan”,
“BBC”, and “CNN”. The results showed that using the combined approach could detect topics with an F-score of
0.8163 while k-means alone scored 0.551.
In the domain of social media of Arabic Web, we can summarize the significant works in this area. For detecting
events in Arabic social media, a research presented in [36] tackling the detection of disruptive events from Twitter.
Their model is based on the co-occurrence of terms over time. Another work in [37] presented an end-to-end event
detection framework which comprises six main components: data collection, pre-processing, classification, feature
selection, topic clustering, and summarization. A dataset of over 16 million Arabic Twitter messages is used. They
focused on the temporal, spatial and textual features for each cluster to help detecting the event. They compared their
results to LDA, and they showed LDA couldn’t perform well on Tweets as short messages. Considering detection of
bursty features from Arabic Twitter, [38] investigated a new technique based on TFIDF, entropy and stream chunking.
They collected Tweets from Egypt using Twitter API between the period of December 26 th 2015 and May 20th 2016.
The results were compared against known events occurred on the days of streaming. The results showed that the
technique could capture the bursty terms related to events happening in real life. Comparing different clustering
techniques in the application of document-pivot approach on Egyptian Twitter datasets is presented in [39]. The results
showed that bisecting k-means performed better than other methods like agglomerative and traditional k-means
methods.
This paper reviewed three main approaches of topic detection area of research. Different researches were applied
on document-pivot, feature-pivot and probabilistic topic modeling approaches. The performance of each approach
relied heavily on the datasets used. The feature-pivot approach appeared more suitable to offline Twitter and short
messages datasets than the document-pivot approach. While for the online topic detection from Twitter incremental
clustering as a document-pivot approach performed better. Traditional LDA method couldn’t perform well on social
media, yet variants of it like dosLDA [26] and CLDA [28] could achieve significant improvement. Applying topic
detection techniques on Arabic text still needs more research, traditional methods were applied, yet more improvement
is on the way. Probabilistic topic modelling methods are an active area of research, and new algorithms are presented
each day to enhance the results of detected topics from large textual corpus. Future work includes applying word
embedding to enhance capturing related topics based on their content. Combining LDA and word2vec representation
for topic detection can be found in [40]. Dynamic embedding is presented in [41] which formulates word embedding
with conditional probabilistic models. To enhance the results of topic detection for short texts, auxiliary word
276 Ahmed Rafea et al. / Procedia Computer Science 142 (2018) 270–277
Author name / Procedia Computer Science 00 (2018) 000–000 7
embeddings are used in [42]. This paper provided a summary of different methods for topic detection and the recent
researches in this area, to work as a guidance for researchers interested in this domain.
References
[1] J. Allan, “Introduction to Topic Detection and Tracking,” in Topic Detection and Tracking, vol. 12, J. Allan,
Ed. Boston, MA: Springer US, 2002, pp. 1–16.
[2] N. Alkhamees and M. Fasli, “Event detection from social network streams using frequent pattern mining with
dynamic support values,” in 2016 IEEE International Conference on Big Data (Big Data), Washington
DC,USA, 2016, pp. 1670–1679.
[3] Xiangying Dai and Yunlian Sun, “Event identification within news topics,” in 2010 International Conference
on Intelligent Computing and Integrated Systems, Guilin, China, 2010, pp. 498–502.
[4] X.-Y. Dai, Q.-C. Chen, X.-L. Wang, and J. Xu, “Online topic detection and tracking of financial news based
on hierarchical clustering,” in 2010 International Conference on Machine Learning and Cybernetics,
Qingdao, China, 2010, pp. 3341–3346.
[5] H. Becker, M. Naaman, and L. Gravano, “Beyond Trending Topics: Real-World Event Identification on
Twitter,” 2011.
[6] O. Ozdikis, P. Karagoz, and H. Oğuztüzün, “Incremental clustering with vector expansion for online event
detection in microblogs,” Social Network Analysis and Mining, vol. 7, no. 1, Dec. 2017.
[7] M. Hasan, M. A. Orgun, and R. Schwitter, “Real-time event detection from the Twitter data stream using the
TwitterNews+ Framework,” Information Processing & Management, Mar. 2018.
[8] M. Hasan, M. A. Orgun, and R. Schwitter, “TwitterNews: Real time event detection from the Twitter data
stream,” PeerJ PrePrints, vol. 4, 2016.
[9] S. Petrovic, M. Osborne, and V. Lavrenko, “Streaming First Story Detection with application to Twitter,”
presented at the Human language technologies: The 2010 annual conference of the north american chapter of
the association for computational linguistics, 2010, pp. 181–189.
[10] M. Cataldi, L. Di Caro, and C. Schifanella, “Emerging topic detection on Twitter based on temporal and
social terms evaluation,” in Proceedings of the Tenth International Workshop on Multimedia Data Mining -
MDMKDD ’10, Washington, D.C., 2010, pp. 1–10.
[11] M. Mathioudakis and N. Koudas, “TwitterMonitor: trend detection over the twitter stream,” in Proceedings of
the 2010 international conference on Management of data - SIGMOD ’10, Indianapolis, Indiana, USA, 2010,
p. 1155.
[12] L. M. Aiello et al., “Sensing Trending Topics in Twitter,” IEEE Transactions on Multimedia, vol. 15, no. 6,
pp. 1268–1282, Oct. 2013.
[13] X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: a structural clustering algorithm for networks,” in
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining -
KDD ’07, San Jose, California, USA, 2007, p. 824.
[14] Nada A.Mostafa, “Tredning Topic Extraction from Social Media,” American University in Cairo, Egypt,
2016.http://dar.aucegypt.edu/bitstream/handle/10526/4691/Trending_Topic_Extraction_from_Social_Media_
Nada_Ayman.pdf?sequence=1
[15] A. Guille and C. Favre, “Event detection, tracking, and visualization in Twitter: a mention-anomaly-based
approach,” Social Network Analysis and Mining, vol. 5, no. 1, Dec. 2015.
[16] J. Benhardus and J. Kalita, “Streaming trend detection in Twitter,” International Journal of Web Based
Communities, vol. 9, no. 1, p. 122, 2013.
[17] Padmaja CH V R, S Lakshmi Narayana, and Divakar CH, “PROBABILISTIC TOPIC MODELING AND ITS
VARIANTS – A SURVEY,” International Journal of Advanced Research in Computer Science, vol. 9, no. 3,
pp. 173–177, Jun. 2018.
[18] H. Jelodar, Y. Wang, C. Yuan, and X. Feng, “Latent Dirichlet Allocation (LDA) and Topic modeling: models,
applications, a survey,” arXiv:1711.04305 [cs], Nov. 2017.
[19] Deerwester, Scott, Dumais, Susan T, Furnas, George W, Landauer, Thomas K, and Harshman, Richard,
“Indexing by latent semantic analysis,” Journal of the American society for information science, vol. 41, no.
6, pp. 391–407, 1990.
[20] T. Hofmann, “Probabilistic Latent Semantic Analysis,” Proceedings of the Fifteenth conference on
Ahmed Rafea et al. / Procedia Computer Science 142 (2018) 270–277 277
8 Author name / Procedia Computer Science 00 (2018) 000–000