0% found this document useful (0 votes)

25 views6 pages

Yadav 2014

Uploaded by

Thanhbich Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views6 pages

Yadav 2014

Uploaded by

Thanhbich Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Semantic Graph Based Approach for Text Mining

Chandra Shekhar Yadav1 Aditi Sharan2 Manju Lata Joshi2

School of Computer & Systems School of Computer & Systems School of Computer & Systems
Sciences, JNU New Delhi- India
, Sciences, JNU New Delhi- India
, Sciences, JNU New Delhi- India
,

chandr28_scs@jnu.ac.in aditisharan@jnu.ac.in manjulatajoshi@gmail.com

Abstract- A semantic network is a graphical notation, for The result of this analysis can be used for various text mining
representing knowledge in form of interconnected nodes and applications. As a second objective, we discuss its importance
arcs. In this paper we propose a novel approach to construct a for different text mining applications and present its application
semantic graph from a text document. Our approach considers for two specific applications: keyword extraction and finding
all the nouns of a document and builds a semantic graph, such "nature of document".
that it represents entire document. We think that our graph
captures many properties of the text documents and can be used Our paper is divided in five sections. Section II represents
for different application in the field of text mining and NLP, such the related work done in area of language network. III section
as keyword extraction and to know the nature of the document. is a theoretical description of our work including all the steps to
Our approach to construct a semantic graph is independent of preprocess the document and then it presents a new approach
any language. We performed an experimental analysis to validate for building a semantic graph from a text document. Section IV
our results to extract keywords of document and to derive nature deals with experiments and results on two different data sets.
of graph. We present the experimental result on construction of The experiment validates the two different applications of
graph on FIRE data set and present its application for keyword language network on different data sets. Section five concludes
extraction and commenting on the nature of document. our works with future work and at last Dataset-2 attached in
Appendix.
Keywords- Semantic graph, Language Network, Keyword
extraction, Nature ofDocument, Text mining, WordNet.
II. RELATED WORK

I. INTRODUCTION Graph theory and the fields of natural language processing

and information retrieval are well-studied disciplines for
Traditional text mining generally follows bag of word
research although, these areas have been ascertained as distinct
approach, where a text is considered as a collection of
with different algorithms, different applications and different
words/phrases. Most popularly used models in such cases are
potential end-users. The recent research [1, 2, 4] has shown that
vector space model (VSM) and n-gram models. Some work
these disciplines are intimately connected, with a large variety
has been done where sentence based models have been
of natural language processing and information retrieval
proposed [15], although, for such cases developing a
applications finding efficient solutions within a graph
computational feasible model is a difficult task. In terms of
theoretical framework.
computational processing, a text is considered as an
unstructured data. VSM model considers text document as a Any natural text can be represented as a language network.
vector of words and vector based measures are used for text Language network is formed by considering the words, or the
mining tasks. However text is made up of multiple units concepts as nodes, and relations among them are represented as
words, phrases, lines or entire paragraphs. Text itself is a single edges of the network. Once a text is represented as a network, a
entity, where words, sentences, phrases and paragraphs are variety of tools for network and graph analysis can be applied
connected to each other through semantic relations, that to perform quantitative analysis and categorization of textual
contribute to the overall meaning, maintain cohesive structure data, detecting closely connected concepts, identifying the key
and discourse unity of the text. VSM based approach does not concepts that produces meaning, and perform comparative
consider these relations. Considering above limitations another analysis of several texts [2]. Researchers have studied well on
approach can be considered, where text is represented as a the various methods of quantitative analysis of network and the
graph in which nodes represent linguistic entities such as word, possible potential applications of the same in the area of text
sentences etc. and the edges represent the relationship between mining. In the last few years, the approach of visualizing and
these entities. Therefore, the source text corpus can be understanding text through converting it into graph and
represented as a language network which leads to derivation of analyzing its structural properties has been boosted up.
overall meaning of text with minimal loss of any information
Till now, most of the work done is based on co-occurrence
and relation between words, phrases, lines and paragraphs.
based language network. Dmitry [2] focused on identifying the
The objective of this paper is twofold. Firstly, we propose a pathways for meaning circulation within the text. They used
method for constructing a semantic graph for a text document. the proximity of concepts and the density of their connections
Once the semantic graph is constructed, graph theoretic and to encode the relations between the words, not their meanings
network analysis techniques can be used to analyze the graph. or affective relations. They perfonned it through visualizing the

978-1-4799-2900-9/14/$31.00 ©2014 IEEE 596

text in the form of graphs. Further, they take into account the The stemming algorithm allows morphological variants of
probabilistic co-occurrence of words in a text to identify the words to be considered as same. Our Stemming process has
topics using the graph analysis methods along with qualitative two steps [9]. First we are applying WordNet lemmatization.
and quantitative metrics. WordNet Lemmatization will transform word in nearest word
that is defined in WordNet. For example, if we take a word
Another work in this direction is done by Kang, Kim [3].
"COOKS", it will transform into "COOK", but if we use it for
Their work is based on concept based information retrieval
"COOKING" the output will be "COOKING" because, the
(CBIR), which allows capturing semantic relations between
word "COOKING" is separately defined in WordNet as verb
words in order to identify the importance of a word. These
and it has its own sense id. The second step of stemming
semantic relations have explored by using ontology. Most of
process includes the application of the regular expression
the work for CBIR including King, Kam [3] has been done in
stemmer that is used in very specific cases. We are stemming
English language. Sharan, Joshi [4] modified and applied their
just "ing" from word after this "COOKING" will be "COOK".
[King, Kam] approach on Hindi text documents. Basic
motivation of the work was to provide an efficient structure for Step 2: Tag Document using POS tagger
representing concept clusters and developing an algorithm for
A Part-Of-Speech Tagger is a piece of software that reads
identifying concept clusters. Further they also suggested a way
text in some language and assigns parts of speech, to each word
of assigning weights to words based on their semantic
(and other token), such as noun, verb, adjective, etc., although
importance in the document. In paper [4], the use of Hindi
generally computational applications use more fine-grained
WordNet ontology was explored for CBIR from Hindi text
POS tags such as 'noun-plural' . We use the online version of
documents. This work was done on a short set of documents
Stanford parser [14] available at following web link http://nlp
taken from Wikipedia but certainly achieved good results as
stanford.edu:8080/parseri.
compared to TF-IDF approach.
Step 3: Extract nouns from tagged text document and
Liu, Wang [5] have used language network for keyword
making concepts
Extraction, firstly they build a semantic network for single
document and then weighted Page Rank was applied in the Processed after step 2, tagged document consist of words
network to decide on the importance of the word. Finally top and corresponding part of speech. In this step we extract all
ranked words were selected as keywords of the document. The nouns from tagged document. We consider proper nouns,
weights they assign between nodes are based on their similarity singular nouns , plural noun. The extracted nouns are converted
(How similar their senses are?). to concept using following idea of concept.

Definition-I (concept): If N= {NI, N2 . . . Ni} be the set of

III. PROPOSED ApPROACH
nouns in a document and R= {Synonymy} is a lexical relation.
In this paper we have proposed an approach and provide an C is a concept, then C consists of group of nouns which are
algorithm to construct a semantic graph from a text document. synonyms to each other.
This graph is an actually a semantic language network. Further
Step 4 and 5: Build wordlist
the paper explains two different applications of semantic
network one is Keyword Extraction and another is to derive After deriving concepts we make a concept list of all
Nature of Document. Keywords briefly describe the content of concepts. For each concept we select a representative word of
a document and thus play a key role in text indexing, the concept. We have selected the representative word wj from
summarization and categorization. However, most of the concept ci by just picking first word in this concept to make it
existing keyword extraction approaches require human-labeled simple. The output of this step a wordlist which can be
training sets. In this paper, we have proposed an automatic represented as Wlist= {wl , w2 , .....wi}. In the next step we are
keyword extraction approach using semantic graph. treating these words like nodes of the graph and finding the
relation between words using WordNet. It is an important to
Our approach is able to construct a semantic graph for any
note that WordNet does not capture proper noun like place or a
document irrespective of its language and domain, provided
person's name so these words will be treated like a single node
Linguistic Ontology for that language is available.
and we can't find any relationship between these words.

A. Steps ofAlgorithms
B. Proposed Algorithm
Our approach takes a text file as input, Firstly the file is
Proposed algorithm for construction of semantic graph is
preprocessed, tagged and all nouns are extracted. The graph
described in figure-I. Here Relation-id is assigned just to
shows Ontology based relation on these nouns. We have used
WordNet Ontology to find relations (synonym, hypernymm I distinguish of edges for better analysis when needed, else no
hyponym and meronymy/holonymy) among the words. The mean of weight here. We have chosen 1, 2, 3 Relation-id for
algorithm for constructing the graph is as follows: Synonyms, member holonymy and hypernymy respectively.
In next section IV, we present exhaustive experimental
Step1: Normalizations and Stemming work along with their results for different applications of text
First we normalize the text by transforming all capital mining. In these experiments we use a network analysis
letters to lowercase so that two or more words can be treated approach for identifying the nature of the graph and keyword
same if they are different by just capital or small letter. For extraction.
example Nature, NATURE all will transform into "nature".

2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT) 597
documents as described in query relevance file (provided by
1. Preprocess the file: nonnalization and stemming.
2. Tag document terms using a POS tagger. FIRE). We tried to extract keywords from these relevant
3. Extract nouns to obtain sets of concepts {CI, C2, C3.... Cn}. documents. Keywords were extracted by applying graph
4. Considering these concepts, build a list of words called theoretic measures such as: Degree, Eccentricity, Closeness
wlist= {WI, W2, W3...Wn}. Wi E Ci : I <i <= n Centrality etc. The nodes with highest degree of these measures
5. Construct a Semantic graph, where wlist provides vertices for a were considered as keywords. We assume that if these
graph. extracted keywords are present in a query or are related to the
5.1 Declare a matrix ofnxninitialize with all 0 entries query, then they are correctly identified as a result of
and i=j=O. experiment. We perfonned experiments on 50 queries and
5.2 While (i<n) do found motivating results. Result for documents related to query
5.3 for (j=i; j<n; j++) do 27 is shown in Table I. The semantic graph for selected
If relation (Wi, Wj) defined in documents is presented in Figure 2.
WordNet where Wi, Wj E Ck {I <k<=n}
then Query 27: Relation between India and China.
table [i][j]= Relation-id
Query Description: Infonnation about the relationship
End for loop
between India and China with regard to economy, diplomacy,
Increment the current candidate pointer i
science, technology and trade is relevant.
End while
Documents selected: 1040913jrontpage_story_3751658.
Figure 1: Proposed Algorithm for construction of Semantic-Net
utili, 1041016_nation_story_3889236. utili, 1050207_nation_
story_4346014. utili 1040909_opinion_story_3732586. utili
IV. EXPERIMENTS AND RESULTS
and 1041006_opinion_story_3819642. utili
The organization of this section is in two parts. First, about
the dataset & tools used and second part is experiment part.
Experiments conducted in our work are divided in three parts:
construction of Semantic Network for a text document,
extracting the keywords from semantic graph and commenting
on the nature of the document based on graphs visualization.

A. Dataset and Tools Used

The experiments were perfonned on FIRE [13] data set
and on manually constructed document set from different

news papers i.e. Times of India (TOI), BBC, Dainik Bhaskar,
The Hindu and Business Standard. We will refer FIRE data set
as dataset-l in coming sections. The documents taken from the
newspaper will be referred as dataset-2 in the following

sections.
The tools we used are: NLTK (Natural Language Tool Kit)
[9], WordNet3.0, Python 2.7.3 and GEPHI for graph

visualization.

GEPHI 0.8.2 [6] is "an interactive visualization and
exploration platform for all kinds of networks and complex
systems, dynamic and hierarchical graphs. The advantage of Figure 2: Semantic Graph for answer documents related to Query 27
this tool is that graph visualization is easy and it provides
In Table I we mention some keyword and corresponding
different views of same graph according to the need". It also
value of that node which are greater than threshold (selected
produces some graph's quantitive attributes for better
manually). The keywords are highlighted in bold. However
understanding of graph. For the fast and efficient computation,
some important word is also below threshold such as
GEPHI uses different algorithms [10, II, 12].
technology in betweenness centrality. But we get many
important words if consider the words above the threshold. For
B. Experiments on dataset 1
this experiment we found that eccentricity and closeness
In order to perfonn experiment on FIRE data set, we centrality & clustering coefficient proving good results
selected some specific queries, selected documents relevant to regarding keyword extraction. The figure 2 is corresponding to
query. The semantic graphs was constructed by proposed table I result. Here red colored nodes with larger size are
algorithm i.e. Figure-I and apply different graph based and appearing in our query strongly related to our query words.
statistical measures for our objective as defined in abstract. Statistical parameter as eccentricity also shows that the tenns
greater than threshold is either coming in query or highly
For keyword extraction we perfonned experiments on the
related to the query words i.e. Economy is in query description
documents related to specific queries (Manually selected). Here
and word Market, Fund, Stock captured by our measures.
we considered some queries and their corresponding relevant

598 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)
Table 1: Keyword Extraction for Documents related to Query 27 Table 2: Keyword Extraction for Dataset-2 with different measures

� E.-'Eukity CIm;a:w Bm'!!!l:l:!ll C .C. NO. DtGREE EC(EN RiellY OostntisC.emrality B.IWffilflelsC.nuality TF
c;a:trilil"
�ODES
IVA!. I :;ODES I V.4J. ](ODES '}'1. \UV"'
. VAl bodies

CCT.:!:U,. �3 1!:\'!!1:!:lEt 9 Dipl �l!i 5.� l!!1!! �S4M4 �lJd:ft

I PI", 1 loday 10 1Od., 6.99 ord" 1905.38 IAf
1 Work 6 3ort1es 10 Y'" 6.46 COll\l\!\
I:'�'a 1482.94 pef50Mel
21 Diplom..�· 9 m\'!!1:!:lEt 5.10 fc:t 18�HS [, _!)!:1" 3 15 TRAGEDY 9 bcrd" 6.�1 "'"
19
Of!alIZibOIl. 1456.00 FORCE
Stock
F\%d
16
I�
R!I!rJl
C=ttily 1 9
R!I!rJl
Cct iruct
5.(� Stock
4. 1 fitlt=!
l6"MJ
I�� . �
C�i!t
Cor.\·!!!!liC)\l
14 Order 15 CALUnr� 9 olfu;l 6.OS orgilliDtion 1210.82 UTTAIWH
l rORCE 14 d;y 9 hour 6.0� �ork 1U9.35 REUEF
Sbr; 11 R�!liC'l: S �1!r1:f! 4'-3 tta:i 1531041 Mc·t:�· 6 reglOu II Bight 9 Ii... 6.02 pi'" 1004.27 KEDARNATI
Power 10 J:J!i II 8 E=e!:ly 4'-� �'
-
It-HI 1�1!l 118 Cnit II !�iou 9 day 6.01 st.temeut 997.17 state
Qc\'=E � Ri lk ! P licy 45 n:w 9�3.4" ir.;esll:1!!1:! pro,.!! 0 do, 9 day 6.00 un. 956.01 in
9 9OO.! dmua 9
Co!liril)t: BOfter 8 1!:!1:1II:=! 4.45 iiJl:t R.mk 10 Ia 9 TRAGrDI' 5.93 rORCE 948.41
POOP'
10
Millin 8 l!!I:Iion 8 T!!t:lioo 4JI �a..-uity &�J.51 policy COll\!.\!'ll 10 lI'iIIlport 9 CAL.-l\On l. 9) mG�ln; 89lB1 air
- II
TB:!! "'If 8 )'1Ot:!j' 4.28 CJt!!f 8 2. cutrt:ruOt: S"" 10 "" 9 sorties 5.61 rim. 1011B RESCUE
- -
4.� I c.;p�aI
Il 9 year l.61
Bcn!er Fur! a Bord!r -S! "oIlIter
13
p!l'IODl1S 9 RESCll: p"'pl. 691.31 """tlo os
- A", 8 I>ord.r 9 mjo,,,), l.ll
Ptotir! Iuder 8 fit:lt:=! 4.1� poI"er 5S�.26 lml: 14
pr-' 644.73 RAlN5
-
Dil>,!-'!! P lit}' a CcilitiOll 4.C- !O\'!fl"J:l!D 48&.35 "'It Ii
iiIIDlJUDI 8 of:iruI 9 pm. l.ll cclI.dirioa 637.71
arU5
- prop!fly 8 hour 9 "",on l AS ;mounl 603.11
SUk!! Ecl)t:�y 8 FriwJ:il' 4.C4 p.lJ!i t:!J:i1' 415.38 roiliriOll 16 OiOPPER
- pi><> 8 Ii", 9 Wl' l.41 "l! 59J.7B
fl:d M!!b! a WI! 3.93 il:".'!lJmEt 38.. !lock I spokespers
!)r.·!lcpm 6 �Iot:�' 8 D!f�r! 3.&2 mk! l51.�8 i: r-;,;! 18
peop1. RESClL 9 mnnru..'1U!! l16 1lLll' 597.75
team
- reir.J! nujority 9 lIAR 515 r� 586.22
fitltO! i Stock Sud: 3.81 �!I:l!l ��S.54 ='O\"!rr.1:J.!tt 19 their
- itil:l!ID� prim 9 umnmdiD! 514 rsean 519.57
them
BJ!:l: 5 fl:d 'Tn<!. 3."4 ooiIitiC'l: lS�.61 pai:i 10
l!nice infr:utru..'1UI'! 9 remo,·a1. ll) smitt 511.23
- �I total
fririlJ:ip i fiJl:t DB!:!er 3.71 itin!trt:=tQ 15S. iI.!rittti�t: rim, 6 UDlo\-;! 9 RELIEF ll) 510.00
prOp'fI),
Ir:fm!ClO! 5 c!i�t! - 1'1!ti!! Hi il:rtm.�! 155.S2 m:at1m: 11
DISASTER 6 RELIEr 9 s!iirch III DESTRt:C 508.00
area
- II delN
Pa.:! i DIr.�er D!';!l<?O!!I:! H! 1ii!!l:tlJ:ip 154.&8 JIO"'er Dum"" 6 SEARCH 9 chi,! l11 world 441.21
- - " DI5ASTER
Stock i 'TB:!! R!la!iot:!J:il' 3 , border 14".&5 d£�!r II
OPER.mo. 6 JlIper'istoD 9 d," l.ll P'" 444.39
- GAURIKUN
D!i!!I:=! 3 Il:ka..-n:n fiy:t 3.P noel: 113.46 f.:t.t ,6
officer 6 fiI. 9 COll\!.\!"i l.ll DEATH 43l.91
Hill
- -
1!:"!!Jm!l:l 3 PC! Dem:ni Oll 3.5 b!!'.k 12.!.33 autr! ,
DllSI 6 region 8 IXSPECTO l.ll 05", 39 15 5
-
Ic!mua
INDIAN
Politi!! 3 BJ!:l: C""'=!I:l 3.5·� 4 .5. i!iE<!ltip DESTRUCT 6 pr""" 8 offici;! 5.09 how. 358.72
�8 minister
- 6 '''' 8 Hood l.1I6 job
U;�Ot: Fttt 3.45 tmolo�)' I �!';!lopmEt :9
D,;!hbcrhood 37109
PffiOllS
:m.I�' I Stoel: 3.45 r;\alil)t:ltil' JO job 6 RELEASE 8 FLOOD l.1I6 p.no... 361.37
temple
ru!lict I TfCHNOL 3.45 �m�loJY
OCY
Tables 2 continue with other measures..

C. Experiment on Dataset 2 :;0 AU1hOfity HUB Chul"iog Co.ffici..1 L'SER

To perform the experiment on dataset 2, we constructed
KODES VAL NODES VAL. :;ODES VAl. l'ODES
dataset from newspapers about the tragedy that happened Ofgmimion 0.041 crrgmizman O.Oll TRAGrDY 1.000 t:TTAll\KH.\l'll
aUSSL'iG
recently in India at Uttrakhand due to Cloudburst and resulted 1 9.'ork 0.018 Plo.... 0.028 C.-\LUnIl' 1.000
D"TRACrD
floods and landslides. � pi,.,. 0.010 fORCE 0.028 DErASTATI 1.000
flOOD
unit O.OJ() Work 0.021 station 1.000 ll\\'AGED
1
To achieve our objective first we constructed semantic 4
st;.te O.O� S"'. 0.0,1 route 1.000 llU'iS

graph shown in Figure-3 for dataset-2 by same algorithm as I

fORCE 0.02 O,d" 0.021 ,ood 1.000 fORCE
re� O.Oli p"'sonne! 0.021 flOOD 1.000 L�
proposed in figure-l and apply different graph theoretical 6
KEDAR.II" ATH
order O.Oll L'uit O.OIS d,y 1.000
measures and statistical measure such as TF, for finding 8 process 0.0�1 Re�ton O.o1S aOSSIOX 1.000
TEMPLE
CRHIAIIO:;
representative keywords. 9 propa1)' 0.0�1 process 0.018 '"'' 1.000 HIrRGL'iCY
10 �e:i: O.O� cmru��ll 0.018 rt!i0n3 1.000 PILGRIlL\GE
11 comm.:;,nd 0.019 CHOPPER
Stit811e:l.t O.o1S 1.000
operorionJ
tl ltatsml SIIRl�l:
0.019 mG�ln.llE O.o1S I.... 1.000
II p"'sonne! CLOlllBl.;RST
0.019 0.014 SPOl 1.000
14 proper1)' ErAC1.:ATIl'G
lim. 0.016 Am 1.000
II 0.014 unit RILIH
16 poopl. 0.016 Rmk 0.014 I1EUCOPH 1.000 OPERmO:l"
,ink 0.016 1.000
11 ...aunl 0.014 CHOPPER .\Rc\fi·
18 mognitud, 0.014 1.000 MAHRlU
DESTRL'CIIO 0.014 mG�lrrH
19 :mwttIII
0.014 0.661 E QV1PME;\'I
DEATH 0.014 RESCll:
lO mOl' OP·ll�H.n
0.014 adminiStr..riOD 0.014 I�llIAl\ 0.661
II job DESIRt:CTION
22
0.014 Peopl. 0.011 dislri::t 0.661
DISASTE.R
tmIcmt
0.014 MOl' 0.661
23 nltlllber
0.011 p1= \1CIIMS
24
O.OtJ Job 0.011 neighborhood 0.600 L\l'llSUDE
operation 0.014 O.lll
II number 0.011 fORCES AIRCllID
ueigbborisood
26 0.014 ueighborhood 0.011 oncer o.loo fORCE
RELEASE
n 0.011 AIR 0.400 SALVAGE
0.011 OPIRATION
smict G.'\!'iGA
28 0.011 gO\'ermne:l.l 0.011 ll-\"l{ 0.311
DESTRL'CT
29 0.011 0.311
T,,,,, 0.011 SIIRI:'il:
30 DHTH
0.011 World ll-\ll\ OJ33
0.011

In Table 2 we are showing some keywords and their

measured value using different quantitative measures i.e.
Figure 3: Semantic Graph for sample document from dataset 2 Degree measures, Eccentricity, Betweenness centrality.
Keywords are sorted in decreasing order of their values.

2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT) 599
To compare the efficiency of the results we compare theme or is multi-thematic. Further, we observed that 40 %
extracted keyword with those evaluated by human judgment words have zero degree and are not important. These words
(only 30 are mentioned here due to space limitation). After need not be considered for indexing. Thus we have a type of
observing the results we have found that clustering coefficient semantic filtration that identifies stop words based on semantic
is providing the best results among all measures for dataset- 2. importance. Almost all important words are in the largest
Number of words is similar to human judgment are more as component. This trend was also observed for all other FIRE
compare to other measure. So we can infer that clustering documents on which we performed our experiment. .
coefficient playing a good role in keyword extraction.
v. CONCLUSION AND FUTURE WORK
After analyzing the result, we can say that clustering
coefficient not only considers the nodes with high degree but In this paper we presented a novel approach for
also takes into consideration that how densely its neighbors are constructing a semantic graph of text documents. Our approach
connected. If the neighbors of a node are densely connected it considers all nouns of a document, converts them to concepts
has a high clustering coefficient. This ensures transitivity in the and automatically builds a graph from the extracted concepts.
relation i.e. if x is connected to y and y is connected to z, there The nodes are the concepts, while edges represent the semantic
is a high probability that x is connected to z. Moreover, relation between concepts. The motivation behind this we are
clustering coefficient also allows the nodes with lower degrees, not depending on the measures like TF-IDF which gives more
which may be part of a smaller connected component of a importance to the word which is more frequent in the
graph to be considered as important nodes. This is a unique document. Our approach gives more importance to semantic
property of clustering coefficient; other measures provide more importance of the word present in the document, which even
importance to the nodes which are directly or indirectly can be applied for small document where TF-IDF cannot be
connected to a large number of nodes. Thus, these measures used. We think that our graph captivate many properties of the
provide high ranking to the nodes belonging to large text documents and can be used for different application in the
components only. field of text mining, NLP and computational linguistic. Here
we present its application for keyword extraction and for
Nature of Document:
commenting on nature of document. We have yet to do in
The nature of the document means, to comment on text depth analysis of our semantic graph. To comment on the
basically what it contains. We are commenting based on nature of the document we find that clustering coefficient
Semantic graph visualization. We have created a Semantic playing an imperative role that not only cover whole big
graph for document (that is prepared manually by taking news component but also it considers small clusters. Further we find
from different newspapers, appendix enclosed), which is that closeness centrality and eccentricity of a node provide
shown in Figure 2 and corresponding results for keyword good criteria for keyword extraction.
extraction are presented in table 2. After reading and
As this is a new work, it requires further exploration. We
understanding this document we can say that this document
intend to preprocess /post process the graph for improving its
contains information about the tragedy that happened recently
efficiency for different applications. Further, we think that the
in India at Uttrakhand due to Cloudburst, resulted into floods
semantic graph has a potential to be applied in many NLP
and landslides. It also informs that the rescue operation was
applications such as: query expansion, topic detection, text
done by Indian forces, army, IAF with the help of choppers etc.
summarization etc.
On observing the graph we can say that it contains many
components but there is one dominating component ( by using REFERENCES
DFSIBFS or search algorithm). This indicates that the text is
[I] R. Mihalcea , D. Radev, "Graph based natural language processing and
focused on a topic. information retrieval," Cambridge University press,2011.

After seeing following semantic graph, we can say that this [2] P. Dmitry , "Identitying the Pathways for Meaning Circulation using
Text Network Analysis," Nodus Labs, 2011.
semantic graph mainly informs about two areas, one is about
[3] B. Kang.,V. Kim, S. Lee, " Exploiting Concept Clusters for Content
this tragedy and its results is shown through arc "A2" that is
based Infonnation Retrieval," Information Sciences 179 (2-4), 2005 pp.
highlighted with red arc. Secondly, theme is represented 443-462 .
through a red arc "AI" has words like {search, operation, [4] A. Sharan, M. Lata Joshi,A. Pandey " Exploiting Ontology for Concept
inspector, unit, forces, command, rank, police, team, crew etc.} Based Information Retrieval," Infonnation Systems for Indian
is clearly pointing towards the rescue operation done by forces. Languages Communications in Computer and Infonnation Science
Arc A3 represented in Blue color shows concepts with small Volume 139,2011,pp 157-164
lengths. These concepts also provide important information [5] J. LIU, J. WANG, "Keyword Extraction Using Language Network," In
Natural Language Processing and Knowledge Engineering,2007.
like { {rain, cloudbursts}, {emergencies, crisis}, {fleet,
aircraft} etc.} (Arc is drawn manually for better analysis). [6] M. Bastian, S. Heymann and M. Jacomy. "Gephi: an open source
software for exploring and manipulating networks," Proceedings of the
We also considered some multi-thematic documents and Third International ICWSM Conference (2009),pp. 361-362.
observed that their graph contains 2-3 large connected [7] G.A. Miller, "WordNet: A Lexical Database for English,"
components of comparable size instead of one dominating Communications of the ACM Volume 38,1995,pp. 39-41.

component. We are not presenting the result due to space [8] Princeton University, "WordNet ", Internet: www.http://wordnet.princ
eton. edu,Dec. 27,2012 [ July, 26,2013].
limitation. This indicates that the semantic graph can be used
[9] S. Bird, E. Klein and E. Loper, "Natural Language Processing with
for identifying whether a document is focused around one
Python ",O"Reilly Media,2009.

600 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)
[10] B. Ulrik, "A Faster Algorithm for Betweenness Centrality," in Journal of relief operations in the state but is maintaining 10 choppers there including world's
Mathematical Sociology,pp. 163-177,2001. largest Mi-26 transport helicopter along with Mi-17s and ALH Dhruvs. In the last 24
hours, the Ministry said the IAF has evacuated around 310 and overall, so far, rescued
[II] L.c. Freeman, "Centrality in social networks, conceptual clarification," 20,712 people using its helicopter fleet. The Army is continuing to maintain the troop
Social Networks I,pp. 215-239,1979. level at 8,000 in the region for the relief operations. It has been maintaining this strength
[12] Wikimedia Foundation Inc. "Clustering coefficient ", Internet: since June 18 when the armed forces were inducted into the state.
http://en.wikipedia.org/wiki/Clustering_coefficient. July. 6, 2013 [July. The Indian Air Force (lAF) has deployed 13 more aircraft for relief and rescue
operations. Fifty-jive helicopters have been pressed into service for rescue work. The IAF
26,2013].
has also deployed its heavy lift Ml-26 helicopters for transporting filel and heavy
[13] ISICAL "Forum for Infonnation Retrieval and Evaluation ", Internet equipment required by the Boarder Road Organization (BRO) to clear roads closed due
http://www.isical.ac.in/-c1ia/data.html. [Aug.,23 2013]. to landslide. New Delhi: Indian Air Force has airlifted over 18,000 persons and dropped
more than 3 lakh kg of relief material in flood-hit Uttarakhand since June 17 in its
[14] T. Kristina , K. Dan , C. Manning, and Y. Singer. 2003. "Feature-Rich
biggest ever helicopter operation for rescue and relief in the state. IAF has airlifted a
Part-of-Speech Tagging with a Cyclic Dependency Network," HLT
total of 18,424 persons, flying a total of 2,137 sorties and dropping/ landing a total of
NAACL 2003,pp. 252-259. 3,36,930 kgs of relief material and equipment, it said in a statement on Sunday.
[15] G. Vishal and L. Gurpreet Singh, "automatic keywords extraction for operations for 'Op Rahat' that were undertaken since morning, a total of 749 persons
punjabi language," [JCS[ International Journal of Computer Science were airlifted, flying a total of 93 sorties and dropped about 12,000 kgs of relief material
Issues, Vol. 8,Issue 5,No 3,September 2011 and equipment, it said.
The Central Reserve Police Force (CRPF) on Saturday announced it will contribute one
[16] J. F. Sowa. "Semantic Network," http://www.jfsowa.com/pubs /semnet. day's salary of its personnel to the Prime Minister's Relief Fund for the victims of
html,Feb. 02,2006 [Jul,26,2013]. Uttarakhand tragedy.CRPF Director General Pranay Sahay said the force will
contribute over Rs18 crore to support the victims of the massive calamity in the hill state,
a spokesperson said in Gurgaon. The CRPF deeply commiserates with the victims of the
Appendix tragedy that has struck Uttarakhand. There has been large scale destruction of property
Dataset-2 and loss of lives in this disaster. The victims are our own brethren. The CRPF rank and
file joins the countlymen in conveying its deepest concern for the victims of the tragedy.
"Ullarakhand-Flood Missing unll'aced till July 15 will be presumed dead: Bahuguna The officers and men have decided to donate one day's salmy to the Prime Minister's
acing gains! time, the Uttarakhand government on Thursday decided that those missing Relief Fund. The amount will be over Rs.18 erore, Sahay added.
in the flood-ravaged state will be presumed dead if they remain untraced till Ju/y J 5 and Personnel of ITBP have been extensively active in the rescue and salvage operations
asked of f icials to remain vigilant in the wake of warning of heavy rains over the next two conducted jointly with the Indian Air Force (lAF) and Army units from the Central
days. Command ever since the news was flashed about the flash floods at several places in the
Chief Minister Vijay Bahuguna said the exact number of people missing after the tragedy Uttarakhand. Rescue teams and police personnel have recovered 48 dead bodiesfi"om the
is 3,064 and the deadline for finding them is July 15. Considering the magnitude of the River Ganga in I-Iaridwar. We are keeping them in the morgue and documenting their
crisis, the slale Cabinet has decided that if the missing persons are not found by July J 5, details. We are also clicking their photographs, and flashing the details for identification,
we will presume that they are dead and the process of paying compensation to their ne.Tt said Haridwar Senior Superintendent of Police Rajiv Swaroop. Central Army
of kin will begin, he said With the MeT department issuing a warning of heavy rains at Commander Lt. Gen Anil Chait said on Friday that about 8,000 to 9,000 people are still
places in Kumaon region over the ne.;rt two days, Bahuguna said that for the ne.;rt 50 stranded in Badrinath. Over 73,000 people have so far been evacuated from the flood
hours the administration needs to be highly vigilant, adding 250 National Disaster and landslide-hit areas of Uttarakhand so far. Another 32,000 to 33,000 people are still
Response Force personnel have been deployed in these areas. Meanwhile, the Indian Air to be evacuated, even as rescuers intensified their efforts to help those in distress in
Force flew 70 civil administration personnel to the Kedarnath temple premises to clean different inaccessible parts of the hill state. In Joshimath sector, the army has
the surroundings there. A team of seven mountaineers is also engaged in a combing constn1Cted a temporary bridge across Alaknanda River near Govindghat to facilitate
operation in areas adjoining the shrine in search of bodies while over 50 members of a I-Jemkunt Sahib pilgrims to cross over. A road link has been opened from Sonprayag to
team of experts and volunteers is stationed in Kedarnath to clean the temple premises of Gaurikund.
tonnes of debris under which more bodies may be lying, an of f icial said. Meanwhile, BJP spokesperson Prakash Javadekar has said that party president Rajnath
In Delhi, the government announced it will rebuild 10,000 houses and undertake other Singh has formed a disaster relief force for Uttarakhand, consisting of party workers,
activities to develop infrastn/cture in all affected municipalities in the state. All affected volunteers and people from all classes of the society. We are making special
municipalities and notified area councils in Uttarakhand can be covered under Rajiv arrangements for those eager to work. It shall be a continuous process, Javadekar said.
Awas Yojana as a special case to support reconstn1Ction of houses of the poor and I-Ie also praised the efforts of the armed forces and the Indo-Tibetan Border Police,
reconstn1Ct and redevelop these devastated houses, Union minister Girija Vyas said in saying they were doing a laudable job. If we move down hill fi"om Badrinath, towards
Delhi. Joshimath, there is a place that falls on route called Gobindghat. There are three isolated
Mass cremation of bodies in Kedarghati held up for the past few days started with 23 segments (spots), which are completely cut-off, in between Gobindghat and Badrinath.
more consigned to flames at Gaurikund and Junglechatti last night, DIG Amit Sinha Some people are still stranded in these places, Lt. Gen Chait said."
under whose supervision the exercise is being undertaken said.Mass cremation of bodies
in Kedarghati held up for the past few days on Thursday started with 23 more consigned
to flames, taking the number of bodies disposed of so far to 59 even as a team of experts
worked on removal of debris and extricating bodies from under them at the Himalayan
shrine.23 more bodies were cremated at Gaurikund and Junglechatti on Wednesday
night, DIG Amit Sinha under whose supervision the exercise is being undertaken told
PTi. This takes the total number of bodies disposed of in Kedarghati to 59, he said. 36
bodies had been cremated earlier, the DIG said, adding the process is slow due to bad
weather and the precautions being taken not to risk the lives of personnel engaged in the
e.;rercise. 50 other members of the team are searching for bodies in Gaurikund and
Rambara areas. Despite continuing bad weather in affected areas amid a MeT
department prediction of heavy rains in the ne.;rt 48 hours at places, efforts were on to
airdrop relief material in affected villages totally cut off after the calamity in the worst
hit Rudraprayag, Chamoli and Uttarkashi districts.It is still raining intermittently in the
area, he said. A team of seven mountaineers is engaged in a combing operation in areas
aqjoining the shrine in search of bodies while over 50 members of a team of experts and
volunteers is stationed in Kedarnath to clean the temple premises of tonnes of debris
under which more bodies may be lying, the official said.
Chief Minister Vijay Bahuguna has alerted the District Magistrates in Kumaon and
Garhwal regions to be prepared to deal with any emergency in case of heavy rains and
suspend all pilgrimage.The Indian Air Force (lAF) today flew 70 civil administration
personnel to the Kedarnath temple premises to clean the surroundings after the area was
ravaged by the recent flash floods in Uttarakhand. A total of 70 personnel have been
inducted at Kedarnath for cleaning of temple surroundings by the fAF choppers, a
Defence Ministry release said.
The eighth century shrine at Kedarnath withstood the cloudbursts and floods that swept
away its neighbourhood and much of the town last month. After evacuating all the people
stranded in the upper reaches of the hill state, the lAF is in the process of evacuating the
locals who want to move out. The lAF has pulled out majority of its assets deployed in the

2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT) 601

Semantic Nets
100% (1)
Semantic Nets
40 pages
Graph Based Representation and Analysis of Text Document: A Survey of Techniques
No ratings yet
Graph Based Representation and Analysis of Text Document: A Survey of Techniques
8 pages
AIML-Unit 3 Notes
No ratings yet
AIML-Unit 3 Notes
22 pages
Semantic Network
No ratings yet
Semantic Network
8 pages
DVT Unit 4
No ratings yet
DVT Unit 4
21 pages
DVT UNIT - 4 Notes 211124
No ratings yet
DVT UNIT - 4 Notes 211124
21 pages
Semantic Networks: Vol. 23, No. 2-5, Pp. 1-50, 1992 0097-4943/92 $5.00 + 0.00
No ratings yet
Semantic Networks: Vol. 23, No. 2-5, Pp. 1-50, 1992 0097-4943/92 $5.00 + 0.00
50 pages
Semantic Networks
100% (1)
Semantic Networks
68 pages
Research Paper 2
No ratings yet
Research Paper 2
7 pages
Journal Pre-Proofs: Expert Systems With Applications
No ratings yet
Journal Pre-Proofs: Expert Systems With Applications
16 pages
Semantics Graph Mining For Topic Discovery and Word Associations
No ratings yet
Semantics Graph Mining For Topic Discovery and Word Associations
14 pages
Module 15
No ratings yet
Module 15
2 pages
Text Summarization in Mongolian Language
No ratings yet
Text Summarization in Mongolian Language
8 pages
Text Summarization in Mongolian Language
No ratings yet
Text Summarization in Mongolian Language
8 pages
DVT U4 My Notes
No ratings yet
DVT U4 My Notes
15 pages
SemanticsSpeechRecognitionUnderstanding PDF
No ratings yet
SemanticsSpeechRecognitionUnderstanding PDF
11 pages
AAA Intro Maria Hanif
No ratings yet
AAA Intro Maria Hanif
3 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
4th Unit DVT
No ratings yet
4th Unit DVT
40 pages
NLP Notes
No ratings yet
NLP Notes
33 pages
Bachelor Thesis 2016
No ratings yet
Bachelor Thesis 2016
56 pages
Conceptual Framework For Abstractive Text Summarization
No ratings yet
Conceptual Framework For Abstractive Text Summarization
11 pages
Research On Text Generation Model of Natural Language Processing Based On Computer Artificial Intelligence
No ratings yet
Research On Text Generation Model of Natural Language Processing Based On Computer Artificial Intelligence
6 pages
Paper 2
No ratings yet
Paper 2
9 pages
Module 3
No ratings yet
Module 3
40 pages
Text Summarization
No ratings yet
Text Summarization
6 pages
LLM-Powered Natural Language Text Processing For O
No ratings yet
LLM-Powered Natural Language Text Processing For O
14 pages
Unit5 01
No ratings yet
Unit5 01
9 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
Kim 2016
No ratings yet
Kim 2016
5 pages
Siva PHD Thesis
No ratings yet
Siva PHD Thesis
173 pages
Garg Interspeech 2009
No ratings yet
Garg Interspeech 2009
4 pages
Text Summarization Using Sports Ontology With Graphical Representation
No ratings yet
Text Summarization Using Sports Ontology With Graphical Representation
3 pages
A Keyword Extraction Approach For Single Document Extractive Summarization Based On Topic Centrality
No ratings yet
A Keyword Extraction Approach For Single Document Extractive Summarization Based On Topic Centrality
9 pages
Class 8
No ratings yet
Class 8
17 pages
NLP UNIT 2 (Ques Ans Bank)
No ratings yet
NLP UNIT 2 (Ques Ans Bank)
26 pages
Graph-Based Text Representations PPT
No ratings yet
Graph-Based Text Representations PPT
14 pages
Coli R 00089
No ratings yet
Coli R 00089
4 pages
Nbnfi Fe2020042322174
No ratings yet
Nbnfi Fe2020042322174
27 pages
Paper On Domain Ontology
No ratings yet
Paper On Domain Ontology
4 pages
Social Network Analysis Unit-6
No ratings yet
Social Network Analysis Unit-6
22 pages
American Journal Computational: Discourse
No ratings yet
American Journal Computational: Discourse
85 pages
Description of Approach
No ratings yet
Description of Approach
5 pages
2010 Workshop On Graph-Based Methods For NLP
No ratings yet
2010 Workshop On Graph-Based Methods For NLP
121 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Sementic Net
No ratings yet
Sementic Net
14 pages
Keyword Extraction Measure
No ratings yet
Keyword Extraction Measure
9 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
103 pages
2017 Computing Semantic Similarity of Concepts in Knowledge Graphs
No ratings yet
2017 Computing Semantic Similarity of Concepts in Knowledge Graphs
14 pages
Unit-3 - Semantics Material
No ratings yet
Unit-3 - Semantics Material
16 pages
Noun-Verb Based Technique of Text Watermarking Using Recursive Decent Semantic Net Parsers
No ratings yet
Noun-Verb Based Technique of Text Watermarking Using Recursive Decent Semantic Net Parsers
4 pages
Adobe Scan 19-Oct-2024
No ratings yet
Adobe Scan 19-Oct-2024
13 pages
WordNet-based Lexical Semantic Classification For Text Corpus Analysis
No ratings yet
WordNet-based Lexical Semantic Classification For Text Corpus Analysis
8 pages
Text Mining: Fast Phrase-Based Text Indexing and Matching: Khaled Hammouda, Ph.D. Student
No ratings yet
Text Mining: Fast Phrase-Based Text Indexing and Matching: Khaled Hammouda, Ph.D. Student
12 pages
NLP Unit-5
No ratings yet
NLP Unit-5
7 pages
Operating
No ratings yet
Operating
3 pages
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
No ratings yet
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
29 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Relationship Extraction: Fundamentals and Applications
From Everand
Relationship Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
FINAL English 10 Q1 Module 7
No ratings yet
FINAL English 10 Q1 Module 7
30 pages
1744959877813-Understanding The Conda Environment
No ratings yet
1744959877813-Understanding The Conda Environment
8 pages
Answer Assignment Questions
No ratings yet
Answer Assignment Questions
3 pages
SSC CGL Pre 01 (Eng Sol) - 1
No ratings yet
SSC CGL Pre 01 (Eng Sol) - 1
7 pages
Detailed LP CO1 Q1
No ratings yet
Detailed LP CO1 Q1
4 pages
PQP1
No ratings yet
PQP1
15 pages
AR 25-50 - Prep & Manage Correspondence PDF
No ratings yet
AR 25-50 - Prep & Manage Correspondence PDF
121 pages
Revalidated - MATH - GR8 - QTR1-MODULE-2 - (28 Pages)
No ratings yet
Revalidated - MATH - GR8 - QTR1-MODULE-2 - (28 Pages)
28 pages
Q & A Exam (Adv V11) Q & A Exam (Adv V11) : Review Your Answers
No ratings yet
Q & A Exam (Adv V11) Q & A Exam (Adv V11) : Review Your Answers
7 pages
Vivekkumar Patel Mus109 Final Paper On Feldman, False Relationships and The Extended Ending Dt:12/17/2010
No ratings yet
Vivekkumar Patel Mus109 Final Paper On Feldman, False Relationships and The Extended Ending Dt:12/17/2010
6 pages
Mathematics For Junior High Schools: September 2012
No ratings yet
Mathematics For Junior High Schools: September 2012
121 pages
Tip 1: Conversion Rules As Per The Reporting Verb: What Is Direct & Indirect Speech?
No ratings yet
Tip 1: Conversion Rules As Per The Reporting Verb: What Is Direct & Indirect Speech?
9 pages
CS458 Lab01
No ratings yet
CS458 Lab01
5 pages
Sap Xi 3.0 - JMS
No ratings yet
Sap Xi 3.0 - JMS
10 pages
ECMR11 Proceedings
No ratings yet
ECMR11 Proceedings
333 pages
The Master's Seminary Catalog 2012
No ratings yet
The Master's Seminary Catalog 2012
153 pages
Grade 10 - Mock Test - English - Feb - 2023
100% (1)
Grade 10 - Mock Test - English - Feb - 2023
12 pages
English 5 - DLP - Week 1 - Day 1 - August 5, 2024
No ratings yet
English 5 - DLP - Week 1 - Day 1 - August 5, 2024
4 pages
Lesson Plan - Computer Architecture: Data Representation, Micro-Operations Organization and Design
No ratings yet
Lesson Plan - Computer Architecture: Data Representation, Micro-Operations Organization and Design
6 pages
Se Ascolslms
No ratings yet
Se Ascolslms
28 pages
Section Ten A Java Calculator Project
No ratings yet
Section Ten A Java Calculator Project
39 pages
Eksponent
No ratings yet
Eksponent
3 pages
DP-Chinese B Overview
No ratings yet
DP-Chinese B Overview
3 pages
Assignment 6-Fall 2024
No ratings yet
Assignment 6-Fall 2024
5 pages
The Sea by James Reeves
No ratings yet
The Sea by James Reeves
2 pages
TC1601en-Ed05 TSAPI Deployments For Voice Recorder
No ratings yet
TC1601en-Ed05 TSAPI Deployments For Voice Recorder
50 pages
Quiz - Flottation - Corrigé
No ratings yet
Quiz - Flottation - Corrigé
2 pages
Naveen's Resume
No ratings yet
Naveen's Resume
1 page
Questions Class 9 Eng 2
No ratings yet
Questions Class 9 Eng 2
1 page
Purposive Communication
No ratings yet
Purposive Communication
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Yadav 2014

Uploaded by

Yadav 2014

Uploaded by

Semantic Graph Based Approach for Text Mining

Chandra Shekhar Yadav1 Aditi Sharan2 Manju Lata Joshi2

chandr28_scs@jnu.ac.in aditisharan@jnu.ac.in manjulatajoshi@gmail.com

I. INTRODUCTION Graph theory and the fields of natural language processing

978-1-4799-2900-9/14/$31.00 ©2014 IEEE 596

Definition-I (concept): If N= {NI, N2 . . . Ni} be the set of

A. Dataset and Tools Used

CCT.:!:U,. �3 1!:\'!!1:!:lEt 9 Dipl �l!i 5.� l!!1!! �S4M4 �lJd:ft

C. Experiment on Dataset 2 :;0 AU1hOfity HUB Chul"iog Co.ffici..1 L'SER

graph shown in Figure-3 for dataset-2 by same algorithm as I

In Table 2 we are showing some keywords and their

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.