Incorporating Semanfics Within a Connectionist Model and a Vector Processing Model Richard Boyd, James Driscoll, mien Syu Department of Computer Science University of Central Florida Orlando, Florida 32816 (407)823-2341 FAX: (407)823-5419 e-mail: driscoll@cs.ucf.edu Abstract Semantic information obtained from the public domain 1911 version of Roget's Thesaurus is combined with key- words to measure similarity between natural language topics and documents. Two approaches are explored. In one approach, a combination of keyword relevance and semantic relevance is achieved by using the vector processing model for calculating similarity, but extending the use of a keyword weight by using individual weights for each of its meanings. This approach is based on the database concept of semantic modeling and the linguistic concept of thematic roles. It is applicable to both routing and archival retrievaL The second approach is especially suited for routing. It is based on an Al connectionist model. In this approach, a probabilistic inference network is modified using semantic information to achieve a competitive activation mechanism that can be used for calculating similarity. Keywords: vector processing model, semantic data model, semantic lexicon, inference network, connectionist model. 1 . Introduction The experiments reported here use a relatively efficient method to detect the semantic representation of text. Our original method is based on semantic modeling and is described in [4,17,19). Semantic modeling was an object of considerable database research in the late 1970's and early 1980's. Abriefoverview can be found in [3]. Essentially, the semantic modeling approach identified concepts useful in talking informally about the real world. These concepts included the two notions of entities (objects in the real world) and relationships among entities (actions in the real world). Both entities and rela- tionships have properties. The properties of entities are often called attributes. There are basic or surface level attributes for entities in the real world. Examples of surface level entity attributes are General Dimensions, Color, and Position. These properties are prevalent in natural language. For example, consider the phrase "large, black book on the table" which indicates the General Dimensions, Color, and Position of the book. In linguistic research, the basic properties of relationships are discussed and called thematic roles. Thematic roles are also referred to in the literature as participant roles, semantic roles and case roles. Examples of thematic roles are Ben~ ficiary and Time. Thematic roles are prevalent in natural language; they reveal how sentence phrases and clauses are semantically related to the verbs in a sentence. For example, consider the phrase "purchase for Mary on Wednesday" which indicates who benefited from a purchase(13eneficiary) and when a purchase occurred (Fime). A main goal of our research has been to detect thematic information along with attribute information contained in natural language queries and documents. In order to use this additional information, the concept of text relevance needs to be modified. In [17,19] the major modifications included the addition of a lexicon with thematic and attribute information, and a modified computation of a vector processing similarity coefficient. That research concerned a Question/Answer environment where queries were the length of a sentence and documents were either a sentence or at most a paragraph. At that time, our lexicon was based on 36 semantic categories, and in that environment, our semantic approach produced a significant improvement in retrieval performance. However, for TREC-1 [4], document and topic length presented a problem and caused our semantic approach based on 36 semantic categories to be of little value. However, as reported in [4], by breaking the TREC documents into paragraphs, a significant improvement was demonstrated. This work has been supported in part by NASA KSC Cooperative Agreement NCC 10~3 Project 2, Florida High Technol- ogy and Industry Council Grants 494011-28-721 and 4940-1 1-2~728. 291 In Section 2, we describe our original semantic lexicon and an extension which uses a larger number of semantic categories. Section 3 presents an application of an Al connectionist model to the task of routing. Section 4 presents an approach different than reported in TREC-1 [4], using our extended semantic lexicon within the vector processing model. Section 5 summarizes our rasearch effort. 2. The Semantic Lexicon Our semantic approach uses a thesaurus as a source of semantic categories (thematic and attribute information). For example, Roget's Thesaurus contains a hierarchy of word classes to relate word senses [14]. In TREC-1 [4] and in earlier research [17,19], we selected several classes from this hierarchy to be used for semantic categories. We defined thirty-six semantic categories as shown in Figure 1. In order to explain the assignment of semantic categories to a given term using Roget's Thesaurus, consider the brief index quotation for the term "vapor": vapor n. fog 404.2 fume 401 illusion 519.1 spirit 4.3 steam 328.10 thing imagi~ed 535.3 v. be bombastic 601.6 bluster 911.3 boast 910.6 exhale 310.23 talk nonsense 547.5 The eleven different meanings of the term "vapor" are given in terms of a numerical category. We developed a mapping of the numerical categories in Roget's Thesaurus to the thematic role and attribute categories given in Figure 1. In this example, "fog" and "fume" correspond to the attribute State; "steam" maps to the attribute Temperature; and "ex- hale" is a trigger for the attribute Motion with Reference to Direction. The remaining seven meanings associated with "vapor" do not trigger any thematic roles or attributes. Since there are eleven meanings associated with "vapor," we indicated in the lexicon a probability of 1/11 each time a category is triggered. Hence, a probability of 2/11 is assigned to State, 1/11 to Temperature, and 1/11 to Motion with Reference to Direction. This technique of calculating prob- abilities is being used as a simple alternative to a corpus analysis. It should be pointed out that we are still experimenting with other ways of calculating probabilities. For example, as in [8], a probabilistic part-of-speech tagger could be used to further restrict the different meanings of a term, and existing lexical sources could be used to obtain an ordering based on frequency of use for the different meanings of a term. As reported in [4], the use of 36 semantic categories caused problems when dealing with TREC documents. When the size of a document is large, a greater number of the 36 semantic categories are triggered in the document. Also, when using the semantic approach described in [19] the probability present for each category in a document is often very close to one. Consequently, almost every one of the Thematic Role Categories Attribute Categories TACM Accomnaniment ACOL Color TAMT Amount AEID External and Internal Dimensions ThNF Beneficiarv AFRM Form TCSE Cause AOND Gender TCND Condition AODM General Dimensions TCMP Comnenson ALDM Linear Dimensions TCNV Conve ance AMFR Motion Conjoined with Foree ThOR De~e AOMT Motion in General ThST Destination AMDR Motion with Reference to Direction ThUR Duration AORD Order TOOL Ooal APIIP Phvsical Pronerties TINS Instrument APOS Position TSPL I:c~tion/Si,ace ASTE State TMAN Manner A~mrature TMNS Means AUSE Use ThUR Purpc~e AVAR Variation ThNO Ran~ i~FS Result TSRC Source TTIM Time Figure 1. Thirty-Six Semantic Categories. 292 36 semantic categories becomes present in every document. This causes semantic category weights to become very low and useless within that approach. As ~reportedin[4], one way to solve this problem is to break ThECdocum ents into paragraphs. But, another way to solve the problem of long documents causing semantic weights to be of little value is to have more semantic categories. A large number of "semantic" categories can be obtained (for example) by using ~ the categories and/or subcategories found in Roget's Thesaurus, instead of the 36 semantic categories we have used. This may be a deviation from database semantic modeling. In any case, it needs to be examined. Consequently, for the experiments reported here, a semantic lexicon was created based on all the word senses found in the public domain 1911 version of Roget's The- saurus. To provide an example, consider Topic 052 as shown in Figure 2. Fi~re 3 indicates the keywords and frequency information within Topic 052, along with the semantic categories obtained from our extended lexicon for those keywords. Note that stemming was not used for the pro- cessing of Topic 052; so, some keywords in Topic 052 were not located in our lexicon (e.g. sanctions). The categories recorded in our extended semantic lexicon usethe category numbers found in the 1911 version of Roget's Thesaurus. These numbers are then followed by a part-of- speech code also found in the 1911 version of Roget's Thesaurus. The number after the part-of-speech code represents a sub-category, but this number does not appear in the 1911 version of Roget's Thesaurus. That number was created based on groupings of words within the thesaurus. ~op> <head> TIpster Topic Description <num> Number: 052 <dom> Domain: International Economics <titie> Topic: South African Sanctions <desc> Description: Document discusses sanctions against South Africa <narr> Narrative: A relevant document will discuss any aspect 0' South African sanctions, such as: sanctions dccl~(po~ by a country against the South African government in response to its apaitheid poncy, or in response topressure by an indIvidual, organization or another country; intemational sanctions against Pretoria imposed bythe United Nations; the effects 0' sanctions against & Africa; opposition to sanctions; or' compliance with sanctions by a company. The document will identif~ the sanctions instituted or being considered, e.g., corporate disinvestment, trade ben' academic boycott, arms embargo. <con> Concept(s): 1. sanctions, international sanctions, economic sanctions 2. corporate exodus, corporate disinvestment, stock divestiture, ben on new investment, trade ban, import ben on South African diamonds, U.N. arms embargo, curtailment 0' delbrise contracts, cutoff 0' nonmUitary goods, academic boycott, reduction 0' cultural ties 3. ap&theid, white domination, racism 4. an-theid, black m~Jority rule 5. Pretoria ~c> Factor(s): <nat> Nationality: South Africa <`lac> <de~ D~flnition(s): 3. Connectionist Model Routing Experiments Recent work suggests that significant improvements in retrieval performance will require a technique that, in some sense, "understands" the content of documents and queries and can be used to infer probable relationships between documents and queries [2]. In this view, information retrieval is an inference or evidential reasoning process in which we estimate the probability that a user's information need is met given a document as "evidence". The techniques required to support this kind of inference are similar to those used in expert systems that must reason with uncertain information. Several probabilistically~oriented inference network models have been developed using experimental document collec- tions [5] during the past few years for information retrieval [15]. These models are generally characterreed by an architecture with two layers corresponding to documents and index terms. The documents and index terms are connected by direct links. Initially, the prior probabilities of all root nodes (nodes with no predecessors) and the conditional probabilities of all non-root nodes ~iven all possible combinations of their direct predecessors) must be specified. Aretrievalconsists of one or more documents with the highest posterior probability for the given set of index terms (evi- dences) which represent a user's information need. Over the last few years, the technique of automated inference using probabilistic inference networks has become popularwithin theM probability and uncertainty community, particular in the context of expert systems [6,7]. The most 293 Figure 2. Topic 052. important constraint on the use of a probabilistic network is the fact that in general, the computation of the exact posterior probabilities is NP-hard [1]. Thus it is unlikely that we could develop an efficient general-purpose algorithm which would work well for all kinds of inference networks. There are several alternatives, such as the use of approximation algo- rithms or heuristic algorithms, and creating special case algorithms [9,10]. The experiments here concern an attempt at a heuristic probabilistic inference network approach based on an Al connectionist model. The connectionist model uses a com- petitive activation rule to find the most probable retrievaL The term competitive activation rule refers to a spreading activation method in which nodes actively compete for available activation in a network. An initial formulation of a competitive activation mechanism was previously studied on three tw~layer, abstract networks for diagnostic problem solving [11,13]. The connectionist model proposed here consists of a two-layer network architecture. Document nodes and index term nodes corresponding to each layer are connected by links whose weights represent association strengths between nodes. These links are also viewed as channels for sending information between nodes. Figure 4 is a simple network consisting of two document nodes and three index term nodes. At each moment of time, each node Topic 52 curtailment 001 201n.1 38n.2 individual 001 372a.0 87n.0 considered 001 611d.0 compliance 001 602n3 reduction 001 103n.1 308n.1 economics 002 692n3 corporate 003 43a.0 response 002 462n.0 586v~ 714n.0 821n.0 888n.2 990n3 relevant 001 23a.1 476a.2 ~.2 majority 001 102n.1 131n.0 33n.O academic 002 514a.0 537a.0 54~.0 effects 001 7&)n.10 780n.5 798n.0 defense 001 717n.0 937n.2 country 002 181n.0 189n.1 189n.8 189n.9 266v.1 344n.0 371a.1 company 001 599n.14 712n.2 726n.8 72n.2 88n.1 892n.4 policy 001 626n.2 692n.2 aspect 001 183n.0 448n.3 7n.0 white 001 429a.1 430a.0 441n.5 996n.5 stock 001 lln.3 153n.4 166n.2 225n.13 25n.3 265a.0 31n.1 501n.2 613a.0 635n.0 636n.O 637v.O 637v.2 780n.10 798n.0 80()n.0 81 lv.0 001 780n.10 798n.0 001 349n2 421a.0 431a.0 431n.3 432n.1 752n.0 945a.2 rule 001 136n.1 200n.1 466n.2 613n.5 693v.1 697n.1 737n.1 737v.2 737v.3 74~.2 80n.0 82n.3 963n~1 arms 002 459d.9 71~.2 722n.0 727n.0 877n.3 894v.3 372n.2 549a.0 79a.0 87a.0 goods black 743n.0 762n.0 772n.0 144n.0 195n.0 201n.1 813n.O 85n3 international organization opposition investment government domination pressure identify document embargo discuss boycott another against united import exodus trade south being wrn new ban any 003 12a.0 89~.4 001 161n.0 001 179n.0 720n.0 001 225n.0 001 692n.3 737n.5 001 175n.0 001 157n.2 642n3 735n.1 001 13v.0 003 467n.3 001 265n.2 001 298v.0 002 297v.2 001 104a.1 15a.1 709v.0 714v.O 004 14a.0 17~.0 237d.0 276v.O 673d.O 704d.0 708d.0 70&1.1 708v.O 708v.2 713v.2 716v.1 716v.5 717a.0 71~.0 764v.0 898a.0 932v.8 001 46a.1 714a.0 001 228v.1 296v.0 642n.1 642v.0 001 293n.0 295v.1 002 625n.4 734v.1 006 278n.1 001 ln.O 3n.0 831n.0 ~6n.2 002 360v.1 60()n.0 6()0v.O 602d.0 604a.0 604n.O 737v.3 771n.11 784n.4 001 123a.0 146v.O 1~.0 614a.0 66n.0 004 761n.0 908n.0 98n.3 001 25a.0 Sln.0 609an.1 357n.1 71On.0 60n.O 71~.0 32~.0 237n.0 357n.0 708n.O 716n.2 693n.0 737n.1 175n.O 31%.0 464v.1 4&)av.5 551n2 761n.0 451v.0 460v.3 476v.0 809n3 737n.1 784n.0 699n3 300v.0 516n.0 516v.0 794n.1 794v.1 Figure 3: Word Frequency and Semantic Categories for Topic 052. 294 top I pI=O.02 0.9 0.9 0.9 e (army) (engineer) (plant) Figure 4: A Simple Network Consisting of Two Document Nodes and Three Index Term Nodes. receives information about the activation levels of its immediate neighboring nodes (nodes connected to it via direct links), and then uses this information to calculate its own activation leveL Through this process of spreading activation, the network setUes down to equilibrium repre- senting a retrieval to a user's information need. The computation of the information retrieval inference process is based on a lormaliration of the causal and proba- bilistic associative knowledge underlying diagnostic prob- lem-solving [18). We do not discuss the formulation architecture and activation mechanism of the connectionist model. This information can be found in [11,13,16,18). For TREC-2, we managed to complete only one official routing experiment for this approach, and itdid not involve semantics. The experiment was intended to be a baseline experiment for our semantic experiments. For ThEC-2, a specific network was constructed for 50 topics. A list of index terms was assembled based on keywords in the concept section of each topic. In this network, each output node represented a topic, and each input node represented akeyword. The prior probability assigned to each topic node was equal to 1/(total number of topics). The connection strengths were assigned equal weights (0.9). The network contained 50 topic nodes and 848 index term nodes. These nodes were connected via 1449 links. An example of this network is shown in Figure 5, where p~ is the prior probability of topic top~. The keywords "army", "engineer", and "plant" were obtained by processing the concept section of topic tO~~ Currently, the network is enhanced by using an estimated weighting scheme. We performed a Category B routing experiment. Using just keywords, the results were not good. The main problem was due to the fact that, in the document ranking, many documents had the same score used to generate the ranking. In order to satisfy the requirements for the ranking, we had to artificially rank those documents with the same score. This was done based on order of appearance. The performance was terrible except for Topic 66. This topic had only two 295 Figure 5: A Sample Network of the Experimental ModeL known relevant documents for Category B routing experi- ments and our inference network retrieved one of them in the top 20 documents! No further connectionist model experiments have been completed. We were unable to modify the baseline keyword experiment or perform semantic experiments for this approach. 4. Vector Processing Model Experiments In this section, we explain the manner in which semantics is incorporated within a vector processing model using the semantic lexicon explained in Section 2. Please note that an entry in our semantic lexicon has the form of a word followed by codes for each of the semantic categories theword triggers. We explain our approach usingatext relevance determination procedure intended to show what is being calculated rather than show the actual computations for the approach. The procedure presented here generates several outputs that are really not necessary, but are included just to help explain the approach. The relevance determination procedure is explained using the four documents and query shown in Figure 6. A few preliminary computations are reviewed in order to explain the procedure. First, the number of documents each word is in must be determined. Figure 7 shows a list of words from the four documents and the query of Figure 6 along with the number of documents each word is in (dJ). Next, the inverse document frequency (idi) of each word is determined by the equation 1og10(NIdJ), where N -4, the total number of documents. Figure 8 provides the idjof each word. Sometimes, the kif of a word is undefined. This can happen when a word does not occur in the documents but does occur a query. For example, the words "depart" "do" in and "when" do not appear in the four documents. Thus, the idf of these terms cannot be defined here. Later, we will see that an adjustment can be made for these undefined values. Next, the category probability of each query word is determined. Figure 9 shows an alphabetized list of all the unique words from the query, the frequency of each word in the query, and the semantic categories each word triggers. word idfof the word Iog10~Nf Document#1 L~comotives pull the trains. and .6 canopy .6 carry .6 Document #2 depart undefined do undefined People meet people under the canopy and within trains. freight .6 from .6 hourly .6 Document #3 leave .6 locomotives .6 Trains carry freight from the station. meet .6 noon .6 people .6 Document #4 pull .6 station .3 Trains leave the station hourly until noon. the 0 trains 0 under .6 Ouery until .6 when undefined When do trains depart the station? within .6 Figure 6. Four Documents and a Query. Figure 8. The i~of Each Word. word frequency category probability depart 1 AMDR 1/4 word number of documents TA~ 1/8 the word is in (dJ) do 1 AUSE 1/21 ATh{P 1/21 and 1 TCSE 1/21 canopy 1 TCNV 2/21 carry 1 ThEs 1/21 do 0 TSRC 1/21 depart 0 freight 1 station 1 APOS 3/16 from 1 AORD 1/8 hourly 1 TAMF 1/16 leave 1 TCND 1/8 locomotives 1 TDGR 1/16 meet 1 TSPL 3/16 noon 1 people 1 the 1 pull 1 trains 1 AORD 7/24 station 2 AMDR 1/12 the 4 AMFR 1/12 trains 4 TACM 1/24 under 1 TCNV 1/12 until 1 when 0 when 1 TAMT 1/3 within 1 TnM 2/3 Figure 7. Ust of Words in the Documents and Query. Figure 9. Words in the Query. 296 The semantic categories in our example are those shown in Figure 1. For example, consider the word "depart" which occurs one time in the query as shown in Figure 9. The semantic lexicon entry for the word 11depart't using the categories of Figure 1 is as follows: depart: NONE NONE NONE NONE NONE AMDR AMDR TA~ where NONE represents a word sense not included in the 36 semantic categories of Figure 1. If a uniform distribution is assumed, then AMDR is triggered 1/4 of the time and TA~ is triggered 1/8 of the time. This is shown in Figure 9 as the probabilities for each semantic category. A similar category probability determination is done for each document. Figure 10 is an alphabetized list of all the unique words in Document #4 of Figure 6. The semantic categories each word triggers along with probabilities are also shown. The text relevance determination procedure is shown in Figure 11. The procedure uses three input lists: a. List of words and the kif of each word, as shown in Figure 8. b. List of words in the query and the semantic categories they trigger along with the probability of triggering those categories, as shown in Figure 9. c. List of words in a document and the semantic categories they trigger along with the probability of triggering those categories, as shown in Figure 10. The procedure operates as follows: Step 1. This step determines the common meanings between the query and the document. Figure 12 corresponds to the output of Step 1 for Document #4. In Step 1, a new list is created as follows: For each word in the query, follow either subsection (a) or ~), whichever applies: a. For each category the word triggers, find each word in the document that triggers the category and output three things: 1) The word in the query and its frequency of occurrence. 2) The word in the document and its frequency of occurrence. 3) The category. b. If the word does not trigger a category, then look for the word in the document and if found, output two things and a ~ 1) The word in the query and its frequency of occurrence. 2) The word in the document and its frequency of occurrence. 297 word frequency category probability hourly 1 `rflM 1.0 leave 1 AMDR 1/7 TA~ 1/7 noon 1 AU)M 1/3 ~flM 2/3 the 1 station 1 APOS 3/16 AORD 1/8 TA~[~ 1/16 TCNP 1/8 ThGR 1/16 J~PL 3/16 trains 1 AORD 7/24 AMDR 1/12 AMFR 1/12 TACM 1/24 TCNV 1/12 until 1 THM 1.0 Figure 10. Words in Document #4. ~tep 1 - Refer to Figure 12. Determine common meaning between query and the document. ~tep 2- Refer to Figure 13. Adjust for words in the query that are not in any of the documents. ~tep 3 - Refer to Figure 14. Calculate the weight of a semantic component in the query and calculate the weight of a semantic component in the document. ~tep 4- Refer to Figure 15. Multiply the weight in the query by the weight in the document. ~tep 5 - Refer to Figure 15. Sum all the individual products of Step 4 into a single value which is the semantic similarity coefficient. I 1 Figure 11. Relevance Determination Procedure to Explain Semantic Similarity. FI~ Ust Item First Entry Second Entry Third Entry Number Word & Frequency Word & Frequency Category in Query in Document #4 1 (depart,1) ( 2 (de~rt,1) ( 3 (depart,1) ( 4 (depart,1) ( 5 (do1l) (i 6 (station,1) ( 7 (station,1) 8 (station,1) 9 (station,1) (((mm 10 (station,1) ( 11 (station,1) ( 12 (stntion,1) ( 13 (station,1) ( 14 (the,1) 15 (tmins,1) 16 (tmins,1) (] 17 (trains,1) 18 (trains,1) (i 19 (trains,1) (i 20 (trains,1) (i 21 (when,1) (I 22 (when,1) (] 23 (when,1) (] 24 (when,1) (I Ienve,1) AMDR ~ns,1) AMDR Icave,1) TAMr station,1) TAMr Irains,1) TCNV itation,1) APOS Btation,1) AORD Imins,1) AORD Ieave,1) TAM~ ~tntion,1) TAMr ~tation,1) TCND ~tation,1) TDGR stntion,1) TSPL Ihe,1) trains,1) AORD Icave,1) AMDR trains,1) AMDR trains,1) AMFR Lmins,1) TACM trains1l) TCNV Ienve,1) TAMr hourly1l) TrIM r'oon,1) TrIM ~ntiI,1) TrIM Figure 12. Common Meaning. Second Ust Item First Entry Second Entry Third Entry Number Word & Frequency Word & Frequency Category in Query in Document #4 1( 2( 3( 4( 5 6 7 8 9 10 11 12 ( 13 ( 14 ( 15 ( 16 ( 17 ( 18 ( 19 ( 20 ( 21 ( 22 ( 23 ( 24 ( 1eave,1) (Ieave,1) AMDR ,tmins,1) (tmins,1) AMDR *ienve,1) (Ieave,1) TAMr station,1) (station,1) TAMT `tmins,1) (trains,1) TCNV station,1) (station,1) APOS `station,1) (stntion,1) AORD `station,1) (tmins,1) AORD stadon,1) (Ienve,1) TAMT station,1) (stntion,1) TAMT station,1) (station,1) TCND station,1) (stndon,1) TDGR stntion,1) (station,1) TSPL the,1) (the,1) trains,1) (trains,1) AORD trnins,1) (Ieave1l) AMDR tmins,1) (tmlns,1) AMDR tmins,1) (tmins,1) AMFR tmins,1) (tnins,1) TACM tmins,1) (tmins,1) TCNV Ienve,1) (`eave,1) TAM~ bourly,1) ("oudy,1) TrIM i'oon,1) (noon,1) TrIM `intil,1) (until,1) TrIM Figure 13. Adjustment for Words with no idf 298 Considering Figure 12, the word "depart" occurs in the query one time and triggers the category AMDR. The word "leave" occurs in Document #4 once and also triggers the category AMDR. Thus, item 1 in Figure 12 corresponds to subsection (a) as described above. An example using sub- section (0) occurs in item 14 of Figure 12. Step 2. This step adjusts for words in the query that are not in any of the documents. Figure 13 shows the output of Step 2 for Document #4. In this step, another list is created from the list created in Step 1. For each item in the Step 1 list which has a word with undefined idi; this step replaces the word in the First Entry column by the word in the Second Entry column. For example, the word "depart" has an undefined idfas shown in Figure & Thus, the word "depart" in item 1 of Figure 12 should be replaced by the word "leave" from the Second Entry column. This is shown in item 1 of Figure 13. Likewise, the words "do" and "when" also have an undefined idf and are respectively replaced by the words from the Second Entry column. Step 3. This step calculates the weight of a semantic component in the query and calculates the weight of a semantic compo- nent in the document. Figure 14 shows the output of Step 3 for Document #4. In Step 3, another list is created from the list created in Step 2 as follows: For each item in the Step 2 list, follow either subsection (a) or (0), whichever applies: a. If the Third Entry specifies a category, then 1) Replace the First Entry by computing: ( i~of frequency ot\( probability the word~ word in word in ~ triggers the category First Entry)~ First Entry )~ in the Third Entry ) 2) Replace the Second Entry by computing: ( i~of frequency o~( probability the word~ word in ~ word in ~ triggers the category Second Entry) ksecond Entry)~ lathe Third Entry ) 3) Omit the Third Entry. b. If the Third Entry does not specify a category, then 1) Replace the First Entry by computing: ( i~of irequencyo~ wordin ~ wordin FirstEntry)~ FirstEntry) 2) Replace the Second Entry by computing: ( i~of frequency word in ~ word in Second Entry) kSecond Entry) 3) Omit the Third Entry. In Figure 14, item 1 is an example of using subsection (a), and item 14 is an example of using subsection (0). 299 Step 4. This step multiplies the weights in the query by the weights in the document. The top portion of Figure 15 shows the output of Step 4. In the list created here, the numerical value created in the First Entry column of Figure 14 is multiplied by the numerical value created in the Second Entry column of Figure 14. Step S. This step sums the values in the Step 4 list to compute the semantic similarity coefficient fora particular document. The bottom portion of Figure 15 shows the output of step 5 for Document #4. We have finally observed an improved Precision~Recall performance using the semantic similarity coefficient explained here. or example, in a Category B filtering experiment where the words being considered were only those in the topics and idf values were determined by the number of topics a word is in, we have observed the keyword and semantic results shown in Figure 16 and Figure 17, respec- tively. The 11-pt average for these two experiments reveals a 23% increase due to the use of semantic categories. According to Sparck Jones' criteria, this change would be classified as "significant" ~reater than 10.0%) [12]. We believe further improvement is possible by considering more words, stemming for plurals and tenses of words, better idf values (like those used for archival retrieval), a modem lexicon, and a focus on paragraphs instead of whole docu- ments. 5. Summary Our progress during ThEC-1 and ThEC-2 has been the following: a. We created efficient code for a UNIX platform. Originally our code used B+ tree structures for implementing inverted files on a DOS platform. We now use hashing to replace B+trees, establishing codes to replace character strings; and the UNIX platform provides faster processing than the DOS platform. b. We built an index forasemantic lexicon based on the public domain 1911 version of Roget's Thesaurus. To do this, we had to create our own category numbering system similar to today's version of Roget's Thesaurus. c. We solved part of the blend problem for semantic and keyword weights. We now base semantic category weights on the kifof words which generate the semantic categories. We can now index or scan TREC documents at rates faster than 60 Megabytes per hour depending on the workstation. We have a semantic lexicon of approximately 20,000 words with flexible category codes that allow a course (36 catego ries) through fine (more than 15,000 categories) semantic analysis. As shown in Section 4, our procedure for determining relevance is based on the senses of each word. For example, using the vector processing model and the similarity coefficient sim~,D~)- X Wqj~djj, i-i "liii" List Item Number First Entry Second Entry 1 .6*1*1fl~.0857 .6*1*1fl~.0857 2 0*1 *1/12m0 0*1*1I12~0 3 .6*1*1fl~.0857 .6*1*lPm.0857 4 .3*1*1/16~.0188 *3*1*1/16~.0188 5 0*1*1,12~0 0*1 *1/12~0 6 .3*1 *3/16m.0563 .3*1*3116~.o563 7 .3*1*7/24m.0875 .3*1 *7/24~ .0875 8 *3*1*1/8~.0375 0*1 *7/24mO 9 .3*1*1116m.0188 .6*1*1fl~.0857 10 .3*1*1I16~.0188 .3*1*1/16~.0188 11 .3* 1*1/8~.0375 .3*1*1/8~.0375 12 .3*1*1/16~.0188 .3*1*1/16~.0188 13 .3*1*3116m.0563 .3*1*3/16m.0563 14 0*1~0 0*lmO 15 0*1*7~~0 0*1*7/24~0 16 0*1*1/12m0 .6*1*1!7~.0857 17 0*1*1I12~0 0*1*1/12~0 18 0*1 *1/12~0 0*1 *1/12~0 19 0*1 *1/24~0 0*1 * 1/24-0 20 0*1 *1/12m0 0*1*1I12~0 21 .6*1*1rl~.0857 .6*1*1!7~.0857 22 T .6*1*1.0~.6000 .6*1*1.0~.6000 23 .6*1*~m.4000 .6*1 *213~ .4000 24 .6*1*1.0~.6000 .6*1*1.0~.6000 Figure 14. Weights of Semantic Components. Fourth List Item Number Value 1 .00734 2 0 3 .00734 4 .00035 5 0 6 .00317 7 .00734 8 0 9 .00170 10 .00035 11 .00141 12 .00035 13 .00317 14 0 15 0 16 0 17 0 18 0 19 0 20 0 21 .00734 22 36000 23 .16000 24 .36000 Sum of all values in Fourth List 0.91986 Figure 15. Multipiled Weights and Their Sum. 300 Queryid (Num): 47 of 50 Total number of documents over all queries Retrieved: 36610 Relevant: 2064 Rel~ret: 913 Interpolated Recall - Precision Averages at 0.00 0.3514 at 0.10 0.1968 at 0.20 0.1367 at 0.30 0.1082 at 0.40 0.0894 at 0.50 0.0752 at 0.60 0.0276 at 0.70 0.0105 at 0.80 0.0062 at 0.90 0.0013 at 1.00 0.0007 (non-interpolated) over all rel does 0.0746 Average precision Queryid ~um): 47 of 50 Total number of documents over all queries Retrieved: 36383 Relevant: 2064 Rel_ret: 956 Interpolated Recall at 0.00 at 0.10 at 0.20 at 0.30 at 0.40 at 0.50 at 0.60 at 0.70 at 0.80 at 0.90 at 1.00 Average precision - Precision Averages 0.3961 0.2479 0.1734 0.1258 0.1067 0.0838 0.0372 0.0195 0.0100 0.0029 0.0009 (non-interpolated) over all rel does 0.0919 Precision: Precision: At 5does: 0.1660 At Sdocs: 0.2426 At lOdocs: 0.1532 At lOdoes: 0.2149 At 15does: 0.1433 At lSdoes: 0.1801 At 20does: 0.1298 At 20does: 0.1574 At 30does: 0.1057 At 30does: 0.1383 At 100does: 0.0643 At l00does: 0.0745 At 200 does: 0.0465 At 200does: 0.0522 At 500 does: 0.0302 At 500 does: 0.0320 At 1000does: 0.0194 At 1000 does: 0.0203 R-Precision ([`recision after R (= num_rel for a query) R-Precision ([`recision after R (= num~rel for a query) does retrieved): does retrieved): Exact: 0.1035 Exact: 0.1283 Figure 16. Fillering Using Keywords. if the word "trains" is in the Query and the word "leaves "is in the Document and we look at the semantic category Motion with Reference to Direction (AMDR), then one of the vector product elements in the formula becomes: . p",abE.Iiy ~ Icavee" triggem AMDR~ where the probabilities are obtained from our semantic lexi- con. We plan to do more experiments incorporating the fol- lowing improvements: a. Modernize the semantic lexicon. Since our lexicon isbased on the 1911 version of Roget's Thesaurus, many modem words are not present and the senses of recorded words are not accurate. We plan to correct this. For example, we could try to get permission to use the current version of Roget's Thesaurus. b. Base similarity on paragraphs instead of whole documents. We have had success using as few as 36 categories in a paragraph environment. ~e also feel that relevance 301 Figure 17. Filtering Using Semantic Categories. decisions are made by humans looking at roughly a paragraph of information. We plan to modify our code to use paragraphs as a basis for the similarity measure. c. Experiment with the number of possible semantic cate- gories and the probability assigned to a triggered category. The experiment behind the performance improvement shown in Figure 16 and Figure 17 uses a very fine number of semantic categories and treats the triggered semantic categories for a word uniformly. We plan to experiment with a fewer number of categories, and we plan to obtain a probability distribution for categories based on word usage. Basically, we are trying to establish a statistically sound approach to using word sense information. Intuition is that word sense information should improve retrieval perform- ance. Furthermore, our approach to using word sense infor- mation has shown a significant performance improvement in a question/answer environment where paragraphs represent documents. We feel that other word sense approaches, such as query expansion or word sense disambiguation, may not be statistically sound, and that may be why successful experiments have not been reported. References [1] G. F. Cooper (1988). Probabilistic inference using belief networks is np-hard. Technical Report KSL-87-27, Stanford University, Stanford, CA. [2] W. B. Croft (1987). Approaches to intelligent infor- mation retrievaL Infonnation Processing & Manage- ment, 23(4):95-1 10. [3] C. Date (1990). An Introduction to Database Systems, Vol.1, Addison Wesley. [4] 3. Driscoll, L Lautenschlager and M. zhao (1993). The QA system, Proc. of the First TextRetrieval Conference (TREC-1), NIST Special Publication 500-207 (1). K. Harman, editor). [5] E. k Fox (1983) Characteriitation of two new exper- imental collections in computer and information science containing textual and bibliographic concepts. Technicat Report 83-561, Department of Computer Science, Cornell Univ., Ithaca, NY [6] L N. Kanal and 3. F. Lemmer Eds (1986). Uncertainty in Artificial Intelligence, North-Holland, Amsterdam. [7] J. F. Lemmer and L N. Kanal Eds (1988). Uncertainty in Artificial Jntelligence 2, North-Holland, Amsterdam. [8] E. L Liddy and S.H. Myaeng (1993). DR-Link, Proc. of the First Text Retrieval Conference (TREC-1), NIST Special Publication 500-207(1). K. Harman, editor). [9] R. E. Neapolitan (1990). Probabilistic Reasoning in Expert System, Theory and Algorithms, A Wiley- Interscience Publication, John Wiley & Sons, Inc. [10] J. Pearl (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann. [11] Y. Peng andj. A. Reggia (1989). Aconnectionist model for diagnostic problem solving. IEEE Transactions on Systems, Man, and Cybernetics, 19(2):285-298. (12] K. Sparck Jones and R. Bates (1977). Research on Automatic Indexing 1974-1976, Technical Report, Computer Laboratory, University of Cambridge. [13] M. Tagamets, J. Wald, M. Farach and J. A. Reggia (1989). Generating plausible diagnostic hypothesis with self-proccssin~ causal networks. Journal of ~er- imental Theories in Artif~cial Intelligence, 2:91-112. [14] Roget's International Thesaurus (1977). Harper & Row, New York, Fourth Edition. [15] H. Turtle and W. B. Croft (1991). Evaluation of an inference network-based retrieval model. ACM Trans- action on Information Systems, 9(3):187-222. (16] P. Van Learti and E. Arts (1987). SimulatedAnnealing: Theory andApplications. Boston: D. ReideL 302 [17] D. Voss and J. Driscoll (1992). Text Retrieval Using a Comprehensive Semantic Lexicon, Proceedings of ISMM First International Conference on Information and Rnowledge Management, Baltimore, Maryland. [18] P. Wang, J. Reggia, D. Nao and Y. Peng (1985). A formal model for diagnostic inference. Information Sciences, 37:227-285. [19] E. Wendlandt and J. Driscoll(1991). Incorporating a Semantic Analysis into a Document Retrieval Strategy, Proceedings of the Fourieenth Annual International ACMISIGIR Conference on Research andDevelopment in Information Retrieval, Chicago, Illinois, 270-279. <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'> <html xmlns='http://www.w3.org/1999/xhtml'> <head> <title>pFad - Phonifier reborn</title> <meta http-equiv='Content-Type' content='text/html; charset=utf-8' /> </head> <body> <h1>Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.</h1> <!-- Disclaimer --> <p>Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.</p> <br> <p>Alternative Proxies:</p><p><a href="http://rainy.clevelandohioweatherforecast.com/php-proxy/index.php?q=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">Alternative Proxy</a></p><p><a href="http://rainy.clevelandohioweatherforecast.com/pFad/index.php?u=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">pFad Proxy</a></p><p><a href="http://rainy.clevelandohioweatherforecast.com/pFad/v3index.php?u=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">pFad v3 Proxy</a></p><p><a href="http://rainy.clevelandohioweatherforecast.com/pFad/v4index.php?u=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">pFad v4 Proxy</a></p></body> </html>