Incorporating Semanfics Within a Connectionist Model
                         and a Vector Processing Model

                         Richard Boyd, James Driscoll, mien Syu
                           Department of Computer Science
                             University of Central Florida
                                Orlando, Florida 32816
                                  (407)823-2341
                                FAX: (407)823-5419
                             e-mail: driscoll@cs.ucf.edu

                      Abstract
   Semantic information obtained from the public domain
1911 version of Roget's Thesaurus is combined with key-
words to measure similarity between natural language topics
and documents.  Two approaches are explored. In one
approach, a combination of keyword relevance and semantic
relevance is achieved by using the vector processing model
for calculating similarity, but extending the use of a keyword
weight by using individual weights for each of its meanings.
This approach is based on the database concept of semantic
modeling and the linguistic concept of thematic roles. It is
applicable to both routing and archival retrievaL The second
approach is especially suited for routing. It is based on an Al
connectionist model.  In this approach, a probabilistic
inference network is modified using semantic information to
achieve a competitive activation mechanism that can be used
for calculating similarity.

Keywords: vector processing model, semantic data model,
semantic lexicon, inference network, connectionist model.

1 .   Introduction
   The experiments reported here use a relatively efficient
method to detect the semantic representation of text. Our
original method is based on semantic modeling and is
described in [4,17,19).

   Semantic modeling was an object of considerable database
research in the late 1970's and early 1980's. Abriefoverview
can be found in [3]. Essentially, the semantic modeling
approach identified concepts useful in talking informally
about the real world. These concepts included the two notions
of entities (objects in the real world) and relationships among
entities (actions in the real world). Both entities and rela-
tionships have properties.

   The properties of entities are often called attributes. There
are basic or surface level attributes for entities in the real

world. Examples of surface level entity attributes are General
Dimensions, Color, and Position.  These properties are
prevalent in natural language. For example, consider the
phrase "large, black book on the table" which indicates the
General Dimensions, Color, and Position of the book.

 In linguistic research, the basic properties of relationships
are discussed and called thematic roles. Thematic roles are
also referred to in the literature as participant roles, semantic
roles and case roles. Examples of thematic roles are Ben~
ficiary and Time. Thematic roles are prevalent in natural
language; they reveal how sentence phrases and clauses are
semantically related to the verbs in a sentence. For example,
consider the phrase "purchase for Mary on Wednesday"
which indicates who benefited from a purchase(13eneficiary)
and when a purchase occurred (Fime).

 A main goal of our research has been to detect thematic
information along with attribute information contained in
natural language queries and documents. In order to use this
additional information, the concept of text relevance needs
to be modified.

 In [17,19] the major modifications included the addition
of a lexicon with thematic and attribute information, and a
modified computation of a vector processing similarity
coefficient. That research concerned a Question/Answer
environment where queries were the length of a sentence and
documents were either a sentence or at most a paragraph. At
that time, our lexicon was based on 36 semantic categories,
and in that environment, our semantic approach produced a
significant improvement in retrieval performance.

 However, for TREC-1 [4], document and topic length
presented a problem and caused our semantic approach based
on 36 semantic categories to be of little value. However, as
reported in [4], by breaking the TREC documents into
paragraphs, a significant improvement was demonstrated.

This work has been supported in part by NASA KSC Cooperative Agreement NCC 10~3 Project 2, Florida High Technol-
ogy and Industry Council Grants 494011-28-721 and 4940-1 1-2~728.


                                     291

  In Section 2, we describe our original semantic lexicon
and an extension which uses a larger number of semantic
categories.  Section 3 presents an application of an Al
connectionist model to the task of routing. Section 4 presents
an approach different than reported in TREC-1 [4], using our
extended semantic lexicon within the vector processing
model. Section 5 summarizes our rasearch effort.

2.   The Semantic Lexicon
  Our semantic approach uses a thesaurus as a source of
semantic categories (thematic and attribute information). For
example, Roget's Thesaurus contains a hierarchy of word
classes to relate word senses [14]. In TREC-1 [4] and in
earlier research [17,19], we selected several classes from this
hierarchy to be used for semantic categories. We defined
thirty-six semantic categories as shown in Figure 1.

  In order to explain the assignment of semantic categories
to a given term using Roget's Thesaurus, consider the brief
index quotation for the term "vapor":

     vapor
           n. fog         404.2
              fume        401
              illusion    519.1
              spirit      4.3
              steam       328.10
              thing imagi~ed 535.3
           v. be bombastic 601.6
              bluster     911.3
              boast       910.6
              exhale      310.23
              talk nonsense 547.5

The eleven different meanings of the term "vapor" are given
in terms of a numerical category. We developed a mapping
of the numerical categories in Roget's Thesaurus to the
thematic role and attribute categories given in Figure 1. In
this example, "fog" and "fume" correspond to the attribute
State; "steam" maps to the attribute Temperature; and "ex-
hale" is a trigger for the attribute Motion with Reference to
Direction. The remaining seven meanings associated with
"vapor" do not trigger any thematic roles or attributes. Since
there are eleven meanings associated with "vapor," we
indicated in the lexicon a probability of 1/11 each time a
category is triggered. Hence, a probability of 2/11 is assigned
to State, 1/11 to Temperature, and 1/11 to Motion with
Reference to Direction. This technique of calculating prob-
abilities is being used as a simple alternative to a corpus
analysis.

 It should be pointed out that we are still experimenting
with other ways of calculating probabilities. For example, as
in [8], a probabilistic part-of-speech tagger could be used to
further restrict the different meanings of a term, and existing
lexical sources could be used to obtain an ordering based on
frequency of use for the different meanings of a term.

 As reported in [4], the use of 36 semantic categories caused
problems when dealing with TREC documents. When the
size of a document is large, a greater number of the 36
semantic categories are triggered in the document. Also,
when using the semantic approach described in [19] the
probability present for each category in a document is often
very close to one. Consequently, almost every one of the

     Thematic Role Categories                      Attribute Categories
     TACM Accomnaniment                            ACOL  Color
     TAMT Amount                                   AEID  External and Internal Dimensions
     ThNF  Beneficiarv                             AFRM  Form
     TCSE  Cause                                   AOND  Gender
     TCND Condition                                AODM  General Dimensions
     TCMP Comnenson                                ALDM  Linear Dimensions
     TCNV Conve ance                               AMFR  Motion Conjoined with Foree
     ThOR De~e                                     AOMT  Motion in General
     ThST  Destination                             AMDR  Motion with Reference to Direction
     ThUR Duration                                 AORD  Order
     TOOL Ooal                                     APIIP Phvsical Pronerties
     TINS  Instrument                              APOS  Position
     TSPL  I:c~tion/Si,ace                         ASTE  State
     TMAN Manner                                   A~mrature
     TMNS Means                                    AUSE  Use
     ThUR  Purpc~e                                 AVAR  Variation
     ThNO Ran~
     i~FS  Result
     TSRC  Source
     TTIM  Time

                                 Figure 1. Thirty-Six Semantic Categories.


                                             292

36 semantic categories becomes present in every document.
This causes semantic category weights to become very low
and useless within that approach.

 As ~reportedin[4], one way to solve this problem is to
break ThECdocum ents into paragraphs. But, another way
to solve the problem of long documents causing semantic
weights to be of little value is to have more semantic
categories. A large number of "semantic" categories can be
obtained (for example) by using ~ the categories and/or
subcategories found in Roget's Thesaurus, instead of the 36
semantic categories we have used. This may be a deviation
from database semantic modeling. In any case, it needs to be
examined.

 Consequently, for the experiments reported here, a
semantic lexicon was created based on all the word senses
found in the public domain 1911 version of Roget's The-
saurus. To provide an example, consider Topic 052 as shown
in Figure 2. Fi~re 3 indicates the keywords and frequency
information within Topic 052, along with the semantic
categories obtained from our extended lexicon for those
keywords. Note that stemming was not used for the pro-
cessing of Topic 052; so, some keywords in Topic 052 were
not located in our lexicon (e.g. sanctions).

 The categories recorded in our extended semantic lexicon
usethe category numbers found in the 1911 version of Roget's
Thesaurus. These numbers are then followed by a part-of-
speech code also found in the 1911 version of Roget's
Thesaurus.  The number after the part-of-speech code
represents a sub-category, but this number does not appear
in the 1911 version of Roget's Thesaurus. That number was
created based on groupings of words within the thesaurus.

~op>
<head> TIpster Topic Description
<num> Number: 052
<dom> Domain: International Economics
<titie> Topic: South African Sanctions
<desc> Description:
Document discusses sanctions against South Africa
<narr> Narrative:
A relevant document will discuss any aspect 0' South African sanctions, such
as: sanctions dccl~(po~ by a country against the South African
government in response to its apaitheid poncy, or in response topressure by
an indIvidual, organization or another country; intemational sanctions against
Pretoria imposed bythe United Nations; the effects 0' sanctions against &
Africa; opposition to sanctions; or' compliance with sanctions by a company.
The document will identif~ the sanctions instituted or being considered, e.g.,
corporate disinvestment, trade ben' academic boycott, arms embargo.
<con> Concept(s):
1. sanctions, international sanctions, economic sanctions
2. corporate exodus, corporate disinvestment, stock divestiture, ben on new
 investment, trade ban, import ben on South African diamonds, U.N. arms
 embargo, curtailment 0' delbrise contracts, cutoff 0' nonmUitary goods,
 academic boycott, reduction 0' cultural ties
3. ap&theid, white domination, racism
4. an-theid, black m~Jority rule
5. Pretoria
~c> Factor(s):
<nat> Nationality: South Africa
<`lac>

                                     <de~ D~flnition(s):
3. Connectionist Model Routing Experiments

  Recent work suggests that significant improvements in
retrieval performance will require a technique that, in some
sense, "understands" the content of documents and queries
and can be used to infer probable relationships between
documents and queries [2]. In this view, information retrieval
is an inference or evidential reasoning process in which we
estimate the probability that a user's information need is met
given a document as "evidence". The techniques required to
support this kind of inference are similar to those used in
expert systems that must reason with uncertain information.
Several probabilistically~oriented inference network models
have been developed using experimental document collec-
tions [5] during the past few years for information retrieval
[15].  These models are generally characterreed by an
architecture with two layers corresponding to documents and
index terms. The documents and index terms are connected
by direct links. Initially, the prior probabilities of all root
nodes (nodes with no predecessors) and the conditional
probabilities of all non-root nodes ~iven all possible
combinations of their direct predecessors) must be specified.
Aretrievalconsists of one or more documents with the highest
posterior probability for the given set of index terms (evi-
dences) which represent a user's information need.
  Over the last few years, the technique of automated
inference using probabilistic inference networks has become
popularwithin theM probability and uncertainty community,
particular in the context of expert systems [6,7]. The most


                                         293

            Figure 2. Topic 052.

important constraint on the use of a probabilistic network is
the fact that in general, the computation of the exact posterior
probabilities is NP-hard [1]. Thus it is unlikely that we could
develop an efficient general-purpose algorithm which would
work well for all kinds of inference networks. There are
several alternatives, such as the use of approximation algo-
rithms or heuristic algorithms, and creating special case
algorithms [9,10].
 The experiments here concern an attempt at a heuristic
probabilistic inference network approach based on an Al
connectionist model. The connectionist model uses a com-
petitive activation rule to find the most probable retrievaL
The term competitive activation rule refers to a spreading
activation method in which nodes actively compete for
available activation in a network. An initial formulation of
a competitive activation mechanism was previously studied
on three tw~layer, abstract networks for diagnostic problem
solving [11,13]. The connectionist model proposed here
consists of a two-layer network architecture.  Document
nodes and index term nodes corresponding to each layer are
connected by links whose weights represent association
strengths between nodes. These links are also viewed as
channels for sending information between nodes. Figure 4
is a simple network consisting of two document nodes and
three index term nodes. At each moment of time, each node

Topic 52

curtailment  001  201n.1 38n.2
individual   001  372a.0

                  87n.0

considered   001  611d.0

compliance   001  602n3

reduction    001  103n.1
                  308n.1

economics    002  692n3

corporate    003  43a.0
response     002  462n.0 586v~ 714n.0 821n.0
                  888n.2 990n3
relevant     001  23a.1 476a.2 ~.2

majority     001  102n.1 131n.0 33n.O

academic     002  514a.0 537a.0 54~.0

effects      001  7&)n.10 780n.5 798n.0

defense      001  717n.0 937n.2

country      002  181n.0 189n.1 189n.8 189n.9
                  266v.1 344n.0 371a.1

company      001  599n.14 712n.2 726n.8 72n.2
                  88n.1 892n.4

policy       001  626n.2 692n.2

aspect       001  183n.0 448n.3 7n.0

white        001  429a.1 430a.0 441n.5 996n.5

stock        001  lln.3 153n.4 166n.2 225n.13 25n.3
                  265a.0 31n.1 501n.2 613a.0 635n.0
                  636n.O 637v.O 637v.2 780n.10
                  798n.0 80()n.0 81 lv.0

             001  780n.10 798n.0

             001  349n2 421a.0 431a.0 431n.3 432n.1
                  752n.0 945a.2

rule         001  136n.1 200n.1 466n.2 613n.5 693v.1
                  697n.1 737n.1 737v.2 737v.3 74~.2
                  80n.0 82n.3 963n~1

arms         002  459d.9 71~.2 722n.0 727n.0 877n.3
                  894v.3

372n.2 549a.0 79a.0 87a.0

goods

black

743n.0 762n.0 772n.0

144n.0 195n.0 201n.1
813n.O 85n3

international

organization

opposition

investment

government

domination

pressure

identify

document

embargo

discuss

boycott

another

against


united

import

exodus

trade

south

being

wrn

new

ban

any

003  12a.0 89~.4

001  161n.0

001  179n.0
     720n.0

001  225n.0

001  692n.3                 737n.5

001  175n.0

001  157n.2            642n3 735n.1

001  13v.0

003  467n.3

001  265n.2

001  298v.0

002  297v.2

001  104a.1 15a.1 709v.0 714v.O

004  14a.0 17~.0 237d.0 276v.O 673d.O
     704d.0 708d.0 70&1.1 708v.O 708v.2
     713v.2 716v.1 716v.5 717a.0 71~.0
     764v.0 898a.0 932v.8

001  46a.1 714a.0

001  228v.1 296v.0
     642n.1 642v.0
001  293n.0 295v.1

002  625n.4 734v.1

006  278n.1

001  ln.O 3n.0 831n.0 ~6n.2

002  360v.1 60()n.0 6()0v.O 602d.0 604a.0
     604n.O 737v.3 771n.11 784n.4

001  123a.0 146v.O 1~.0 614a.0 66n.0

004  761n.0 908n.0 98n.3

001  25a.0 Sln.0 609an.1

357n.1

71On.0

60n.O

71~.0

32~.0

237n.0

357n.0

708n.O

716n.2

693n.0

737n.1

175n.O 31%.0
464v.1 4&)av.5

551n2

761n.0

451v.0 460v.3 476v.0

809n3

737n.1

784n.0

699n3

300v.0 516n.0 516v.0

794n.1 794v.1

                   Figure 3: Word Frequency and Semantic Categories for Topic 052.


                                          294

                                                                                     top I
                                                                                    pI=O.02


                                                                        0.9         0.9           0.9


                                                                                      e
                                                            (army)                (engineer)         (plant)

Figure 4: A Simple Network Consisting of Two Document
          Nodes and Three Index Term Nodes.

receives information about the activation levels of its
immediate neighboring nodes (nodes connected to it via
direct links), and then uses this information to calculate its
own activation leveL Through this process of spreading
activation, the network setUes down to equilibrium repre-
senting a retrieval to a user's information need.
  The computation of the information retrieval inference
process is based on a lormaliration of the causal and proba-
bilistic associative knowledge underlying diagnostic prob-
lem-solving [18).   We do not discuss the formulation
architecture and activation mechanism of the connectionist
model. This information can be found in [11,13,16,18). For
TREC-2, we managed to complete only one official routing
experiment for this approach, and itdid not involve semantics.
The experiment was intended to be a baseline experiment for
our semantic experiments.
  For ThEC-2, a specific network was constructed for 50
topics. A list of index terms was assembled based on
keywords in the concept section of each topic. In this network,
each output node represented a topic, and each input node
represented akeyword. The prior probability assigned to each
topic node was equal to 1/(total number of topics). The
connection strengths were assigned equal weights (0.9).
  The network contained 50 topic nodes and 848 index term
nodes. These nodes were connected via 1449 links. An
example of this network is shown in Figure 5, where p~ is the
prior probability of topic top~. The keywords "army",
"engineer", and "plant" were obtained by processing the
concept section of topic tO~~   Currently, the network is
enhanced by using an estimated weighting scheme.
  We performed a Category B routing experiment. Using
just keywords, the results were not good. The main problem
was due to the fact that, in the document ranking, many
documents had the same score used to generate the ranking.
In order to satisfy the requirements for the ranking, we had
to artificially rank those documents with the same score. This
was done based on order of appearance. The performance
was terrible except for Topic 66. This topic had only two


                                                 295

Figure 5: A Sample Network of the Experimental ModeL


known relevant documents for Category B routing experi-
ments and our inference network retrieved one of them in the
top 20 documents!  No further connectionist model
experiments have been completed. We were unable to modify
the baseline keyword experiment or perform semantic
experiments for this approach.
4. Vector Processing Model Experiments
 In this section, we explain the manner in which semantics
is incorporated within a vector processing model using the
semantic lexicon explained in Section 2. Please note that an
entry in our semantic lexicon has the form of a word followed
by codes for each of the semantic categories theword triggers.
We explain our approach usingatext relevance determination
procedure intended to show what is being calculated rather
than show the actual computations for the approach. The
procedure presented here generates several outputs that are
really not necessary, but are included just to help explain the
approach.  The relevance determination procedure is
explained using the four documents and query shown in
Figure 6. A few preliminary computations are reviewed in
order to explain the procedure.
 First, the number of documents each word is in must be
determined. Figure 7 shows a list of words from the four
documents and the query of Figure 6 along with the number
of documents each word is in (dJ).
 Next, the inverse document frequency (idi) of each word
is determined by the equation 1og10(NIdJ), where N -4, the
total number of documents. Figure 8 provides the idjof each
word. Sometimes, the kif of a word is undefined. This can
happen when a word does not occur in the documents but
does occur a query. For example, the words "depart" "do"
          in
and "when" do not appear in the four documents. Thus, the
idf of these terms cannot be defined here. Later, we will see
that an adjustment can be made for these undefined values.
 Next, the category probability of each query word is
determined. Figure 9 shows an alphabetized list of all the
unique words from the query, the frequency of each word in
the query, and the semantic categories each word triggers.

                                                                    word         idfof the word
                                                                                    Iog10~Nf
Document#1
L~comotives pull the trains.                                        and               .6
                                                                    canopy            .6
                                                                    carry             .6
Document #2                                                         depart         undefined
                                                                    do             undefined
People meet people under the canopy and within trains.              freight           .6
                                                                    from              .6
                                                                    hourly            .6
Document #3                                                         leave             .6
                                                                    locomotives       .6
Trains carry freight from the station.                              meet              .6
                                                                    noon              .6
                                                                    people            .6
Document #4                                                         pull              .6
                                                                    station           .3
Trains leave the station hourly until noon.                         the               0
                                                                    trains            0
                                                                    under             .6
Ouery                                                               until             .6
                                                                    when           undefined
When do trains depart the station?                                  within            .6

       Figure 6. Four Documents and a Query.                          Figure 8. The i~of Each Word.


                                                                  word frequency   category probability

                                                                 depart     1      AMDR     1/4
       word       number of documents                                              TA~      1/8
                    the word is in (dJ)
                                                                 do         1      AUSE    1/21
                                                                                   ATh{P   1/21
       and                      1
                                                                                   TCSE    1/21
       canopy                   1
                                                                                   TCNV    2/21
       carry                    1                                                  ThEs    1/21
       do                       0                                                  TSRC    1/21
       depart                   0
       freight                  1                                station    1      APOS    3/16
       from                     1                                                  AORD     1/8
       hourly                   1                                                  TAMF    1/16
       leave                    1                                                  TCND     1/8
       locomotives              1                                                  TDGR    1/16
       meet                     1                                                  TSPL    3/16
       noon                     1
       people                   1                                the        1
       pull                     1                                trains     1      AORD    7/24
       station                  2                                                  AMDR    1/12
       the                      4                                                  AMFR    1/12
       trains                   4                                                  TACM    1/24
       under                    1                                                  TCNV    1/12
       until                    1
       when                     0                                when       1      TAMT    1/3
       within                   1                                                  TnM     2/3

  Figure 7. Ust of Words in the Documents and Query.                   Figure 9. Words in the Query.


                                                         296

  The semantic categories in our example are those shown
in Figure 1. For example, consider the word "depart" which
occurs one time in the query as shown in Figure 9. The
semantic lexicon entry for the word 11depart't using the
categories of Figure 1 is as follows:

depart: NONE NONE NONE NONE NONE AMDR
     AMDR TA~

where NONE represents a word sense not included in the 36
semantic categories of Figure 1. If a uniform distribution is
assumed, then AMDR is triggered 1/4 of the time and TA~
is triggered 1/8 of the time. This is shown in Figure 9 as the
probabilities for each semantic category.

  A similar category probability determination is done for
each document. Figure 10 is an alphabetized list of all the
unique words in Document #4 of Figure 6. The semantic
categories each word triggers along with probabilities are also
shown.

  The text relevance determination procedure is shown in
Figure 11. The procedure uses three input lists:

a. List of words and the kif of each word, as shown in Figure
  8.

b. List of words in the query and the semantic categories they
  trigger along with the probability of triggering those
  categories, as shown in Figure 9.

c. List of words in a document and the semantic categories
  they trigger along with the probability of triggering those
  categories, as shown in Figure 10.

  The procedure operates as follows:

Step 1.

  This step determines the common meanings between the
query and the document. Figure 12 corresponds to the output
of Step 1 for Document #4. In Step 1, a new list is created as
follows:

For each word in the query, follow either subsection (a) or
~), whichever applies:

a. For each category the word triggers, find each word in the
  document that triggers the category and output three things:

  1) The word in the query and its frequency of occurrence.
  2) The word in the document and its frequency of
    occurrence.
  3) The category.

b. If the word does not trigger a category, then look for the
  word in the document and if found, output two things and
  a ~

  1) The word in the query and its frequency of occurrence.
  2) The word in the document and its frequency of
    occurrence.


                                       297

word    frequency category probability

hourly      1      `rflM   1.0
leave       1      AMDR    1/7
                   TA~     1/7
noon        1      AU)M    1/3
                   ~flM    2/3
the         1
station     1      APOS   3/16
                   AORD    1/8
                   TA~[~  1/16
                   TCNP    1/8
                   ThGR   1/16
                   J~PL   3/16
trains      1      AORD   7/24
                   AMDR   1/12
                   AMFR   1/12
                   TACM   1/24
                   TCNV   1/12
until       1      THM     1.0

     Figure 10. Words in Document #4.


~tep 1 - Refer to Figure 12.
   Determine common meaning
   between query and the document.


~tep 2- Refer to Figure 13.
   Adjust for words in the
   query that are not in any
   of the documents.


~tep 3 - Refer to Figure 14.
   Calculate the weight of a
   semantic component in the query
   and calculate the weight of a
   semantic component in the document.


~tep 4- Refer to Figure 15.
   Multiply the weight in the query
   by the weight in the document.

~tep 5 - Refer to Figure 15.
   Sum all the individual products
   of Step 4 into a single value which
   is the semantic similarity coefficient.

I

1

Figure 11. Relevance Determination Procedure to Explain
        Semantic Similarity.

                                   FI~ Ust
 Item          First Entry            Second Entry             Third Entry
Number      Word & Frequency         Word & Frequency           Category
               in Query               in Document #4

   1           (depart,1)              (
   2           (de~rt,1)               (
   3           (depart,1)              (
   4           (depart,1)              (
   5           (do1l)                  (i
   6           (station,1)             (
   7           (station,1)
   8           (station,1)
   9           (station,1)             (((mm
 10            (station,1)             (
 11            (station,1)             (
 12            (stntion,1)             (
 13            (station,1)             (
 14            (the,1)
 15            (tmins,1)
 16            (tmins,1)               (]
 17            (trains,1)
 18            (trains,1)              (i
 19            (trains,1)              (i
 20            (trains,1)              (i
 21            (when,1)                (I
 22            (when,1)                (]
 23            (when,1)                (]
 24            (when,1)                (I

                                               Ienve,1)                      AMDR
                                               ~ns,1)                        AMDR
                                               Icave,1)                      TAMr
                                               station,1)                    TAMr
                                               Irains,1)                     TCNV
                                               itation,1)                    APOS
                                               Btation,1)                    AORD
                                               Imins,1)                      AORD
                                               Ieave,1)                      TAM~
                                               ~tntion,1)                    TAMr
                                               ~tation,1)                    TCND
                                               ~tation,1)                    TDGR
                                               stntion,1)                    TSPL
                                               Ihe,1)
                                               trains,1)                     AORD
                                               Icave,1)                      AMDR
                                               trains,1)                     AMDR
                                               trains,1)                     AMFR
                                               Lmins,1)                      TACM
                                               trains1l)                     TCNV
                                               Ienve,1)                      TAMr
                                               hourly1l)                     TrIM
                                               r'oon,1)                      TrIM
                                               ~ntiI,1)                      TrIM

                            Figure 12. Common Meaning.


                                  Second Ust
 Item          First Entry            Second Entry             Third Entry
Number      Word & Frequency         Word & Frequency           Category
               in Query               in Document #4

               1(
               2(
               3(
               4(
   5
   6
   7
   8
   9
 10
 11
 12            (
 13            (
 14            (
 15            (
 16            (
 17            (
 18            (
 19            (
 20            (
 21            (
 22            (
 23            (
 24            (

             1eave,1)                 (Ieave,1)               AMDR
             ,tmins,1)                (tmins,1)               AMDR
             *ienve,1)                (Ieave,1)               TAMr
             station,1)               (station,1)             TAMT
             `tmins,1)                (trains,1)              TCNV
             station,1)               (station,1)             APOS
             `station,1)              (stntion,1)             AORD
             `station,1)              (tmins,1)               AORD
             stadon,1)                (Ienve,1)               TAMT
             station,1)               (stntion,1)             TAMT
             station,1)               (station,1)             TCND
             station,1)               (stndon,1)              TDGR
             stntion,1)               (station,1)             TSPL
             the,1)                   (the,1)
             trains,1)                (trains,1)              AORD
             trnins,1)                (Ieave1l)               AMDR
             tmins,1)                 (tmlns,1)               AMDR
             tmins,1)                 (tmins,1)               AMFR
             tmins,1)                 (tnins,1)               TACM
             tmins,1)                 (tmins,1)               TCNV
             Ienve,1)                 (`eave,1)               TAM~
             bourly,1)                ("oudy,1)               TrIM
             i'oon,1)                 (noon,1)                TrIM
             `intil,1)                (until,1)               TrIM

                   Figure 13. Adjustment for Words with no idf


                                    298

  Considering Figure 12, the word "depart" occurs in the
query one time and triggers the category AMDR. The word
"leave" occurs in Document #4 once and also triggers the
category AMDR. Thus, item 1 in Figure 12 corresponds to
subsection (a) as described above. An example using sub-
section (0) occurs in item 14 of Figure 12.
Step 2.
  This step adjusts for words in the query that are not in any
of the documents. Figure 13 shows the output of Step 2 for
Document #4. In this step, another list is created from the list
created in Step 1. For each item in the Step 1 list which has
a word with undefined idi; this step replaces the word in the
First Entry column by the word in the Second Entry column.
For example, the word "depart" has an undefined idfas shown
in Figure & Thus, the word "depart" in item 1 of Figure 12
should be replaced by the word "leave" from the Second Entry
column. This is shown in item 1 of Figure 13. Likewise, the
words "do" and "when" also have an undefined idf and are
respectively replaced by the words from the Second Entry
column.
Step 3.
  This step calculates the weight of a semantic component
in the query and calculates the weight of a semantic compo-
nent in the document. Figure 14 shows the output of Step 3
for Document #4. In Step 3, another list is created from the
list created in Step 2 as follows:
For each item in the Step 2 list, follow either subsection (a)
or (0), whichever applies:
a. If the Third Entry specifies a category, then
  1) Replace the First Entry by computing:
     (  i~of    frequency ot\( probability the   word~
       word in   word in ~ triggers the category
      First Entry)~ First Entry )~ in the Third Entry )

  2) Replace the Second Entry by computing:
    (   i~of     frequency o~( probability the   word~
       word in ~  word in ~ triggers the category
     Second Entry) ksecond Entry)~ lathe Third Entry )

  3) Omit the Third Entry.
b. If the Third Entry does not specify a category, then
  1) Replace the First Entry by computing:
               ( i~of    irequencyo~
                wordin ~ wordin
                       FirstEntry)~ FirstEntry)

  2) Replace the Second Entry by computing:
             (  i~of     frequency
                word in ~ word in
               Second Entry) kSecond Entry)

  3) Omit the Third Entry.
  In Figure 14, item 1 is an example of using subsection (a),
and item 14 is an example of using subsection (0).


                                                      299

Step 4.
  This step multiplies the weights in the query by the weights
in the document. The top portion of Figure 15 shows the
output of Step 4. In the list created here, the numerical value
created in the First Entry column of Figure 14 is multiplied
by the numerical value created in the Second Entry column
of Figure 14.
Step S.
  This step sums the values in the Step 4 list to compute the
semantic similarity coefficient fora particular document. The
bottom portion of Figure 15 shows the output of step 5 for
Document #4.
  We have finally observed an improved Precision~Recall
performance using the semantic similarity coefficient
explained here.  or example, in a Category B filtering
experiment where the words being considered were only those
in the topics and idf values were determined by the number
of topics a word is in, we have observed the keyword and
semantic results shown in Figure 16 and Figure 17, respec-
tively. The 11-pt average for these two experiments reveals
a 23% increase due to the use of semantic categories.
According to Sparck Jones' criteria, this change would be
classified as "significant" ~reater than 10.0%) [12]. We
believe further improvement is possible by considering more
words, stemming for plurals and tenses of words, better idf
values (like those used for archival retrieval), a modem
lexicon, and a focus on paragraphs instead of whole docu-
ments.

5. Summary
  Our progress during ThEC-1 and ThEC-2 has been the
following:
a. We created efficient code for a UNIX platform. Originally
  our code used B+ tree structures for implementing inverted
  files on a DOS platform. We now use hashing to replace
  B+trees, establishing codes to replace character strings;
  and the UNIX platform provides faster processing than the
  DOS platform.
b. We built an index forasemantic lexicon based on the public
  domain 1911 version of Roget's Thesaurus. To do this,
  we had to create our own category numbering system
  similar to today's version of Roget's Thesaurus.
c. We solved part of the blend problem for semantic and
  keyword weights. We now base semantic category weights
  on the kifof words which generate the semantic categories.
  We can now index or scan TREC documents at rates faster
than 60 Megabytes per hour depending on the workstation.
We have a semantic lexicon of approximately 20,000 words
with flexible category codes that allow a course (36 catego
ries) through fine (more than 15,000 categories) semantic
analysis.  As shown in Section 4, our procedure for
determining relevance is based on the senses of each word.
For example, using the vector processing model and the
similarity coefficient

                sim~,D~)- X Wqj~djj,
                        i-i

                           "liii" List

Item Number            First Entry                      Second Entry

    1                .6*1*1fl~.0857                   .6*1*1fl~.0857
    2                0*1 *1/12m0                      0*1*1I12~0
    3                .6*1*1fl~.0857                   .6*1*lPm.0857
    4                .3*1*1/16~.0188                  *3*1*1/16~.0188
    5                0*1*1,12~0                       0*1 *1/12~0
    6                .3*1 *3/16m.0563                 .3*1*3116~.o563
    7                .3*1*7/24m.0875                  .3*1 *7/24~ .0875
    8                *3*1*1/8~.0375                   0*1 *7/24mO
    9                .3*1*1116m.0188                  .6*1*1fl~.0857
   10                .3*1*1I16~.0188                  .3*1*1/16~.0188
   11                .3* 1*1/8~.0375                  .3*1*1/8~.0375
   12                .3*1*1/16~.0188                  .3*1*1/16~.0188
   13                .3*1*3116m.0563                  .3*1*3/16m.0563
   14                0*1~0                            0*lmO
   15                0*1*7~~0                         0*1*7/24~0
   16                0*1*1/12m0                       .6*1*1!7~.0857
   17                0*1*1I12~0                       0*1*1/12~0
   18                0*1 *1/12~0                      0*1 *1/12~0
   19                0*1 *1/24~0                      0*1 * 1/24-0
   20                0*1 *1/12m0                      0*1*1I12~0
   21                .6*1*1rl~.0857                   .6*1*1!7~.0857
   22                                                                                       T
                     .6*1*1.0~.6000                   .6*1*1.0~.6000
   23                .6*1*~m.4000                     .6*1 *213~ .4000
   24                .6*1*1.0~.6000                   .6*1*1.0~.6000

            Figure 14. Weights of Semantic Components.


                          Fourth List

                 Item Number         Value
                     1               .00734
                     2                   0
                     3               .00734
                     4               .00035
                     5                   0
                     6               .00317
                     7               .00734
                     8                   0
                     9               .00170
                    10               .00035
                    11               .00141
                    12               .00035
                    13               .00317
                    14                   0
                    15                   0
                    16                   0
                    17                   0
                    18                   0
                    19                   0
                    20                   0
                    21               .00734
                    22               36000
                    23               .16000
                    24               .36000

                  Sum of all values in Fourth List
                           0.91986

            Figure 15. Multipiled Weights and Their Sum.


                             300

Queryid (Num):             47 of 50
Total number of documents over all queries
    Retrieved:              36610
    Relevant:                2064
    Rel~ret:                  913
Interpolated Recall - Precision Averages
    at 0.00                 0.3514
    at 0.10                 0.1968
    at 0.20                 0.1367
    at 0.30                 0.1082
    at 0.40                 0.0894
    at 0.50                 0.0752
    at 0.60                 0.0276
    at 0.70                 0.0105
    at 0.80                 0.0062
    at 0.90                 0.0013
    at 1.00                 0.0007
                (non-interpolated) over all rel does
                            0.0746

Average precision

                                           Queryid ~um):             47 of 50
                                           Total number of documents over all queries
                                               Retrieved:             36383
                                               Relevant:               2064
                                               Rel_ret:                 956
                                           Interpolated Recall
                                               at 0.00
                                               at 0.10
                                               at 0.20
                                               at 0.30
                                               at 0.40
                                               at 0.50
                                               at 0.60
                                               at 0.70
                                               at 0.80
                                               at 0.90
                                               at 1.00
                                           Average precision

                                                       - Precision Averages
                                                                  0.3961
                                                                  0.2479
                                                                  0.1734
                                                                  0.1258
                                                                  0.1067
                                                                  0.0838
                                                                  0.0372
                                                                  0.0195
                                                                  0.0100
                                                                  0.0029
                                                                  0.0009
                                                       (non-interpolated) over all rel does
                                                                  0.0919

Precision:                                                       Precision:
    At          5does:         0.1660                                At   Sdocs:        0.2426
    At      lOdocs:            0.1532                                At  lOdoes:        0.2149
    At      15does:            0.1433                                At  lSdoes:        0.1801
    At      20does:            0.1298                                At  20does:        0.1574
    At      30does:            0.1057                                At  30does:        0.1383
    At      100does:           0.0643                                At l00does:        0.0745
    At      200 does:          0.0465                                At 200does:        0.0522
    At      500 does:          0.0302                                At 500 does:       0.0320
    At     1000does:           0.0194                                At 1000 does:      0.0203
R-Precision ([`recision after R (= num_rel for a query)          R-Precision ([`recision after R (= num~rel for a query)
does retrieved):                                                 does retrieved):
    Exact:                     0.1035                                Exact:             0.1283

     Figure 16. Fillering Using Keywords.

if the word "trains" is in the Query and the word "leaves "is
in the Document and we look at the semantic category Motion
with Reference to Direction (AMDR), then one of the vector
product elements in the formula becomes:
                         .           p",abE.Iiy  ~
                                  Icavee" triggem AMDR~


where the probabilities are obtained from our semantic lexi-
con.

  We plan to do more experiments incorporating the fol-
lowing improvements:

a. Modernize the semantic lexicon. Since our lexicon isbased
  on the 1911 version of Roget's Thesaurus, many modem
  words are not present and the senses of recorded words are
  not accurate. We plan to correct this. For example, we
  could try to get permission to use the current version of
  Roget's Thesaurus.

b. Base similarity on paragraphs instead of whole documents.
  We have had success using as few as 36 categories in a
  paragraph environment. ~e also feel that relevance


                                                 301

  Figure 17. Filtering Using Semantic Categories.

  decisions are made by humans looking at roughly a
  paragraph of information. We plan to modify our code to
  use paragraphs as a basis for the similarity measure.

c. Experiment with the number of possible semantic cate-
  gories and the probability assigned to a triggered category.
  The experiment behind the performance improvement
  shown in Figure 16 and Figure 17 uses a very fine number
  of semantic categories and treats the triggered semantic
  categories for a word uniformly. We plan to experiment
  with a fewer number of categories, and we plan to obtain
  a probability distribution for categories based on word
  usage.

  Basically, we are trying to establish a statistically sound
approach to using word sense information. Intuition is that
word sense information should improve retrieval perform-
ance. Furthermore, our approach to using word sense infor-
mation has shown a significant performance improvement in
a question/answer environment where paragraphs represent
documents. We feel that other word sense approaches, such
as query expansion or word sense disambiguation, may not
be statistically sound, and that may be why successful
experiments have not been reported.

References
[1] G. F. Cooper (1988). Probabilistic inference using belief
   networks is np-hard. Technical Report KSL-87-27,
   Stanford University, Stanford, CA.

[2] W. B. Croft (1987). Approaches to intelligent infor-
   mation retrievaL Infonnation Processing & Manage-
   ment, 23(4):95-1 10.

[3] C. Date (1990). An Introduction to Database Systems,
   Vol.1, Addison Wesley.

[4] 3. Driscoll, L Lautenschlager and M. zhao (1993). The
   QA system, Proc. of the First TextRetrieval Conference
   (TREC-1), NIST Special Publication 500-207 (1). K.
   Harman, editor).

[5] E. k Fox (1983) Characteriitation of two new exper-
   imental collections in computer and information science
   containing textual   and bibliographic concepts.
   Technicat Report 83-561, Department of Computer
   Science, Cornell Univ., Ithaca, NY

[6] L N. Kanal and 3. F. Lemmer Eds (1986). Uncertainty
   in Artificial Intelligence, North-Holland, Amsterdam.

[7] J. F. Lemmer and L N. Kanal Eds (1988). Uncertainty
   in Artificial Jntelligence 2, North-Holland, Amsterdam.

[8] E. L Liddy and S.H. Myaeng (1993). DR-Link, Proc.
   of the First Text Retrieval Conference (TREC-1), NIST
   Special Publication 500-207(1). K. Harman, editor).

[9] R. E. Neapolitan (1990). Probabilistic Reasoning in
   Expert System, Theory and Algorithms, A Wiley-
   Interscience Publication, John Wiley & Sons, Inc.

[10] J. Pearl (1988). Probabilistic Reasoning in Intelligent
   Systems: Networks of Plausible Inference. San Mateo,
   CA: Morgan Kaufmann.

[11] Y. Peng andj. A. Reggia (1989). Aconnectionist model
   for diagnostic problem solving. IEEE Transactions on
   Systems, Man, and Cybernetics, 19(2):285-298.

(12] K. Sparck Jones and R. Bates (1977). Research on
   Automatic Indexing 1974-1976, Technical Report,
   Computer Laboratory, University of Cambridge.

[13] M. Tagamets, J. Wald, M. Farach and J. A. Reggia
   (1989). Generating plausible diagnostic hypothesis with
   self-proccssin~ causal networks. Journal of ~er-
   imental Theories in Artif~cial Intelligence, 2:91-112.

[14] Roget's International Thesaurus (1977). Harper &
   Row, New York, Fourth Edition.

[15] H. Turtle and W. B. Croft (1991). Evaluation of an
   inference network-based retrieval model. ACM Trans-
   action on Information Systems, 9(3):187-222.

(16] P. Van Learti and E. Arts (1987). SimulatedAnnealing:
   Theory andApplications. Boston: D. ReideL


                                               302

[17] D. Voss and J. Driscoll (1992). Text Retrieval Using a
   Comprehensive Semantic Lexicon, Proceedings of
   ISMM First International Conference on Information
   and Rnowledge Management, Baltimore, Maryland.

[18] P. Wang, J. Reggia, D. Nao and Y. Peng (1985). A
   formal model for diagnostic inference. Information
   Sciences, 37:227-285.

[19] E. Wendlandt and J. Driscoll(1991). Incorporating a
   Semantic Analysis into a Document Retrieval Strategy,
   Proceedings of the Fourieenth Annual International
   ACMISIGIR Conference on Research andDevelopment
   in Information Retrieval, Chicago, Illinois, 270-279.


<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
<html xmlns='http://www.w3.org/1999/xhtml'>
<head>
<title>pFad - Phonifier reborn</title>
<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />
</head>
<body>
<h1>Pfad - The Proxy pFad of &#169; 2024 Garber Painting. All rights reserved.</h1>


<!-- Disclaimer -->
<p>Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.</p>
<br>
<p>Alternative Proxies:</p><p><a href="http://rainy.clevelandohioweatherforecast.com/php-proxy/index.php?q=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">Alternative Proxy</a></p><p><a href="http://rainy.clevelandohioweatherforecast.com/pFad/index.php?u=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">pFad Proxy</a></p><p><a href="http://rainy.clevelandohioweatherforecast.com/pFad/v3index.php?u=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">pFad v3 Proxy</a></p><p><a href="http://rainy.clevelandohioweatherforecast.com/pFad/v4index.php?u=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">pFad v4 Proxy</a></p></body>
</html>