Incorporating Semanfics Within a Connectionist Model
and a Vector Processing Model

Richard Boyd, James Driscoll, mien Syu
Department of Computer Science
University of Central Florida
Orlando, Florida 32816
(407)823-2341
FAX: (407)823-5419
e-mail: driscoll@cs.ucf.edu

Abstract
Semantic information obtained from the public domain
1911 version of Roget's Thesaurus is combined with key-
words to measure similarity between natural language topics
and documents. Two approaches are explored. In one
approach, a combination of keyword relevance and semantic
relevance is achieved by using the vector processing model
for calculating similarity, but extending the use of a keyword
weight by using individual weights for each of its meanings.
This approach is based on the database concept of semantic
modeling and the linguistic concept of thematic roles. It is
applicable to both routing and archival retrievaL The second
approach is especially suited for routing. It is based on an Al
connectionist model. In this approach, a probabilistic
inference network is modified using semantic information to
achieve a competitive activation mechanism that can be used
for calculating similarity.

Keywords: vector processing model, semantic data model,
semantic lexicon, inference network, connectionist model.

1 . Introduction
The experiments reported here use a relatively efficient
method to detect the semantic representation of text. Our
original method is based on semantic modeling and is
described in [4,17,19).

Semantic modeling was an object of considerable database
research in the late 1970's and early 1980's. Abriefoverview
can be found in [3]. Essentially, the semantic modeling
approach identified concepts useful in talking informally
about the real world. These concepts included the two notions
of entities (objects in the real world) and relationships among
entities (actions in the real world). Both entities and rela-
tionships have properties.

The properties of entities are often called attributes. There
are basic or surface level attributes for entities in the real

world. Examples of surface level entity attributes are General
Dimensions, Color, and Position. These properties are
prevalent in natural language. For example, consider the
phrase "large, black book on the table" which indicates the
General Dimensions, Color, and Position of the book.

In linguistic research, the basic properties of relationships
are discussed and called thematic roles. Thematic roles are
also referred to in the literature as participant roles, semantic
roles and case roles. Examples of thematic roles are Ben~
ficiary and Time. Thematic roles are prevalent in natural
language; they reveal how sentence phrases and clauses are
semantically related to the verbs in a sentence. For example,
consider the phrase "purchase for Mary on Wednesday"
which indicates who benefited from a purchase(13eneficiary)
and when a purchase occurred (Fime).

A main goal of our research has been to detect thematic
information along with attribute information contained in
natural language queries and documents. In order to use this
additional information, the concept of text relevance needs
to be modified.

In [17,19] the major modifications included the addition
of a lexicon with thematic and attribute information, and a
modified computation of a vector processing similarity
coefficient. That research concerned a Question/Answer
environment where queries were the length of a sentence and
documents were either a sentence or at most a paragraph. At
that time, our lexicon was based on 36 semantic categories,
and in that environment, our semantic approach produced a
significant improvement in retrieval performance.

However, for TREC-1 [4], document and topic length
presented a problem and caused our semantic approach based
on 36 semantic categories to be of little value. However, as
reported in [4], by breaking the TREC documents into
paragraphs, a significant improvement was demonstrated.

This work has been supported in part by NASA KSC Cooperative Agreement NCC 10~3 Project 2, Florida High Technol-
ogy and Industry Council Grants 494011-28-721 and 4940-1 1-2~728.

291

In Section 2, we describe our original semantic lexicon
and an extension which uses a larger number of semantic
categories. Section 3 presents an application of an Al
connectionist model to the task of routing. Section 4 presents
an approach different than reported in TREC-1 [4], using our
extended semantic lexicon within the vector processing
model. Section 5 summarizes our rasearch effort.

2. The Semantic Lexicon
Our semantic approach uses a thesaurus as a source of
semantic categories (thematic and attribute information). For
example, Roget's Thesaurus contains a hierarchy of word
classes to relate word senses [14]. In TREC-1 [4] and in
earlier research [17,19], we selected several classes from this
hierarchy to be used for semantic categories. We defined
thirty-six semantic categories as shown in Figure 1.

In order to explain the assignment of semantic categories
to a given term using Roget's Thesaurus, consider the brief
index quotation for the term "vapor":

vapor
n. fog 404.2
fume 401
illusion 519.1
spirit 4.3
steam 328.10
thing imagi~ed 535.3
v. be bombastic 601.6
bluster 911.3
boast 910.6
exhale 310.23
talk nonsense 547.5

The eleven different meanings of the term "vapor" are given
in terms of a numerical category. We developed a mapping
of the numerical categories in Roget's Thesaurus to the
thematic role and attribute categories given in Figure 1. In
this example, "fog" and "fume" correspond to the attribute
State; "steam" maps to the attribute Temperature; and "ex-
hale" is a trigger for the attribute Motion with Reference to
Direction. The remaining seven meanings associated with
"vapor" do not trigger any thematic roles or attributes. Since
there are eleven meanings associated with "vapor," we
indicated in the lexicon a probability of 1/11 each time a
category is triggered. Hence, a probability of 2/11 is assigned
to State, 1/11 to Temperature, and 1/11 to Motion with
Reference to Direction. This technique of calculating prob-
abilities is being used as a simple alternative to a corpus
analysis.

It should be pointed out that we are still experimenting
with other ways of calculating probabilities. For example, as
in [8], a probabilistic part-of-speech tagger could be used to
further restrict the different meanings of a term, and existing
lexical sources could be used to obtain an ordering based on
frequency of use for the different meanings of a term.

As reported in [4], the use of 36 semantic categories caused
problems when dealing with TREC documents. When the
size of a document is large, a greater number of the 36
semantic categories are triggered in the document. Also,
when using the semantic approach described in [19] the
probability present for each category in a document is often
very close to one. Consequently, almost every one of the

Thematic Role Categories Attribute Categories
TACM Accomnaniment ACOL Color
TAMT Amount AEID External and Internal Dimensions
ThNF Beneficiarv AFRM Form
TCSE Cause AOND Gender
TCND Condition AODM General Dimensions
TCMP Comnenson ALDM Linear Dimensions
TCNV Conve ance AMFR Motion Conjoined with Foree
ThOR De~e AOMT Motion in General
ThST Destination AMDR Motion with Reference to Direction
ThUR Duration AORD Order
TOOL Ooal APIIP Phvsical Pronerties
TINS Instrument APOS Position
TSPL I:c~tion/Si,ace ASTE State
TMAN Manner A~mrature
TMNS Means AUSE Use
ThUR Purpc~e AVAR Variation
ThNO Ran~
i~FS Result
TSRC Source
TTIM Time

Figure 1. Thirty-Six Semantic Categories.

292

36 semantic categories becomes present in every document.
This causes semantic category weights to become very low
and useless within that approach.

As ~reportedin[4], one way to solve this problem is to
break ThECdocum ents into paragraphs. But, another way
to solve the problem of long documents causing semantic
weights to be of little value is to have more semantic
categories. A large number of "semantic" categories can be
obtained (for example) by using ~ the categories and/or
subcategories found in Roget's Thesaurus, instead of the 36
semantic categories we have used. This may be a deviation
from database semantic modeling. In any case, it needs to be
examined.

Consequently, for the experiments reported here, a
semantic lexicon was created based on all the word senses
found in the public domain 1911 version of Roget's The-
saurus. To provide an example, consider Topic 052 as shown
in Figure 2. Fi~re 3 indicates the keywords and frequency
information within Topic 052, along with the semantic
categories obtained from our extended lexicon for those
keywords. Note that stemming was not used for the pro-
cessing of Topic 052; so, some keywords in Topic 052 were
not located in our lexicon (e.g. sanctions).

The categories recorded in our extended semantic lexicon
usethe category numbers found in the 1911 version of Roget's
Thesaurus. These numbers are then followed by a part-of-
speech code also found in the 1911 version of Roget's
Thesaurus. The number after the part-of-speech code
represents a sub-category, but this number does not appear
in the 1911 version of Roget's Thesaurus. That number was
created based on groupings of words within the thesaurus.

~op>
<head> TIpster Topic Description
<num> Number: 052
<dom> Domain: International Economics
<titie> Topic: South African Sanctions
<desc> Description:
Document discusses sanctions against South Africa
<narr> Narrative:
A relevant document will discuss any aspect 0' South African sanctions, such
as: sanctions dccl~(po~ by a country against the South African
government in response to its apaitheid poncy, or in response topressure by
an indIvidual, organization or another country; intemational sanctions against
Pretoria imposed bythe United Nations; the effects 0' sanctions against &
Africa; opposition to sanctions; or' compliance with sanctions by a company.
The document will identif~ the sanctions instituted or being considered, e.g.,
corporate disinvestment, trade ben' academic boycott, arms embargo.
<con> Concept(s):
1. sanctions, international sanctions, economic sanctions
2. corporate exodus, corporate disinvestment, stock divestiture, ben on new
investment, trade ban, import ben on South African diamonds, U.N. arms
embargo, curtailment 0' delbrise contracts, cutoff 0' nonmUitary goods,
academic boycott, reduction 0' cultural ties
3. ap&theid, white domination, racism
4. an-theid, black m~Jority rule
5. Pretoria
~c> Factor(s):
<nat> Nationality: South Africa
<`lac>

<de~ D~flnition(s):
3. Connectionist Model Routing Experiments

Recent work suggests that significant improvements in
retrieval performance will require a technique that, in some
sense, "understands" the content of documents and queries
and can be used to infer probable relationships between
documents and queries [2]. In this view, information retrieval
is an inference or evidential reasoning process in which we
estimate the probability that a user's information need is met
given a document as "evidence". The techniques required to
support this kind of inference are similar to those used in
expert systems that must reason with uncertain information.
Several probabilistically~oriented inference network models
have been developed using experimental document collec-
tions [5] during the past few years for information retrieval
[15]. These models are generally characterreed by an
architecture with two layers corresponding to documents and
index terms. The documents and index terms are connected
by direct links. Initially, the prior probabilities of all root
nodes (nodes with no predecessors) and the conditional
probabilities of all non-root nodes ~iven all possible
combinations of their direct predecessors) must be specified.
Aretrievalconsists of one or more documents with the highest
posterior probability for the given set of index terms (evi-
dences) which represent a user's information need.
Over the last few years, the technique of automated
inference using probabilistic inference networks has become
popularwithin theM probability and uncertainty community,
particular in the context of expert systems [6,7]. The most

293

Figure 2. Topic 052.

important constraint on the use of a probabilistic network is
the fact that in general, the computation of the exact posterior
probabilities is NP-hard [1]. Thus it is unlikely that we could
develop an efficient general-purpose algorithm which would
work well for all kinds of inference networks. There are
several alternatives, such as the use of approximation algo-
rithms or heuristic algorithms, and creating special case
algorithms [9,10].
The experiments here concern an attempt at a heuristic
probabilistic inference network approach based on an Al
connectionist model. The connectionist model uses a com-
petitive activation rule to find the most probable retrievaL
The term competitive activation rule refers to a spreading
activation method in which nodes actively compete for
available activation in a network. An initial formulation of
a competitive activation mechanism was previously studied
on three tw~layer, abstract networks for diagnostic problem
solving [11,13]. The connectionist model proposed here
consists of a two-layer network architecture. Document
nodes and index term nodes corresponding to each layer are
connected by links whose weights represent association
strengths between nodes. These links are also viewed as
channels for sending information between nodes. Figure 4
is a simple network consisting of two document nodes and
three index term nodes. At each moment of time, each node

Topic 52

curtailment 001 201n.1 38n.2
individual 001 372a.0

87n.0

considered 001 611d.0

compliance 001 602n3

reduction 001 103n.1
308n.1

economics 002 692n3

corporate 003 43a.0
response 002 462n.0 586v~ 714n.0 821n.0
888n.2 990n3
relevant 001 23a.1 476a.2 ~.2

majority 001 102n.1 131n.0 33n.O

academic 002 514a.0 537a.0 54~.0

effects 001 7&)n.10 780n.5 798n.0

defense 001 717n.0 937n.2

country 002 181n.0 189n.1 189n.8 189n.9
266v.1 344n.0 371a.1

company 001 599n.14 712n.2 726n.8 72n.2
88n.1 892n.4

policy 001 626n.2 692n.2

aspect 001 183n.0 448n.3 7n.0

white 001 429a.1 430a.0 441n.5 996n.5

stock 001 lln.3 153n.4 166n.2 225n.13 25n.3
265a.0 31n.1 501n.2 613a.0 635n.0
636n.O 637v.O 637v.2 780n.10
798n.0 80()n.0 81 lv.0

001 780n.10 798n.0

001 349n2 421a.0 431a.0 431n.3 432n.1
752n.0 945a.2

rule 001 136n.1 200n.1 466n.2 613n.5 693v.1
697n.1 737n.1 737v.2 737v.3 74~.2
80n.0 82n.3 963n~1

arms 002 459d.9 71~.2 722n.0 727n.0 877n.3
894v.3

372n.2 549a.0 79a.0 87a.0

goods

black

743n.0 762n.0 772n.0

144n.0 195n.0 201n.1
813n.O 85n3

international

organization

opposition

investment

government

domination

pressure

identify

document

embargo

discuss

boycott

another

against

united

import

exodus

trade

south

being

wrn

new

ban

any

003 12a.0 89~.4

001 161n.0

001 179n.0
720n.0

001 225n.0

001 692n.3 737n.5

001 175n.0

001 157n.2 642n3 735n.1

001 13v.0

003 467n.3

001 265n.2

001 298v.0

002 297v.2

001 104a.1 15a.1 709v.0 714v.O

004 14a.0 17~.0 237d.0 276v.O 673d.O
704d.0 708d.0 70&1.1 708v.O 708v.2
713v.2 716v.1 716v.5 717a.0 71~.0
764v.0 898a.0 932v.8

001 46a.1 714a.0

001 228v.1 296v.0
642n.1 642v.0
001 293n.0 295v.1

002 625n.4 734v.1

006 278n.1

001 ln.O 3n.0 831n.0 ~6n.2

002 360v.1 60()n.0 6()0v.O 602d.0 604a.0
604n.O 737v.3 771n.11 784n.4

001 123a.0 146v.O 1~.0 614a.0 66n.0

004 761n.0 908n.0 98n.3

001 25a.0 Sln.0 609an.1

357n.1

71On.0

60n.O

71~.0

32~.0

237n.0

357n.0

708n.O

716n.2

693n.0

737n.1

175n.O 31%.0
464v.1 4&)av.5

551n2

761n.0

451v.0 460v.3 476v.0

809n3

737n.1

784n.0

699n3

300v.0 516n.0 516v.0

794n.1 794v.1

Figure 3: Word Frequency and Semantic Categories for Topic 052.

294

top I
pI=O.02

0.9 0.9 0.9

e
(army) (engineer) (plant)

Figure 4: A Simple Network Consisting of Two Document
Nodes and Three Index Term Nodes.

receives information about the activation levels of its
immediate neighboring nodes (nodes connected to it via
direct links), and then uses this information to calculate its
own activation leveL Through this process of spreading
activation, the network setUes down to equilibrium repre-
senting a retrieval to a user's information need.
The computation of the information retrieval inference
process is based on a lormaliration of the causal and proba-
bilistic associative knowledge underlying diagnostic prob-
lem-solving [18). We do not discuss the formulation
architecture and activation mechanism of the connectionist
model. This information can be found in [11,13,16,18). For
TREC-2, we managed to complete only one official routing
experiment for this approach, and itdid not involve semantics.
The experiment was intended to be a baseline experiment for
our semantic experiments.
For ThEC-2, a specific network was constructed for 50
topics. A list of index terms was assembled based on
keywords in the concept section of each topic. In this network,
each output node represented a topic, and each input node
represented akeyword. The prior probability assigned to each
topic node was equal to 1/(total number of topics). The
connection strengths were assigned equal weights (0.9).
The network contained 50 topic nodes and 848 index term
nodes. These nodes were connected via 1449 links. An
example of this network is shown in Figure 5, where p~ is the
prior probability of topic top~. The keywords "army",
"engineer", and "plant" were obtained by processing the
concept section of topic tO~~ Currently, the network is
enhanced by using an estimated weighting scheme.
We performed a Category B routing experiment. Using
just keywords, the results were not good. The main problem
was due to the fact that, in the document ranking, many
documents had the same score used to generate the ranking.
In order to satisfy the requirements for the ranking, we had
to artificially rank those documents with the same score. This
was done based on order of appearance. The performance
was terrible except for Topic 66. This topic had only two

295

Figure 5: A Sample Network of the Experimental ModeL

known relevant documents for Category B routing experi-
ments and our inference network retrieved one of them in the
top 20 documents! No further connectionist model
experiments have been completed. We were unable to modify
the baseline keyword experiment or perform semantic
experiments for this approach.
4. Vector Processing Model Experiments
In this section, we explain the manner in which semantics
is incorporated within a vector processing model using the
semantic lexicon explained in Section 2. Please note that an
entry in our semantic lexicon has the form of a word followed
by codes for each of the semantic categories theword triggers.
We explain our approach usingatext relevance determination
procedure intended to show what is being calculated rather
than show the actual computations for the approach. The
procedure presented here generates several outputs that are
really not necessary, but are included just to help explain the
approach. The relevance determination procedure is
explained using the four documents and query shown in
Figure 6. A few preliminary computations are reviewed in
order to explain the procedure.
First, the number of documents each word is in must be
determined. Figure 7 shows a list of words from the four
documents and the query of Figure 6 along with the number
of documents each word is in (dJ).
Next, the inverse document frequency (idi) of each word
is determined by the equation 1og10(NIdJ), where N -4, the
total number of documents. Figure 8 provides the idjof each
word. Sometimes, the kif of a word is undefined. This can
happen when a word does not occur in the documents but
does occur a query. For example, the words "depart" "do"
in
and "when" do not appear in the four documents. Thus, the
idf of these terms cannot be defined here. Later, we will see
that an adjustment can be made for these undefined values.
Next, the category probability of each query word is
determined. Figure 9 shows an alphabetized list of all the
unique words from the query, the frequency of each word in
the query, and the semantic categories each word triggers.

word idfof the word
Iog10~Nf
Document#1
L~comotives pull the trains. and .6
canopy .6
carry .6
Document #2 depart undefined
do undefined
People meet people under the canopy and within trains. freight .6
from .6
hourly .6
Document #3 leave .6
locomotives .6
Trains carry freight from the station. meet .6
noon .6
people .6
Document #4 pull .6
station .3
Trains leave the station hourly until noon. the 0
trains 0
under .6
Ouery until .6
when undefined
When do trains depart the station? within .6

Figure 6. Four Documents and a Query. Figure 8. The i~of Each Word.

word frequency category probability

depart 1 AMDR 1/4
word number of documents TA~ 1/8
the word is in (dJ)
do 1 AUSE 1/21
ATh{P 1/21
and 1
TCSE 1/21
canopy 1
TCNV 2/21
carry 1 ThEs 1/21
do 0 TSRC 1/21
depart 0
freight 1 station 1 APOS 3/16
from 1 AORD 1/8
hourly 1 TAMF 1/16
leave 1 TCND 1/8
locomotives 1 TDGR 1/16
meet 1 TSPL 3/16
noon 1
people 1 the 1
pull 1 trains 1 AORD 7/24
station 2 AMDR 1/12
the 4 AMFR 1/12
trains 4 TACM 1/24
under 1 TCNV 1/12
until 1
when 0 when 1 TAMT 1/3
within 1 TnM 2/3

Figure 7. Ust of Words in the Documents and Query. Figure 9. Words in the Query.

296

The semantic categories in our example are those shown
in Figure 1. For example, consider the word "depart" which
occurs one time in the query as shown in Figure 9. The
semantic lexicon entry for the word 11depart't using the
categories of Figure 1 is as follows:

depart: NONE NONE NONE NONE NONE AMDR
AMDR TA~

where NONE represents a word sense not included in the 36
semantic categories of Figure 1. If a uniform distribution is
assumed, then AMDR is triggered 1/4 of the time and TA~
is triggered 1/8 of the time. This is shown in Figure 9 as the
probabilities for each semantic category.

A similar category probability determination is done for
each document. Figure 10 is an alphabetized list of all the
unique words in Document #4 of Figure 6. The semantic
categories each word triggers along with probabilities are also
shown.

The text relevance determination procedure is shown in
Figure 11. The procedure uses three input lists:

a. List of words and the kif of each word, as shown in Figure
8.

b. List of words in the query and the semantic categories they
trigger along with the probability of triggering those
categories, as shown in Figure 9.

c. List of words in a document and the semantic categories
they trigger along with the probability of triggering those
categories, as shown in Figure 10.

The procedure operates as follows:

Step 1.

This step determines the common meanings between the
query and the document. Figure 12 corresponds to the output
of Step 1 for Document #4. In Step 1, a new list is created as
follows:

For each word in the query, follow either subsection (a) or
~), whichever applies:

a. For each category the word triggers, find each word in the
document that triggers the category and output three things:

1) The word in the query and its frequency of occurrence.
2) The word in the document and its frequency of
occurrence.
3) The category.

b. If the word does not trigger a category, then look for the
word in the document and if found, output two things and
a ~

1) The word in the query and its frequency of occurrence.
2) The word in the document and its frequency of
occurrence.

297

word frequency category probability

hourly 1 `rflM 1.0
leave 1 AMDR 1/7
TA~ 1/7
noon 1 AU)M 1/3
~flM 2/3
the 1
station 1 APOS 3/16
AORD 1/8
TA~[~ 1/16
TCNP 1/8
ThGR 1/16
J~PL 3/16
trains 1 AORD 7/24
AMDR 1/12
AMFR 1/12
TACM 1/24
TCNV 1/12
until 1 THM 1.0

Figure 10. Words in Document #4.

~tep 1 - Refer to Figure 12.
Determine common meaning
between query and the document.

~tep 2- Refer to Figure 13.
Adjust for words in the
query that are not in any
of the documents.

~tep 3 - Refer to Figure 14.
Calculate the weight of a
semantic component in the query
and calculate the weight of a
semantic component in the document.

~tep 4- Refer to Figure 15.
Multiply the weight in the query
by the weight in the document.

~tep 5 - Refer to Figure 15.
Sum all the individual products
of Step 4 into a single value which
is the semantic similarity coefficient.

Figure 11. Relevance Determination Procedure to Explain
Semantic Similarity.

FI~ Ust
Item First Entry Second Entry Third Entry
Number Word & Frequency Word & Frequency Category
in Query in Document #4

1 (depart,1) (
2 (de~rt,1) (
3 (depart,1) (
4 (depart,1) (
5 (do1l) (i
6 (station,1) (
7 (station,1)
8 (station,1)
9 (station,1) (((mm
10 (station,1) (
11 (station,1) (
12 (stntion,1) (
13 (station,1) (
14 (the,1)
15 (tmins,1)
16 (tmins,1) (]
17 (trains,1)
18 (trains,1) (i
19 (trains,1) (i
20 (trains,1) (i
21 (when,1) (I
22 (when,1) (]
23 (when,1) (]
24 (when,1) (I

Ienve,1) AMDR
~ns,1) AMDR
Icave,1) TAMr
station,1) TAMr
Irains,1) TCNV
itation,1) APOS
Btation,1) AORD
Imins,1) AORD
Ieave,1) TAM~
~tntion,1) TAMr
~tation,1) TCND
~tation,1) TDGR
stntion,1) TSPL
Ihe,1)
trains,1) AORD
Icave,1) AMDR
trains,1) AMDR
trains,1) AMFR
Lmins,1) TACM
trains1l) TCNV
Ienve,1) TAMr
hourly1l) TrIM
r'oon,1) TrIM
~ntiI,1) TrIM

Figure 12. Common Meaning.

Second Ust
Item First Entry Second Entry Third Entry
Number Word & Frequency Word & Frequency Category
in Query in Document #4

1(
2(
3(
4(
5
6
7
8
9
10
11
12 (
13 (
14 (
15 (
16 (
17 (
18 (
19 (
20 (
21 (
22 (
23 (
24 (

1eave,1) (Ieave,1) AMDR
,tmins,1) (tmins,1) AMDR
*ienve,1) (Ieave,1) TAMr
station,1) (station,1) TAMT
`tmins,1) (trains,1) TCNV
station,1) (station,1) APOS
`station,1) (stntion,1) AORD
`station,1) (tmins,1) AORD
stadon,1) (Ienve,1) TAMT
station,1) (stntion,1) TAMT
station,1) (station,1) TCND
station,1) (stndon,1) TDGR
stntion,1) (station,1) TSPL
the,1) (the,1)
trains,1) (trains,1) AORD
trnins,1) (Ieave1l) AMDR
tmins,1) (tmlns,1) AMDR
tmins,1) (tmins,1) AMFR
tmins,1) (tnins,1) TACM
tmins,1) (tmins,1) TCNV
Ienve,1) (`eave,1) TAM~
bourly,1) ("oudy,1) TrIM
i'oon,1) (noon,1) TrIM
`intil,1) (until,1) TrIM

Figure 13. Adjustment for Words with no idf

298

Considering Figure 12, the word "depart" occurs in the
query one time and triggers the category AMDR. The word
"leave" occurs in Document #4 once and also triggers the
category AMDR. Thus, item 1 in Figure 12 corresponds to
subsection (a) as described above. An example using sub-
section (0) occurs in item 14 of Figure 12.
Step 2.
This step adjusts for words in the query that are not in any
of the documents. Figure 13 shows the output of Step 2 for
Document #4. In this step, another list is created from the list
created in Step 1. For each item in the Step 1 list which has
a word with undefined idi; this step replaces the word in the
First Entry column by the word in the Second Entry column.
For example, the word "depart" has an undefined idfas shown
in Figure & Thus, the word "depart" in item 1 of Figure 12
should be replaced by the word "leave" from the Second Entry
column. This is shown in item 1 of Figure 13. Likewise, the
words "do" and "when" also have an undefined idf and are
respectively replaced by the words from the Second Entry
column.
Step 3.
This step calculates the weight of a semantic component
in the query and calculates the weight of a semantic compo-
nent in the document. Figure 14 shows the output of Step 3
for Document #4. In Step 3, another list is created from the
list created in Step 2 as follows:
For each item in the Step 2 list, follow either subsection (a)
or (0), whichever applies:
a. If the Third Entry specifies a category, then
1) Replace the First Entry by computing:
( i~of frequency ot\( probability the word~
word in word in ~ triggers the category
First Entry)~ First Entry )~ in the Third Entry )

2) Replace the Second Entry by computing:
( i~of frequency o~( probability the word~
word in ~ word in ~ triggers the category
Second Entry) ksecond Entry)~ lathe Third Entry )

3) Omit the Third Entry.
b. If the Third Entry does not specify a category, then
1) Replace the First Entry by computing:
( i~of irequencyo~
wordin ~ wordin
FirstEntry)~ FirstEntry)

2) Replace the Second Entry by computing:
( i~of frequency
word in ~ word in
Second Entry) kSecond Entry)

3) Omit the Third Entry.
In Figure 14, item 1 is an example of using subsection (a),
and item 14 is an example of using subsection (0).

299

Step 4.
This step multiplies the weights in the query by the weights
in the document. The top portion of Figure 15 shows the
output of Step 4. In the list created here, the numerical value
created in the First Entry column of Figure 14 is multiplied
by the numerical value created in the Second Entry column
of Figure 14.
Step S.
This step sums the values in the Step 4 list to compute the
semantic similarity coefficient fora particular document. The
bottom portion of Figure 15 shows the output of step 5 for
Document #4.
We have finally observed an improved Precision~Recall
performance using the semantic similarity coefficient
explained here. or example, in a Category B filtering
experiment where the words being considered were only those
in the topics and idf values were determined by the number
of topics a word is in, we have observed the keyword and
semantic results shown in Figure 16 and Figure 17, respec-
tively. The 11-pt average for these two experiments reveals
a 23% increase due to the use of semantic categories.
According to Sparck Jones' criteria, this change would be
classified as "significant" ~reater than 10.0%) [12]. We
believe further improvement is possible by considering more
words, stemming for plurals and tenses of words, better idf
values (like those used for archival retrieval), a modem
lexicon, and a focus on paragraphs instead of whole docu-
ments.

5. Summary
Our progress during ThEC-1 and ThEC-2 has been the
following:
a. We created efficient code for a UNIX platform. Originally
our code used B+ tree structures for implementing inverted
files on a DOS platform. We now use hashing to replace
B+trees, establishing codes to replace character strings;
and the UNIX platform provides faster processing than the
DOS platform.
b. We built an index forasemantic lexicon based on the public
domain 1911 version of Roget's Thesaurus. To do this,
we had to create our own category numbering system
similar to today's version of Roget's Thesaurus.
c. We solved part of the blend problem for semantic and
keyword weights. We now base semantic category weights
on the kifof words which generate the semantic categories.
We can now index or scan TREC documents at rates faster
than 60 Megabytes per hour depending on the workstation.
We have a semantic lexicon of approximately 20,000 words
with flexible category codes that allow a course (36 catego
ries) through fine (more than 15,000 categories) semantic
analysis. As shown in Section 4, our procedure for
determining relevance is based on the senses of each word.
For example, using the vector processing model and the
similarity coefficient

sim~,D~)- X Wqj~djj,
i-i

"liii" List

Item Number First Entry Second Entry

1 .6*1*1fl~.0857 .6*1*1fl~.0857
2 0*1 *1/12m0 0*1*1I12~0
3 .6*1*1fl~.0857 .6*1*lPm.0857
4 .3*1*1/16~.0188 *3*1*1/16~.0188
5 0*1*1,12~0 0*1 *1/12~0
6 .3*1 *3/16m.0563 .3*1*3116~.o563
7 .3*1*7/24m.0875 .3*1 *7/24~ .0875
8 *3*1*1/8~.0375 0*1 *7/24mO
9 .3*1*1116m.0188 .6*1*1fl~.0857
10 .3*1*1I16~.0188 .3*1*1/16~.0188
11 .3* 1*1/8~.0375 .3*1*1/8~.0375
12 .3*1*1/16~.0188 .3*1*1/16~.0188
13 .3*1*3116m.0563 .3*1*3/16m.0563
14 0*1~0 0*lmO
15 0*1*7~~0 0*1*7/24~0
16 0*1*1/12m0 .6*1*1!7~.0857
17 0*1*1I12~0 0*1*1/12~0
18 0*1 *1/12~0 0*1 *1/12~0
19 0*1 *1/24~0 0*1 * 1/24-0
20 0*1 *1/12m0 0*1*1I12~0
21 .6*1*1rl~.0857 .6*1*1!7~.0857
22 T
.6*1*1.0~.6000 .6*1*1.0~.6000
23 .6*1*~m.4000 .6*1 *213~ .4000
24 .6*1*1.0~.6000 .6*1*1.0~.6000

Figure 14. Weights of Semantic Components.

Fourth List

Item Number Value
1 .00734
2 0
3 .00734
4 .00035
5 0
6 .00317
7 .00734
8 0
9 .00170
10 .00035
11 .00141
12 .00035
13 .00317
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 .00734
22 36000
23 .16000
24 .36000

Sum of all values in Fourth List
0.91986

Figure 15. Multipiled Weights and Their Sum.

300

Queryid (Num): 47 of 50
Total number of documents over all queries
Retrieved: 36610
Relevant: 2064
Rel~ret: 913
Interpolated Recall - Precision Averages
at 0.00 0.3514
at 0.10 0.1968
at 0.20 0.1367
at 0.30 0.1082
at 0.40 0.0894
at 0.50 0.0752
at 0.60 0.0276
at 0.70 0.0105
at 0.80 0.0062
at 0.90 0.0013
at 1.00 0.0007
(non-interpolated) over all rel does
0.0746

Average precision

Queryid ~um): 47 of 50
Total number of documents over all queries
Retrieved: 36383
Relevant: 2064
Rel_ret: 956
Interpolated Recall
at 0.00
at 0.10
at 0.20
at 0.30
at 0.40
at 0.50
at 0.60
at 0.70
at 0.80
at 0.90
at 1.00
Average precision

- Precision Averages
0.3961
0.2479
0.1734
0.1258
0.1067
0.0838
0.0372
0.0195
0.0100
0.0029
0.0009
(non-interpolated) over all rel does
0.0919

Precision: Precision:
At 5does: 0.1660 At Sdocs: 0.2426
At lOdocs: 0.1532 At lOdoes: 0.2149
At 15does: 0.1433 At lSdoes: 0.1801
At 20does: 0.1298 At 20does: 0.1574
At 30does: 0.1057 At 30does: 0.1383
At 100does: 0.0643 At l00does: 0.0745
At 200 does: 0.0465 At 200does: 0.0522
At 500 does: 0.0302 At 500 does: 0.0320
At 1000does: 0.0194 At 1000 does: 0.0203
R-Precision ([`recision after R (= num_rel for a query) R-Precision ([`recision after R (= num~rel for a query)
does retrieved): does retrieved):
Exact: 0.1035 Exact: 0.1283

Figure 16. Fillering Using Keywords.

if the word "trains" is in the Query and the word "leaves "is
in the Document and we look at the semantic category Motion
with Reference to Direction (AMDR), then one of the vector
product elements in the formula becomes:
. p",abE.Iiy ~
Icavee" triggem AMDR~

where the probabilities are obtained from our semantic lexi-
con.

We plan to do more experiments incorporating the fol-
lowing improvements:

a. Modernize the semantic lexicon. Since our lexicon isbased
on the 1911 version of Roget's Thesaurus, many modem
words are not present and the senses of recorded words are
not accurate. We plan to correct this. For example, we
could try to get permission to use the current version of
Roget's Thesaurus.

b. Base similarity on paragraphs instead of whole documents.
We have had success using as few as 36 categories in a
paragraph environment. ~e also feel that relevance

301

Figure 17. Filtering Using Semantic Categories.

decisions are made by humans looking at roughly a
paragraph of information. We plan to modify our code to
use paragraphs as a basis for the similarity measure.

c. Experiment with the number of possible semantic cate-
gories and the probability assigned to a triggered category.
The experiment behind the performance improvement
shown in Figure 16 and Figure 17 uses a very fine number
of semantic categories and treats the triggered semantic
categories for a word uniformly. We plan to experiment
with a fewer number of categories, and we plan to obtain
a probability distribution for categories based on word
usage.

Basically, we are trying to establish a statistically sound
approach to using word sense information. Intuition is that
word sense information should improve retrieval perform-
ance. Furthermore, our approach to using word sense infor-
mation has shown a significant performance improvement in
a question/answer environment where paragraphs represent
documents. We feel that other word sense approaches, such
as query expansion or word sense disambiguation, may not
be statistically sound, and that may be why successful
experiments have not been reported.

References
[1] G. F. Cooper (1988). Probabilistic inference using belief
networks is np-hard. Technical Report KSL-87-27,
Stanford University, Stanford, CA.

[2] W. B. Croft (1987). Approaches to intelligent infor-
mation retrievaL Infonnation Processing & Manage-
ment, 23(4):95-1 10.

[3] C. Date (1990). An Introduction to Database Systems,
Vol.1, Addison Wesley.

[4] 3. Driscoll, L Lautenschlager and M. zhao (1993). The
QA system, Proc. of the First TextRetrieval Conference
(TREC-1), NIST Special Publication 500-207 (1). K.
Harman, editor).

[5] E. k Fox (1983) Characteriitation of two new exper-
imental collections in computer and information science
containing textual and bibliographic concepts.
Technicat Report 83-561, Department of Computer
Science, Cornell Univ., Ithaca, NY

[6] L N. Kanal and 3. F. Lemmer Eds (1986). Uncertainty
in Artificial Intelligence, North-Holland, Amsterdam.

[7] J. F. Lemmer and L N. Kanal Eds (1988). Uncertainty
in Artificial Jntelligence 2, North-Holland, Amsterdam.

[8] E. L Liddy and S.H. Myaeng (1993). DR-Link, Proc.
of the First Text Retrieval Conference (TREC-1), NIST
Special Publication 500-207(1). K. Harman, editor).

[9] R. E. Neapolitan (1990). Probabilistic Reasoning in
Expert System, Theory and Algorithms, A Wiley-
Interscience Publication, John Wiley & Sons, Inc.

[10] J. Pearl (1988). Probabilistic Reasoning in Intelligent
Systems: Networks of Plausible Inference. San Mateo,
CA: Morgan Kaufmann.

[11] Y. Peng andj. A. Reggia (1989). Aconnectionist model
for diagnostic problem solving. IEEE Transactions on
Systems, Man, and Cybernetics, 19(2):285-298.

(12] K. Sparck Jones and R. Bates (1977). Research on
Automatic Indexing 1974-1976, Technical Report,
Computer Laboratory, University of Cambridge.

[13] M. Tagamets, J. Wald, M. Farach and J. A. Reggia
(1989). Generating plausible diagnostic hypothesis with
self-proccssin~ causal networks. Journal of ~er-
imental Theories in Artif~cial Intelligence, 2:91-112.

[14] Roget's International Thesaurus (1977). Harper &
Row, New York, Fourth Edition.

[15] H. Turtle and W. B. Croft (1991). Evaluation of an
inference network-based retrieval model. ACM Trans-
action on Information Systems, 9(3):187-222.

(16] P. Van Learti and E. Arts (1987). SimulatedAnnealing:
Theory andApplications. Boston: D. ReideL

302

[17] D. Voss and J. Driscoll (1992). Text Retrieval Using a
Comprehensive Semantic Lexicon, Proceedings of
ISMM First International Conference on Information
and Rnowledge Management, Baltimore, Maryland.

[18] P. Wang, J. Reggia, D. Nao and Y. Peng (1985). A
formal model for diagnostic inference. Information
Sciences, 37:227-285.

[19] E. Wendlandt and J. Driscoll(1991). Incorporating a
Semantic Analysis into a Document Retrieval Strategy,
Proceedings of the Fourieenth Annual International
ACMISIGIR Conference on Research andDevelopment
in Information Retrieval, Chicago, Illinois, 270-279.

<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
<html xmlns='http://www.w3.org/1999/xhtml'>
<head>
<title>pFad - Phonifier reborn</title>
<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />
</head>
<body>
<h1>Pfad - The Proxy pFad of &#169; 2024 Garber Painting. All rights reserved.</h1>

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
 
Alternative Proxies:<a href="http://rainy.clevelandohioweatherforecast.com/php-proxy/index.php?q=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">Alternative Proxy</a><a href="http://rainy.clevelandohioweatherforecast.com/pFad/index.php?u=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">pFad Proxy</a><a href="http://rainy.clevelandohioweatherforecast.com/pFad/v3index.php?u=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">pFad v3 Proxy</a><a href="http://rainy.clevelandohioweatherforecast.com/pFad/v4index.php?u=https://trec.nist.gov/pubs/trec2/papers/txt/29.txt" target="_blank">pFad v4 Proxy</a></body>
</html>