Eis 05 J
Eis 05 J
ADVIS Lab
Department of Computer Science
University of Illinois at Chicago, USA
{ifc | hxiao}@cs.uic.edu
Abstract
In this paper, we discuss the use of ontologies for data integra-
tion. We consider two different settings depending on the system
architecture: central and peer-to-peer data integration. Within those
settings, we discuss five different cases studies that illustrate the use of
ontologies in metadata representation, in global conceptualization, in
high-level querying, in declarative mediation, and in mapping support.
Each case study is described in detail and accompanied by examples.
1 Introduction
1.1 Data Integration
Data integration provides the ability to manipulate data transparently across
multiple data sources. It is relevant to a number of applications including
enterprise information integration, medical information management, geo-
graphical information systems, and E-Commerce applications. Based on the
architecture, there are two different kinds of systems: central data integra-
tion systems [1, 3, 7, 10, 18, 22] and peer-to-peer data integration systems
[2, 4, 5, 11, 16, 19]. A central data integration system usually has a global
schema, which provides the user with a uniform interface to access informa-
tion stored in the data sources. In contrast, in a peer-to-peer data integration
system, there are no global points of control on the data sources (or peers).
1
Instead, any peer can accept user queries for the information distributed in
the whole system.
The two most important approaches for building a data integration sys-
tem are Global-as-View (GaV) and Local-as-View (LaV) [22, 17]. In the
GaV approach, every entity in the global schema is associated with a view
over the source local schema. Therefore querying strategies are simple, but
the evolution of the local source schemas is not easily supported. On the
contrary, the LaV approach permits changes to source schemas without af-
fecting the global schema, since the local schemas are defined as views over
the global schema, but query processing can be complex.
2
common use of ontologies is data standardization and conceptualization via a
formal machine-understandable ontology language. For example, the global
schema in a data integration system may be an ontology, which then acts as a
mediator for reconciliating the heterogeneities between different sources. As
an example of the use of ontologies on peer-to-peer data integration, we can
produce for each source schema a local ontology, which is made accessible
to other peers so as to support semantic mappings between different local
ontologies.
2 Ontologies
An ontology is a formal, explicit specification of a shared conceptualization
[13]. In this definition, “conceptualization” refers to an abstract model of
some domain knowledge in the world that identifies that domain’s relevant
concepts. “Shared” indicates that an ontology captures consensual knowl-
edge, that is, it is accepted by a group. “Explicit” means that the type of
concepts in an ontology and the constraints on these concepts are explic-
itly defined. Finally, “formal” means that the ontology should be machine
understandable.
Typical “real-world” ontologies include taxonomies on the Web (e.g., Ya-
hoo! categories), catalogs for on-line shopping (e.g., Amazon.com’s product
catalog), and domain-specific standard terminology (e.g., UMLS1 and Gene
1
http://www.nlm.nih.gov/research/umls/
3
Ontology2 ). As an online lexicon database, WordNet3 is widely used for
discovery of semantic relationships between concepts.
Existing ontology languages include:
XML Schema. Strictly speaking, XML Schema is a semantic markup lan-
guage for Web data. The database-compatible data types supported
by XML Schema provide a way to specify a hierarchical model.4 How-
ever, there are no explicit constructs for defining classes and properties
in XML Schema, therefore ambiguities may arise when mapping an
XML-based data model to a semantic model.
4
Other ontology languages include SHOE (Simple HTML Ontology Exten-
sions),9 XOL (Ontology Exchange Language),10 and UML (Unified Modeling
Language).11
Among all these ontology languages, we are most interested in XML
Schema and RDFS for their particular roles in data integration and the
“Semantic Web” [12]. More specifically, XML Schema and RDFS use the
same syntax and can be used for data modeling and ontology representation.
But they have their own particular features in the sense that XML data has
document structure in terms of the nesting elements in an individual XML
document, whereas RDF data has domain structure formed by the concepts
and relationships between concepts [11, 16]. We shall discuss this issue in
detail in Section 4.
5
existing mappings. Our layered framework [9] is an example of this
approach.
The single and hybrid approaches are appropriate for building central data
integration systems, the former being more appropriate for GaV systems
and the latter for LaV systems. A hybrid peer-to-peer system, where a
global ontology exists in a “super-peer” can also use the hybrid ontology
approach [11]. The multiple ontology approach can be best used to construct
pure peer-to-peer data integration systems, where there are no super-peers.
We identify the following five uses of ontologies in data integration:
6
4 Central Data Integration
In this section, we will describe three case studies of ontologies in the context
of central data integration. To make the issues concrete, we use a running
example involving two XML sources and demonstrate how to enable semantic
interoperation between them.
Example 1 Figure 1 displays two XML schemas (S1 and S2 ) and their re-
spective documents (D1 and D2 ), which are represented as trees. The two
XML documents conform to different schemas but represent data with similar
semantics. In particular, both schemas represent a many-to-many relation-
ship between two concepts: book and author in S1 (equivalently denoted by
article and writer in S2 ). However, structurally speaking, they are dif-
ferent: S1 (book-centric schema) has the author element nested under the
book element, whereas S2 (author-centric schema) has the article element
nested under the writer element.
Semantically equivalent data elements, such as the authors of publica-
tion “b2 ”, can be reached using different XML path patterns, respectively for
schema S1 and schema S2 :
/books/book[@booktitle="b2"]/author/@name
and
/writers/writer[article/@title="b2"]/@fullname
where the contents in the square brackets specify the constraints for the search
patterns.
books writers
writers
books
book book writer writer writer
writer *
book *
author article article article
author author article *
[1..10] "b2" author @fullname "w1" "w2" "w3"
@booktitle "b1" @title
@name
"a1" "a2" "a3" "t1" "t2" "t2"
XML schema S1 XML document D1 XML schema S2 XML document D2
"books.xml" "writers.xml"
7
The example demonstrates that multiple XML schemas (or structures)
can exist for a single conceptual model. In comparison, the schema or on-
tology languages (e.g., RDFS, DAML+OIL, and OWL) that operate on
the conceptual level are structurally flat so that the user can formulate a
query from a conceptual perspective without considering the structure of the
source [1, 7, 23, 10].
Figure 2 shows the architecture of a system that interoperates among
RDF-based
global ontology
mapping table
Query translator
Query in data-integration direction
8
Relational Schema. Relations are converted into RDF classes and attributes
into RDF properties, which are attached to the class corresponding to
the relation to which the attributes belong. Foreign key dependencies
between two relations are represented by two properties (corresponding
to the two relations) sharing the same value in the target local ontology.
XML Schema. Complex-type elements are converted into RDF classes and
simple-type elements and attributes are converted into RDF properties.
This transformation process encodes the mapping information between
each concept in the local RDF ontology and the path to the corre-
sponding element in the XML source. Nesting relationships between
XML elements are represented using a meta-property rdfx:contains; rdfx
stands for the namespace where contains is defined. This meta-property
enables the RDF representation of the XML nesting structure, by con-
necting two RDF classes representing the two nesting XML elements.
9
• Copying a class and/or its properties: classes and properties that do
not exist in the target ontology are copied into it.
10
The global RDF ontology correspondence
Publications Person
rdfs:subClassof
rdfx:contains
rdfx:contains rdfx:contains
Books Book Author Authors
subquery for each source. The subqueries over sources are subject to the
structure of source schemas, and may be expressed in a different language
from that of the high-level query. An inference mechanism may be needed in
the query rewriting, for example, when a concept involved in the query has
super-concepts or sub-concepts.
In addition to handling high-level queries on the global ontology, a bidi-
rectional query translation algorithm is also supported [10] (see Figure 2).
In this case, we can translate a query posed against an XML source to an
equivalent query against any other XML source.
Example 4 Suppose the user asks the query “Find the persons who have
written publication b2 .” This query will be expressed in a RDF query lan-
guage such as RDQL. 13 First, Person has sub-concept Author, which corre-
sponds to two different concepts (Author and Writer) in two different RDF
local databases. Therefore the initial query will be rewritten as two sub-queries
to those databases. In turn, those queries may be further rewritten using a
XML query language incorporating the path expressions of Example 1 (unless
the data was materialized under the RDF local ontologies). Using the bidi-
rectional query translation mechanism, a query involving the concepts Book
and Author in one source will be translated into a query involving Article and
13
http://www.hpl.hp.com/semweb/rdql.htm
11
Q2 Q1
peer 1 super peer peer n
Q2n'
XML to mapping table mapping table
local RDF
wrapper Global RDF local RDF
XML schema
ontology
schema mapping Q11'
table Q1n'
Q2i' Q1i'
peer i
Query processing in
XML to
data-integration fashion
local RDF
XML wrapper Query processing in
schema hybrid P2P fashion
mapping
table
Mapping process
12
Q1: List all publications
answer. Meanwhile, the source query is rewritten into a target query over
every connected peer. The query rewriting utilizes the global ontology, and
the composition of mappings from the original peer to the super peer with
mappings from the super-peer to the target peers. By executing the target
query, each peer returns an answer to the original peer, called the remote
answer. The local and remote answers are integrated and returned to the
user at the site of the originating peer.
Example 5 Consider two XML sources, one in peer p1 and the other in
peer p2 , and a global ontology expressed in RDF in a super-peer. As shown
in Figure 6, the global ontology consists of a class Publication and two sub-
classes Paper and Book. The Publication class is mapped to the publication
element of the XML source in p1 , while the class Book corresponds to book
of the XML source in p2 . An XML query Q1 on p1 involving publication will
be rewritten to a target query Q2 on p2 involving include book. The XML
fragments inside the dashed-line boxes are integrated and returned as answers.
13
A thesaurus-based schema matching approach has been devised for peer-
to-peer data integration [24]; this approach consists of the following three
steps (as illustrated in Figure 7):
1. Path Exploration. Among the semantic relations between synsets
in WordNet, we choose those of synonymy, hyponymy/hypernymy (i.e., more
specific/more general), and related-to, when enumerating the paths between
two arbitrary concepts from different local ontologies in peers. As shown in
Figure 7, six paths are found from Quantity to Number.
2. Path Selection. When multiple paths are found between two con-
cepts, we choose the optimal path, which corresponds to the most likely se-
mantic relation between the two concepts. For this purpose, semantic simi-
larities (i.e., the number above each path in the figure) are calculated for all
the paths. The calculation is implemented by assigning different semantic
relations with different weights (e.g., 1 for synonymy and 0.8 for hypernymy)
and then taking the average of all the weights. The path with highest sim-
ilarity is then chosen as the optimal path. If there is more than one such
path, then the user’s intervention is needed.
3. Semantic Derivation. The last step is to derive the (direct) se-
mantic relationship, Sem, between the two concepts by reasoning on the se-
mantic relations along the optimal path p between them. More specifically,
Sem(p) = Sem(pn ) is computed based on the following recursive algorithm,
where pn = (r1 , r2 , ..., rn ), and ri (1≤i≤n) are the edges (semantic relations)
along p.
1
SYN (Synonym): 1
Amount
HYPER (Hypernym): 0.8 SYN SYN
0.9 SY Quantity Amount Number
N N
HYPO (Hyponym): 0.8 SY Total HY
REL (Related-to): 0.5 SYN 0.8
PO 2. Path 3. Semantic
O HYPO Selection Derivation
H YP Definite Quantity
Quantity HYP O 0.8 Number
WordNet HYPO SYN
HY Product Quantity Number
1. Path
HY PE R 0.8 HY
PO
Exploration PE
R Constant PO
HY
0.8
Sum
14
for the semantic relation of synonymy, hypernymy, hyponymy, and related-to.
The operation ∧ obeys the rules that are shown in Table 1.
∧ ≈ ⊇ ⊆ ∼
≈ ≈ ⊇ ⊆ ∼
⊇ ⊇ ⊇ ? ∼
⊆ ⊆ ? ⊆ ∼
∼ ∼ ∼ ∼ ∼
Table 1: Inference rules for semantic relations: a white cell (at the intersec-
tion of each pair of grey cells) contains the result of the operation on the
relations in the two grey cells, and a question mark indicates that human
intervention is needed.
6 Conclusions
The advent of XML has created a syntactic platform for Web data stan-
dardization and exchange. However, XML has several problems. First of
all, documents expressed in XML share the same syntax, but can be other-
wise heterogeneous, for example by having different structures and naming
conventions. Also, an XML document does not express the semantics of the
elements or of the relationships among elements explicitly. Therefore, it is
not a suitable language for metadata representation.
Ontologies provide an explicit and formal specification of a shared concep-
tualization, and are able to facilitate knowledge sharing and reuse. We use
ontologies expressed in RDFS, a semantically rich schema language, to bridge
across syntactic, schematic, and semantic heterogeneities in data sources.
In this paper, we have presented five different case studies that illustrate
the role that ontologies play in the process of data integration, in centralized
and peer-to-peer architectures.
Related research includes research on ontology generation, ontology map-
ping, and ontology evolution. An ontology can be generated manually using
an authoring tool or (semi-)automatically from various knowledge sources
(e.g., database schemas). Techniques used for ontology mapping, including
ontology alignment and ontology merging [20, 8], overlap to a large extent
with those techniques for schema matching [21]. Finally, ontology evolution,
also called ontology versioning, involves changes on representation, structure,
15
and semantics of ontologies. Each step of such an evolution must ensure the
consistency between the old version and the improved version of the ontol-
ogy, just as if a database schema’s evolution must guarantee the consistency
of the new schema with the data.
References
[1] B. Amann, C. Beeri, I. Fundulaki, and M. Scholl. Ontology-Based Inte-
gration of XML Web Resources. In Proceedings of the 1st International
Semantic Web Conference (ISWC 2002), pages 117–131, 2002.
[6] Y. A. Bishr. Overcoming the semantic and other barriers to GIS inter-
operability. International Journal of Geographical Information Science,
12(4):229–314, 1998.
16
[9] I. F. Cruz and H. Xiao. Using a Layered Approach for Interoperability
on the Semantic Web. In Proceedings of the 4th International Conference
on Web Information Systems Engineering (WISE 2003), pages 221–232,
Rome, Italy, December 2003.
17
[18] E. Mena, V. Kashyap, A. P. Sheth, and A. Illarramendi. OBSERVER:
An Approach for Query Processing in Global Information Systems based
on Interoperation across Pre-existing Ontologies. In Proceedings of the
1st IFCIS International Conference on Cooperative Information Systems
(CoopIS 1996), pages 14–25, 1996.
[20] N. F. Noy and M. A. Musen. PROMPT: Algorithm and Tool for Au-
tomated Ontology Merging and Alignment. In Proceedings of the 17th
National Conference on Artificial Intelligence and 12th Conference on
Innovative Applications of Artificial Intelligence (AAAI/IAAI 2000),
pages 450–455, 2000.
[25] H. Xiao, I. F. Cruz, and F. Hsu. Semantic Mappings for the Integration
of XML and RDF Sources. In Workshop on Information Integration on
the Web (IIWeb 2004), August 2004.
18