0% found this document useful (0 votes)
22 views

Resolving Data Integration Conflicts

Uploaded by

Yaniv Naor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Resolving Data Integration Conflicts

Uploaded by

Yaniv Naor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Fusion – Resolving Data Conflicts for Integration

Tutorial proposal, intended length 1.5 hours

Xin (Luna) Dong Felix Naumann


AT&T Labs Inc. Hasso Plattner Institute (HPI)
Florham Park, NJ, USA Potsdam, Germany
lunadong@research.att.com naumann@hpi.uni-potsdam.de

1. MOTIVATION entity into a single record and resolving possible conflicts


The amount of information produced in the world in- from different data sources. Data fusion plays an important
creases by 30% every year and this rate will only go up. role in data integration systems: it detects and removes dirty
With advanced network technology, more and more sour- data and increases correctness of the integrated data.
ces are available either over the Internet or in enterprise Objectives and Coverage. The main objective of the pro-
intranets. Modern data management applications, such as posed tutorial is to gather models, techniques, and systems
setting up Web portals, managing enterprise data, managing of the wide but yet unconsolidated field of data fusion and
community data, and sharing scientific data, often require present them in a concise and consolidated manner. In the
integrating available data sources and providing a uniform 1.5-hour tutorial we will provide an overview of the causes
interface for users to access data from different sources; such and challenges of data fusion. We will cover a wide set of
requirements have been driving fruitful research on data in- both simple and advanced techniques to resolve data con-
tegration over the last two decades [13, 15]. flicts in different types of settings and systems. Finally, we
Data integration systems face two folds of challenges. First, provide a classification of existing information management
data from disparate sources are often heterogeneous. Het- systems with respect to their ability to perform data fusion.
erogeneity can exist at the schema level, where different
data sources often describe the same domain using differ- Intended audience. Data fusion touches many aspects of
ent schemas; it can also exist at the instance level, where the very basics of data integration. Thus, we expect the tu-
different sources can represent the same real-world entity in torial to appeal to a large portion of the VLDB community:
different ways. There has been rich body of work on resolv- • Researchers in the fields of data integration, data cleans-
ing heterogeneity in data, including, at the schema level, ing, data consolidation, data extraction, data mining,
schema mapping and matching [17], model management [1], and Web information management.
answering queries using views [14], data exchange [10], and
at the instance level, record linkage (a.k.a., entity resolu- • Practitioners developing and distributing products in
tion, object matching, reference linkage, etc.) [9, 18], string the data integration, data cleansing, ETL & data ware-
similarity comparison [6], etc. housing, and master data management areas.
Second, different sources can provide conflicting data. Con- We expect that attendees will take home from this sem-
flicts can arise because of incomplete data, erroneous data, inar (i) an understanding of the causes and challenges of
and out-of-date data. Returning incorrect data in a query conflicting data along with different application scenarios,
result can be misleading and even harmful: one may contact (ii) knowledge about concrete methods to resolve data con-
a person by an out-of-date phone number, visit a clinic at flicts both within relational DBMS and through dedicated
a wrong address, carry wrong knowledge of the real world, applications, and (iii) an overview of existing tools and sys-
and even make poor business decisions. It is thus critical tems to perform data fusion.
for data integration systems to resolve conflicts from vari-
ous sources and identify true values from false ones. This Assumed background. Apart from a basic understanding
problem becomes especially prominent with the ease of pub- of database technology and data integration, there are no
lishing and spreading false information on the Web and has prerequisites for this tutorial.
recently received increasing attention. The proposed tutorial is based on a recent survey on data
This tutorial focuses on data fusion, which addresses the fusion [4] and various techniques proposed for truth discov-
second challenge by fusing records on the same real-world ery (including, but not limited to, [2, 7, 8, 19, 21]). We
acknowledge the great contributions of authors of relevant
papers.
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage,
the VLDB copyright notice and the title of the publication and its date appear, 2. TUTORIAL OUTLINE
and notice is given that copying is by permission of the Very Large Data Our tutorial starts from overviewing the importance of
Base Endowment. To copy otherwise, or to republish, to post on servers data fusion in data integration and possible reasons for data
or to redistribute to lists, requires a fee and/or special permission from the conflicts. We then present a classification of existing data fu-
publisher, ACM.
VLDB ‘09, August 24-28, 2009, Lyon, France sion techniques and introduce relational operations for con-
Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00. flict resolution. After that, we describe several advanced
Application Table 1: Motivating example: five data sources
provide information on the affiliations of five re-
Phase 3: Data Fusion searchers. Only S1 provides complete and correct
information.
S1 S2 S3 S4 S5
Phase 2: Duplicate Detection
Stonebraker MIT Berkeley MIT MIT null
Dewitt MSR MSR UWisc UWisc null
Phase 1: Schema Mapping Bernstein MSR MSR MSR MSR null
Carey UCI AT&T BEA BEA BEA
Halevy Google Google UW UW UW
Data sources

non-null values that are all used to describe the same prop-
Figure 1: Three tasks in data integration (from [16]). erty of the same entity. Contradiction is caused by different
sources providing different values for the same attribute of
a real-world entity.
techniques for finding the best (true) values in presence of
data conflicts. We end our tutorial with surveying data fu- Example 2.1. Consider the five data sources in Table 1.
sion techniques in existing data integration systems and sug- There exists uncertainty on the affiliations of Stonebraker,
gesting future research directions. Dewitt and Bernstein because of the null values provided by
source S5 , and contradiction on the affiliations of Stone-
2.1 Overview braker, Dewitt, Carey and Halevy. 2
Data integration has three broad goals: increasing the
completeness, conciseness, and correctness of data that is There are two key issues in data fusion. First, how to find
available to users and applications. Completeness measures the best values among conflicting values? Second, how to do
the amount of data, in terms of both the number of tu- so efficiently? We next survey existing working on solving
ples and the number of attributes. Conciseness measures these problems.
the uniqueness of object representations in the integrated 2.2 Conflict resolution and data merging
data, in terms of both the number of unique objects and the
number of unique attributes of the objects. Finally, correct- Data conflicts, in the form of uncertainties or contradic-
ness measures correctness of data; that is, whether the data tions, can be resolved in numerous ways. After introduc-
conform to the real world. ing a broad classification we turn to the relational algebra,
Whereas high completeness can be obtained by adding which already provides several possibilities. More elaborate
more data sources to the system, achieving the other two data integration systems and their fusion capabilities are
goals is non-trivial. To meet these requirements, a data analyzed in Section 2.4.
integration system needs to perform three levels of tasks Conflict resolution strategies. There are many different
(Fig. 1): data integration and fusion systems, each with their own
solution. Fig. 2 classifies existing strategies to approach data
1. Schema mapping: First, a data integration system needs
conflicts and Table 2 lists some of the strategies and their
to resolve heterogeneity at the schema level by estab-
classification. In particular, Conflict ignoring strategies are
lishing semantic mappings between contents of dis-
not aware of conflicts, perform no resolution, and thus may
parate data sources.
produce inconsistent results. Conflict avoiding strategies are
2. Duplicate detection: Second, a data integration system aware of conflicts but do not perform individual resolution
needs to resolve heterogeneity at the instance level by for each conflict. Rather, a single decision is made, e.g.,
detecting records that refer to the same real-world en- preference of a source, and applied to all conflicts. Finally,
tity. conflict resolving strategies provide the means for individual
fusion decisions for each conflict.
3. Data fusion: Third, a data integration system needs Such decisions can be instance-based, i.e., they regard the
to combine records that refer to the same real-world actual conflicting data values, or they can be metadata-
entity by fusing them into a single representation and based, i.e., they choose values based on metadata, such as
resolving possible conflicts from different data sources. freshness of data or the reliability of a source. Finally, strate-
gies can be classified by the result they are able to produce:
Among these three tasks, schema mapping and record
deciding strategies choose a preferred value among the ex-
linkage aim at removing redundancy and increasing concise-
isting values, while mediating strategies can produce an en-
ness of the data. Data fusion, which is the focus of this
tirely new value, such as the average of a set of conflicting
tutorial, aims at resolving conflicts from data and increas-
numbers.
ing correctness of data.
We distinguish two kinds of data conflicts: uncertainty Relational operations. Both join and union (and their
and contradiction. Uncertainty is a conflict between a non- relatives) perform data fusion of sorts. Joining two tables
null value and one or more null values that are all used to enlarges the schema of the original individual relations and
describe the same property of a real-world entity. Uncer- thus appends previously unknown values to tuples. Outer-
tainty is caused by missing information, such as null values join variants avoid the loss of tuples without join partner.
in a source or a completely missing attribute in a source. Full disjunction combines two or more input relations by
Contradiction is a conflict between two or more different combining all matching tuples into a single result-tuple [11].
Table 2: Conflict resolution strategies (from [3]).
Strategy Classification Short Description
Pass it on ignoring escalates conflicts to user or application
Consider all possibilities ignoring creates all possible value combinations
Take the information avoiding, instance based prefers values over null-values
No Gossiping avoiding, instance based returns only “consistent” tuples
Trust your friends avoiding, metadata based takes the value of a preferred source
Cry with the wolves resolution, instance based, deciding takes the most often occurring value
Roll the dice resolution, instance based, deciding takes a random value
Meet in the middle resolution, instance based, mediating takes an average value
Keep up to date resolution, metadata based, deciding takes the most recent value

conflict handling conflicts by collecting possible values and producing a single,


strategies
possibly new value for the fusion result.

2.3 Advanced techniques for conflict resolu-


conflict conflict conflict tion
ignorance avoidance resolution
Obviously, none of the methods in Table 2 are perfect in
resolving conflicts. They all fall short in some or all of the
instance metadata instance metadata following three aspects. First, data sources are of different
based based based based quality and we often trust data from more accurate sources,
but accurate sources can make mistakes as well; thus, nei-
deciding mediating deciding mediating ther treating all sources as the same nor taking all data from
accurate sources without verifying is appropriate. Second,
the real world is dynamic and the true value often evolves
Figure 2: A classification of conflict resolution over time (such as a person’s affiliation and a business’s con-
strategies (from [4]). tact phone number), but it is hard to distinguish incorrect
values from out-of-date values; thus, taking the most com-
mon value may end up with an out-of-date value, whereas
The union of two relations performs data fusion by fusing taking the most recent value may end up with a wrong value.
same tuples, i.e., pairs of tuples that have same values in all Third, data sources can copy from each other and errors can
attributes. In the example of Tab. 1, a union of all 25 tuples be propagated quickly; thus, ignoring possible dependencies
would remove all exact duplicates and thus reduce the data among sources can lead to biased decisions due to copied
set by 12 tuples. For instance, the fact that Bernstein works information.
at MSR would be represented only once, increasing the read- We next describe several advanced techniques that con-
ability of the result. A slight enhancement is given by the sider accuracy of sources, freshness of sources, and depen-
minimum union operation, which additionally removes sub- dencies between sources to solve the problems.
sumed tuples, i.e., tuples that agree with other tuples in all
non-null values but have more null-values than the other. Considering accuracy of sources: Data sources are of
In the example, further 3 tuples, would be removed. For different accuracy and some are more trustworthy. To il-
instance, the tuple (Bernstein,null) is subsumed by the tu- lustrate, consider the first three sources in the motivating
ple (Bernstein,MSR). This definition is further extended to example. If we realize that S1 is more accurate than the
complementing tuples, i.e., tuples that have mutual uncer- other two sources and give its values higher weights, we are
tainties but no contradicting values [5]. For example, assume able to make more precise decisions, such as correctly de-
the tuples of the example had an additional attribute ‘city’. ciding that Carey is at UCI (there is a tie in voting between
Tuples t1 and t2 in the following table are complementing S1 , S2 , and S3 ). It is proposed in [7, 19, 21] that we should
tuples and would be fused to a more complete tuple: consider accuracy of sources when deciding the true values.
We describe their probabilistic models that iteratively com-
t1 Bernstein MSR null pute accuracy of sources and decide the true values.
t2 Bernstein null Redmond
Fused result Bernstein MSR Redmond Considering freshness of sources: The world is often
changing dynamically and a value, in addition to being true
Besides removing uncertainties there have been relational or false, can be in a subtle third case: out-of-date. Some
approaches to remove contradictions. The match-join oper- sources, though appearing to provide wrong values, actually
ator in a first step creates all possible tuples and in a second just have low freshness and provide stale data (S3 in the
step reduces this number in a user-defined manner, for in- motivating example falls in this category). It is proposed
stance by selecting random tuples as a representative from a in [8] that we should consider freshness of sources and treat
set of duplicates [20]. The prioritized merge operator [12] is incorrect values and out-of-date values differently in truth
similar but can give preferences to values of certain sources. discovery and we describe their probabilistic model accord-
Finally, we discuss fusion through the SQL-based tech- ingly.
niques of user-defined-functions, the coalesce function, and
aggregation functions. All have the goal of resolving data Considering dependence between sources: In many
Table 3: Data fusion capabilities, possible strategies, and fusion specification in existing data integration
systems (from [4]).
System Fusion possible Fusion strategy Fusion specification
Multibase resolution Trust your friends, Meet in the middle manually, in query
Hermes resolution Keep up to date, Trust your friends, . . . manually, in mediator
Fusionplex resolution Keep up to date manually, in query
HumMer resolution Keep up to date, Trust your friends, Meet manually, in query
in the middle, . . .
Ajax resolution various manually, in workflow definition
TSIMMIS avoidance Trust your friends manually, rules in mediator
SIMS/Ariadne avoidance Trust your friends automatically
Infomix avoidance No Gossiping automatically
Hippo avoidance No Gossiping automatically
ConQuer avoidance No Gossiping automatically
Rainbow avoidance No Gossiping automatically
Pegasus ignorance Pass it on manually
Nimble ignorance Pass it on manually
Carnot ignorance Pass it on automatically
InfoSleuth unknown Pass it on unknown
Potter’s Wheel ignorance Pass it on manually, transformation

domains, especially on the Web, data sources may copy from • Incremental fusion: Non-associative fusion functions,
each other for some of their data. In the motivating exam- such as voting or average, are subject to incorrect re-
ple, S4 and S5 copy all or part of the data from S3 . If we sults if new conflicting values appear. Techniques, such
treat S4 and S5 the same as other sources, we will incor- as retaining data lineage, maintaining simple metadata
rectly decide that all data provided by S3 are correct. It or statistics, need to be developed to facilitate incre-
is proposed in [2, 7] that we should consider dependence mental fusion.
between sources in truth discovery. We describe their algo-
rithms that iteratively detect dependence between sources • Online fusion: In some applications it is infeasible to
and discover the true values taking into consideration such fuse data from different sources in advance, either be-
dependence. cause it is impossible to obtain all data from some
sources, or because the total amount of data from var-
2.4 Data fusion in existing DI systems ious sources is huge. In such cases we need to efficiently
This part of the tutorial examines relevant properties of perform data fusion in an online fashion at the time of
both commercial and prototypical data fusion systems. The query answering.
tutorial itself will not be held by rattling off long lists of
properties and systems, but rather by highlighting certain • Data lineage: Database administrators and data own-
relevant properties and special interesting features of these ers are notoriously hesitant to merge data and thus lose
systems. The supplemental material can include the corre- the original values, in particular if the merged result
sponding lists and tables found in [4]. An example is Tab. 3, is not the same as at least one of the original values.
which lists the fusion capabilities of different integration sys- Retaining data lineage despite merging is similar to
tems. the problem of data lineage through aggregation op-
Among the analyzed research prototypes with some fu- erators. Effective and efficient management of data
sion capabilities are Multibase, Hermes, FusionPlex, Hum- lineage in the context of fusion is yet to be examined.
Mer, Ajax, TSIMMIS, SIMS, Ariadne, ConQuer, Infomix,
HIPPO, and Rainbow (see [4] for references). Among the • Combining truth discovery and record linkage: Although
analyzed commercial data integration systems are several Fig. 1 positions data fusion as the last phase in data
DBMS and ETL tools, such as IBM’s Information Server or integration, the results of data fusion can often benefit
Microsoft’s SQL Server Integration Services. other tasks. For example, correcting wrong values in
some records can help link these records with records
2.5 Open problems that represent the same entity. To obtain the best
We conclude the tutorial with a discussion of open prob- results in schema mapping, record linkage, and data
lems and desiderata for data fusion systems. These include: fusion, we may need to combine them and perform
them iteratively.
• Complex fusion functions: Often, the fusion decision
is not based on the conflicting values themselves, but
possibly on other data values of the affected tuples, 3. ABOUT THE PRESENTERS
such as a time stamp. In addition, fusion decisions on Xin Luna Dong received a Bachelor’s Degree in Computer
different attributes of the same tuples often need to co- Science from Nankai University in China in 1988, and a Mas-
ordinate, for instance in an effort to keep associations ter’s Degree in Computer Science from Peking University in
between first and last names and not to mix them from China in 2001. She obtained her Ph.D. in Compute Science
different tuples. Providing a language to express such and Engineering from University of Washington in 2007 and
fusion functions and developing algorithms for their joined AT&T Labs–Research after graduation. Dr. Dong’s
efficient execution are open problems. research interests include databases, information retrieval
and machine learning, with an emphasis on data integra- [7] X. Dong, L. Berti-Equille, and D. Srivastava.
tion, data cleaning, probabilistic data management, schema Integrating conflicting data: the role of source
matching, personal information management, web search, dependence. Technical report, AT&T Labs–Research,
web-service discovery and composition, and Semantic Web. Florham Park, NJ, 2009.
Dr. Dong has led development of the Semex personal in- [8] X. Dong, L. Berti-Equille, and D. Srivastava. Truth
formation management system, which won the best demo discovery and copying detection from source update
award (one of the top 3 demos) in Sigmod’05. history. Technical report, AT&T Labs–Research,
Felix Naumann studied mathematics at the Technical Uni- Florham Park, NJ, 2009.
versity of Berlin and received his diploma in 1997. As a [9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.
member of the graduate school “Distributed Information Duplicate record detection: A survey. IEEE
Systems” at Humboldt-University of Berlin, he finished his Transactions on Knowledge and Data Engineering
PhD thesis in 2000. His dissertation in the area of data (TKDE), 19(1):1–16, 2007.
quality received the dissertation prize of the German Soci- [10] R. Fagin, P. G. Kolaitis, and L. Popa. Data exchange:
ety of Informatics (GI) for the best dissertation in Germany Getting to the core. ACM Transactions on Database
in 2000. In the following two years Felix Naumann worked Systems (TODS), 30(1):174–201, 2005.
at the IBM Almaden Research Center. From 2003-2006 he [11] C. A. Galindo-Legaria. Outerjoins as disjunctions. In
was an assistant professor at Humboldt-University of Berlin Proceedings of the ACM International Conference on
heading the Information Integration group. Since 2006 he Management of Data (SIGMOD), pages 348–358,
is a full professor at the Hasso Plattner Institute, which is Minneapolis, Minnesota, May 1994.
affiliated with the university of Potsdam. There he heads [12] S. Greco, L. Pontieri, and E. Zumpano. Integrating
the information systems department. His experience in the and managing conflicting data. In Revised Papers from
area of data integration and data fusion is demonstrated the 4th International Andrei Ershov Memorial
by many publications in that area and numerous relevant Conference on Perspectives of System Informatics,
industrial cooperations. Felix Naumann has served in the pages 349–362, 2001.
program committee of many international conferences and [13] L. M. Haas. Beauty and the beast: The theory and
has served as a reviewer for many journals. He is the asso- practice of information integration. In Proc. of ICDT,
ciate editor of the ACM Journal on Data and Information pages 28–43, 2007.
Quality and will be the general chair of the International [14] A. Y. Halevy. Answering queries using views: A
Conference on Information Quality (ICIQ) in 2009. survey. VLDB Journal, 10(4):270–294, 2001.
[15] A. Y. Halevy, A. Rajaraman, and J. J. Ordille. Data
4. REFERENCES integration: The teenage years. In Proc. of VLDB,
[1] P. A. Bernstein and S. Melnik. Model management pages 9–16, 2006.
2.0: manipulating richer mappings. In Proc. of [16] F. Naumann, A. Bilke, J. Bleiholder, and M. Weis.
SIGMOD, pages 1–12, 2007. Data fusion in three steps: Resolving inconsistencies
[2] L. Berti-Equille, A. D. Sarma, X. Dong, A. Marian, at schema-, tuple-, and value-level. IEEE Data
and D. Srivastava. Sailing the information ocean with Engineering Bulletin, 29(2):21–31, 2006.
awareness of currents: Discovery and application of [17] E. Rahm and P. A. Bernstein. A survey of approaches
source dependence. In CIDR, 2009. to automatic schema matching. The VLDB Journal,
[3] J. Bleiholder and F. Naumann. Conflict handling 10(4):334–350, 2001.
strategies in an integrated information system. In [18] W. Winkler. Overview of record linkage and current
Proceedings of the International Workshop on research directions. Technical report, Statistical
Information Integration on the Web (IIWeb), Research Division, U. S. Bureau of the Census, 2006.
Edinburgh, UK, 2006. [19] M. Wu and A. Marian. Corroborating answers from
[4] J. Bleiholder and F. Naumann. Data fusion. ACM multiple web sources. In Proc. of WebDB, 2007.
Computing Surveys, 41(1):1–41, 2008. [20] L. L. Yan and M. T. Özsu. Conflict tolerant queries in
[5] J. Bleiholder, S. Szott, M. Herschel, F. Kaufer, and AURORA. In Proceedings of the International
F. Naumann. Algorithms for computing subsumption Conference on Cooperative Information Systems
and complementation. 2009. Submitted. (CoopIS), pages 279–290, 1999.
[6] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A [21] X. Yin, J. Han, and P. S. Yu. Truth discovery with
comparison of string distance metrics for multiple conflicting information providers on the web.
name-matching tasks. In Proc. of IIWEB, pages In Proc. of SIGKDD, 2007.
73–78, 2003.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy