Resolving Data Integration Conflicts
Resolving Data Integration Conflicts
non-null values that are all used to describe the same prop-
Figure 1: Three tasks in data integration (from [16]). erty of the same entity. Contradiction is caused by different
sources providing different values for the same attribute of
a real-world entity.
techniques for finding the best (true) values in presence of
data conflicts. We end our tutorial with surveying data fu- Example 2.1. Consider the five data sources in Table 1.
sion techniques in existing data integration systems and sug- There exists uncertainty on the affiliations of Stonebraker,
gesting future research directions. Dewitt and Bernstein because of the null values provided by
source S5 , and contradiction on the affiliations of Stone-
2.1 Overview braker, Dewitt, Carey and Halevy. 2
Data integration has three broad goals: increasing the
completeness, conciseness, and correctness of data that is There are two key issues in data fusion. First, how to find
available to users and applications. Completeness measures the best values among conflicting values? Second, how to do
the amount of data, in terms of both the number of tu- so efficiently? We next survey existing working on solving
ples and the number of attributes. Conciseness measures these problems.
the uniqueness of object representations in the integrated 2.2 Conflict resolution and data merging
data, in terms of both the number of unique objects and the
number of unique attributes of the objects. Finally, correct- Data conflicts, in the form of uncertainties or contradic-
ness measures correctness of data; that is, whether the data tions, can be resolved in numerous ways. After introduc-
conform to the real world. ing a broad classification we turn to the relational algebra,
Whereas high completeness can be obtained by adding which already provides several possibilities. More elaborate
more data sources to the system, achieving the other two data integration systems and their fusion capabilities are
goals is non-trivial. To meet these requirements, a data analyzed in Section 2.4.
integration system needs to perform three levels of tasks Conflict resolution strategies. There are many different
(Fig. 1): data integration and fusion systems, each with their own
solution. Fig. 2 classifies existing strategies to approach data
1. Schema mapping: First, a data integration system needs
conflicts and Table 2 lists some of the strategies and their
to resolve heterogeneity at the schema level by estab-
classification. In particular, Conflict ignoring strategies are
lishing semantic mappings between contents of dis-
not aware of conflicts, perform no resolution, and thus may
parate data sources.
produce inconsistent results. Conflict avoiding strategies are
2. Duplicate detection: Second, a data integration system aware of conflicts but do not perform individual resolution
needs to resolve heterogeneity at the instance level by for each conflict. Rather, a single decision is made, e.g.,
detecting records that refer to the same real-world en- preference of a source, and applied to all conflicts. Finally,
tity. conflict resolving strategies provide the means for individual
fusion decisions for each conflict.
3. Data fusion: Third, a data integration system needs Such decisions can be instance-based, i.e., they regard the
to combine records that refer to the same real-world actual conflicting data values, or they can be metadata-
entity by fusing them into a single representation and based, i.e., they choose values based on metadata, such as
resolving possible conflicts from different data sources. freshness of data or the reliability of a source. Finally, strate-
gies can be classified by the result they are able to produce:
Among these three tasks, schema mapping and record
deciding strategies choose a preferred value among the ex-
linkage aim at removing redundancy and increasing concise-
isting values, while mediating strategies can produce an en-
ness of the data. Data fusion, which is the focus of this
tirely new value, such as the average of a set of conflicting
tutorial, aims at resolving conflicts from data and increas-
numbers.
ing correctness of data.
We distinguish two kinds of data conflicts: uncertainty Relational operations. Both join and union (and their
and contradiction. Uncertainty is a conflict between a non- relatives) perform data fusion of sorts. Joining two tables
null value and one or more null values that are all used to enlarges the schema of the original individual relations and
describe the same property of a real-world entity. Uncer- thus appends previously unknown values to tuples. Outer-
tainty is caused by missing information, such as null values join variants avoid the loss of tuples without join partner.
in a source or a completely missing attribute in a source. Full disjunction combines two or more input relations by
Contradiction is a conflict between two or more different combining all matching tuples into a single result-tuple [11].
Table 2: Conflict resolution strategies (from [3]).
Strategy Classification Short Description
Pass it on ignoring escalates conflicts to user or application
Consider all possibilities ignoring creates all possible value combinations
Take the information avoiding, instance based prefers values over null-values
No Gossiping avoiding, instance based returns only “consistent” tuples
Trust your friends avoiding, metadata based takes the value of a preferred source
Cry with the wolves resolution, instance based, deciding takes the most often occurring value
Roll the dice resolution, instance based, deciding takes a random value
Meet in the middle resolution, instance based, mediating takes an average value
Keep up to date resolution, metadata based, deciding takes the most recent value
domains, especially on the Web, data sources may copy from • Incremental fusion: Non-associative fusion functions,
each other for some of their data. In the motivating exam- such as voting or average, are subject to incorrect re-
ple, S4 and S5 copy all or part of the data from S3 . If we sults if new conflicting values appear. Techniques, such
treat S4 and S5 the same as other sources, we will incor- as retaining data lineage, maintaining simple metadata
rectly decide that all data provided by S3 are correct. It or statistics, need to be developed to facilitate incre-
is proposed in [2, 7] that we should consider dependence mental fusion.
between sources in truth discovery. We describe their algo-
rithms that iteratively detect dependence between sources • Online fusion: In some applications it is infeasible to
and discover the true values taking into consideration such fuse data from different sources in advance, either be-
dependence. cause it is impossible to obtain all data from some
sources, or because the total amount of data from var-
2.4 Data fusion in existing DI systems ious sources is huge. In such cases we need to efficiently
This part of the tutorial examines relevant properties of perform data fusion in an online fashion at the time of
both commercial and prototypical data fusion systems. The query answering.
tutorial itself will not be held by rattling off long lists of
properties and systems, but rather by highlighting certain • Data lineage: Database administrators and data own-
relevant properties and special interesting features of these ers are notoriously hesitant to merge data and thus lose
systems. The supplemental material can include the corre- the original values, in particular if the merged result
sponding lists and tables found in [4]. An example is Tab. 3, is not the same as at least one of the original values.
which lists the fusion capabilities of different integration sys- Retaining data lineage despite merging is similar to
tems. the problem of data lineage through aggregation op-
Among the analyzed research prototypes with some fu- erators. Effective and efficient management of data
sion capabilities are Multibase, Hermes, FusionPlex, Hum- lineage in the context of fusion is yet to be examined.
Mer, Ajax, TSIMMIS, SIMS, Ariadne, ConQuer, Infomix,
HIPPO, and Rainbow (see [4] for references). Among the • Combining truth discovery and record linkage: Although
analyzed commercial data integration systems are several Fig. 1 positions data fusion as the last phase in data
DBMS and ETL tools, such as IBM’s Information Server or integration, the results of data fusion can often benefit
Microsoft’s SQL Server Integration Services. other tasks. For example, correcting wrong values in
some records can help link these records with records
2.5 Open problems that represent the same entity. To obtain the best
We conclude the tutorial with a discussion of open prob- results in schema mapping, record linkage, and data
lems and desiderata for data fusion systems. These include: fusion, we may need to combine them and perform
them iteratively.
• Complex fusion functions: Often, the fusion decision
is not based on the conflicting values themselves, but
possibly on other data values of the affected tuples, 3. ABOUT THE PRESENTERS
such as a time stamp. In addition, fusion decisions on Xin Luna Dong received a Bachelor’s Degree in Computer
different attributes of the same tuples often need to co- Science from Nankai University in China in 1988, and a Mas-
ordinate, for instance in an effort to keep associations ter’s Degree in Computer Science from Peking University in
between first and last names and not to mix them from China in 2001. She obtained her Ph.D. in Compute Science
different tuples. Providing a language to express such and Engineering from University of Washington in 2007 and
fusion functions and developing algorithms for their joined AT&T Labs–Research after graduation. Dr. Dong’s
efficient execution are open problems. research interests include databases, information retrieval
and machine learning, with an emphasis on data integra- [7] X. Dong, L. Berti-Equille, and D. Srivastava.
tion, data cleaning, probabilistic data management, schema Integrating conflicting data: the role of source
matching, personal information management, web search, dependence. Technical report, AT&T Labs–Research,
web-service discovery and composition, and Semantic Web. Florham Park, NJ, 2009.
Dr. Dong has led development of the Semex personal in- [8] X. Dong, L. Berti-Equille, and D. Srivastava. Truth
formation management system, which won the best demo discovery and copying detection from source update
award (one of the top 3 demos) in Sigmod’05. history. Technical report, AT&T Labs–Research,
Felix Naumann studied mathematics at the Technical Uni- Florham Park, NJ, 2009.
versity of Berlin and received his diploma in 1997. As a [9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.
member of the graduate school “Distributed Information Duplicate record detection: A survey. IEEE
Systems” at Humboldt-University of Berlin, he finished his Transactions on Knowledge and Data Engineering
PhD thesis in 2000. His dissertation in the area of data (TKDE), 19(1):1–16, 2007.
quality received the dissertation prize of the German Soci- [10] R. Fagin, P. G. Kolaitis, and L. Popa. Data exchange:
ety of Informatics (GI) for the best dissertation in Germany Getting to the core. ACM Transactions on Database
in 2000. In the following two years Felix Naumann worked Systems (TODS), 30(1):174–201, 2005.
at the IBM Almaden Research Center. From 2003-2006 he [11] C. A. Galindo-Legaria. Outerjoins as disjunctions. In
was an assistant professor at Humboldt-University of Berlin Proceedings of the ACM International Conference on
heading the Information Integration group. Since 2006 he Management of Data (SIGMOD), pages 348–358,
is a full professor at the Hasso Plattner Institute, which is Minneapolis, Minnesota, May 1994.
affiliated with the university of Potsdam. There he heads [12] S. Greco, L. Pontieri, and E. Zumpano. Integrating
the information systems department. His experience in the and managing conflicting data. In Revised Papers from
area of data integration and data fusion is demonstrated the 4th International Andrei Ershov Memorial
by many publications in that area and numerous relevant Conference on Perspectives of System Informatics,
industrial cooperations. Felix Naumann has served in the pages 349–362, 2001.
program committee of many international conferences and [13] L. M. Haas. Beauty and the beast: The theory and
has served as a reviewer for many journals. He is the asso- practice of information integration. In Proc. of ICDT,
ciate editor of the ACM Journal on Data and Information pages 28–43, 2007.
Quality and will be the general chair of the International [14] A. Y. Halevy. Answering queries using views: A
Conference on Information Quality (ICIQ) in 2009. survey. VLDB Journal, 10(4):270–294, 2001.
[15] A. Y. Halevy, A. Rajaraman, and J. J. Ordille. Data
4. REFERENCES integration: The teenage years. In Proc. of VLDB,
[1] P. A. Bernstein and S. Melnik. Model management pages 9–16, 2006.
2.0: manipulating richer mappings. In Proc. of [16] F. Naumann, A. Bilke, J. Bleiholder, and M. Weis.
SIGMOD, pages 1–12, 2007. Data fusion in three steps: Resolving inconsistencies
[2] L. Berti-Equille, A. D. Sarma, X. Dong, A. Marian, at schema-, tuple-, and value-level. IEEE Data
and D. Srivastava. Sailing the information ocean with Engineering Bulletin, 29(2):21–31, 2006.
awareness of currents: Discovery and application of [17] E. Rahm and P. A. Bernstein. A survey of approaches
source dependence. In CIDR, 2009. to automatic schema matching. The VLDB Journal,
[3] J. Bleiholder and F. Naumann. Conflict handling 10(4):334–350, 2001.
strategies in an integrated information system. In [18] W. Winkler. Overview of record linkage and current
Proceedings of the International Workshop on research directions. Technical report, Statistical
Information Integration on the Web (IIWeb), Research Division, U. S. Bureau of the Census, 2006.
Edinburgh, UK, 2006. [19] M. Wu and A. Marian. Corroborating answers from
[4] J. Bleiholder and F. Naumann. Data fusion. ACM multiple web sources. In Proc. of WebDB, 2007.
Computing Surveys, 41(1):1–41, 2008. [20] L. L. Yan and M. T. Özsu. Conflict tolerant queries in
[5] J. Bleiholder, S. Szott, M. Herschel, F. Kaufer, and AURORA. In Proceedings of the International
F. Naumann. Algorithms for computing subsumption Conference on Cooperative Information Systems
and complementation. 2009. Submitted. (CoopIS), pages 279–290, 1999.
[6] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A [21] X. Yin, J. Han, and P. S. Yu. Truth discovery with
comparison of string distance metrics for multiple conflicting information providers on the web.
name-matching tasks. In Proc. of IIWEB, pages In Proc. of SIGKDD, 2007.
73–78, 2003.