0% found this document useful (0 votes)
8 views5 pages

Paper73335 339

The research article discusses the challenges of record normalization by eliminating duplicate entries from multiple data sources, focusing on improving precision through record, field, and value level normalization techniques. The proposed methodology aims to create a unique representation of records by linking similar data, and it is tested on citation record datasets to evaluate accuracy and execution time. The paper highlights existing methods and their limitations while presenting a new system architecture for effective data integration and normalization.

Uploaded by

hariprasad13876
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views5 pages

Paper73335 339

The research article discusses the challenges of record normalization by eliminating duplicate entries from multiple data sources, focusing on improving precision through record, field, and value level normalization techniques. The proposed methodology aims to create a unique representation of records by linking similar data, and it is tested on citation record datasets to evaluate accuracy and execution time. The paper highlights existing methods and their limitations while presenting a new system architecture for effective data integration and normalization.

Uploaded by

hariprasad13876
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Current Engineering and Technology E-ISSN 2277 – 4106, P-ISSN 2347 – 5161

©2021 INPRESSCO®, All Rights Reserved Available at http://inpressco.com/category/ijcet

Research Article

Record Normalization by Eliminating Duplicate entries from Multiple


Sources
Kalyani Ashok Sankpal and Kalpana V. Metre

Department of Computer Engineering MET’s BKC Institute Of Engineering, Adgaon, Nashik, Maharashtra

Received 10 Nov 2020, Accepted 10 Dec 2020, Available online 01 Feb 2021, Special Issue-8 (Feb 2021)

Abstract

A bulk data is generated from various sources. The sources may provide duplicate data with some representative
changes. To mine such big data and create a representative data is a challenging task. The data importance
increases when it is linked with similar resources and similar data is fused in one source. Lot of research work has
been done to provide a single representative data of all real world entities by removing the duplicate records. This
task is called as record normalization. This technique focuses on precision of record normalization as compared with
the existing strategies. For record normalization it uses record level, field level and value level normalization
technique. The precision of unique representation of record is increases in each level. Along with unique
representation, the data is linked with similar resources by comparing the similar record field values. The system is
tested on citation record based dataset and its accuracy and execution time is compared.

Keywords: Record normalization, data clustering, data fusion, data linking, data integration

Introduction 1. Best match search


The bulk data is generated in the world wide web. 2. Data should be de-duplicated
Based on the user search parameter the data is
If ad-hoc approaches for data matching is followed or
collected from various sources. The structured data
all the matched records are displayed to the end user
contents are stored in web warehouses containing web
then it will very frustrating for end user to sort and
databases and web tables. The relevant data collection extract useful information from the generated result
is done from various warehouses likes Google, Bing set. Ad-hoc extraction of records may lead to record
Shopping. Google Scholar is important mining domain. with missing value or incorrect data representation.
It is known as web data integration. In web data The record normalization is a challenging problem
integration, the structured data should be matched because various resources provide same data in
automatically coming from various web warehouses. A various formats. There is conflict in data which is
data containing similar records, records that point to collected from various sources due to erroneous data,
the same entity should be grouped together as a incomplete data, different data representation or
standard record set. missing some attribute values.
The result set generated after searching a query on Consider an example: User fire a search query as:
search engine generates the redundant results, “Data integration: the teenage years”, based on the title
matching various records are fetched like:
showing multiple entries of same record coming from
various sources. This record representation contains
Table I. Publication records
duplicate and unnecessary entries. Such result set is
inconvenient to the end user for analysis. Sr.
Author Title Venue Date Pages
Record normalization is important is variety of No.
in proc
domains. For example, in case of research publication Halevy, A.; Data
32nd int
Rajaraman integration:
domain Citeseer or Google Scholar are important 1. conf on 2006
A.; the teenage
Very large
Ordille, J. years
integrator websites that collects data from various data bases
sources from automatic data collection technique. The A. Halevy,
Data
A.
data is displayed to the user based on the user query. integration: in
2. Rajaraman, 2006 9-16
the teenage VLDB
J.
The data should be clear and in normalized form. The Ordille
years
search result should be: 3. A. Halevy, Data in proc 2006 pp.916

335| cPGCON 2020(9th post graduate conference of computer engineering), Amrutvahini college of engineering, Sangamner, India
International Journal of Current Engineering and Technology, Special Issue-8 (Feb 2021)

A. integration: 32nd conf records. The ER algorithm is defined to invoke these


Rajaraman, the teenage on Very
J. years large data functions. The system generates de-duplicate records
Ordille bases but not generate the normalized records. It increases
A. Halevy,
A.
Data the complexity of record matching problem [3].
integration: Wick et al. proposes a technique for data
4. Rajaraman, 2006 9-16
the teenage
J.
years integration using schema matching. It also focuses on
Ordille
co-reference resolution, record canonicalization. For
implementation it uses discriminatively-trained model.
In the above table, the same author name
Due to combined objective, the system complexity
representation is in the various form. Venue and pages
increases. The paper only deal with field level record
contain some missing value or variation in
matching and not at the value level and hence the
representation of same data. By analyzing all the
system do no generate the complete normalization
records the normal record should be generated as:
records.[4]
Tejada et al. proposes a technique for database
Table II. Normalized Records
record normalization called as object normalization.
Sr.
The system collects the data from various web sources
Author Title Venue Date Pages and saves collectively in a database. At the time of
No
in proc search these database object are normalized with
A. Halevy,
Data
32nd int duplication removal. The system uses attribute ranking
A. conf on as well as string ranking in attribute, based on the
integration:
1 Rajaraman, Very 2006 pp.916
J.
the teenage
large user’s confidence score. [5]
years Wang et al. works on shopping dataset. The dataset
Ordille data
bases is normalized in terms of records. It works on data
integration and data cleaning. It works on record
For normalized record generation record level marching and replacing the missing values with the
duplication should be removed. With the record level most relevant values. It also corrects the data which is
comparison, field level comparison should be done. In best suitable to the record by comparing the other
the above example author, title, venue data and pages dataset record entries. It do not work on value level
are various fields in a record. For more precision the and working globally on field level
values in a field should be normalized. In the following normalization.[6]
section literature survey is discussed followed by Chaturvedi et al. works on pattern discovery in the
problem formulation. Based on the analyzed problem a records. This technique do not focus on data
new system is proposed in section IV. Implementation normalization and removal of duplicate records but it
details are discussed in section V followed by the extracts patterns form duplicate record and find the
conclusion. most important and prevalent patterns in the dataset.
This approach can be applicable for data
Literature Survey normalization.[7]
Dragut et al. works on automatic labeling called as
Culotta et al. proposes a record normalization at the Label normalization. The label normalization is used
very first time. The normalization technique is also for record normalization and assigning meaningful
called as Canonicalization. This is a process of labels to the elements of an integrated query interface.
converting the data in one standard canonical form by It works on field level labeling and assign labels to each
analyzing various parameters. In this paper author attribute within the global interface. [8]
proposes a technique for the record normalization on
S. Raunich et. al. proposes an ATOM system. The
database. For normalization 3 type of solutions are
Atom system works on Ontology merging which is
provided. The solution is in terms of field values. These
solutions are enlisted as follows: nothing but a record normalization. But in merging
phase user involvement is required. The approach
1. String edit distance to find most relevant central should be automated with less involvement of end user
record [9].
2. Optimize the edit distance parameter Yongquan Dong et. al. works on automatic record
3. Feature-based solution to improve performance of normalization. The normalization is performed at three
Canonicalization. levels: record level, field level and value level. The
normalization accuracy increases at each level of data
This paper does not consider the value component
pruning. The duplicate records are removed. A single
level normalization and hence the normalized record
database contains many instances of repetitive data entry is created by analyzing the duplicate entries. The
and unnecessary normalized records [2]. related entries are not clubbed together. A single
Swoosh treats the data duplication problem as representation of record is created. For more
entity relationship problem. The problem is like a black informative data representation data should be
box function. This back box matches and merges the normalized and linked together [1].
336| cPGCON 2020(9th post graduate conference of computer engineering), Amrutvahini college of engineering, Sangamner, India
International Journal of Current Engineering and Technology, Special Issue-8 (Feb 2021)

Problem Formulation
Sim-bigram(a,b)=
Let E1 be the real world entity. Re is set of records Bigram(a) and bigram(b) are 2-grams of a and b
collected from various sources representing the same respectively.
entity E1. Re= {R1,R2,..Rp}. This record is the collection 6. Feature-based rankers:
of various fields. In each field various string values are
present. Let FS be the set of fields FS = {f1, f2, …, fq} Feature based rankers are divided in 2 sections:
and ri[fi] is the value in the field fi. There is need to
define the problem as record normalization and linking a. Strategy feature:
problem. From the set of Re, generate a new
customized record that represent the entity E1 more This is binary indicator that indicates the unit is
accurately in a very descriptive manner. The records representative unit ranked by some ranking criteria. b.
from other entities like E1 should be linked together by Text Feature:
matching the field and value level components. This feature examines the property of string. It
checks the string is acronyms or abbreviations of
Proposed Methodology certain representative string or not. For example: conf
is abbreviation of conference whereas VLDB is
A. Preliminaries: acronym for Vary Large Databases.

1. Frequency Ranker: 7. Collocation:


The frequency ranker ranks the mostly occurred unit u
in the list of distinct units. Collocation is sequence of consecutive terms with the
FR(U)= [u1,u2,..up] inverse term document frequency (idf) value less than
Where, FR(U) is a sorted list in the descending order of the given threshold. N-collocation defines the
units based on the occurrence frequency. consecutive n terms.
2. Length Ranker
The length ranker ranks the length of unit u in the list 8. Sub-collocation
of distinct units.
LR(U)= [u1,u2,..up] Is the substring of n-collocation string with k
consecutive terms. For example “in the conference” is
Where, LR(U) is a sorted list in the descending order of the sub-collocation of “ in the conference of VLDB”.
units based on the number of characters present in the
unit. 9. Template collocation:
3. Centroid Ranker
An n- collocation terms is called as template collocation
This gives the ordered list of distinct units. It initially if its inverse term document frequency (idf) is greater
calculates the similarity score among unit and finds the than the given threshold.
centroid. The centroid is calculated as:
10. Twin template collocation:

The terms tc1 and tc2 are twin collocation if it satisfies


the following conditions:
Where, P(tc1, tc2) > p(tc1, tc), for all tc Ɛ TC and tc1 <> tc2
U = bag of units (p(tc1,tc2))/(p(tc2))>threshold
U’ = distinct units in U
Au and Av: occurrence frequency of u and v.
B. System Architecture
4. Edit-distance based Similarity measure:
Redundant record Set is input to the system. After
The number of edit required to transform one string to processing, system generates Non-redundant
another. Edit distance based similarity between two normalized record set along with the data linking. The
string a and b is given as: data processing is mainly categorized in 5 sections:

1. Data preprocessing
|a| and |b| is lengths of a and b respectively. 2. Record Level Normalization
3. Filed Level Normalization and
5. bigram similarity measure:
4. Value Level Normalization.
This distance is based on 2- character substring 5. Filed Based Clusters
present in string. The similarity measure between
string a and b is given as: Following figure shows the architecture of the system.
337| cPGCON 2020(9th post graduate conference of computer engineering), Amrutvahini college of engineering, Sangamner, India
International Journal of Current Engineering and Technology, Special Issue-8 (Feb 2021)

relevant records are linked as per the field value


details.

D. Algorithms

1. Mining Abbreviation-Definition Pairs


Input : collection of all values of the field fi
Tlen, Tidf, Tpos : Threshold Values Output: AWP: a set
of abbreviation-word pairs Processing:
2. cwords = EMPTY; AWP = EMPTY;
3. pwords = tokenize data in fi
4. uwords = find unique words in pwords;
5. for each uword in uwords do
6. if len(uword) >= Tlen and idf(uword,Re) ≤Tidf then
7. insert uword into cwords
8. end if
9. end for
10. for each cword in cwords do
11. pa-words = Find words in same Context(cword,
uwords, , Tpos)
12. if pa-words <> EMPTY then
13. abbreviations = find Abbreviations(cword, pa-
Fig. 1. System Architecture words)
14. end if
C. System Description: 15. if abbreviations <> EMPTY then
16. for each abbreviation in abbreviations do
1. Pre-processing step: Initially from the given data 17. insert (abbreviation, cword) into AWP;
each record is separated and from each record various 18. end for
fields are extracted. For Example: Consider the 19. end if
following citation: A. Halevy, A. Rajaraman, J. Ordille, 20. end for
“Data integration: the teenage years”, in proc 32nd int 21. Algorithms: Mining TemplateCollocation-
conf on Very large data bases,2006, pp.9-16 In this SubCollocationPairs (MTS)
citation the following fields can be separated as: Input: CVal(f) – abbreviations in val(f).
Author: A. Halevy, A. Rajaraman, J. Ordille Title: Data Tidf .: Threshold value
integration: the teenage years Venue: in proc 32nd int Output: TCSP: A list of template collation Tc and its
conf on Very large data bases Date: 2006 Pages: pp.9- subcollations Stc pair Processing:
16 All the comma separated values are extracted and
added in the respective fields. 2. Record selection: The 1. Initialize TCSP = EMPTY;
record is generated with the combination of various 2. m =getMaxWordCount(CVal(f));1-collocs = Find
fields. There should be all values present in each field One Word Collocations(CVal(f));
so that a complete informatory citation can be 3. if 1-collocs <> EMPTY then
generated as a representative of all redundant data. 4. for each 1-colloc Ɛ 1-collocs do
This is a selection criterion for record level data 5. add (1-colloc, NULL) to TCSP
filtering. The selected records are further processed 6. end for
using field and value level. 3. Field Selection: The 7. ews = Find Candidate Expand Words(CVal(f)))
normalized record is generated by combining the most 8. for n = 2 to m do
descriptive features of all fields. From all the records 9. n-collocs = Find NCollocations(CVal(f), n, Tidf );
each field data is normalized and then a new record is 10. if n-collocs == EMPTY then
generated. For record normalization frequency ranker, 11. break
length ranker, centroid rankers and feature based 12. end if
ranker are used. 4. Value Selection: The values of each 13. Y = EMPTY
field are extracted. The abbreviation and acronyms are 14. for each n-colloc Ɛ n-collocs do
replaced by Mining Abbreviation-Definition Pairs 15. cspairs = Find Expanded Subcollocation
algorithm. Afterwards its collocation, sub collocation Pairs(ncolloc, ews, TCSP)
cPGCON 2020 (Post Graduate Conference for Computer 16. if cspairs <> EMPTY then
Engineering) - ` and twin-collocation is identified using 17. for each cspair Ɛ cspairs do
Mining TemplateCollocation-SubCollocation Pairs 18. X = {c} Sc;
(MTS)algorithm. And normalized record is generated 19. insert (n-colloc, X) into TCSP
at the value level. 5. Field based Clusters: Based on the 20. add cspair to Y
normalized value extracted for each field in the record, 21. end for
338| cPGCON 2020(9th post graduate conference of computer engineering), Amrutvahini college of engineering, Sangamner, India
International Journal of Current Engineering and Technology, Special Issue-8 (Feb 2021)

22. end if Conclusions


23. end for
24. TCSP = TCSP − Y The proposed system generates Normalized records by
25. end for removing duplicate entries that points to the same
26. remove the pairs of the form (c, NULL) from entity. For data normalization processing is applied at
TCSP
tree levels: record level, Field level and value level. The
27. End if
28. return TCSP precision of deduplication increases from record level
to value level. Along with the duplication removal
Result and Discussions similar entities are grouped together using field and
value level data comparison. The grouped data is
The system is implemented on windows system with linked together to generate more representative data.
8gb ram and i3 processor. For programming java In future system can be extended to handle numeric
development kit- jdk1.8 is used.
and more complex values.
A. Dataset:
PVCD[10] dataset is used. This dataset is a publication References
dataset. It contains publication venue information. It
contains 3,683 publications and 100 distinct [1] Yongquan Dong, Eduard C. Dragut and Weiyi Meng,
publication records. The dataset contains acronyms, "Normalization of Duplicate Records from Multiple Sources",
abbreviations, and misspellings.
in IEEE Transactions on
B. Performance Measures: Knowledge and Data Engineering, Vol. 31 , Issuue 4 , April
1. Accuracy: 2019, pp. 769
The fivefold cross validation is performed and accuracy
is measured in terms of correctly normalized units at – 782
record and field level with respect to the predicted [2] A. Culotta, M. Wick, R. Hall, M. Marzilli, and A. McCallum,
normalized units. The accuracy is measured for record
"Canonicalization of database records using adaptive
level, filed level and value level normalization.
2. Processing Time: similarity measures,"in SIGKDD, 2007, pp. 201–209.
The processing time for each level processing is [3] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E.
measured.
Whang, and J. Widom, "Swoosh: A generic approach to entity
C. Implementation Status: resolution,"VLDBJ, vol. 18, no. 1, pp. 255–276, 2009.
The system implemented partially. The frequency and [4] M. L. Wick, K. Rohanimanesh, K. Schultz, and A. McCallum,
length rankers are applied on dataset.
"A unified approach for schema matching, coreference and
The dataset contains venue information. Initially for
frequency ranker, distinct venue fields are extracted canonicalization,"in SIGKDD, 2008, pp. 722–730.
from dataset with its occurrence frequency. The list is [5] S. Tejada, C. A. Knoblock, and S. Minton, "Learning object
sorted in descending order of frequency count.
For length ranker, length of characters in a field is identification rules for information integration,"Inf. Sys., vol.
calculated and field list is sorted in descending order of 26, no. 8, pp. 607–633, 2001.
length value. [6] L. Wang, R. Zhang, C. Sha, X. He, and A. Zhou, "A hybrid
Following table shows the time required for processing
frequency and length ranker. framework for product normalization in online shopping,"in
DASFAA, vol. 7826, 2013, pp. 370–384.
Table III : Time Evaluation [7] S. Chaturvedi and et al., "Automating pattern discovery
for rule based data standardization systems,"in ICDE, 2013,
Frequency Length
Number
of records
ranker(time in Ranker(time in pp. 1231-1241.
Seconds) Seconds)
[8] E. C. Dragut, C. Yu, and W. Meng, "Meaningful labeling of
500 0.91 0.71
integrated query interfaces,"in VLDB, 2006, pp. 679-690.
1000 1.58 1.22
1500 2.12 1.72 [9] S. Raunich and E. Rahm, "Atom: Automatic target-driven

2000 2.37 2.16 ontology merging,"in ICDE, 2011, pp. 1276-1279.

339| cPGCON 2020(9th post graduate conference of computer engineering), Amrutvahini college of engineering, Sangamner, India

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy