Paper73335 339
Paper73335 339
Research Article
Department of Computer Engineering MET’s BKC Institute Of Engineering, Adgaon, Nashik, Maharashtra
Received 10 Nov 2020, Accepted 10 Dec 2020, Available online 01 Feb 2021, Special Issue-8 (Feb 2021)
Abstract
A bulk data is generated from various sources. The sources may provide duplicate data with some representative
changes. To mine such big data and create a representative data is a challenging task. The data importance
increases when it is linked with similar resources and similar data is fused in one source. Lot of research work has
been done to provide a single representative data of all real world entities by removing the duplicate records. This
task is called as record normalization. This technique focuses on precision of record normalization as compared with
the existing strategies. For record normalization it uses record level, field level and value level normalization
technique. The precision of unique representation of record is increases in each level. Along with unique
representation, the data is linked with similar resources by comparing the similar record field values. The system is
tested on citation record based dataset and its accuracy and execution time is compared.
Keywords: Record normalization, data clustering, data fusion, data linking, data integration
335| cPGCON 2020(9th post graduate conference of computer engineering), Amrutvahini college of engineering, Sangamner, India
International Journal of Current Engineering and Technology, Special Issue-8 (Feb 2021)
Problem Formulation
Sim-bigram(a,b)=
Let E1 be the real world entity. Re is set of records Bigram(a) and bigram(b) are 2-grams of a and b
collected from various sources representing the same respectively.
entity E1. Re= {R1,R2,..Rp}. This record is the collection 6. Feature-based rankers:
of various fields. In each field various string values are
present. Let FS be the set of fields FS = {f1, f2, …, fq} Feature based rankers are divided in 2 sections:
and ri[fi] is the value in the field fi. There is need to
define the problem as record normalization and linking a. Strategy feature:
problem. From the set of Re, generate a new
customized record that represent the entity E1 more This is binary indicator that indicates the unit is
accurately in a very descriptive manner. The records representative unit ranked by some ranking criteria. b.
from other entities like E1 should be linked together by Text Feature:
matching the field and value level components. This feature examines the property of string. It
checks the string is acronyms or abbreviations of
Proposed Methodology certain representative string or not. For example: conf
is abbreviation of conference whereas VLDB is
A. Preliminaries: acronym for Vary Large Databases.
1. Data preprocessing
|a| and |b| is lengths of a and b respectively. 2. Record Level Normalization
3. Filed Level Normalization and
5. bigram similarity measure:
4. Value Level Normalization.
This distance is based on 2- character substring 5. Filed Based Clusters
present in string. The similarity measure between
string a and b is given as: Following figure shows the architecture of the system.
337| cPGCON 2020(9th post graduate conference of computer engineering), Amrutvahini college of engineering, Sangamner, India
International Journal of Current Engineering and Technology, Special Issue-8 (Feb 2021)
D. Algorithms
339| cPGCON 2020(9th post graduate conference of computer engineering), Amrutvahini college of engineering, Sangamner, India