Abstract
As an important operation in data cleaning, near duplicate Web pages detection and data mining, similarity joins have received much attention recently. Existing similarity joins fall into two broad categories—the similarity-threshold-based similarity join and top-k similarity join (TopkJoin). Compared with the traditional one, TopkJoin is more suitable for cases where the similarity threshold is unknown before hand. In this paper, we focus on the performance optimization problem of TopkJoin. Particularly, we observed that the state-of-the-art TopkJoin algorithm has three serious performance issues, i.e., the inappropriate application of hash table, inefficient use of suffix filtering and unnecessary evaluation of excessive unqualified candidates. To resolve these problems, we proposed a novel algorithm, SETJoin, by combining the existing event-driven framework with three simple yet efficient optimization techniques, viz., (1) reducing the cost in hashing by rearranging the orders of the candidate filtering and hash table lookup operations; (2) maximizing the pruning capability of suffix filtering by judiciously choosing the (near) optimal recursion depth; and (3) terminating join operations earlier by setting a much tighter stop condition for iteration. The experimental results show that SETJoin achieves up to 1.26x–3.49x speedup over the state-of-the-art algorithm on several real datasets.













Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Will be discussed in Sect. 5 in more detail.
For instance, during the execution of the top-500 query, over two hundred million candidate pairs are generated.
We do not present the details of prefix and positional filtering in Algorithm 1 for the sake of conciseness.
Please note that the suffixes of two records are passed to r and s when SuffixFilter is invoked
ppjoin+ is the state-of-the-art SimJoin algorithm proposed in Xiao et al. (2008).
http://www.informatik.uni-trier.de/ ley/db.
http://trec.nist.gov/data/t9-filtering.html.
http://www.cs.cmu.edu/ enron.
Please note that the number of hash lookup operations is equal to the number of generated candidates in topk-join.
References
Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: VLDB, pp 918–929
Arasu A, Chaudhuri S, Kaushik R (2008) Transformation-based framework for record matching. In: ICDE, pp 40–49
Baraglia R, Morales GDF, Lucchese C (2010) Document similarity self-join with mapreduce. In: Webb GI, Zhang C, Gunopulos D, Wu X (eds) ICDM. IEEE Computer Society, Washington, pp 731–736
Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWW, pp 131–140
Behm A, Li C, Carey MJ (2011) Answering approximate string queries on large data sets using external memory. In: ICDE, pp 888–899
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations (extended abstract). In: STOC, pp 327–336
Charikar M (2002) Similarity estimation techniques from rounding algorithms. In: STOC, pp 380–388
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE, p 5
Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2000) Closest pair queries in spatial databases. In: SIGMOD, pp 189–200
Deng D, Li G, Hao S, Wang J, Feng J (2014) Massjoin: a mapreduce-based method for scalable string similarity joins. In: ICDE, pp 340–351
Fries S, Boden B, Stepien G, Seidl T (2014) Phidj: parallel similarity self-join for high-dimensional vector data with mapreduce. In: ICDE, pp 796–807
Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: VLDB, pp 491–500
Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1):9–37
Hu H, Li G, Bao Z, Feng J, Wu Y, Gong Z, Xu Y (2016) Top-k spatio-textual similarity join. IEEE Trans Knowl Data Eng 28(2):551–565
Huang J, Zhang R, Buyya R, Chen J (2014) MELODY-JOIN: efficient earth mover’s distance similarity joins using mapreduce. In: ICDE, pp 808–819
Jestes J, Li F, Yan Z, Yi K (2010) Probabilistic string similarity joins. In: SIGMOD, pp 327–338
Jiang Y, Li G, Feng J, Li W (2014) String similarity joins: an experimental evaluation. PVLDB 7(8):625–636
Kim Y, Shim K (2012) Parallel top-k similarity join algorithms using mapreduce. In: ICDE, pp 510–521
Lam HT, Dung DV, Perego R, Silvestri F (2010) An incremental prefix filtering approach for the all pairs similarity search problem. APWeb 2010:188–194
Li G, He J, Deng D, Li J (2015) Efficient similarity join and search on multi-attribute data. In: SIGMOD, pp 1137–1151
Mann W, Augsten N, Bouros P (2016) An empirical evaluation of set similarity join techniques. Proc VLDB Endow 9(9):636–647
Metwally A, Faloutsos C (2012) V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8):704–715
Quirino RD, Ribeiro-Junior S, Ribeiro LA,Martins WS (2018) Efficient filter-based algorithms for exact set similarity join on GPUs. In: Hammoudi S, Śmiałek M, Camp O, Filipe J (eds) Enterprise information systems. ICEIS 2017. Lecture notes in business information processing, vol 321. Springer, Cham, pp 74–95
Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: SIGMOD, pp 743–754
Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. PVLDB 7(12):1059–1070
SriUsha I, Choudary KR, Sasikala T et al (2018) Data mining techniques used in the recommendation of e-commerce services. In: second international conference on electronics, communication and aerospace technology (ICECA). IEEE, pp 379–382
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD, pp 495–506
Wang J, Li G, Feng J (2012) Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD, pp 85–96
Wang X, Qin L, Lin X, Zhang Y, Chang L (2017) Leveraging set relations in exact set similarity join. Proc VLDB Endow 10(9):925–936
Willi M, Augsten N, Jensen CS (2017) Swoop: top-k similarity joins over set streams. arXiv: Databases
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140
Xiao C, Wang W, Lin X, Shang H (2009) Top-k set similarity joins. In: ICDE, pp 916–927
Xiong Y, Zhu Y, Yu PS (2015) Top-k similarity join in heterogeneous information networks. IEEE Trans Knowl Data Eng 27(6):1710–1723
Zhu M, Papadias D, Zhang J, Lee DL (2005) Top-k spatial joins. IEEE Trans Knowl Data Eng 17(4):567–579
Acknowledgements
The work reported in this paper is partially supported by NSFC under Grant Numbers 61370205, NSF of Shanghai under Grant Number 13ZR1400800 and the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Human participants or animals rights
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, H., Yang, L. & Xiao, Y. SETJoin: a novel top-k similarity join algorithm. Soft Comput 24, 14577–14592 (2020). https://doi.org/10.1007/s00500-020-04807-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-020-04807-w