CLC number: TP311.13
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2017-11-06
Cited: 0
Clicked: 6442
Ji-zhou Luo, Sheng-fei Shi, Hong-zhi Wang, Jian-zhong Li. FrepJoin: an efficient partition-based algorithm for edit similarity join[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(10): 1499-1510.
@article{title="FrepJoin: an efficient partition-based algorithm for edit similarity join",
author="Ji-zhou Luo, Sheng-fei Shi, Hong-zhi Wang, Jian-zhong Li",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="10",
pages="1499-1510",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601347"
}
%0 Journal Article
%T FrepJoin: an efficient partition-based algorithm for edit similarity join
%A Ji-zhou Luo
%A Sheng-fei Shi
%A Hong-zhi Wang
%A Jian-zhong Li
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 10
%P 1499-1510
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601347
TY - JOUR
T1 - FrepJoin: an efficient partition-based algorithm for edit similarity join
A1 - Ji-zhou Luo
A1 - Sheng-fei Shi
A1 - Hong-zhi Wang
A1 - Jian-zhong Li
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 10
SP - 1499
EP - 1510
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601347
Abstract: string similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and-refine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
[1]Afrati, F.N., Sarma, A.D., Menestrina, D., et al., 2012. Fuzzy joins using MapReduce. Int. Conf. on Data Engineering, p.498-509.
[2]Arasu, A., Ganti, V., Kaushik, R., 2006. Efficient exact set-similarity joins. Int. Conf. on Very Large Data Bases, p.918-929.
[3]Bayardo, R.J., Ma, Y., Srikant, R., 2007. Scaling up all pairs similarity search. Int. World Wide Web Conf., p.131-140.
[4]Chaudhuri, S., Ganjam, K., Ganti, V., et al., 2003. Robust and efficient fuzzy match for online data cleaning. Int. SIGMOD Conf. on Management of Data, p.313-324.
[5]Chaudhuri, S., Ganti, V., Kaushik, R., 2006a. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull., 29(2):60-66.
[6]Chaudhuri, S., Ganti, V., Kaushik, R., 2006b. A primitive operator for similarity joins in data cleaning. Int. Conf. on Data Engineering, p.687-698.
[7]Dong, X., Halevy, A.Y., Yu, C., 2007. Data integration with uncertainty. Int. Conf. on Very Large Data Bases, p.687-698.
[8]Feng, J., Wang, J., Li, G., 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4):437-461.
[9]Ge, T., Li, Z., 2011. Approximate substring matching over uncertain strings. Proc. VLDB Endow., {bf 4}(11):772-782.
[10]Gravano, L., Ipeirotis, P.G., Jagadish, H.V., et al., 2001. Approximate string joins in a database (almost) for free. Int. Conf. on Very Large Data Bases, p.491-500.
[11]Hadjieleftheriou, M., Srivastava, D., 2010. Weighted Set-Based String Similarity. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. AT&T Lab-Research.
[12]Henzinger, M.R., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.284-291.
[13]Ji, S., Li, G., Li, C., et al., 2009. Efficient interactive fuzzy keyword search. Int. World Wide Web Conf., p.371-380.
[14]Li, C., Lu, J., Lu, Y., 2008. Efficient merging and filtering algorithms for approximate string searches. Int. Conf. on Data Engineering, p.257-266.
[15]Li, G., Deng, D., Wang, J., et al., 2011. Pass-Join: a partition-based method for similarity joins. Proc. VLDB Endow., {bf 5}(3):253-264.
[16]Metwally, A., Agrawal, D., Abbadi, A.E., 2007. Detectives: detecting coalition hit inflation attacks in advertising networks streams. Int. World Wide Web Conf., p.241-250.
[17]Navarro, G., Salmela, L., 2009. Indexing variable length substrings for exact and approximate matching. Int. Symp. on String Processing and Information Retrieval, p.214-221.
[18]Qin, J., Wang, W., Xiao, C., et al., 2013. Vchunkjoin: an efficient algorithm for edit similarity joins. Trans. Knowl. Dat. Eng., {bf 25}(8):1916-1929.
[19]Sarawagi, S., Kirpal, A., 2004. Efficient set joins on similarity predicates. Int. SIGMOD Conf. on Management of Data, p.743-754.
[20]Scott, D.W., 1979. On optimal and data-based histograms. Biometrika, 66:605-610.
[21]Vernica, R., Carey, M.J., Li, C., 2010. Efficient parallel set-similarity joins using MapReduce. Int. SIGMOD Conf. on Management of Data, p.495-506.
[22]Wang, J., Li, G., Feng, J., 2011. Fast-join: an efficient method for fuzzy token matching based string similarity join. Int. Conf. on Data Engineering, p.458-469.
[23]Wang, W., Xiao, C., Lin, X., et al., 2009. Efficient approximate entity extraction with edit distance constraints. Int. SIGMOD Conf. on Management of Data, p.759-770.
[24]Xiao, C., Wang, W., Lin, X., 2008a. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow., {bf 1}(1):933-944.
[25]Xiao, C., Wang, W., Lin, X., et al., 2008b. Efficient similarity joins for near duplicate detection. Int. World Wide Web Conf., p.131-140.
Open peer comments: Debate/Discuss/Question/Opinion
<1>