|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2017 Vol.18 No.10 P.1499-1510
FrepJoin: an efficient partition-based algorithm for edit similarity join
Abstract: String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and-refine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
Key words: String similarity join, Edit distance, Filter and refine, Data partition, Combined frequency vectors
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.1601347
CLC number:
TP311.13
Download Full Text:
Downloaded:
1990
Download summary:
<Click Here>Downloaded:
1644Clicked:
6718
Cited:
0
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2017-11-06