Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

FrepJoin: an efficient partition-based algorithm for edit similarity join

Abstract: String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and-refine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

Key words: String similarity join, Edit distance, Filter and refine, Data partition, Combined frequency vectors

Chinese Summary  <21> 频率连接:基于数据划分的一种高效字符串相似性连接算法

概要:字符串相似性连接(string similarity join, SSJ)在很多应用中,特别是在需要找出重复对象的应用中发挥着关键作用。本文关注基于编辑距离的字符串相似性连接。现有算法大多采用先过滤再细化的框架,使得它们很难发现和利用字符串子集间的不相似性,也很难利用如字符频率这样的统计信息。本研究提出了一种基于数据划分的字符串相似性连接算法,它充分利用了这种统计信息。采用频率向量将字符串集划分成一系列较小的子集,使得子集之间的不相似性很容易被发现。本文提出的新算法利用划分后的数据高效地对字符串进行相似性。此外,本文还给出了一个新的过滤器,它能利用字符频率来过滤很多能够通过现有过滤器的不相似字符串。真实数据集上的试验表明,本文提出的算法性能较现有算法有较大幅度的提升。

关键词组:字符串相似性连接;编辑距离;过滤再细化;数据划分;组合频率向量


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.1601347

CLC number:

TP311.13

Download Full Text:

Click Here

Downloaded:

1734

Download summary:

<Click Here> 

Downloaded:

1489

Clicked:

5868

Cited:

0

On-line Access:

2017-12-04

Received:

2016-06-17

Revision Accepted:

2016-12-15

Crosschecked:

2017-11-06

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE