Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

A non-group parallel frequent pattern mining algorithm based on conditional patterns

Abstract: Frequent itemset mining serves as the main method of association rule mining. With the limitations in computing space and performance, the association of frequent items in large data mining requires both extensive time and effort, particularly when the datasets become increasingly larger. In the process of associated data mining in a big data environment, the MapReduce programming model is typically used to perform task partitioning and parallel processing, which could improve the execution efficiency of the algorithm. However, to ensure that the associated rule is not destroyed during task partitioning and parallel processing, the inner-relationship data must be stored in the computer space. Because inner-relationship data are redundant, storage of these data will significantly increase the space usage in comparison with the original dataset. In this study, we find that the formation of the frequent pattern (FP) mining algorithm depends mainly on the conditional pattern bases. Based on the parallel frequent pattern (PFP) algorithm theory, the grouping model divides frequent items into several groups according to their frequencies. We propose a non-group PFP (NG-PFP) mining algorithm that cancels the grouping model and reduces the data redundancy between sub-tasks. Moreover, we present the NG-PFP algorithm for task partition and parallel processing, and its performance in the Hadoop cluster environment is analyzed and discussed. Experimental results indicate that the non-group model shows obvious improvement in terms of computational efficiency and the space utilization rate.

Key words: Frequent pattern mining, Parallel algorithm, Conditional pattern bases, MapReduce, Big data

Chinese Summary  <19> 基于条件模式的一种无分组并行频繁模式挖掘算法

摘要:频繁项集挖掘是关联规则挖掘的主要方法。由于计算空间和性能限制,特别是当数据集剧增时,挖掘频繁项的关联需要大量时间和资源。在大数据环境下的关联数据挖掘过程中,通常采用MapReduce模型进行任务划分及并行处理,从而提高算法执行效率。为确保关联规则在任务划分和并行处理期间不被破坏,需要将内部关系数据存储在计算机空间中。与原始数据集相比,存储冗余的内部关系数据将显著增加空间的使用。研究发现,频繁模式挖掘算法的形成主要依赖于条件模式基。基于并行频繁模式(PFP)算法理论,本文提出一种无分组的PFP(NG-PFP)挖掘算法。该算法取消了分组模式,减少了子任务之间的数据冗余。实验结果表明,无分组模型在计算效率和空间利用率方面都有显著提高。

关键词组:频繁模式挖掘;并行算法;条件模式基;MapReduce;大数据


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.1800467

CLC number:

TP301

Download Full Text:

Click Here

Downloaded:

2404

Download summary:

<Click Here> 

Downloaded:

1705

Clicked:

5820

Cited:

0

On-line Access:

2019-10-08

Received:

2018-08-05

Revision Accepted:

2018-12-18

Crosschecked:

2019-06-23

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE