Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Meeting deadlines for approximation processing in MapReduce environments

Abstract: To provide timely results for big data analytics, it is crucial to satisfy deadline requirements for MapReduce jobs in today’s production environments. Much effort has been devoted to the problem of meeting deadlines, and typically there exist two kinds of solutions. The first is to allocate appropriate resources to complete the entire job before the specified time limit, where missed deadlines result because of tight deadline constraints or lack of resources; the second is to run a pre-constructed sample based on deadline constraints, which can satisfy the time requirement but fail to maximize the volumes of processed data. In this paper, we propose a deadline-oriented task scheduling approach, named ‘’Dart’, to address the above problem. Given a specified deadline and restricted resources, Dart uses an iterative estimation method, which is based on both historical data and job running status to precisely estimate the real-time job completion time. Based on the estimated time, Dart uses an approach–revise algorithm to make dynamic scheduling decisions for meeting deadlines while maximizing the amount of processed data and mitigating stragglers. Dart also efficiently handles task failures and data skew, protecting its performance from being harmed. We have validated our approach using workloads from OpenCloud and Facebook on a cluster of 64 virtual machines. The results show that Dart can not only effectively meet the deadline but also process near-maximum volumes of data even with tight deadlines and limited resources.

Key words: MapReduce, Approximation jobs, Deadline, Task scheduling, Straggler mitigation

Chinese Summary  <20> 满足MapReduce环境下近似处理的时限要求

概要:为了向大数据分析提供实时结果,在现今的生产环境中满足MapReduce作业的时限要求是非常关键的。许多研究致力于解决时限要求的问题,目前存在两种代表性的方法。第一种是分配适量资源以在时限前完成整个作业,在时限紧迫或资源受限时,该方法会错过时限;另一种是在时限约束下运行预数据量的样本,该方法能满足时限但无法使数据量最大化。在本文中,我们提出一个时限-导向的任务调度方法来解决上述问题,称为"Dart"。给定具体的时限和可用资源量时,Dart使用基于历史数据和作业运行状态的迭代估计法准确预测作业完成时间。基于时间预测,Dart法采用接近-修改算法做出动态调度决策,在满足时限的情况下将可处理数据量最大化并消除掉队任务。同时Dart法可有效地避免任务失败和数据倾斜,防止其性能受影响。在包含64个虚拟机的集群上使用OpenCloud和Facebook的工作负载对Dart法进行评估。结果表明Dart法在时限紧迫和资源受限情况下能有效满足时限并将处理数据量最大化。

关键词组:MapReduce;近似作业;时限;任务调度;掉队任务消除


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.1601056

CLC number:

TP311

Download Full Text:

Click Here

Downloaded:

2790

Download summary:

<Click Here> 

Downloaded:

1602

Clicked:

5706

Cited:

0

On-line Access:

2018-01-11

Received:

2016-03-13

Revision Accepted:

2016-06-24

Crosschecked:

2017-11-22

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE