Journal of Zhejiang University

Frontiers of Information Technology & Electronic Engineering 2017 Vol.18 No.11 P.1754-1772

Meeting deadlines for approximation processing in MapReduce environments

Author(s): Ming-hao Hu, Chang-jian Wang, Yu-xing Peng
Affiliation(s): 1. National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha 410073, China
Corresponding email(s): minghao_hu@yeah.net
Key Words: MapReduce, Approximation jobs, Deadline, Task scheduling, Straggler mitigation

Share this article to： More <<< Previous Article \|Next Article >>>

Ming-hao Hu, Chang-jian Wang, Yu-xing Peng. Meeting deadlines for approximation processing in MapReduce environments[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(11): 1754-1772.

@article{title="Meeting deadlines for approximation processing in MapReduce environments",
author="Ming-hao Hu, Chang-jian Wang, Yu-xing Peng",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="11",
pages="1754-1772",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601056"
}

%0 Journal Article
%T Meeting deadlines for approximation processing in MapReduce environments
%A Ming-hao Hu
%A Chang-jian Wang
%A Yu-xing Peng
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 11
%P 1754-1772
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601056

TY - JOUR
T1 - Meeting deadlines for approximation processing in MapReduce environments
A1 - Ming-hao Hu
A1 - Chang-jian Wang
A1 - Yu-xing Peng
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 11
SP - 1754
EP - 1772
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601056

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: To provide timely results for big data analytics, it is crucial to satisfy deadline requirements for mapReduce jobs in today’s production environments. Much effort has been devoted to the problem of meeting deadlines, and typically there exist two kinds of solutions. The first is to allocate appropriate resources to complete the entire job before the specified time limit, where missed deadlines result because of tight deadline constraints or lack of resources; the second is to run a pre-constructed sample based on deadline constraints, which can satisfy the time requirement but fail to maximize the volumes of processed data. In this paper, we propose a deadline-oriented task scheduling approach, named ‘’Dart’, to address the above problem. Given a specified deadline and restricted resources, Dart uses an iterative estimation method, which is based on both historical data and job running status to precisely estimate the real-time job completion time. Based on the estimated time, Dart uses an approach–revise algorithm to make dynamic scheduling decisions for meeting deadlines while maximizing the amount of processed data and mitigating stragglers. Dart also efficiently handles task failures and data skew, protecting its performance from being harmed. We have validated our approach using workloads from OpenCloud and Facebook on a cluster of 64 virtual machines. The results show that Dart can not only effectively meet the deadline but also process near-maximum volumes of data even with tight deadlines and limited resources.

满足MapReduce环境下近似处理的时限要求

概要：为了向大数据分析提供实时结果，在现今的生产环境中满足MapReduce作业的时限要求是非常关键的。许多研究致力于解决时限要求的问题，目前存在两种代表性的方法。第一种是分配适量资源以在时限前完成整个作业，在时限紧迫或资源受限时，该方法会错过时限；另一种是在时限约束下运行预数据量的样本，该方法能满足时限但无法使数据量最大化。在本文中，我们提出一个时限-导向的任务调度方法来解决上述问题，称为"Dart"。给定具体的时限和可用资源量时，Dart使用基于历史数据和作业运行状态的迭代估计法准确预测作业完成时间。基于时间预测，Dart法采用接近-修改算法做出动态调度决策，在满足时限的情况下将可处理数据量最大化并消除掉队任务。同时Dart法可有效地避免任务失败和数据倾斜，防止其性能受影响。在包含64个虚拟机的集群上使用OpenCloud和Facebook的工作负载对Dart法进行评估。结果表明Dart法在时限紧迫和资源受限情况下能有效满足时限并将处理数据量最大化。

关键词：MapReduce；近似作业；时限；任务调度；掉队任务消除

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Acharya, S., Gibbons, P., Poosala, V., 1999. Aqua: a fast decision support system using hboxapproximate query hboxanswers. Proc. 25th Int. Conf. on Very Large Data Bases, p.754-757.

[2]Agarwal, S., Mozafari, B., Panda, A., et al., 2013. Blinkdb: queries with bounded errors and bounded response times on very large data. Proc. 8th ACM European Conf. on Computer Systems, p.29-42.

[3]Ananthanarayanan, G., Kandula, S., Greenberg, A.G., et al., 2010. Reining in the outliers in Map-Reduce clusters using Mantri. Proc. 10th USENIX Symp. on Operating Systems Design and Implementation, p.24-38.

[4]Ananthanarayanan, G., Ghodsi, A., Shenker, S., et al., 2013. Effective straggler mitigation: attack of the clones. Proc. 10th USENIX Symp. on Networked Systems Design and Implementation, p.185-198.

[5]Ananthanarayanan, G., Hung, M.C.C., Ren, X., et al., 2014. Grass: trimming stragglers in approximation analytics. Proc. 11th USENIX Symp. on Networked Systems Design and Implementation, p.289-302.

[6]Apache, 2016. The Apache Hadoop Project. http://hadoop.apache.org/

[7]Bates, D.M., Watts, D.G., 1988. Nonlinear regression inference using the linear approximation. In: Jantsch, E., Waddington, C. (Eds.), Nonlinear Regression: Iterative Estimation and Linear Approximations. Wiley Online Library, p.142-167.

[8]Bell Laboratories, 2001. Approximate Query Processing: Taming the Terabytes. http://www.vldb.org/conf/2001/tut4.pdf

[9]Chen, Y., Ganapathi, A., Griggith, R., et al., 2011. The case for evaluating MapReduce performance using workload suites. Proc. IEEE 19th Int. Symp. on Modeling, Analysis ’ Simulation of Computer and Telecommunication Systems.

[10]Chen, Y., Alspaugh, S., Katz, R., 2012. Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads. Proc. VLDB Endow., 5(12):1802-1813.

[11]Chowdhury, M., Zaharia, M., Ma, J., et al., 2011. Managing data transfers in computer clusters with orchestra. SIGCOMM Comput. Commun. Rev., 41(4):98-109.

[12]Chowdhury, M., Zhong, Y., Stoica, I., 2014. Efficient coflow scheduling with varys. SIGCOMM Comput. Commun. Rev., 44(4):443-454.

[13]Cloudera, 2013. Statistical Workload Injector for MapReduce. https://github.com/SWIMProjectUCB/SWIM

[14]Dean, J., Ghemawat, S., 2008. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113.

[15]Ferguson, A.D., Bodik, P., Kandula, S., 2012. Jockey: guaranteed job latency in data parallel clusters. Proc. 7th ACM European Conf. on Computer Systems, p.99-112.

[16]Herodotou, H., Lim, H., Luo, G., 2011. Starfish: a self-tuning system for big data analytics. Proc. 7th Biennial Conf. on Innovative Data Systems Research, p.261-272.

[17]Hu, M., Wang, C., You, P., et al., 2015. Deadline-oriented task scheduling for mapreduce environments. LNCS, 9529:359-372.

[18]Kc, K., Anyanwu, K., 2010. Scheduling Hadoop jobs to meet deadlines. IEEE 2nd Int. Conf. on Cloud Computing Technology and Science, p.388-392.

[19]Li, S., Hu, S., Wang, S., et al., 2014. Woha: deadline-aware Map-Reduce workflow scheduling framework over Hadoop clusters. IEEE 34th Int. Conf. on Distributed Computing Systems, p.93-103.

[20]Liu, J., Shih, K., Lin, W., et al., 1994. Imprecise computations. Proc. IEEE, 82:83-94.

[21]Lohr, S., 2009. Simple probability samples. In: Sampling: Design and Analysis. Addison-Wesley, London, p.35-67.

[22]Marquardt, D.W., 1963. An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Ind. Appl. Math., 11(2):431-441.

[23]Morton, K., Balazinska, M., Grossman, D., 2010a. ParaTimer: a progress indicator for MapReduce dags. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.507-518.

[24]Morton, K., Friesen, A., Balazinska, M., et al., 2010b. Estimating the progress of MapReduce pipelines. Proc. IEEE 26th Int. Conf. on Data Engineering, p.681-684.

[25]Motulsky, H.J., Ransnas, L.A., 1987. Fitting curves to data using nonlinear regression: a practical and nonmathematical review. FASEB J., 1(5):365-374.

[26]OREILLY, 2013. Interactive Big Data Analysis Using Approximate Answers. https://tinyurl.com/k5favda

[27]Polo, J., Carrera, D., Becerra, Y., et al., 2010. Performance-driven task co-scheduling for MapReduce environments. Proc. IEEE Int. Congress on Network Operations and Management Symp., p.373-380.

[28]Ren, K., Kwon, Y., Balazinska, M., et al., 2013. Hadoop’s adolescence: an analysis of Hadoop usage in scientific workloads. Proc. VLDB Endow., 6(10):853-864.

[29]Vavilapalli, V.K., Murthy, A.C., Douglas, C., et al., 2013. Apache Hadoop Yarn: yet another resource negotiator. Proc. 4th Annual Symp. on Cloud Computing, p.5:1-5:16.

[30]Venkataraman, S., Panda, A., Ananthanarayanan, G., et al., 2007. The power of choice in data-aware cluster scheduling. Proc. 11th USENIX Symp. on Operating Systems Design and Implementation, p.301-316.

[31]Verma, A., Cherkasova, L., Campbell, R.H., 2011. Aria: automatic resource inference and allocation for MapReduce environments. Proc. 8th ACM Int. Conf. on Autonomic Computing, p.235-244.

[32]Verma, A., Cherkasova, L., Kumar, V.S., et al., 2012. Deadline-based workload management for MapReduce environments: pieces of the performance puzzle. Proc. IEEE Int. Congress on Network Operations and Management Symp., p.900-905.

[33]Wang, C., Peng, Y., Tang, M., et al., 2014. MapCheckReduce: an improved MapReduce computing model for imprecise applications. Proc. IEEE Int. Congress on Big Data, p.366-373.

[34]Wang, X., Shen, D., Bai, M., et al., 2015. SAMES: deadline-constraint scheduling in MapReduce. Front. Comput. Sci., 9(1):128-141.

[35]Zacheilas, N., Kalogeraki, V., 2014. Real-time scheduling of skewed MapReduce jobs in heterogeneous environments. Proc. 11th Int. Conf. on Autonomic Computing, p.189-200.

[36]Zaharia, M., Konwinski, A., Joseph, A.D., et al., 2008. Improving MapReduce performance in heterogeneous environments. Proc. 8th USENIX Symp. on Operating Systems Design and Implementation, p.7-21.

[37]Zaharia, M., Borthakur, D., Sen, S., et al., 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. Proc. 5th European Conf. on Computer Systems, p.265-278.

Open peer comments: Debate/Discuss/Question/Opinion

<1>