Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Dr. Hadoop: an infinite scalable metadata management for Hadoop—How the baby elephant becomes immortal

Abstract: In this Exa byte scale era, data increases at an exponential rate. This is in turn generating a massive amount of metadata in the file system. Hadoop is the most widely used framework to deal with big data. Due to this growth of huge amount of metadata, however, the efficiency of Hadoop is questioned numerous times by many researchers. Therefore, it is essential to create an efficient and scalable metadata management for Hadoop. Hash-based mapping and subtree partitioning are suitable in distributed metadata management schemes. Subtree partitioning does not uniformly distribute workload among the metadata servers, and metadata needs to be migrated to keep the load roughly balanced. Hash-based mapping suffers from a constraint on the locality of metadata, though it uniformly distributes the load among NameNodes, which are the metadata servers of Hadoop. In this paper, we present a circular metadata management mechanism named dynamic circular metadata splitting (DCMS). DCMS preserves metadata locality using consistent hashing and locality-preserving hashing, keeps replicated metadata for excellent reliability, and dynamically distributes metadata among the NameNodes to keep load balancing. NameNode is a centralized heart of the Hadoop. Keeping the directory tree of all files, failure of which causes the single point of failure (SPOF). DCMS removes Hadoop’s SPOF and provides an efficient and scalable metadata management. The new framework is named ‘Dr. Hadoop’ after the name of the authors.

Key words: Hadoop, NameNode, Metadata, Locality-preserving hashing, Consistent hashing

Chinese Summary  <215> Dr. Hadoop: Hadoop的一种无限可扩展元数据管理机制—小象如何不老?

目的:在这个“兆兆兆字节”(Exa byte)时代,数据量随时间指数率增长。剧增的数据在文件系统中制造了大量的元数据(metadata)。虽然Hadoop是处理大数据时最广泛采用的软件架构,其效率仍被研究者们广泛质疑。有必要为Hadoop创建一个有效且可扩展的元数据管理机制。
创新点:基于哈希的映射和子树分区适用于分布式元数据管理方案。基于哈希的映射在NameNode(Hadoop中存储元数据的服务器)间均衡地分配负载,但受到元数据空间局部性的限制;子树分区不需为保持负载均衡而迁移元数据,但也不能在服务器间均衡任务负载。本文提出一种称为DCMS(dynamic circular metadata splitting,动态环形元数据分割)的环形元数据管理机制(图3),并依此构建了Hadoop的改进框架—Dr. Hadoop(“Dr.”来自于本文作者名字首字母Dipayan DEV,Ripon PATGIRI)。NameNode是Hadoop的核心,其对所有文件路径树的保存失败将导致单点故障(single point of failure,SPoF)。DCMS能够移除Hadoop中的单点故障,从而提供一种有效且可扩展的元数据管理机制。
方法:通过使用局部保持哈希(locality-preserving hashing,LpH)保持元数据的空间局部性,通过使用一致性哈希(consistent hashing)保持服务器间的负载均衡,通过保留复制后的元数据实现高可靠性。
结论:理论分析表明,Dr. Hadoop架构在99.99%的时间能够可靠使用。通过衡量数据吞吐率、容错性和NameNode负载等性能,DCMS在大规模文件系统上较传统方法更具效力。

关键词组:Hadoop;NameNode;元数据;局部保持哈希;一致性哈希


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.1500015

CLC number:

TP311

Download Full Text:

Click Here

Downloaded:

7233

Download summary:

<Click Here> 

Downloaded:

1881

Clicked:

8746

Cited:

1

On-line Access:

2016-01-05

Received:

2015-01-12

Revision Accepted:

2015-06-11

Crosschecked:

2015-12-25

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE