Full Text:   <2417>

CLC number: TP309.3

On-line Access: 2010-04-28

Received: 2009-07-22

Revision Accepted: 2009-11-01

Crosschecked: 2009-12-31

Cited: 3

Clicked: 6240

Citations:  Bibtex RefMan EndNote GB/T7714

-   Go to

Article info.
1. Reference List
Open peer comments

Journal of Zhejiang University SCIENCE C 2010 Vol.11 No.5 P.315-327


Scalable high performance de-duplication backup via hash join

Author(s):  Tian-ming Yang, Dan Feng, Zhong-ying Niu, Ya-ping Wan

Affiliation(s):  Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

Corresponding email(s):   dfeng@hust.edu.cn

Key Words:  Backup system, De-duplication, Post-processing, Fingerprint lookup, Scalability

Share this article to: More |Next Article >>>

Tian-ming Yang, Dan Feng, Zhong-ying Niu, Ya-ping Wan. Scalable high performance de-duplication backup via hash join[J]. Journal of Zhejiang University Science C, 2010, 11(5): 315-327.

@article{title="Scalable high performance de-duplication backup via hash join",
author="Tian-ming Yang, Dan Feng, Zhong-ying Niu, Ya-ping Wan",
journal="Journal of Zhejiang University Science C",
publisher="Zhejiang University Press & Springer",

%0 Journal Article
%T Scalable high performance de-duplication backup via hash join
%A Tian-ming Yang
%A Dan Feng
%A Zhong-ying Niu
%A Ya-ping Wan
%J Journal of Zhejiang University SCIENCE C
%V 11
%N 5
%P 315-327
%@ 1869-1951
%D 2010
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C0910445

T1 - Scalable high performance de-duplication backup via hash join
A1 - Tian-ming Yang
A1 - Dan Feng
A1 - Zhong-ying Niu
A1 - Ya-ping Wan
J0 - Journal of Zhejiang University Science C
VL - 11
IS - 5
SP - 315
EP - 327
%@ 1869-1951
Y1 - 2010
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C0910445

Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article


[1]Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R., 2007. A Five-Year Study of File-System Metadata. Proc. 5th USENIX Conf. on File and Storage Technologies, p.31-45.

[2]Bhagwat, D., Eshghi, K., Long, D.D.E., Lillibridge, M., 2009. Extreme Binning: Scalable, Parallel Deduplication for Chunk-Based File Backup. Proc. 17th IEEE Int. Symp. on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[3]Blomer, J., Kalfane, M., Karpinski, M., Karp, R., Luby, M., Zuckerman, D., 1995. A XOR-Based Erasure Resilient Coding Scheme. International Computer Science Institute Technical Report, No. TR-95-048.

[4]Bloom, B., 1970. Space/Time trade-offs in hash coding with allowable errors. Commun. ACM., 13(7):422-426.

[5]Broder, A., Mitzenmacher, M., 2004. Network applications of Bloom filters: a survey. Internet Math., 1(4):485-509.

[6]Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M., 2009. Hydrastor: a Scalable Secondary Storage. Proc. 7th USENIX Conf. on File and Storage Technologies, p.197-210.

[7]Eshghi, K., Lillibridge, M., Wilcock, L., Belrose, G., Hawkes, R., 2007. Jumbo Store: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service. Proc. 5th USENIX Conf. on File and Storage Technologies, p.22-37.

[8]Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P., 2009. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. Proc. 7th USENIX Conf. on File and Storage Technologies, p.111-123.

[9]Muthitacharoen, A., Chen, B., Mazieres, D., 2001. A Low-Bandwidth Network File System. Proc. 18th ACM Symp. on Operating Systems Principles, p.174-187.

[10]Quinlan, S., Dorward, S., 2002. Venti: a New Approach to Archival Storage. Proc. USENIX Conf. on File and Storage Technologies, p.89-101.

[11]Rhea, S., Cox, R., Pesterev, A., 2008. Fast, Inexpensive Content-Addressed Storage in Foundation. Proc. USENIX Annual Technical Conf., p.143-156.

[12]Secure Hash Standard, 1995. Department of Commerce/ NIST, National Technical Information Service, Springfield, VA, USA.

[13]Shapiro, L.D., 1986. Join processing in database systems with large main memories. ACM Trans. Database Syst., 11(3):239-264.

[14]Tanenbaum, A.S., Herder, J.N., Bos, H., 2006. File size distribution on UNIX systems: then and now. ACM SIGOPS Oper. Syst. Rev., 40(1):100-104.

[15]You, L., Pollack, K., Long, D.D.E., 2005. Deep Store: an Archival Storage System Architecture. Proc. 21st Int. Conf. on Data Engineering, p.804-815.

[16]Zeng, L.F., Zhou, K., Shi, Z., Feng, D., Wang, F., Xie, C.S., Li, Z.T., Yu, Z.W., Gong, J.Y., Cao, Q., et al., 2006. HUSt: a Heterogeneous Unified Storage System for GIS Grid. Proc. ACM/IEEE Conf. on Supercomputing, p.325-338.

[17]Zhu, B.J., Li, H., Patterson, H., 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. Proc. 6th USENIX Conf. on File and Storage Technologies, p.269-282.

Open peer comments: Debate/Discuss/Question/Opinion


Please provide your name, email address and a comment

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2022 Journal of Zhejiang University-SCIENCE