CLC number: TP309.3
On-line Access: 2010-04-28
Received: 2009-07-22
Revision Accepted: 2009-11-01
Crosschecked: 2009-12-31
Cited: 3
Clicked: 8359
Tian-ming Yang, Dan Feng, Zhong-ying Niu, Ya-ping Wan. Scalable high performance de-duplication backup via hash join[J]. Journal of Zhejiang University Science C, 2010, 11(5): 315-327.
@article{title="Scalable high performance de-duplication backup via hash join",
author="Tian-ming Yang, Dan Feng, Zhong-ying Niu, Ya-ping Wan",
journal="Journal of Zhejiang University Science C",
volume="11",
number="5",
pages="315-327",
year="2010",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.C0910445"
}
%0 Journal Article
%T Scalable high performance de-duplication backup via hash join
%A Tian-ming Yang
%A Dan Feng
%A Zhong-ying Niu
%A Ya-ping Wan
%J Journal of Zhejiang University SCIENCE C
%V 11
%N 5
%P 315-327
%@ 1869-1951
%D 2010
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C0910445
TY - JOUR
T1 - Scalable high performance de-duplication backup via hash join
A1 - Tian-ming Yang
A1 - Dan Feng
A1 - Zhong-ying Niu
A1 - Ya-ping Wan
J0 - Journal of Zhejiang University Science C
VL - 11
IS - 5
SP - 315
EP - 327
%@ 1869-1951
Y1 - 2010
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C0910445
Abstract: Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.
[1]Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R., 2007. A Five-Year Study of File-System Metadata. Proc. 5th USENIX Conf. on File and Storage Technologies, p.31-45.
[2]Bhagwat, D., Eshghi, K., Long, D.D.E., Lillibridge, M., 2009. Extreme Binning: Scalable, Parallel Deduplication for Chunk-Based File Backup. Proc. 17th IEEE Int. Symp. on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.
[3]Blomer, J., Kalfane, M., Karpinski, M., Karp, R., Luby, M., Zuckerman, D., 1995. A XOR-Based Erasure Resilient Coding Scheme. International Computer Science Institute Technical Report, No. TR-95-048.
[4]Bloom, B., 1970. Space/Time trade-offs in hash coding with allowable errors. Commun. ACM., 13(7):422-426.
[5]Broder, A., Mitzenmacher, M., 2004. Network applications of Bloom filters: a survey. Internet Math., 1(4):485-509.
[6]Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M., 2009. Hydrastor: a Scalable Secondary Storage. Proc. 7th USENIX Conf. on File and Storage Technologies, p.197-210.
[7]Eshghi, K., Lillibridge, M., Wilcock, L., Belrose, G., Hawkes, R., 2007. Jumbo Store: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service. Proc. 5th USENIX Conf. on File and Storage Technologies, p.22-37.
[8]Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P., 2009. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. Proc. 7th USENIX Conf. on File and Storage Technologies, p.111-123.
[9]Muthitacharoen, A., Chen, B., Mazieres, D., 2001. A Low-Bandwidth Network File System. Proc. 18th ACM Symp. on Operating Systems Principles, p.174-187.
[10]Quinlan, S., Dorward, S., 2002. Venti: a New Approach to Archival Storage. Proc. USENIX Conf. on File and Storage Technologies, p.89-101.
[11]Rhea, S., Cox, R., Pesterev, A., 2008. Fast, Inexpensive Content-Addressed Storage in Foundation. Proc. USENIX Annual Technical Conf., p.143-156.
[12]Secure Hash Standard, 1995. Department of Commerce/ NIST, National Technical Information Service, Springfield, VA, USA.
[13]Shapiro, L.D., 1986. Join processing in database systems with large main memories. ACM Trans. Database Syst., 11(3):239-264.
[14]Tanenbaum, A.S., Herder, J.N., Bos, H., 2006. File size distribution on UNIX systems: then and now. ACM SIGOPS Oper. Syst. Rev., 40(1):100-104.
[15]You, L., Pollack, K., Long, D.D.E., 2005. Deep Store: an Archival Storage System Architecture. Proc. 21st Int. Conf. on Data Engineering, p.804-815.
[16]Zeng, L.F., Zhou, K., Shi, Z., Feng, D., Wang, F., Xie, C.S., Li, Z.T., Yu, Z.W., Gong, J.Y., Cao, Q., et al., 2006. HUSt: a Heterogeneous Unified Storage System for GIS Grid. Proc. ACM/IEEE Conf. on Supercomputing, p.325-338.
[17]Zhu, B.J., Li, H., Patterson, H., 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. Proc. 6th USENIX Conf. on File and Storage Technologies, p.269-282.
Open peer comments: Debate/Discuss/Question/Opinion
<1>