CLC number: TP303
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2017-12-27
Cited: 0
Clicked: 10100
Xin Liu, Yu-tong Lu, Jie Yu, Peng-fei Wang, Jie-ting Wu, Ying Lu. ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(12): 1940-1971.
@article{title="ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers",
author="Xin Liu, Yu-tong Lu, Jie Yu, Peng-fei Wang, Jie-ting Wu, Ying Lu",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="12",
pages="1940-1971",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1700626"
}
%0 Journal Article
%T ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers
%A Xin Liu
%A Yu-tong Lu
%A Jie Yu
%A Peng-fei Wang
%A Jie-ting Wu
%A Ying Lu
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 12
%P 1940-1971
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1700626
TY - JOUR
T1 - ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers
A1 - Xin Liu
A1 - Yu-tong Lu
A1 - Jie Yu
A1 - Peng-fei Wang
A1 - Jie-ting Wu
A1 - Ying Lu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 12
SP - 1940
EP - 1971
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1700626
Abstract: With supercomputers developing towards exascale, the number of compute cores increases dramatically, making more complex and larger-scale applications possible. The input/output (I/O) requirements of large-scale applications, workflow applications, and their checkpointing include substantial bandwidth and an extremely low latency, posing a serious challenge to high performance computing (HPC) storage systems. Current hard disk drive (HDD) based underlying storage systems are becoming more and more incompetent to meet the requirements of next-generation exascale supercomputers. To rise to the challenge, we propose a hierarchical hybrid storage system, on-line and near-line file system (ONFS). It leverages dynamic random access memory (DRAM) and solid state drive (SSD) in compute nodes, and HDD in storage servers to build a three-level storage system in a unified namespace. It supports portable operating system interface (POSIX) semantics, and provides high bandwidth, low latency, and huge storage capacity. In this paper, we present the technical details on distributed metadata management, the strategy of memory borrow and return, data consistency, parallel access control, and mechanisms guiding downward and upward migration in ONFS. We implement an ONFS prototype on the TH-1A supercomputer, and conduct experiments to test its I/O performance and scalability. The results show that the bandwidths of single-thread and multi-thread &x2018;read&x2019;/&x2018;write&x2019; are 6-fold and 5-fold better than HDD-based Lustre, respectively. The I/O bandwidth of data-intensive applications in ONFS can be 6.35 times that in Lustre.
[1]Agrawal, N., Bolosky, W.J., Douceur, J.R., et al., 2007. A five-year study of file-system metadata. ACM Trans. Stor., 3(3):9.
[2]ALCF, 2017. Computational Systems: Mira. Argonne Leadership Computing Facility. https://www.alcf.anl. gov/user-guides/computational-systems
[3]Ali, N., Carns, P., Iskra, K., et al., 2009. Scalable I/O forwarding framework for high-performance computing systems. IEEE Int. Conf. on CLUSTER Computing and Workshops, p.1-10.
[4]Anderson, E., Hall, J., Hartline, J., et al., 2001. An experimental study of data migration algorithms. Proc. Algorithm Engineering, Int. Workshop, p.145-158.
[5]Appuswamy, R., van Moolenbroek, D.C., Tanenbaum, A.S., 2012. Integrating flash-based SSDs into the storage stack. IEEE Symp. on Mass Storage Systems and Technologies, p.1-12.
[6]Bent, J., Grider, G., Kettering, B., et al., 2012. Storage challenges at Los Alamos National Lab. IEEE 28th Symp. on Mass Storage Systems and Technologies, p.1-5.
[7]Bharathi, S., Chervenak, A., Deelman, E., et al., 2008. Characterization of scientific workflows. 3rd Workshop on Workflows in Support of Large-Scale Science, p.1-10.
[8]Byan, S., Lentini, J., Madan, A., et al., 2012. Mercury: host-side flash caching for the data center. IEEE 28th Symp. on MASS Storage Systems and Technologies, p.1-12.
[9]Canim, M., Mihaila, G.A., Bhattacharjee, B., et al., 2010. SSD bufferpool extensions for database systems. Proc. VLDB Endow., 3(1-2):1435-1446.
[10]Carns, P.H., Ligon, W.B., III, Ross, R.B., 2000. PVFS: a parallel file system for Linux clusters. Proc. 4th Annual Linux Showcase and Conf., p.317-328.
[11]Carns, P.H., Harms, K., Allcock, W., et al., 2011. Understanding and improving computational science storage access through continuous characterization. ACM Trans. Stor., 7(3):1-14.
[12]Chen, F., Koufaty, D.A., Zhang, X., 2011. Hystor: making the best use of solid state drives in high performance storage systems. Proc. Int. Conf. on Supercomputing, p.22-32.
[13]Cheong, S.K., Jeong, J.J., Jeong, Y.W., et al., 2011. Research on the I/O performance advancement of a low speed HDD using DDR-SSD. 6th Int. Conf. on Future Information Technology, p.508-513.
[14]Congiu, G., Narasimhamurthy, S., Süss, T., et al., 2016. Improving collective I/O performance using non-volatile memory devices. IEEE Int. Conf. on Cluster Computing, p.120-129.
[15]Cray, 2017. Cray Sonexion 3000. https://www.cray.com/products/storage/sonexion
[16]Dai, N., Wu, W., Zhang, W., et al., 2011. TTI RTM using variable grid in depth. Int. Petroleum Technology Conf., p.1-7.
[17]Dell EMC, 2017. All Flash Storage. https://www.dellemc.com/en-us/storage/discover-flash-storage/index.htm
[18]Dong, W.R., Liu, G.M., Yu, J., et al., 2015. SFDC: file access pattern aware cache framework for high-performance computer. IEEE 17th Int. Conf. on High Performance Computing and Communications, IEEE 7th Int. Symp. on Cyberspace Safety and Security, IEEE 12th Int. Conf. on Embedded Software and Systems, p.342-350.
[19]Dong, X., Xie, Y., Muralimanohar, N., et al., 2011. Hybrid checkpointing using emerging nonvolatile memories for future exascale systems. ACM Trans. Archit. Code Optim., 8(2):1-29.
[20]Dongarra, J., 2010. Impact of architecture and technology for extreme scale on software and algorithm design. Department of Energy Workshop on Cross-cutting Technologies for Computing at the Exascale.
[21]Facebook, 2013. Flashcache at Facebook from 2010 to 2013 and Beyond. https://www.facebook.com/notes/facebook-engineering/flashcache-at-facebook-from-2010-to-2013-and-beyond/10151725297413920/
[22]Gluster, 2017. Gluster File System. http://www.gluster.org
[23]Hitachi Data Systems Cooperation, 2010. Dynamic Storage Tiering: the Integration of Block, File and Content. https://shobiziems.com/hitachi_nas/hitachi-white-paper-dynamic-storage-tiering.pdf
[24]Holland, D.A., Angelino, E., Wald, G., et al., 2013. Flash caching on the storage client. USENIX Annual Technical Conf., p.127-138.
[25]IBM, 2017. IBM Blue Gene/Q. https://www-03.ibm.com/systems/technicalcomputing/solutions/bluegene/
[26]Intel, 2017. Intel Data Center SSD. https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-SSDs/dc-p4600-series/dc-p4600-4tb-aic-3d1.html
[27]Iskra, K., Romein, J.W., Yoshii, K., et al., 2008. ZOID: I/O-forwarding infrastructure for petascale architectures. Proc. 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, p.153-162.
[28]Kim, Y., Gupta, A., Urgaonkar, B., et al., 2011. Hybridstore: a cost-efficient, high-performance storage system combining SSDs and HDDs. IEEE 19th Annual Int. Symp. on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, p.227-236.
[29]Kuhlen, M., Vogelsberger, M., Angulo, R., 2012. Numerical simulations of the dark universe: state of the art and the next decade. Phys. Dark Univ., 1(1):50-93.
[30]Lee, D., Choi, J., Kim, J.H., et al., 1999. On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies. Proc. ACM SIGMETRICS Int. Conf. on Measurement and Modeling of Computer Systems, p.134-143.
[31]Liao, X., Xiao, L., Yang, C., et al., 2014. MilkyWay-2 supercomputer: system and application. Front. Comput. Sci., 8(3):345-356.
[32]Liu, N., Cope, J., Carns, P., et al., 2012. On the role of burst buffers in leadership-class storage systems. IEEE 28th Symp. on MASS Storage Systems and Technologies, p.1-11.
[33]Liu, X., Lu, Y., Yu, J., et al., 2017a. MemUsing: dynamic, efficient memory utilization in compute nodes for HPC memory-based storage systems. Proc. 7th Int. Workshop on Computer Science and Engineering, p.8-16.
[34]Liu, X., Lu, Y., Wu, C., et al., 2017b. UGSD: scalable and efficient metadata management for EB-scale file systems. Proc. Int. Conf. on Compute and Data Analysis, p.81-90.
[35]LLNL, 2012. Sequoia. Lawrence Livermore National Laboratory. https://computation.llnl.gov/computers/sequoia
[36]Lofstead, J., Jimenez, I., Maltzahn, C., et al., 2016. DAOS and friends: a proposal for an exascale storage system. Int. Conf. for High Performance Computing, Networking, Storage & Analysis, p.585-596.
[37]Lu, C.Y., Alvarez, G.A., Wilkes, J., 2002. Aqueduct: online data migration with performance guarantees. FAST Conf. on File and Storage Technologies, p.219-230.
[38]Miller, E.L., Greenan, K., Leung, A., et al., 2011. Reliable and efficient metadata storage and indexing using nvram. J. Comput. Sci. Technol., 26(3):344-351.
[39]Muralidhar, S., Lloyd, W., Roy, S., et al., 2014. f4: Facebook&x2019;s warm blob storage system. Proc. 11th USENIX Symp. on Operating Systems Design and Implementation, p.383-398.
[40]NERSC, 2017a. Burst Buffer Architecture and Software Roadmap. National Energy Research Scientific Computing Center. http://www.nersc.gov/users/ computational-systems/cori/burst-buffer/burst-buffer
[41]NERSC, 2017b. The Configuration of Cori File System. National Energy Research Scientific Computing Center. http://www.nersc.gov/users/computational-systems/cori/configuration/
[42]NetApp, 2016. All Flash Arrays. http://www.netapp.com/ us/products/storage-systems/all-flash-array/aff-a-series.aspx
[43]Ocaña, K., de Oliveira, D., 2015. Parallel computing in genomic research: advances and applications. Adv. Appl. Bioinform. Chem., 8:23-25.
[44]Ovsyannikov, A., Romanus, M., Straalen, B.V., et al., 2017. Scientific workflows at datawarp-speed: accelerated data-intensive science using NERSE&x2019;s burst buffer. Joint Int. Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, p.1-6.
[45]Pawlowski, B., Juszczak, C., Staubach, P., et al., 1994. NFS version 3: design and implementation. USENIX Summer Technical Conf., p.137-152.
[46]Prabhakar, R., Vazhkudai, S.S., Kim, Y., et al., 2011. Provisioning a multi-tiered data staging area for extreme-scale machines. Int. Conf. on Distributed Computing Systems, p.1-12.
[47]Qiao, F., Song, Z., Bao, Y., et al., 2013. Development and evaluation of an earth system model with surface gravity waves. J. Geophys. Res. Ocean., 118(9):4514-4524.
[48]Rajachandrasekar, R., Moody, A., Mohror, K., et al., 2013. A 1 PB/s file system to checkpoint three million MPI tasks. Proc. 22nd Int. Symp. on High-Performance Parallel and Distributed Computing, p.143-154.
[49]Rodeh, O., Teperman, A., 2003. zFS–-a scalable distributed file system using object disks. Proc. 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, p.207-218.
[50]Roselli, D., Anderson, T.E., Lorchid, J.R., 2000. A comparison of file system workloads. Proc. USENIX Annual Technical Conf., p.41-54.
[51]Saito, S., Oikawa, S., 2012. Exploration of non-volatile memory management in the OS kernel. 3rd Int. Conf. on Networking and Computing, p.302-306.
[52]Sato, K., Mohror, K., Moody, A., et al., 2014. A user-level infiniband-based file system and checkpoint strategy for burst buffers. 14th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing, p.21-30.
[53]Satyanarayanan, M., Kistler, J.J., Kumar, P., et al., 1990. Coda: a highly available file system for a distributed workstation environment. IEEE Trans. Comput., 39(4):447-459.
[54]Saxena, M., Swift, M.M., Zhang, Y., 2012. FlashTier: a lightweight, consistent and durable storage cache. Proc. 7th ACM European Conf. on Computer Systems, p.267-280.
[55]Schenck, W., El Sayed, S., Foszczynski, M., et al., 2017. Evaluation and performance modeling of a burst buffer solution. ACM SIGOPS Oper. Syst. Rev., 50(1):12-26.
[56]Schmuck, F., Haskin, R., 2002. GPFS: a shared-disk file system for large computing clusters. Proc. 1st USENIX Conf. on File and Storage Technologies, No. 19.
[57]Seagate Technology LLC, 2017. Seagate NAS+SRS HDD Product Manual. https://www.seagate.com/www-content/product-content/nas-fam/nas-hdd/en-us/docs/100764115g.pdf
[58]Shalf, J., Dosanjh, S., Morrison, J., 2010. Exascale computing technology challenges. Int. Conf. on High Performance Computing for Computational Science, p.1-25.
[59]Shibata, T., Choi, S., Taura, K., 2010. File-access patterns of data-intensive workflow applications and their implications to distributed filesystems. Proc. 19th ACM Int. Symp. on High Performance Distributed Computing, p.746-755.
[60]Soundararajan, G., Prabhakaran, V., Balakrishnan, M., et al., 2010. Extending SSD lifetimes with disk-based write caches. Proc. 8th USENIX Conf. on File and Storage Technologies, No. 8.
[61]Strande, S.M., Cicotti, P., Sinkovits, R.S., et al., 2012. Gordon: design, performance, and experiences deploying and supporting a data intensive supercomputer. Proc. 1st Conf. of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the Campus and Beyond, No. 3.
[62]Tan, Z., Zhou, W., Feng, D., et al., 2013. ALDM: adaptive loading data migration in distributed file systems. IEEE Trans. Magn., 49(6):2645-2652.
[63]Uta, A., Sandu, A., Kielmann, T., 2016. Overcoming data locality. Fut. Gener. Comput. Syst., 54(C):144-158.
[64]Vangoor, B.K.R., Tarasov, V., Zadok, E., 2017. To FUSE or not to FUSE: performance of user-space file systems. Proc. 15th USENIX Conf. on File and Storage Technologies, p.59-72.
[65]Vetter, J.S., Mittal, S., 2015. Opportunities for nonvolatile memory systems in extreme-scale high-performance computing. Comput. Sci. Eng., 17(2):73-82.
[66]Wang, F., Xin, Q., Hong, B., et al., 2004. File system workload analysis for large scale scientific computing applications. Proc. 21st IEEE/12th NASA Goddard Conf. on Mass Storage Systems and Technologies, p.139-152.
[67]Wang, F., Oral, S., Shipman, G., et al., 2010. Understanding Lustre Filesystem Internals. Technical Report, No. ORNL/TM-2009/117. Oak Ridge National Laboratory, National Center for Computational Sciences, Oak Ridge, USA.
[68]Wang, T., Oral, S., Wang, Y., et al., 2014. BurstMem: a high-performance burst buffer system for scientific applications. IEEE Int. Conf. on Big Data, p.71-79.
[69]Wang, T., Mohror, K., Moody, A., et al., 2016. An ephemeral burst-buffer file system for scientific applications. Proc. Int. Conf. for High Performance Computing, Networking, Storage and Analysis, p.807-818.
[70]Weil, S.A., Brandt, S.A., Miller, E.L., et al., 2006. Ceph: a scalable, high-performance distributed file system. Proc. 7th Symp. on Operating Systems Design and Implementation, p.307-320.
[71]Yang, X.J., Liao, X.K., Lu, K., et al., 2011. The TianHe-1A supercomputer: its hardware and software. J. Comput. Sci. Technol., 26(3):344-351.
[72]Yildiz, O., Dorier, M., Ibrahim, S., et al., 2016. On the root causes of cross-application I/O interference in HPC storage systems. IEEE Int. Parallel and Distributed Processing Symp., p.750-759.
[73]Yu, J., Liu, G.M., Dong, W.R., et al., 2017. WatCache: a workload-aware temporary cache on the compute side of HPC systems. J. Supercomput., 1(2):1-33.
[74]Zhao, D.F., Raicu, I., 2013. HyCache: a user-level caching middleware for distributed file systems. IEEE Int. Symp. on Parallel and Distributed Processing Workshops and Phd Forum, p.1997-2006.
[75]Zhao, D.F., Zhang, Z., Zhou, X.B., et al., 2014. FusionFS: toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. IEEE Int. Conf. on Big Data, p.61-70.
Open peer comments: Debate/Discuss/Question/Opinion
<1>