Journal of Zhejiang University

Frontiers of Information Technology & Electronic Engineering 2016 Vol.17 No.10 P.982-993

TextGen: a realistic text data content generation method for modern storage system benchmarks

Author(s): Long-xiang Wang, Xiao-she Dong, Xing-jun Zhang, Yin-feng Wang, Tao Ju, Guo-fu Feng
Affiliation(s): 1. School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China more
Corresponding email(s): wanglongxiang@stu.xjtu.edu.cn, xsdong@mail.xjtu.edu.cn, xjzhang@mail.xjtu.edu.cn, wangyinfeng@gmail.com, jutao2011@stu.xjtu.edu.cn, jt_f@163.com
Key Words: Benchmark, Storage system, Word-based compression

Share this article to： More <<< Previous Article \|Next Article >>>

Long-xiang Wang, Xiao-she Dong, Xing-jun Zhang, Yin-feng Wang, Tao Ju, Guo-fu Feng. TextGen: a realistic text data content generation method for modern storage system benchmarks[J]. Frontiers of Information Technology & Electronic Engineering, 2016, 17(10): 982-993.

@article{title="TextGen: a realistic text data content generation method for modern storage system benchmarks",
author="Long-xiang Wang, Xiao-she Dong, Xing-jun Zhang, Yin-feng Wang, Tao Ju, Guo-fu Feng",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="17",
number="10",
pages="982-993",
year="2016",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1500332"
}

%0 Journal Article
%T TextGen: a realistic text data content generation method for modern storage system benchmarks
%A Long-xiang Wang
%A Xiao-she Dong
%A Xing-jun Zhang
%A Yin-feng Wang
%A Tao Ju
%A Guo-fu Feng
%J Frontiers of Information Technology & Electronic Engineering
%V 17
%N 10
%P 982-993
%@ 2095-9184
%D 2016
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1500332

TY - JOUR
T1 - TextGen: a realistic text data content generation method for modern storage system benchmarks
A1 - Long-xiang Wang
A1 - Xiao-she Dong
A1 - Xing-jun Zhang
A1 - Yin-feng Wang
A1 - Tao Ju
A1 - Guo-fu Feng
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 17
IS - 10
SP - 982
EP - 993
%@ 2095-9184
Y1 - 2016
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1500332

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Modern storage systems incorporate data compressors to improve their performance and capacity. As a result, data content can significantly influence the result of a storage system benchmark. Because real-world proprietary datasets are too large to be copied onto a test storage system, and most data cannot be shared due to privacy issues, a benchmark needs to generate data synthetically. To ensure that the result is accurate, it is necessary to generate data content based on the characterization of real-world data properties that influence the storage system performance during the execution of a benchmark. The existing approach, called SDGen, cannot guarantee that the benchmark result is accurate in storage systems that have built-in word-based compressors. The reason is that SDGen characterizes the properties that influence compression performance only at the byte level, and no properties are characterized at the word level. To address this problem, we present TextGen, a realistic text data content generation method for modern storage system benchmarks. TextGen builds the word corpus by segmenting real-world text datasets, and creates a word-frequency distribution by counting each word in the corpus. To improve data generation performance, the word-frequency distribution is fitted to a lognormal distribution by maximum likelihood estimation. The Monte Carlo approach is used to generate synthetic data. The running time of TextGen generation depends only on the expected data size, which means that the time complexity of TextGen is O(n). To evaluate TextGen, four real-world datasets were used to perform an experiment. The experimental results show that, compared with SDGen, the compression performance and compression ratio of the datasets generated by TextGen deviate less from real-world datasets when end-tagged dense code, a representative of word-based compressors, is evaluated.

TextGen：用于新型存储系统基准测试的真实文本数据集生成方法

概要：新型存储系统通过内置数据压缩功能提高性能，并节省存储空间。因此，数据内容会显著影响存储系统基准测试结果。由于真实数据集规模庞大，难以复制到目标测试系统，并且大多数数据集由于隐私性无法进行共享。因此，基准测试程序需要人工生成测试数据集。为了保证测试结果的准确性，需要根据影响存储系统性能的真实数据集特征信息生成数据。现有方法SDGen在字节级别上分析真实数据集内容分布特征，并以此生成数据集，因此能够保证内置字节级压缩算法的存储系统测试结果准确。但是SDGen并未分析真实数据集的词级别内容分布特征，因此不能保证内置词级别压缩算法的存储系统测试结果准确，本文提出了一种基于Lognormal概率分布模型的文本数据集生成方法TextGen。该方法根据真实数据集的词切分结果建立语料库，分析语料库中词的分布特征，利用最大似然估计得到词分布的Lognormal模型参数，根据模型采用蒙特卡洛方法生成数据内容。该方法生成数据集所消耗的时间只与生成数据集规模相关，具有线性的时间复杂度O(n)。本文收集了四种数据集验证方法有效性，并通过一种典型的词级别压缩算法——ETDC（End-Tagged Dense Code）进行测试。实验结果表明：相比SDGen，TextGen生成文本数据集性能更高，并且，生成数据集用于压缩测试后与真实数据集的压缩速率、压缩率相似程度更高。

关键词：基准测试；存储系统；基于词的压缩算法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Agrawal, N., Bolosky, W.J., Douceur, J.R., et al., 2007. A five-year study of file-system metadata. ACM Trans. Stor., 3(3):9.1-9.32.

[2]Agrawal, N., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., 2009. Generating realistic impressions for file-system benchmarking. ACM Trans. Stor., 5(4):16.1-16.30.

[3]Anderson, E., Kallahalla, M., Uysal, M., et al., 2004. Buttress: a toolkit for flexible and high fidelity I/O benchmarking. Proc. USENIX Conf. on File and Storage Technologies, p.4.

[4]Armstrong, T.G., Ponnekanti, V., Borthakur, D., et al., 2013. Linkbench: a database benchmark based on the Facebook social graph. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.1185-1196.

[5]Arnold, R., Bell, T., 1997. A corpus for the evaluation of lossless compression algorithms. Data Compression Conf., p.201-210.

[6]Baayen, H., 1992. Statistical-models for word-frequency distributions—a linguistic evaluation. Comput. Human. 26(5-6):347-363.

[7]Bäck, T., 1996. Evolutionary Algorithms in Theory and Practice. Oxford University Press, Oxford, UK, p.120.

[8]Bonwick, J., Ahrens, M., Henson, V., et al., 2003. The Zetta byte File System. Technical Report, Sun Microsystems, Inc., Santa Clara, USA.

[9]Box, G.E.P., Muller, M.E., 1958. A note on the generation of random normal deviates. Ann.. Math. Statist., 29(2): 610-611.

[10]Brisaboa, N.R., Iglesias, E., Navarro, G., et al., 2003. An efficient compression code for text databases. Adv. Inform. Retriev., 2633:468-481.

[11]Brisaboa, N.R., Fariña, A., Navarro, G., et al., 2007. Lightweight natural language text compression. Inform. Retriev., 10(1):1-33.

[12]Brisaboa, N.R., Fariña, A., Navarro, G., 2008. New adaptive compressors for natural language text. Softw.-Pract. Exper., 38(13):1429-1450.

[13]Brisaboa, N.R., Fariña, A., Navarro, G., et al., 2010. Dynamic lightweight text compression. ACM Trans. Inform. Syst., 28(3):1-32.

[14]Chilan, C.M., 2005. IOzone: an Open Source File System Benchmark Tool. Technical Report, the National Center for Supercomputing Applications Hierarchical Data Format Group, University of Illinois at Urbana Champaign, Illinois.

[15]Cooper, B.F., Silberstein, A., Tam, E., et al., 2010. Benchmarking cloud serving systems with YCSB. Proc. ACM Symp. on Cloud Computing, p.143-154.

[16]Difallah, D.E., Pavlo, A., Curino, C., et al., 2013. OLTP-bench: an extensible testbed for benchmarking relational databases. Proc. VLDB Endow., 7(4):277-288.

[17]Drago, I., Bocchi, E., Mellia, M., et al., 2013. Benchmarking personal cloud storage. Proc. Conf. on Int. Measurement, p.205-212.

[18]Dvorský, J., Pokorný, J., Snášel, V., 1999. Word-based compression methods and indexing for text retrieval systems. Adv. Database Inform. Syst., 1691:76-84.

[19]Fariña, A., Brisaboa, N.R., Navarro, G., et al., 2012. Word-based self-indexes for natural language text. ACM Trans. Inform. Syst., 30(1):1-34.

[20]Gracia-Tinedo, R., Harnik, D., Naor, D., et al., 2015. SDGen: mimicking datasets for content generation in storage benchmarks. Proc. USENIX Conf. on File and Storage Technologies, p.317-330.

[21]Horspool, R.N., Cormack, G.V., 1992. Constructing word based text compression algorithms. Data Compression Conf., p.62-71.

[22]Lang, K., 1995. Newsweeder: learning to filter netnews. Proc. Int. Conf. on Machine Learning, p.331-339.

[23]Li, A., Yang, X., Kandula, S., et al., 2010. Cloudcmp: comparing public cloud providers. Proc. ACM SIGCOMM Conf. on Internet Measurement, p.1-14.

[24]Li, W.T., 1992. Random texts exhibit Zipf-law-like word frequency distribution. IEEE Trans. Inform. Theor., 38(6):1842-1845.

[25]Moffat, A., Zobel, J., Sharman, N., 1997. Text compression for dynamic document databases. IEEE Trans. Knowl. Database Eng., 9(2):302-313.

[26]Myung, I.J., 2003. Tutorial on maximum likelihood estimation. J. Math. Psychol., 47(1):90-100.

[27]Powers, D.M.W., 1998. Applications and explanations of Zipf’s law. Proc. Joint Conf. on New Methods in Language Processing and Computational Natural Language Learning, p.151-160.

[28]Rodeh, O., Bacik, J., Mason, C., 2013. BTRFS: the Linux B-tree filesystem. ACM Trans. Stor., 9(3):1-32.

[29]Salomon, D., 2006. Data Compression: the Complete Reference. Springer-Verlag New York, Inc., New York, USA, p.885.

[30]Tarasov, V., Bhanage, S., Zadok, E., et al., 2011. Benchmarking file system benchmarking: it *is* rocket science. Proc. USENIX Conf. on Hot Topics in Operating Systems, p.8-13.

[31]Traeger, A., Zadok, E., Joukov, N., et al., 2008. A nine year study of file system and storage benchmarking. ACM Trans. Stor., 4(2):1-56.

[32]Vitter, J.S., 1985. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37-57.

[33]Yoshida, S., Morihara, T., Yahagi, H., et al., 1999. Application of a word-based text compression method to Japanese and Chinese texts. Data Compression Conf., p.561.

[34]Ziv, J., Lempel, A., 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theor., 23(3): 337-343.

Open peer comments: Debate/Discuss/Question/Opinion

<1>