节点文献

基于BWT和小波树的数据压缩算法

A Study of Data Compression Based on BWT And Wavelet Tree

【作者】 赵恒

【导师】 霍红卫; 张小宁;

【作者基本信息】 西安电子科技大学 , 计算机技术, 2014, 硕士

【摘要】 随着互联网领域以及其他新兴领域的迅猛发展,全世界每天产生的数据急剧增加。数据里面蕴含着非常有价值的信息,于是人们开始对这些数据进行分析与研究,挖掘自己感兴趣的信息。由于数据规模本身非常大,因此数据的存储是一项很大的挑战,需要耗费很大的磁带,磁盘资源;除此之外互联网上不同主机之间的数据通信量日趋增大,这占用较多的网络带宽资源,而且是很耗时的。因此对这些数据的压缩成为大数据处理的研究热点和解决大数据问题的必然趋势。首先分析了基于后缀数组计算BWT的方法,BWT变换的性质,并且提出一种BWT逆变换时的LF修正算法;其次研究了01序列上各种rank/select数据结构的原理,空间大小,并从数学角度分析了小波树上内部节点01序列的层次化对各种rank/select数据结构空间大小的影响;最后基于BWT变换和小波树,我们提出一种数据压缩及解压缩算法的框架,并且给出了该算法并发化的一些思路。我们通过定义自己的压缩文件格式,最终高效地开发了一款数据压缩软件,wzip。我们还通过实验测试了小波树树形,节点压缩方式,BWT变换块大小对wzip压缩率的影响。最终我们发现在BWT变换块大小固定的条件下,基于hu-tacker[10]编码的小波树并且内部节点采用run-length Gamma[11]的wzip压缩方案能够达到最好的压缩效果。通过将wzip和bzip2,gzip,rar,zip,7z,compress这些数据压缩软件在压缩率,压缩时间,解压时间方面的对比,我们发现wzip具有压缩率好,压缩及解压缩速度快的特点。此外wzip提供丰富的参数选项,以提供不同的压缩性能需求。我们的压缩软件wzip可从Github下载使用(https://github.com/goldbeef/wzip)。

【Abstract】 With the fast development of internet and other new scientific fields, it produces more and more data which contains much useful and important information all over the world. People have started do some research and analysis on the large scale data to find their interested information. But it has been a big challenge to store the big data because of the scale of data. Also it takes a lots of time and network bandwidth to transmit data between computers on the internet. So the ways to compress data become more and more necessary and flourishing.Firstly, we analyze the methods to compute BWT based on suffix array and the property of BWT. Then we propose a modified algorithm to compute LF at the time of BWT inverse transform. Secondly, we do some research on the different rank/select data structures of bits sequence and we analyze the hierarchical effect on space size of various rank/select data structures of wavelet tree. At last, based on BWT and wavelet tree we propose an algorithm to compress and decompress data and some ideas to parallel the algorithm. By defining our own compressed file format, we efficiently develop data compression software called wzip.By doing some experiments, we test the influence of wavelet tree shape, inner nodes compression method and block size on wzip’s compression ratio. We discover that hu-taker shaped wavelet tree with run-length γ to compression its inner nodes will get the best compression ratio under the condition of fixed block size of BWT transform. By comparing wzip with bzip2, gzip, rar, zip, 7z and compress in the aspects of compression ratio, compression time and decompression time, we find that wzip could get a good compression ratio with fast compression speed. The source code is available online(https://github.com/goldbeef/wzip).

【关键词】 数据压缩BWT变换小波树
【Key words】 data compressionBWT transformwavelet tree
节点文献中: 

本文链接的文献网络图示:

本文的引文网络