节点文献

基于内存的列存储数据集动态压缩技术的研究与应用

Research and Application on Dynamic Compression Technique on In-Memory Column Oriented Dataset

【作者】 蒋志鹏

【导师】 陈昊鹏;

【作者基本信息】 上海交通大学 , 软件工程, 2016, 硕士

【摘要】 随着信息化产业的快速发展,越来越多的行业正面临着数据量巨大、数据种类复杂、数据处理速度要求更快、数据计算准确性要求更高等一系列问题,以单机为计算工具的计算模式已经远远无法胜任大数据计算的存储需求和性能需求,因此以Hadoop为代表的一系列大数据计算技术应运而生,其中最为核心的两个工具分别为MapReduce和HDFS,分别解决了计算性能与计算存储的问题。然而,随着摩尔定律的放缓,近几年磁盘的读写性能一直未能有突破性发展,这种需要频繁读写磁盘的大数据分析工具对于日渐庞大的数据量显得越来越力不从心。针对这类问题,University of California in B erkeley的AMPLab实验室设计了一整套以内存计算为核心的计算框架,其目的是将数据缓存在内存中以解决频繁读写的问题。然而,在大大提高了计算性能的同时,我们也不得不考虑内存成本相对昂贵的问题。另一方面,对于当前计算机体系的设计而言,整个计算系统的性能与拥有的内存资源数量并非正比关系,系统的控制总线的吞吐率还是会限制系统对内存资源的调度效率。因此,如何更高效地利用内存资源对于内存计算而言显得尤为重要。本论文提出了一种内存数据集动态压缩的压缩策略,旨在高效灵活地为内存计算解决内存资源的分配问题。通过充分测试不同压缩算法的压缩性能,以及详细地研究Spark内存计算的资源分配模型,该方案能够针对不同的数据计算类型,分析出适合的数据压缩算法,然后通过系统各方面的运行参数来判断是否需要对数据进行压缩并持久化,从而达到节省内存资源并充分优化系统整体性能的目的。此外,对于内存计算数据集多数采用列式存储的特点,应用数据压缩显得尤为方便。针对该论文提出的内存数据集动态压缩策略的设想,本文设计并实现了基于Spark计算框架的一套数据动态压缩模块,可根据计算数据的类型选择适合的压缩算法,根据系统的计算性能判断是否需要进行数据的压缩和持久化。与此同时,为了将理论研究运用到实际案例中,我们设计并实现了一套完整的日志大数据实时处理框架,其中除了已集成我们研究的数据动态压缩策略外,我们还设计了一套统一的类SQL的数据查询接口,可以供用户同时对实时和非实时的数据进行查询。此外,我们的系统还包括了一套数据收集的消息队列系统、用户输入SQL查询的接口以及HTTP消息转发的后端系统。在论文的最后,通过对系统中各个模块进行不同数据类型的压力测试,验证了系统的性能。具体来说,对文本统计等数据密集型应用而言,动态压缩算法的性能提升可达3.6倍;而对于图像识别、机器学习等迭代计算较多的应用,动态压缩算法的性能提升可达6倍之多。本文提出的内存数据集动态压缩策略的创新点体现在三方面:首先,它针对不同的数据场景自动选择适合的压缩算法,大数据应用开发人员无需对系统进行反复调试与参数设置,即可在充分发挥硬件性能的基础上最大化计算效率;其次,对于列式存储数据集而言,实现了列级别的压缩,并提供了一套完整的SQL查询接口,以实现数据的实时与非实时查询;最后,结合现在主流的大数据技术,我们将本论文的核心压缩策略运用到日志大数据的实时分析中,有效地将理论与实际相结合,并验证了系统的可行性与性能。

【Abstract】 With the increasing popularity of information technology,it has been impossible for a local machine to handle large data set in a reasonable time.Hadoop was developed to solve this problem by taking distributed computing into production environment,which introduced MapReduce and HDFS that were aimed at the domain respectively of computing and storage.However due to the fact that Moore’s Law has slowed down,the performance of hard disk has not make a significant progress in the past few years,even thought the average data size gener-ated on internet has been increased massively.Under this circumstance,the idea of in-memory computing was raised by University of California in Berkeley,they developed Spark that has been highly successful in implementing large scale data intensive applications,especially for those that reuse data across multiple parallel operations.However due to the fact that memory resources are still costly,it is inevitable to find a solution to manage memory resources in a better way.In this paper,we presented an elastic data persisting solution on column oriented data set for Spark,which enables data compression to save more heap space for Java Virtual Machine and reducing disk I/O throughput for faster data access.We Effectively tested three common compression algorithms and concluded their suitable target data type,then we mathematically derived the criteria for selecting the optimal data compression and persisting plan.It is very convenient for a column oriented data set to perform the data compression.Based on the hypothesis and test results,we managed to design and implemented a data compression module for Spark that enables dynamic data compression both for Spark RDD and Dataframe.Our evaluation of the preliminary prototype of this elastic data persisting solution shows that it can provide resource management recommendations by accounting for input data type,memory space and CPU resource,and can consistently yield high performance that accel-erates Spark up to 6x.In order to test the performance of the in memory computing system,we developed a server log big data real time analysis system,providing message queue service for log aggregation,dynamic compression plan for column oriented data sets as well as SQL query interface for in memory data set.There are mainly three highlights of this research,first of all,it introduced the idea of dynamic data compression concept that reduced the work for application developer,they no longer need to tune their application by repeating tests and changing configurations.Secondly,it enables users to perform SQL query both on real-time analysis and off-line data.Last but not the least,by taking other popular big data analyzing tools into this tool chain,we developed a log big data analysis framework,which not only applied our research into practice,but also proved the possibility of compression and ensured the performance.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络