节点文献
异构Hadoop集群环境下数据副本动态管理研究
Research on Dynamic Management of Data Replicas in Heterogeneous Hadoop Clusters
【作者】 刘洋;
【导师】 吴奇石;
【作者基本信息】 西北大学 , 通信与信息系统, 2018, 硕士
【摘要】 Hadoop中数据及其副本如何更好的存放和管理是HDFS中亟待解决的问题。在实际中,Hadoop同构下默认的数据放置策略对异构环境下的设想存在一定的局限性,在异构环境下使用可能会增加额外花费和降低Map Reduce的性能。本文首先基于灰度预测模型预测文件数据的热度,提出了一个动态的数据副本放置策略,该策略包括了动态实时计算数据块的副本数,考虑了数据块热度和异构集群下各节点的性能特性,可以根据数据块热度进行实时动态调整。本文的研究内容主要包括以下几个方面:(1)针对数据热度预测问题,通过分析大量历史文件数据的访问请求次数,发现其在某个时间段内具有一定特征,为了能够找出这种特征,本文采用了一个灰度预测模型,在某个特定时间段下,通过对历史数据块请求次数进行预测,用于得到对下一个时间段的数据块热度预测结果。(2)针对静态数据副本因子所存在的缺陷,使用了一个实时热度副本计算方法,结合动态权重和当前数据块的热度,即数据块的访问率,以便得到该数据块应有的副本数。(3)针对异构集群,提出了一个动态的数据放置策略,结合各个节点的计算能力、磁盘存储空间、IOPS(Input/Output Operations Per Seconds磁盘每秒读写操作次数)等参数不一致的情况,来决定这个新的副本什么时候放置,放置在哪个节点上。(4)在仿真系统中对该策略进行实验,结果表明本文提出的动态数据副本放置策略比Hadoop中默认的静态数据放置策略的性能高、系统执行时间快、降低了网络访问竞争和减少了用户响应时间。
【Abstract】 It is an important problem to store and manage data and the replicas in HDFS.The assumption made by the default data placement strategy on homogenous Hadoop clusters has some limitations in heterogeneous environments as it may incur additional costs and reduce MapReduce performance.In this thesis,we design a dynamic data replica placement strategy that employs the gray prediction model to predict the hotness of data.The proposed strategy determines the number of replicas for data blocks in real-time,considering the hotness of each data block and the performance characteristics of each node in a heterogeneous cluster,and adaptively adjusts the number of replicas based on their corresponding data hotness.This thesis has the following contents:(1)For the data hotness prediction problem,we analyze a large number of access requests in historical data and observe some unique characteristics.We use a gray prediction model to predict the data blocks over a certain time period to obtain the prediction results of data block hotness in the next time period.(2)To address the limitations of static data replica methods,we adopt a real-time hotness-based replica method that combines a dynamic weight and the current hotness of the data block to decide the replica number of data blocks.(3)On heterogeneous clusters,we propose a dynamic data placement strategy,which considers the characteristics of each node,including computing power,disk storage space,IOPS(Input / Output Operations Per Seconds),and so on.This strategy determines on which node a new copy is placed and when.(4)The proposed solution is tested and evaluated in a simulated Hadoop system.The results show that the proposed dynamic data replica placement strategy outperforms the default static data placement strategy in terms of execution time,response time,and network access contention.
【Key words】 Hadoop; heterogeneous cluster; data replica management strategy; dynamic data replica placement; grey prediction;