节点文献

基于数据去重的备份系统中垃圾数据回收技术的研究

Research on Data Garbage Collection Technique in Data Deduplication Based Backup Systems

【作者】 刘涛

【导师】 谭玉娟;

【作者基本信息】 重庆大学 , 工程硕士(计算机技术)(专业学位), 2018, 硕士

【摘要】 在信息存储技术领域,使用数据去重技术的备份系统中的垃圾回收一直是人们关注的焦点。在备份系统中,一般会为备份数据设置一个保留时间,过时的数据会被回收。但是在对数据去重之后,重复的数据块只保留一份,每个数据块很可能被同一个备份数据流版本内部的数据引用多次,也有可能被多个备份数据流版本之间的数据引用多次。这种同一个数据块的多次引用增加了过期数据块的回收难度,如何高效的清除这些无效数据块,使其所占用的存储空间得以重新利用,是应用数据去重技术的备份存储系统中亟待解决的问题。现有的基于数据去重的备份系统中垃圾数据的回收方法主要包括两种,引用计数(RC)和标记回收(MS)。引用计数的主要思想是为每一个去重系统中的数据块设置一个引用计数值,每次引用到该数据块将对其引用计数值加1,通过检查该引用计数值是否为0即可判断该数据块是否为垃圾数据块。标记回收方法与引用计数不同,它没有设置引用计数值,不在备份阶段对数据进行任何预处理,而是在垃圾回收阶段通过扫描所有的备份元数据来寻找没有被引用的垃圾数据块。这两种垃圾数据回收方法的缺点都很明显。对于引用计数方法,其主要的缺点是可靠性低,任何对引用计数值的重复更新或延迟更新都将导致该数值不正确,使系统中的所存储/引用的数据块与该计数值不一致,导致垃圾数据回收出错。而对于标记回收,其最主要缺点是备份数据的扫描时间太长,标记垃圾数据的速度太慢。针对现有引用计数和标记回收方法的缺点,本文提出了基于引用时间图的垃圾回收机制(Gc_RTM)。该机制以存储容器为单位,构建针对每个存储容器的引用时间图(RTM)和容器位表(CBT),结合引用时间图和容器位表结构快速获取可以回收的垃圾数据块和可以重新利用的存储空间。与引用计数相比,该机制采用了引用时间图,不需要对引用计数值进行简单的加/减1操作,可靠性更高;而与标记回收相比,该方法通过存储容器的引用时间图和容器位表能快速标记要回收的垃圾数据,不需要对备份数据流进行全盘扫描,垃圾数据回收的速度更快。本文采用大量的测试数据集对基于Gc_RTM的垃圾回收方法进行了性能分析和评估。测试结果表明,与引用计数和标记回收相比,该方法无论在垃圾回收的时间性能开销还是空间性能开销表现都要更好。在单个版本回收中,Gc_RTM的时间性能表现相对于RC约有20倍的提升,相对于MS约有100倍的提升,且随着备份版本的增加,性能表现更好。在批量版本回收中,Gc_RTM性能表现要更好。在空间开销表现方面,Gc_RTM开销最小,约为RC的1/2,MS的1/3,且随着备份版本增加,优势更加明显。总的来说,Gc_RTM能够有效的提升使用数据去重技术的备份系统中的垃圾回收的时间和空间性能表现,优化了存储性能。

【Abstract】 In the field of information storage technology,garbage collection in backup systems using data deduplication technology has always been the focus of attention.In the backup system,a retention time is generally set for the backup data,and outdated data should be collected.But after deduplicating the data,only one copy of the repeated data blocks is retained,and each data block is likely to be referenced multiple times by data within the same backup data flow version.It is also possible for data references between multiple backup data stream versions to be multiple times.This multiple reference of the same data block increases the difficulty of retrieving expired data blocks.How to effectively eliminate these invalid data blocks and reuse the storage space occupied by them is a problem to be solved urgently in the backup storage system of the application data deduplication technology.There are two types of garbage data collection methods in existing backup systems based on data deduplication.Respectively,reference count(RC)and mark and sweep(MS).The main idea of reference count is to set a reference count value for each data block in the deduplication system.Each reference to the block will add 1 to its reference count value.By checking whether the reference count value is 0,it can be judged whether the data block is a garbage data block.The mark and sweep method differs from the reference count in that it does not set a reference count value,does not perform any preprocessing on the data in the backup phase,and in the garbage collection phase scans all backup metadata to find unreferenced garbage data blocks.The disadvantages of these two garbage collection methods are obvious.The main disadvantage of the reference count method is its low reliability.Any repeated update or deferred update of the reference count value will cause the value to be incorrect,making the stored/referenced data block in the system inconsistent with the count value,resulting in garbage data recovery errors.The main drawback of mark and sweep is that the scan time of backup data is too long,and the speed of marking garbage data is too slow.Aiming at the shortcomings of existing reference counting and mark and sweep methods,this paper proposes a garbage collection mechanism based on reference time map(Gc_RTM).The mechanism builds a reference time map(RTM)and a container bitmap(CBT)for each storage container in units of storage containers.Combine the reference time map and the container bitmap structure to quickly obtain garbage chunks that can be reclaimed and storage space that can be reused.Compared with reference count,this mechanism uses a reference time map,and does not require a simple addition/subtraction of 1 operation on the reference count value,resulting in higher reliability.Compared with mark and sweep,the method can quickly mark the garbage data to be recovered by using the reference time map and the container bit table of the storage container,and does not need to perform a full scan on the backup data stream,and the garbage data recovery speed is faster.This paper uses a large number of test data sets to perform performance analysis and evaluation of Gc_RTM garbage collection methods.Test results show that compared to reference count and mark and sweep,this method performs better regardless of the time overhead or space performance overhead of garbage collection.The time performance of Gc_RTM is about 20 times faster than that of RC,and 100 times that of MS.With the increase of backup version,the performance is more obvious.In batch version recovery,Gc_RTM performance is even better.In terms of space overhead performance,the Gc_RTM overhead is the smallest,about 1/2 of RC,and 1/3 of MS.With the increase of backup versions,the advantages are even more obvious.In summary,Gc_RTM can effectively improve the time and space performance of garbage collection in backup systems that use data deduplication technology and optimize storage performance.

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2019年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络