节点文献
相似聚类的二级索引重复数据删除算法
Secondary Index Deduplication Algorithm Based on Similar Clustering
【摘要】 针对重复数据删除算法指纹对比I/O瓶颈问题,提出一种基于相似聚类的二级索引重复数据删除算法.首先计算所有数据块的Simhash值,基于Simhash值之间的海明距离,提出一种自适应的相似聚类算法,所有聚类中心信息形成一级索引存放在内存中.然后计算每个类中数据块的MD5值,将其信息形成二级索引存放在聚类中心.当需要进行重复数据块检测时,计算待检测数据块Simhash值到一级索引中所有聚类中心Simhash值的海明距离,并将海明距离最小的类的二级索引调入内存中进行MD5指纹对比.实验结果表明,算法没有误判率,在较大提高了指纹对比速度的同时,每次检测只产生一次I/O操作,具有更高效的性能.
【Abstract】 Focused on deduplication algorithm of fingerprint comparison I/O bottleneck problem,put forward a secondary index deduplication algorithm based on similar clustering. Firstly calculating all the data blocks’ s Simhash values,based on the Hamming distance between Simhash values,proposes an adaptive similar clustering algorithm,and all clustering centers’ s informations form the primary index stored in memory. Then,calculating data blocks’ s MD5 in each cluster and forming the secondary index stored in clustering center. When need to check blocks,computing the Hamming distance between detectioned block’s Simhash and all clustering centers’ s Simhash,load the cluster that has the minimum Hamming distance into memory,and comparing MD5 fingerprints. The experimental results showthat the algorithm has no false positive rate,at the same time it has considerable improvement in the speed of fingerprint comparison,only one time I/O operation is generated at each detection,hence it has more efficient performance.
【Key words】 deduplication; secondary index; Similar clustering; simhash; hamming distance;
- 【文献出处】 小型微型计算机系统 ,Journal of Chinese Computer Systems , 编辑部邮箱 ,2017年12期
- 【分类号】TP311.13
- 【被引频次】1
- 【下载频次】98