节点文献
多维布隆算法在Redis指纹自动过期中的应用
APPLICATION OF MULTIDIMENSIONAL BLOOM ALGORITHM IN REDIS FINGERPRINT AUTO-EXPIRATION
【摘要】 针对Scrapy-Redis框架占用空间严重,且Redis一旦键过期就会删除全部去重集合内数据的情况,设计基于多维Bloom过滤器的指纹自动过期算法,并采用Python语言实现。实现后的代码通过替换去重类和修改框架内方法等操作,集成到Scrapy-Redis框架中。在测试阶段,将使用重构后框架与使用Redis散列表设置指纹过期时间的方法进行了对比,结果显示重构后框架更能在大规模爬虫中节省大量空间,同时能够在满足误判率低于万分之一的情况下实现指纹的自动过期。
【Abstract】 The Scrapy-Redis framework takes up a lot of space. And once the key expires, Redis will delete all data in the de-duplicate set. Therefore, we design an automatic fingerprint expiration algorithm based on multi-dimensional Bloom filter, and it is implemented by python. The implemented code was integrated into the Scrapy-Redis framework by replacing the de-duplicated classes and modifying the methods in the framework. In the test phase, the method of using the reconstructed framework was compared with that using the Redis hash table to set the fingerprint expiration time. The results show that the reconstructed framework can save a lot of space in large-scale crawlers, and it can realize the automatic expiration of fingerprint when the rate of misjudgment is less than 1/10 000.
【Key words】 Multidimensional bloom algorithm; Scrapy-Redis; Fingerprint expiration; Crawler; Billion scale;
- 【文献出处】 计算机应用与软件 ,Computer Applications and Software , 编辑部邮箱 ,2020年08期
- 【分类号】TP311.13
- 【被引频次】2
- 【下载频次】129