节点文献
基于双缓冲的分布式爬虫调度策略的设计与研究
Design and Research of Distributed Reptile Scheduling Strategy Based on Double Buffer
【摘要】 互联网的高速发展使得大数据的应用越来越广泛,使得分布式爬虫处于愈来愈重要的地位。目前主流开源爬虫框架在网络通信开销上优化甚少,缺乏一个有效的方案来减少网络开销问题。论文利用对等式架构的爬行器既是任务的消费者又是任务的生产者,提出了任务尽量在本地执行的优化方向。基于双缓冲技术实现的大粒度任务动态负载均衡策略,能有效地降低通信频次,基于高速缓存原理的URL判重方案,以“空间换时间”的方式,有效地提升爬虫URL判重性能。实验结果表明,该策略具有良好的扩展性、鲁棒性,能使分布式系统的性能优势得到更为充分的发挥。
【Abstract】 With the rapid development of the Internet,the application requirements of big data are becoming more and more extensive,making distributed crawlers in an increasingly important position. At present,mainstream open source crawler frameworks have little optimization on network communication overhead,and lack an effective solution to reduce network overhead. This article uses the peer-to-peer crawler to be both the consumer and the producer of the task,and proposes an optimization direction in which the task should be performed locally as much as possible. The dynamic load balancing strategy for large-grained tasks based on double-buffering technology can effectively reduce the communication frequency. The URL weighting scheme based on the cache principle effectively improves the crawler URL weighting performance by "space-for-time". Experimental results show that the strategy has good scalability and robustness,and can make the performance advantages of distributed systems more fully play.
【Key words】 distributed crawler; dynamic load balancing; Scrapy-Redis; double buffering mechanism;
- 【文献出处】 计算机与数字工程 ,Computer & Digital Engineering , 编辑部邮箱 ,2022年08期
- 【分类号】TP301.6
- 【下载频次】50