节点文献

基于双缓冲的分布式爬虫调度策略的设计与研究

Design and Research of Distributed Reptile Scheduling Strategy Based on Double Buffer

推荐 CAJ下载
PDF下载
不支持迅雷等下载工具，请取消加速工具后下载。

【Author】 LU Zhao;SHI Jun;ZHANG Yaowu;WANG Qi;School of Mathematics and Information Technology,Yuncheng University;School of Computer Science,Shaanxi Normal University;

【机构】运城学院数学与信息技术学院；陕西师范大学计算机科学学院；

【摘要】互联网的高速发展使得大数据的应用越来越广泛，使得分布式爬虫处于愈来愈重要的地位。目前主流开源爬虫框架在网络通信开销上优化甚少，缺乏一个有效的方案来减少网络开销问题。论文利用对等式架构的爬行器既是任务的消费者又是任务的生产者，提出了任务尽量在本地执行的优化方向。基于双缓冲技术实现的大粒度任务动态负载均衡策略，能有效地降低通信频次，基于高速缓存原理的URL判重方案，以“空间换时间”的方式，有效地提升爬虫URL判重性能。实验结果表明，该策略具有良好的扩展性、鲁棒性，能使分布式系统的性能优势得到更为充分的发挥。更多还原

【Abstract】 With the rapid development of the Internet,the application requirements of big data are becoming more and more extensive,making distributed crawlers in an increasingly important position. At present,mainstream open source crawler frameworks have little optimization on network communication overhead,and lack an effective solution to reduce network overhead. This article uses the peer-to-peer crawler to be both the consumer and the producer of the task,and proposes an optimization direction in which the task should be performed locally as much as possible. The dynamic load balancing strategy for large-grained tasks based on double-buffering technology can effectively reduce the communication frequency. The URL weighting scheme based on the cache principle effectively improves the crawler URL weighting performance by "space-for-time". Experimental results show that the strategy has good scalability and robustness,and can make the performance advantages of distributed systems more fully play.更多还原

【关键词】分布式爬虫；动态负载均衡； Scrapy-Redis；双缓冲机制；
【Key words】 distributed crawler； dynamic load balancing； Scrapy-Redis； double buffering mechanism；

【基金】运城学院应用研究项目（编号：XK-2018039/CY-2019038）资助

【文献出处】计算机与数字工程 ,Computer & Digital Engineering , 编辑部邮箱 ,2022年08期

【分类号】TP301.6
【下载频次】50

知网节下载

节点文献中：

本文链接的文献网络图示:

本文的引文网络

节点文献