节点文献
基于coroutine模型的网络爬虫设计与实现
Design and Implementation of Web Crawlers Based on the Coroutine Model
【摘要】 网络爬虫在中文信息处理中被大量使用,根据待处理的问题定向爬取相关领域的数据,为后续中文信息处理提供基础.传统多线程模型在处理高并发和大量I/O阻塞操作时,存在较为明显的限制和不足.针对以上问题,提出了一种基于coroutine模型的解决方案.从coroutine的基本原理和实现方法上作了较为详细的阐述,并给出基于coroutine网络爬虫的完整实现.实验表明,该方案能够有效地降低系统负荷,提高爬虫的爬取效率.
【Abstract】 Web crawler is widely used in Chinese information processing.According to the problem to be dealt with,crawling related domains data,it provides the basis for subsequent Chinese information processing.The traditional multi-threaded model has obvious limitations and deficiencies when dealing with high concurrency and large number of I/O blocking operations.To solve the above problems,this paper proposes a solution based on the coroutine model.In this paper,the basic principles and implementation methods of coroutine are discussed in detail,then give a complete implementation of web crawler based on coroutine.Experimental results had shown that our scheme can effectively reduce system load and improve web crawler crawling efficiency.
- 【文献出处】 河北师范大学学报(自然科学版) ,Journal of Hebei Normal University(Natural Science Edition) , 编辑部邮箱 ,2018年03期
- 【分类号】TP391.1
- 【被引频次】2
- 【下载频次】97