节点文献

基于coroutine模型的网络爬虫设计与实现

Design and Implementation of Web Crawlers Based on the Coroutine Model

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 仇晶丁任霜张光华张红斌

【Author】 QIU Jing;DING Renshuang;ZHANG Guanghua;ZHANG Hongbin;Advanced Technology Research Institute of Cyberspace,Guangzhou University;School of Informations and Engineering,Hebei University of Science and Technology;

【机构】 广州大学网络空间先进技术研究院河北科技大学信息科学与工程系

【摘要】 网络爬虫在中文信息处理中被大量使用,根据待处理的问题定向爬取相关领域的数据,为后续中文信息处理提供基础.传统多线程模型在处理高并发和大量I/O阻塞操作时,存在较为明显的限制和不足.针对以上问题,提出了一种基于coroutine模型的解决方案.从coroutine的基本原理和实现方法上作了较为详细的阐述,并给出基于coroutine网络爬虫的完整实现.实验表明,该方案能够有效地降低系统负荷,提高爬虫的爬取效率.

【Abstract】 Web crawler is widely used in Chinese information processing.According to the problem to be dealt with,crawling related domains data,it provides the basis for subsequent Chinese information processing.The traditional multi-threaded model has obvious limitations and deficiencies when dealing with high concurrency and large number of I/O blocking operations.To solve the above problems,this paper proposes a solution based on the coroutine model.In this paper,the basic principles and implementation methods of coroutine are discussed in detail,then give a complete implementation of web crawler based on coroutine.Experimental results had shown that our scheme can effectively reduce system load and improve web crawler crawling efficiency.

【关键词】 coroutine爬虫多线程阻塞
【Key words】 coroutinecrawlermulti threadblock
【基金】 河北省自然科学基金(F2012208016)
  • 【文献出处】 河北师范大学学报(自然科学版) ,Journal of Hebei Normal University(Natural Science Edition) , 编辑部邮箱 ,2018年03期
  • 【分类号】TP391.1
  • 【被引频次】2
  • 【下载频次】97
节点文献中: 

本文链接的文献网络图示:

本文的引文网络