节点文献

基于Scrapy的深层网络爬虫研究

Research on Deep Network Crawler Based on Scrapy

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 刘宇郑成焕

【Author】 LIU Yu;ZHENG Cheng-huan;Zhejiang University;Yanbian University Jilin;

【机构】 浙江大学延边大学

【摘要】 随着大数据时代的到来,网络爬虫已经成为很普遍的技术,无论是做项目、科研、创业或者写论文,获得大量数据并且对数据进行分析都是必不可少的。但是目前存在深层网(Deep Web)的数据量是表层网(Surface Web)数据量的数百倍,乃至上千倍。传统的爬虫对表层网数据进行获取已经无法满足我们的需求,同时因为深层网数据通常没有各种复杂的标签结构等,使得其本身更加清晰,干净,故而我们深入研究深层网络爬虫是非常有必要的。本文将会通过Python的Scrapy爬虫框架,对深层网络爬虫进行研究,通过分析深层网络特点制定合适的Scrapy爬虫策略,最后通过实际操作,对指定的爬虫策略进行验证。

【Abstract】 With the advent of large data age, web crawler has become a very popular technology, whether it is doing projects, scientific research, entrepreneurship or writing papers, accessto a large number of data and analysis of the data is essential. However, the amount of data in the deep Network(deep web) is hundreds of times more than the amount of surface network(surface web) data, even thousands of times. The traditional crawler to obtain the surface network data has not been able to meet our needs, at the same time, because the deep network data is usually not a variety of complex tag structure, making itself more clear and clean, so it is very necessary for us to deeply study deep web crawler. This paper will study the deep web crawler through Python’s Scrapy crawler framework, make suitable Scrapy crawler strategy by analyzing the characteristics of deep network, and finally validate the specified crawler strategy through actual operation.

【关键词】 深层网网络爬虫ScrapyPython
【Key words】 Deep networkWeb crawlerScrapyPython
  • 【文献出处】 软件 ,Computer Engineering & Software , 编辑部邮箱 ,2017年07期
  • 【分类号】TP393.092
  • 【被引频次】68
  • 【下载频次】904
节点文献中: