节点文献

面向Web2.0社区的爬虫关键技术研究

A Study of Web 2.0 Community Oriented Crawling Techniques

【作者】 高晖

【导师】 孙建伶;

【作者基本信息】 浙江大学 , 计算机应用技术, 2011, 硕士

【摘要】 Web 2.0社区是当前最为热门的互联网应用,SNS、微博、在线问答、贴吧等都是其典型代表。这类网站的特点是用户参与网站内容的创建与编辑,改变了以往信息单向发布的模式;此外,大量运用Ajax等富客户端技术提升用户体验,网页加载形式不同于以往的一次性加载,需要依赖于用户的交互操作才能形成最终视图。由于Web 2.0社区内容构成渠道更为多样化,信息的实效性和发布模式的不确定性较之传统网站大为增强,信息质量良莠不齐,客户端动态内容难于自动获取等,都给传统搜索引擎带来了挑战,现有爬虫技术需要在实时搜索和客户端动态内容索引方面进行改进,才能够适应Web 2.0社区所带来的互联网新浪潮。在实时爬虫方面,本文着重研究基于发布模式预测的爬虫调度策略,通过对本地索引质量标准的改进,引入社区网页内容权重评价体系,将其与索引时延因素结合作为新的度量标准,从而将爬虫调度问题归约为本地索引质量优化问题,利用网站历史发布数据挖掘出最优的爬行计划。在Ajax爬虫方面,由于Ajax单个页面中包含多个状态,’本文援引了经典的状态转换图模型对Ajax网站进行建模,并且引入基于XPath特征的无效元素检测、基于XHR监听的异步请求优化等手段,改进原有算法无关状态多、状态爆炸、识别重复状态困难、性能低下等缺陷,相比传统爬虫又在网页召回率方面获得了大幅提升。最后,本文提出了面向Web 2.0社区的爬虫原型系统的设计与实现,通过将其成功应用于校内新闻搜索引擎,验证了本文观点的正确性和有效性。

【Abstract】 Web 2.0 community is the most popular Internet applications nowadays. Social networking, micro-blogging, online QA and post bar are the typical representatives. This kind of websites is characterized with involving user’s participations in content creation and editing. In addition, they apply Ajax and other rich-client technologies extensively in order to enhance user experience.As information sources of Web 2.0 community are diversified, its uncertain publish pattern, timing of information, varied content quality and abundance of dynamic scripting all become outstanding problems. These issues prevent traditional search engines from performing effective information retrievals as usual, hence the existing crawling techniques call for improvements in both real-time search and client dynamic content index to adapt to the new wave of Internet evolution.In respect of real-time crawler, we focus on the crawl scheduling optimization problem based upon publish pattern predictions. We refine the local index quality metrics by introducing a new community content weight evaluation system, and combine it with delay metrics. We try to schedule a crawler to achieve a minimum weighted delay value, and figure out an optimized solution leveraging with historic publish data from specific community.On the other hand, we enable Ajax crawling capability. Since one Ajax page contains multiple states, we refer to a classic transition graph to model Ajax sites. By introducing heuristic invalid element inspecting and XmlHttpRequest monitoring, we boost the crawling performance as well as its recall rate.Finally, we propose a Web 2.0 community oriented crawler prototype, and succeed in applying it in a campus news search engine, which proves the effectives of our points of view from a practical application perspective.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2011年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络