节点文献

一种基于邻居规则分类算法的聚焦爬虫

Focused Crawler Based on CRN Classification Algorithm

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 敬烜鲁红英

【Author】 JING Xuan;LU Hong-Ying;College of Information Science & Technology, Chengdu University of Technology;

【机构】 成都理工大学信息科学与技术学院

【摘要】 判定链接锚文本与主题的相关度、评估链接的优先级并过滤不相关的链接是实现聚焦爬行的关键。通过判定链接所在页面是否主题相关、是否是导航页面划分链接的类别,结合链接锚文本与主题的相似度,提出了一种基于邻居规则分类算法评估链接优先级的聚焦爬虫。该爬虫包括一个主题相关网页判别器、一个导航网页判别器和一个基于邻居规则分类算法的链接优先级评估器。实验结果表明,基于邻居规则分类算法的聚焦爬虫比仅仅根据锚文本判定链接优先级的标准聚焦爬虫具有更好的性能,因此更加适合用于信息检索。

【Abstract】 The key of focused crawling is to assess similarity between anchor text and given topic, estimate priority of links and filter irrelevant links. A focused crawler based on CRN classification algorithm is proposed in this work, which determine category of links by determining whether the page links located in relate given topic and whether the page is navigation web page. Priority of the links is estimated based their category and the similarity between anchor text and given topic. The crawler consist of a relevant page classifier, a navigation page classifier and a link priority evaluator based on CRN classification algorithm. CRN(Classification by Rule-based Neighbors) classification algorithm is introduced is this paper. Experiment results shows that focused crawler based on CRN classification algorithm has better behavior than standard focused crawler only estimating link priority based on anchor text.

【基金】 四川省教育厅重点项目:大数据视角下西部地区贫困评价指标体系构建及贫困精准识别研究(17ZA0029)
  • 【文献出处】 电脑知识与技术 ,Computer Knowledge and Technology , 编辑部邮箱 ,2017年14期
  • 【分类号】TP391.3
  • 【下载频次】69
节点文献中: 

本文链接的文献网络图示:

本文的引文网络