节点文献
可动态自适应主题爬虫的研究
Research and Implementation of Dynamic Adaptive Topical Crawler
【摘要】 针对传统的主题爬虫在面对动态变化的互联网时存在着主题知识涵盖不全、领域知识更新以及主题资源中心转移等问题。论文提出了一种可动态自适应互联网信息的主题爬虫。其中,可动态选择种子URL的TopicHub算法,相比于传统的静态种子URL的主题爬虫,抓取效率提升了7%以上,查全率提升了5%以上。另外,针对于静态本体库所存在的主题信息涵盖不全、领域知识变化更新等问题,提出了一种可动态扩充领域语义信息的结合静态本体库和动态语义的主题算法简称为SDTP算法。相比于传统的基于静态本体库的算法查准率提升了13%,相比于基于向量空间模型VSM的算法提升了4%。
【Abstract】 In the face of a dynamically changing Internet,the traditional topical crawlers have problems such as incomplete topical knowledge,domain knowledge updating,topical resource center transfer and so on. In this paper,a topic crawler that can dynamically adapt to Internet information is proposed. In which the TopicHub algorithm can dynamically select seed URLs. Compared with the traditional topic crawler of static seed URL,the crawling efficiency increases by more than 7%,and the recall rate increases by more than 5%. Additionally,aiming at the problems of the incomplete coverage of the topic information and domain knowledge updating in the static ontology library,an algorithm named SDTP can dynamically expand the domain semantic information is proposed. Compared with the traditional algorithm which is based on the static ontology library,the precision of the algorithm is improved by 13%,and compared with the algorithm which is based on the VSM,the improvement is 4%.
- 【文献出处】 计算机与数字工程 ,Computer & Digital Engineering , 编辑部邮箱 ,2019年05期
- 【分类号】TP391.3
- 【被引频次】1
- 【下载频次】166