节点文献
基于网络日志分析的混合策略主题爬虫
A Mixed Strategy Topic Crawler Based On Network Log Analysis
【摘要】 为适应主题的动态性和完整性,本文提出了一种基于网络日志分析的混合策略主题爬虫。首先,它通过对网络日志的分析,一方面发现种子页面,有效扩充主题群落;另一方面挖掘用户兴趣,进一步精确描述主题。然后,爬虫从新种子集出发,结合用户兴趣,采用混合策略,对页面进行筛选。实验证明,该爬虫能够有效地采集更多的主题页面。
【Abstract】 This article provides a mixed strategy topic crawler which is based on network log analysis in order to adapt the dynamics and integrality of topic. Firstly, through network log analysis,new seeds are discovered to extend web community and users ’ interest is mined which makes the further description of the topic possible. In addition,according to the new seeds, with the application of the mixed strategy, the crawler filters the pages by referring to page user interest. Experiment results show that this system can fetch more topic pages efficiently.
【关键词】 主题爬虫;
网络日志;
主题群落;
用户兴趣;
混合策略;
【Key words】 topic crawler; network log; web community; user interest; mixed strategy;
【Key words】 topic crawler; network log; web community; user interest; mixed strategy;
【基金】 颁发部门:国家自然科学基金委员会信息科学二处(原计算机科学学科)(90612016);基金项目名称:计算化学E-SCIENCE研究与示范应用
- 【文献出处】 微计算机信息 ,Microcomputer Information , 编辑部邮箱 ,2009年03期
- 【分类号】TP393.092
- 【被引频次】1
- 【下载频次】260