节点文献
改进空间向量模型主题网络爬虫系统
Topic-Focused Web Crawler System
【摘要】 详细阐述了主题网络爬虫实现的关键技术,将传统的空间向量模型进行改进形成自适应的空间向量模型,结合网页内容和链接两个方面进行网页相关度计算,设计并实现了一个面向主题的网络爬虫系统.针对主题网络爬虫爬行中出现的页面捕捉不全问题还提出了一种改进的手动与遗传因子相结合的网页搜索策略.最后给出实验结果,证明该系统的可行性及优越性.
【Abstract】 This paper researched key techniques of topic-focused web crawler at first,then designed and implemented a crawler system by using improved slef-adapted vector space model.It analysised documents both in text and links.As the same time,this paper also comed up with a web search stategy based on gene factor combined with manully control.This strategy can solve the problem of searching path blocked.In the end,we provide some experiment results to prove the feasibility and advantages of our system from recall ratio and precision ratio.
【关键词】 主题爬虫;
相关度计算;
搜索策略;
遗传因子;
【Key words】 topic-focused web crawler; relevance calculation; search strategy; gene factor;
【Key words】 topic-focused web crawler; relevance calculation; search strategy; gene factor;
- 【文献出处】 计算机系统应用 ,Computer Systems & Applications , 编辑部邮箱 ,2013年07期
- 【分类号】TP391.3
- 【被引频次】23
- 【下载频次】279