节点文献

主题Web信息采集与分析技术研究

Study on Topic-Specific Web Information Collection and Analysis Technology

【作者】 唐志

【导师】 王成良;

【作者基本信息】 重庆大学 , 计算机软件与理论, 2006, 硕士

【摘要】 目前,搜索引擎逐渐成为用户在Web上获取信息的主要工具。传统的通用搜索引擎利用一个Crawler程序面向整个Web进行信息采集,它的缺点是采集无针对性、页面失效率高、不能满足特定专业人群的需要。针对这种情况,需要一个分类细致精确、数据全面深入、更新及时的面向主题的搜索引擎。本文设计了一个主题搜索引擎,并对现有的主题Web信息采集与分析技术进行了深入研究;按照评价链接价值所采用方法的不同,对Crawler的爬行搜索策略进行了分类,分析和比较了各类搜索策略的特点及优缺点。通过对几种常见的Web社区结构的分析,指出现有的基于局部信息的主题Web信息采集技术存在一些问题:技术层面上的“局部最优”与“主题漂移”之间,以及采集结果上查全率与查准率之间存在着不协调的现象。因此,本文决定利用基于概率选择的,具有通用性、高适应性和全局性的遗传算法来解决这一问题。论文所做的工作主要有:①根据传统的通用搜索引擎与主题搜索引擎之间目的和实现手段上的差异,提出了一个主题搜索引擎,并介绍了系统每一部分的功能和其实现方法。②研究了信息采集与分析技术、信息检索技术,其中主要研究了主题Web信息采集与分析技术。通过对比和分析,发现现有技术的优点和不足之处。③研究了遗传算法的概念、特点、实现方法和其数学原理,并提出将其运用在主题Web信息采集领域里,改进信息采集系统的性能。④通过深入分析遗传算法与主题Web信息采集技术的共性与不同之处,论述了在信息采集系统中应用遗传算法的可行性和需要注意的问题。提出了算法的框架、实现步骤;对其实现的功能进行了深入分析和实验验证,指出该算法具有较好的性能,能够较好地解决目前主题Web信息采集领域内面临的问题。

【Abstract】 Currently, search engine has become people’main access to gather information on the web. Traditional generic search engine use a program named Crawler to collect information from the whole Web, it has some disadvantages such as non-specific information collection, high rates of pages missing, and can not meet the needs of specific professional groups. What we need is a focused search engine, well classified, containing profound and entire data, and updating in time.We designed a focused search engine, and studied the topic-driven crawler’s Web information collection and analysis technology; In accordance with the different methodology used to assess the value of links, we classified the search strategy, analyzed and compared characteristics, advantages and disadvantages of various search strategies. Also we analyzed several common Web community structure, and point out that the existing topic-driven Web information collection techniques that based on partial information had some problems: the contradictions between "partial optimistic" and "topic drift" on technical level, and“Recall”rate and“Precission”rate of the results. Therefore, we supposed to use Genetic Algorithm, which is highly interoperable, adaptable, Global, and based on probability of selection, to solve these issues. Mainly work is about:①According to the differences of destination and methodologies between traditional generic search engines and focused search engine, we designed a focused search engine,introduced the function of each part of the search engine.②Studied the technologies about information collection, analysis and information retrieval, mainly about the topic-specific Web information collection and analysis technologies. Through comparison and analysis, we found out the existing technologies’advantages and disadvantages.③Studied the genetic algorithm’s concepts, characteristics, methods and its mathematical mechanisms, supposed to use it in the topic-driven Web information collection area to improve information collection system’s performance.④By analyzing the difference and similarity between genetic algorithm and Web information collection technologies, we discussed the feasibility and some noteworthy issues when using genetic algorithm in Web information collection system. We designed

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2007年 01期
  • 【分类号】TP391.3
  • 【被引频次】7
  • 【下载频次】433
节点文献中: 

本文链接的文献网络图示:

本文的引文网络