节点文献
中文网页定题采集及分类研究
Research on Topic-specific Gathering and Classification of Chinese WebPages
【作者】 宗校军;
【作者基本信息】 华中科技大学 , 系统工程, 2006, 博士
【摘要】 网络正深刻地改变着我们的生活,Internet已经发展成为当今世界上最大的信息库,如何快速、准确地从浩瀚的信息资源中寻找到所需的信息已经成为网络用户的一大难题。因而基于Web的网上信息的采集和相关的信息处理日益成为人们关注的焦点。传统的Web信息采集所采集的页面数量过于庞大,所采集页面的内容过于杂乱,需要消耗非常大的系统资源和网络资源。同时Internet信息的分散状态和动态变化也是困扰着信息采集的问题。定题搜索引擎将信息检索限定在特定主题领域,就主题相关的信息提供检索服务,相应特定主题所需采集的网页数量极大减少且主题统一。与通用的搜索引擎相比较,定题搜索引擎由于检索的范围较小,查准率和查全率易于保证。本文所研究的就是建立定题搜索引擎的前期关键技术—Web信息定题采集及分类技术。全文的主要内容如下:通过对Web结构和Web链接特性的研究,分析了一些在定题Web信息采集过程中有用的规律。对元数据进行定义,讨论了几种基本的超链及其元数据类型。研究了网页信息抽取问题,分析了常见类型的元数据并确定了适合作为定题信息采集依据的元数据类型。讨论了如何基于元数据通过主题扩展得到主题相关词集,包括禁用词过滤、主题候选词的抽取及相关策略过滤等过程。重点研究了利用改进的Apriori算法,对元数据库进行关联挖掘抽取主题候选词,并给出了主题词关联挖掘和过滤,即主题扩展的迭代算法。实验证明,本文所提供的元数据处理策略,能很好地进行主题的抽取与扩展,为实现更有效的定题Web信息采集提供良好前提。给出了一个基于Web元数据的定题信息采集系统并加以描述。对经典的基于超链分析的相关性判别算法HITS和PageRank进行了描述和分析,给出了基于Web元数据的多种相关性判别算法,并利用Web元数据对HITS和PageRank算法进行了改进,提出了M-PageRank算法和M-HITS算法。测试了各种算法的性能并作了比较研究,实验验证了所提出的算法能为实现定题检索提供良好前提。讨论了文本分类的基础及Web网页在文本分类中的特性和特殊处理,将HTML文档用TFE表示,考虑半结构文档词条所处结构对分类的影响,修正了反映特征词在网页中的权重函数,引入扩展文本作为网页分类的内容补充。研究了既考虑文档结构又兼顾文档内容的改进的朴素贝叶斯和支持向量机Web网页分类方法,并通过实验验证了两种分类方法的良好效果。通过对Web定题采集和分类的研究,本文从技术和方法上做出了一些增强及改进,提出的方法及改进的算法取得了较好的实验效果,所得到的若干结论也具有理论和实践上的指导意义。
【Abstract】 The network is changing our life profoundly, and internet has developed to be the biggest information database. However, it’s difficult for browsers to find what they need from the expanding information database rapidly and precisely. Information gathering and processing from World Wide Web attracts more and more attention.Because of the enormous and disordered WebPages, the traditional scalable Web crawling technology consumes too intensive system and network resources. Decentralization and dynamic developments of Web information are also problems for information gathering. Topic-specific Web search engine is a new direction of information retrieval. Rather than collecting and indexing all accessible Web documents, topic-specific Web search system restricts its crawl boundary to find links that likely to be most relevant to the given topic. The precision and recall of information search are prone to be guaranteed. In this paper, some sticking points, such as topic-specific information gathering and Classification of Chinese WebPages, are discussed as follows:Based on the analysis of Web structure and Web links, some useful rules are summed up for more effective topic-specific information gathering. Web metadata is defined and a few kinds of hyperlink and metadata are discussed. Information extraction is studied in this paper and some kinds of appropriate Web metadata are confirmed for topic-specific information gathering.Topic expansion is discussed to get the set of topic terms with its relevant topics, including stop words filtration, extraction of candidate topics and relevance metrics filtration. Using association mining on the database of metadata, the technologies of metadata extraction and topic expansion are proposed as a relevant topics mining algorithm. Experimental results indicate that our algorithm and strategies have better performance and precondition for topic-specific information gathering. Based on Web metadata, a topic-specific information gathering system is designed and the overall process is described. Two classic algorithms for topic-directed crawling founded on hyperlink, Hypertext Induced Topic Search (HITS) and PageRank, are discussed and analyzed. A set of algorithms, which exploit hyperlink metadata, that keep crawler focuses to the topic are presented. The utility of hyperlink metadata for betterment of HITS and PageRank is demonstrated and some ameliorative algorithms are proposed, such as M-PageRank and M-HITS. The capabilities of multiform algorithms are compared, and experimental results indicate that our approach has better performance and precondition for topic-specific search.An overview of text classification is reviewed in this paper. According to the semi-structured format of Web documents, a document representation method called TFE is adopted. Some classic weighting functions for characteristic words are revised and extended anchortexts are introduced for classifying Web pages. Giving attention to structure and content of Web documents, we put forward improved na?ve Bayesian algorithm and Support Vector Machine (SVM). The experiments results show that those approaches have better performance.In this thesis, we propose some amelioration for topic-specific information gathering and Classification of Chinese WebPages on techniques. Our approaches and algorithms have better performance and some conclusions drawn in this paper provide a guideline and basis for both theory and practice.
【Key words】 Web Information Gathering; Topic-specific Information Gathering; Web Metadata; Topic Expansion; Relativity Judging; Classification of WebPages;