节点文献
新闻垂直搜索引擎的设计
News Vertical Search Engine Design
【作者】 文斌;
【导师】 胡雯蔷;
【作者基本信息】 华中科技大学 , 软件工程, 2007, 硕士
【摘要】 我们现在所处的时代是一个信息爆炸的时代,人们在享受互联网带来的便利的同时,却面临着一个如何在如此海量的内容中准确,快捷地找到自己所需要的信息问题,由此互联网的搜索引擎应运而生。其中,搜索引擎的蜘蛛程序如何可以适应各色各样的网页,分词的效率成为影响搜索引擎优劣的重要标准之一。因此,建立满足各色网页的蜘蛛,高效分词程序并建立的快速索引对搜索引擎快速开发,性能优化都有重要意义。本文先从垂直搜索引擎的概念,原理,组成,工作流程等方面研究和分析入手。对比其他种类的搜索引擎的优劣,指出垂直搜索引擎的市场和发展方向。然后,将通过网络蜘蛛,分词模块和索引模块三个方面,多角度的对垂直搜索引擎分析和设计,并结合实际的需求,做适当的改进。其中,蜘蛛部分可以对各主流门户网站的新闻,博客和论坛进行下载,解析并规范化;分词部分可以只针对特定关键字进行分词,减少系统在此方面的开销,为索引作准备;最后,索引部分根据分词信息完成搜索引擎的索引建立和检索工作。通过本文的分析,在建立搜索蜘蛛模块时采用了逐步优化的方法,使得在以后的软件开发和维护过程中,蜘蛛模块在解析新网页方面开发会越来越高效,而且关键词库的升级也会相对简单,随之而来的是分词效率变高,建立索引的速度也会提高。对以后扩展功能,缩短开发周期,适应新需求提供了坚实的基础。
【Abstract】 The era we live in is the one of information explosion,people enjoy the convenience of the Internet,is facing a flood of how such content accurately, and quickly find the information they need,so the Internet search engines have emerged.Search engine spiders can adapt to how the procedures colored variety of website and the efficiency of the word segmentation become the merits of one of the important criteria.Therefore, the establishment of the spiders to meet various websites, efficient segmentation procedures and the establishment of the rapid indexing of searchengines rapid development, performance optimization are great significance.We start with the vertical search engine concept, principle,composition, workflow research and the analysis.Compared to other types of search engine merits,noting that vertical search engine market and development direction.Then,through the three modules of network spiders,segmentation and index,We get the Multi-angle analysis of vertical search engines and design,and point out of the actual needs,appropriate improvements.Among them,the spiders can deal with the mainstream news portal,blog and forum for download,analytical and standardization; Participle can only face some specific keyword participle to reduce system overhead in this regard,and to prepare for the index;Finally,according to the segmentation information,the indexing module finish the index and search work.Through this analysis,there is a gradually optimization method to make the Spider module in analytic aspects of the new website will be more efficient,where the software development and maintenance process in future,and the upgrade of the Keywords will become simple,then the segmentation of the words will become more effective, the index speed will improve.It will supply a solid foundation for further extensions, shortening the development cycle and adapting the new demand.
- 【网络出版投稿人】 华中科技大学 【网络出版年期】2009年 05期
- 【分类号】TP391.3
- 【被引频次】10
- 【下载频次】386