节点文献
通用中英文专业搜索引擎技术的研究及应用
【作者】 刘峰;
【导师】 王秀坤;
【作者基本信息】 大连理工大学 , 计算机应用技术, 2004, 硕士
【摘要】 随着Internet应用的逐渐普及和发展,因特网上的信息资源正在呈几何级数增长。它给人们带来极丰富信息的同时也向人们提出了一个重要的研究课题,即如何从浩如烟海的信息资源中迅速而准确地检索出人们所需要的信息,Web搜索引擎因此应运而生。近年来广而不精的综合性搜索引擎已无法满足人们获取专业信息的需要,小型专业化的搜索引擎正成为未来发展的一个趋势并且将具有广泛的应用前景。 本文介绍了综合搜索引擎的基本结构和基本原理,分析了搜索引擎各部件的关键技术、工作原理、实现方法和设计原则。其中着重讨论了网络机器人(Robot)技术、中文分词技术、向量空间模型(Vector Space Model,简称VSM)技术、文本自动分类技术、Web数据索引技术和Web数据检索技术。在此基础上,对各关键技术的实现方法进行了深入的研究。在实现中,采用了多线程、特征提取及加权、相关度排序等若干技术,有效地提高了Web数据采集、分类、检索的效率和质量。 在综合搜索引擎技术的基础上,本文针对专业信息搜索的特点,通过限制搜索网站范围和自动分类过滤专业信息相结合的专业化方法设计了一个中英文专业搜索引擎。同时为了提高本搜索引擎的广泛的适用性,本文采用了通用化的设计思想,使得该引擎可以方便地构建成各种专业的专业搜索引擎。在提高分类、分词的效率和质量方面,本搜索引擎采用了下列关键技术:对用户日志进行分析来动态修正词库;定期增加已分类专业文档来动态扩充训练文档集。与传统的分词和索引技术相比,本搜索引擎通过建立首字视图和词条视图简单有效地实现了专业词汇的分词统计;通过建立文档与词条的双向索引,解决了倒排表索引建立维护困难的问题,并节省了大量存储空间。 本文采用Java为开发工具,以Oracle8i为数据库,实现了一个实用的通用中英文专业搜索引擎。经过比较充分的测试,该搜索引擎已应用于国家科技部973预研项目人类脑计划和神经信息学研究中。
【Abstract】 With the gradual popularization and development of Internet, the information resource of Internet is increasing as geometric series. It brings us a great plenty of information and at the same time it also brings up an important research task how to retrieve useful information from tremendous amount of information resource effectively and accurately. Thus, web search engine comes into being as the times requires. Recently, the general search engine can’t satisfy our requirements of getting professional information. The minitype and professional search engine is the trend of development and has wide application prospect.The paper introduces the basic structure and principle of general search engine and analyses the key technology, working principle, realization method and design fundamental of every composing part in search engine. It lays a strong emphasis on discussing web robot technique, Chinese segmentation technique, vector space model technique, text automatic categorization technique, web information index technique and web information retrieval technique. On the basis of all above techniques, the paper makes some deep research on the realization method of all key technologies. In realization, the paper adopts multi-threads technique, feature extraction and adding weight technique, similarity ranking technique. These techniques are effective in increasing the efficiency and quantity of collection, classification and retrieval of web information.On the basis of general search engine techniques and according to the speciality of professional searching characteristics, the paper designs on a Chinese-English professional search engine. It mainly uses the specialization method of general search engine, which limits the searching range and filters professional information by auto-classification. At the same time, in order to make the design more generalized, the paper takes the general design method, based on which all kinds of professional search engine can be constructed easily. In order to enhance the efficiency and quality, the professional search engine uses some key techniques, such as dynamic revising the words database by analyzing the log of retrieval, dynamic extending the training documents set by adding the classified professional documents. Compared with the conventional technology of Chinese segmentation and index, the paper uses a more simple and effective method respectively. They are Chinese segmentation based on view of database and bidirectional index method based on table of database.According to the paper’s design, a general professional search engine is realized, which uses Java as programming language and uses Oracle8i as DBMS. By sufficient test, the current Chinese-English’ professional search engine has applied to the research of Chinese human brain project and neuroinformatics, which is one of 973 preliminary research projects of our national ministry of science and technology.
【Key words】 Search Engine; Robot; Automatic Categorization; VSM; Feature Extraction;
- 【网络出版投稿人】 大连理工大学 【网络出版年期】2004年 04期
- 【分类号】TP393.09
- 【被引频次】24
- 【下载频次】554