节点文献

搜索引擎联邦算法设计与系统实现

Algorithm Design and System Implementation of Search Engine Confederation

【作者】 刘辉

【导师】 李星;

【作者基本信息】 清华大学 , 信息与通信工程, 2004, 硕士

【摘要】 随着互联网上信息的空前膨胀,当前大型集中式搜索引擎面临扩展性、更新速度和用户专业化需求等一系列挑战;分布式搜索引擎在一定程度上解决了集中式系统数据库规模扩展的限制,但在扩展性、相关度和分布式资源控制策略等方面仍存在很大局限,需要扩展性强、相关度高、可行性强的系统结构和组织方式来进行资源管理和检索。基于以上背景和现有分布式算法研究,论文设计了分布式资源组织和导航系统——搜索引擎联邦的体系结构,实现了基于日志分析的联邦原型系统,有效的组织了专业化的基于站点或站点群的搜索引擎节点,提供了快速、准确、更新快的分布式资源导航。在搜索引擎技术分析的基础上,论文提出搜索引擎联邦体系结构的设计。联邦为中心控制结构,中心服务器实现对分布式资源的导航,节点为面向站点或站点群的中小规模搜索引擎,节点之间通过中心实现相互推荐。该体系结构扩展性强,可作为分布式资源检索系统的标准框架。联邦实现的基础是分布式搜索引擎节点,因此论文设计并实现了应用于中小规模站点的集中式搜索引擎关键技术,主要是采集预处理、索引算法和网页排序算法。采用了创新性的分块索引结构优化和针对中小站点的网页排序算法,进行了大量工程性工作,使该软件系统化、集成化,并推广到教育网内的五个节点,为联邦建立了应用平台。考虑到用户日志在当今信息检索领域的重要应用和日志信息对结果预测准、更新快等优势,论文提出基于日志分析的联邦系统设计方案,主要包括基于日志的联邦体系结构、日志协议格式和基于日志的节点排序算法。该设计同样具有较强的扩展性和可行性,在日志信息的应用上具有创新性。最后,论文实现了基于日志的联邦原型系统,关键技术包括日志协议实现、节点信息采集、融合、索引和查询。论文基于现有五个节点的数据对联邦系统进行了实验分析,显示了联邦设计的合理性和应用前景。综上,论文对搜索引擎联邦的贡献是分布式算法研究、体系结构设计、节点搜索软件关键技术实现、基于日志分析的系统设计和原型实现,为联邦的推广和发展奠定了坚实的基础。

【Abstract】 With Internet information explosion, central search engines face challenges in scalability, freshness, specialized requirements and etc; distributed search engines to some degree solved scalability problem of central systems, but has limitation in precision, distributed organization and etc. In this case, highly scalable, meaningful and practical resources organization method and retrieval system is needed, and this paper designed the system architecture of such distributed resource navigation system – search engine confederation, then implemented log-based confederation prototype, which serves rapid, accurate and dynamic resource navigation. After key technologies analysis, the paper provides the design of search engine confederation system architecture. It is a central control system, with the center serves resources navigation and the nodes are site search engines which could recommend each other through the center. The design is highly scalable, which could become the standard for distributed information retrieval system.Since the basis of confederation is nodes, we firstly worked on high-quality search software for nodes. Key issues include webpage crawling, preprocessing, indexing and ranking. Novel block-based indexing optimization and webpage ranking algorithm were adopted, many engineering works done before the software deployed at five sites in CERNET, which became the experimental platform of confederation.Considering the importance of search log application in information retrieval and its characters of accurate result prediction and fast adaptability, this paper put forward the system design of log-based confederation. Key issues include system architecture, log protocol format and log-based resources ranking algorithm. This design has high scalability, practicability, and creativity in search log application.At last, detailed implementation of log-based confederation is introduced, including log protocol generation, and node information gathering, indexing and retrieving. Based on current five nodes, the prototype of confederation is set up, and the experimental data demonstrated its performance and promising application. Above all, the paper’s main contribution is distributed algorithm researching, architecture design, site software implementation, and log-based confederation set up, which established solid ground for future development of confederation.

  • 【网络出版投稿人】 清华大学
  • 【网络出版年期】2005年 03期
  • 【分类号】TP391.3
  • 【被引频次】4
  • 【下载频次】413
节点文献中: 

本文链接的文献网络图示:

本文的引文网络