节点文献

面向XML文档的二级索引技术及其在XML关键词检索中的应用研究

Two-Layer Based Index Strategy for XML Data and Its Usage in XML Keyword Retrieval

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 向永清邓志鸿于航高宁

【Author】 Xiang Yongqing,Deng Zhihong,Yu hang,and Gao ning (Key Laboratory of Machine Perception of Ministry of Education,School of Electronics Engineering and Computer Science,Peking University,Beijing 100871)

【机构】 北京大学信息科学技术学院机器感知与智能教育部重点实验室

【摘要】 随着互联网上XML文档的大量增加,如何高效地索引、存储和检索这些XML数据成为一个非常值得深入研究的课题.目前,在XML关键词检索方面,主流的检索系统都是建立在一级索引的基础上.一级索引存在两个明显的缺点:1)索引的冗余度比较高;2)索引的可扩展性和灵活性较差.通过结合传统倒排索引和基于杜威编码的XML节点索引的优点,提出面向XML文档的二级索引模型,并把该模型应用于求解XML关键词检索中的SLCA,实现了基于二级索引的求解SLCA的栈算法.实验表明,二级索引模型能够节省约30%的空间开销,在时间效率方面,基于二级索引的栈算法在效率上比基于一级索引的栈算法要高1个数量级左右,并且随着关键词数目的增加,这种效率优势会越加明显.

【Abstract】 With the rapid increasing of XML documents on the Web,how to index,store and retrieve these documents has become a very popular and valuable problem.At present,there are two normal ways of retrieving XML documents.One is structure-based retrieval,such as XPath and XQuery;the other is keyword-based retrieval.In the aspect of keyword-based XML retrieval,a majority of systems and algorithms are built based on one-layer index.However,one-layer index has two disadvantages: firstly,it may cause redundancy;secondly,it is not easy to be updated.In this paper,a new XML index model called two-layer index model is proposed,which considers both the advantages of traditional inverted index and dewey-code based inverted index.Moreover,a new stack algorithm based on two-layer index is proposed in order to rapidly get SLCA results from XML document sets. At last,the results of building two-layer index on Wiki document set and applying the stack algorithm to get SLCA results from two-layer index are presented,which show the efficiency of the proposed index model and algorithm.

【关键词】 XML二级索引关键词检索SLCA栈算法
【Key words】 XMLtwo-layer indexkeyword retrievalSLCAstack algorithm
【基金】 国家“八六三”高技术研究发展计划基金项目(2009AA01Z136);国家自然科学基金项目(90812001)
  • 【会议录名称】 第26届中国数据库学术会议论文集(B辑)
  • 【会议名称】第26届中国数据库学术会议
  • 【会议时间】2009-10-15
  • 【会议地点】中国江西南昌
  • 【分类号】TP391.3
  • 【主办单位】中国计算机学会数据库专业委员会
节点文献中: 

本文链接的文献网络图示:

本文的引文网络