节点文献
面向XML文档的二级索引技术及其在XML关键词检索中的应用研究
Two-Layer Based Index Strategy for XML Data and Its Usage in XML Keyword Retrieval
【Author】 Xiang Yongqing,Deng Zhihong,Yu hang,and Gao ning (Key Laboratory of Machine Perception of Ministry of Education,School of Electronics Engineering and Computer Science,Peking University,Beijing 100871)
【机构】 北京大学信息科学技术学院机器感知与智能教育部重点实验室;
【摘要】 随着互联网上XML文档的大量增加,如何高效地索引、存储和检索这些XML数据成为一个非常值得深入研究的课题.目前,在XML关键词检索方面,主流的检索系统都是建立在一级索引的基础上.一级索引存在两个明显的缺点:1)索引的冗余度比较高;2)索引的可扩展性和灵活性较差.通过结合传统倒排索引和基于杜威编码的XML节点索引的优点,提出面向XML文档的二级索引模型,并把该模型应用于求解XML关键词检索中的SLCA,实现了基于二级索引的求解SLCA的栈算法.实验表明,二级索引模型能够节省约30%的空间开销,在时间效率方面,基于二级索引的栈算法在效率上比基于一级索引的栈算法要高1个数量级左右,并且随着关键词数目的增加,这种效率优势会越加明显.
【Abstract】 With the rapid increasing of XML documents on the Web,how to index,store and retrieve these documents has become a very popular and valuable problem.At present,there are two normal ways of retrieving XML documents.One is structure-based retrieval,such as XPath and XQuery;the other is keyword-based retrieval.In the aspect of keyword-based XML retrieval,a majority of systems and algorithms are built based on one-layer index.However,one-layer index has two disadvantages: firstly,it may cause redundancy;secondly,it is not easy to be updated.In this paper,a new XML index model called two-layer index model is proposed,which considers both the advantages of traditional inverted index and dewey-code based inverted index.Moreover,a new stack algorithm based on two-layer index is proposed in order to rapidly get SLCA results from XML document sets. At last,the results of building two-layer index on Wiki document set and applying the stack algorithm to get SLCA results from two-layer index are presented,which show the efficiency of the proposed index model and algorithm.
【Key words】 XML; two-layer index; keyword retrieval; SLCA; stack algorithm;
- 【会议录名称】 第26届中国数据库学术会议论文集(B辑)
- 【会议名称】第26届中国数据库学术会议
- 【会议时间】2009-10-15
- 【会议地点】中国江西南昌
- 【分类号】TP391.3
- 【主办单位】中国计算机学会数据库专业委员会