节点文献

基于查询日志的数据库关键字查询研究

Research on Query Log Based Keyword Search Over the Database

【作者】 高磊

【导师】 禹晓辉;

【作者基本信息】 山东大学 , 计算机软件与理论, 2012, 硕士

【摘要】 数据库系统是用来组织、存储和管理数据的仓库,它在企业、部门乃至个人的日常生产生活等诸多领域都得到了广泛的应用。随着现代互联网的迅速发展,信息呈现爆炸式增长,数据库系统存储的信息不断增多,用户搜索这些信息的需求也日益激增。传统的数据库访问方式需要用户掌握结构化的查询语言,了解并且.熟悉底层的数据模式,而这对于大多数的普通用户来说是非常复杂的。受到关键字查询在Web搜索引擎上获得巨大成功的影响,近些年来在数据库支持关键字查询得到了来自数据库领域和信息检索领域研究人员的广泛关注并且成为新兴的研究热点。不同于传统的数据库查询方式,数据库上的关键字查询简单易用,查询没有固定的格式限制,极大地减轻了用户学习和记忆的压力。但是这也给如何开发一个高效健壮的关键字查询系统带来了巨大挑战。传统的数据库查询的结果是一组孤立的元组,而关键字杏询则需要从数据库的不同表中组合与关键字匹配的元组来形成最终结果,这会导致查询的搜索空间急剧膨胀。一般来讲,关键字查询的搜索空间与查询中的关键字数目成指数型关系。还有关键字查询经常是脏的,用户的查询中经常包含一些不相关或者不正确的词,而通常这些脏查询会对随后的查询处理的效率和准确性产生负面的影响。为了解决查询的搜索空间指数性爆炸这一问题,一个被称作查询清理的预处理步骤被引入进来,它被用来清理用户提交的原始查询并抽取出高质量的查询项。这个新增的预处理步骤不仅改进后续查询结果的质量,而且还大大地降低了后续的查询搜索算法的搜索空间。但是它仍然存在一些问题,即引入的查询清理算法并没有考虑到用户偏好,而这样的偏好可以用来进一步改进查询清理的质量。基于模式图的关键字查询方法在它的执行过程中会生成大量的候选网络,其中有些候选网络所表示的关系没有实际意义的或者极少被用户访问,而有些候选网络所表示的关系则史为用户所偏好,即用户经常访问这类关系。而传统的基于模式图的方法通常按照候选网络大小递增的顺序依次求解候选网络,而不是按照用户对候选网络的偏好程度对候选网络求解,这样做也会影响到整个查询执行的效率和查询结果的质量。本文主要针对数据库关键字查询中存在的上述问题,借助于记录用户行为的查询日志,提出两种基于查询日志方法扩展原有的查询清理方法以进一步改进查询清理的质量。我们还使用树数据挖掘算法来对用户的查询日志进行挖掘来获取用户偏好,并通过它来改进基于模式图的关键字查询方法。本文的主要工作及成果如下:(1)针对提出的查询清理算法中的原始得分函数,在它的基础上提出了两种利用查询日志进行扩展的方法。原始的得分函数仅仅根据数据库来对产生的项进行评分,没有考虑到该项在日志中的使用行为。我们基于查询日志使用两种不同方式来对产生的项进行评分,从而获得一个项的日志评分。最后再将得到的日志评分和原始的评分按照某种方式结合起来形成项的最终评分。我们给出的实验证明提出的两种改进方法都在一定程度上改进查询清理的质量,获得了小错的效果。(2)通过使用查询日志来进一步改进传统的基于模式图的查询方法。一般地,基于模式图的查询算法通常按两个步骤处理查询:候选网络生成和候选网络求解。我们引入查询日志来记录用户提交的查询和他们选择的候选网络。然后将数据挖掘算法引入到关键字查询中来,使用已有的树挖掘算法来对用户的查询日志进行挖掘,以获取用户偏好的频繁模式树。然后又引入树编辑距离来定义生成的候选网络与挖掘得到的频繁模式树的相似度,基于此对生成的候选网络进行排序并优先求解排位靠前的候选网络,以此来改进查询的质量和效率。

【Abstract】 The database system is usually treated as the warehouse which is used to organize, store and mange the data. It has been widely used in many fields such as enterprise, departments and even in the personal lives. With the development of the modern Web, the information of the modern world significantly explodes and the data stored in the database system also greatly increases and users search the information more frequently. In order to access the traditional databases, users have to master the structured query language and be familiar with the underlying data schemas, which is very complicated for average users. Inspired by the great success of information retrieval style keyword search on the Web, the keyword search on the databases has been extensively studied by researchers from the database and information retrieval field and emerged as the hot area.Compared to the traditional database access, the keyword search is simple and easy to use and has no fixed format which relives the users from the steep learning curve. Despite its advantage the keyword search brings the great challenges to the developers on how to develop the efficient and robust system. Generally the result returned by the traditional database is a set of tuples. Compared with this, keyword search usually assembles different pieces of data matched with keywords locating in different tables to form the final result, which causes the explosion of the search space. In general the search space is exponential in the number of keywords in the query. Moreover the queries posed by users are often dirty with irrelevant or incorrect words which will affect the accuracy and efficiency of query processing.Lots of candidate network are generated in the query processing of the schema based approach. The relationship represented by some of these candidate networks are less meaningful or rarely accessed by users while the relationship represented by some of these candidate networks are preferred by users. Traditional schema based approach usually evaluates all candidate networks following the order of the size of the candidate network rather than the preference to the candidate network by users, which also affect the quality and efficiency of query processing.In this paper, motivated by the mentioned problem, we propose two query log based approaches to enhance the original query cleaning approach in order to further improve the quality of the query cleaning. We also make use of the tree mining algorithm to mine the query log in order to obtain the user preference which is applied to improve the schema based approach. Our contribution is as follows:1. Based on the original scoring function in the query cleaning algorithm, we propose two query based approaches to enhance the original approach. The original scoring function scores the generated terms based on the database without considering the behavior of terms in the query log. We score the generated terms based on the query log in two different ways to obtain the term score in the query log. Then we combine the original score together with the term score in the query log to form the final score. Our experiments shows that the two enhanced approaches proposed indeed improve the quality of query cleaning.2. We further improve the traditional schema based approach using the query log. Generally, the schema based approach processes queries in two consecutive steps:candidate network generation and candidate network evaluation. The query log used here records the candidate networks selected by users. We introduce the existing tree mining algorithm to mine the query log to obtain the frequent patterns preferred by users. Then we introduce the tree edit distance to define the similarity between the generated candidate networks and the mined frequent patterns. Based on the similarity the candidate networks are sorted and the candidate networks with higher rank are prioritized to evaluate.

  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2013年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络