节点文献

一种基于HTML位置信息的查询扩展技术

Query Expansion Using Tags in the HTML File

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 陈志玮肖诗斌施水才王昕

【Author】 Chenzhiwei, Xiaoshibin, Shishuicai, Wangxin (Chinese Information Processing Research Center, Beijing Information Science & Technology University, Beijing 100101; Zhong Chuan Architecture Design& Research Institute, Beijing 100101)

【机构】 北京信息科技大学中文信息研究中心中船建筑工程设计研究院

【摘要】 查询扩展是指对用户提供的有关实体属性查询的描述进行语义上同义或近义方面的扩展。针对信息检索中文档与查询之间的词不匹配问题,本文提出一种基于HTML位置信息的查询扩展方法。由于HTML文件中存在位置信息(即Tag标签信息),所以,选择HTML文件进行查询扩展,相对于选择纯文本文件来说效果更好。本文中利用现有的各大搜索引擎的搜索结果组成训练语料,且利用词项与所有查询词在局部文档集合中的共现程度来评估扩展词的质量。最后,使用标准的向量空间模型(VSM)作为检索算法,将使用位置信息进行查询扩展与不加查询扩展及使用查询扩展的效果进行比对。该查询扩展技术对于查询短小、文档集内容比较分散的情况应尤为适用,可以极大地提高查询效果。同时,利用HTML中的位置信息,能够更好得对查询进行扩展。

【Abstract】 Query Expansion is adding related words and phrases to the original query which was supplied by the user. Techniques for automatic query expansion have been extensively studied in information retrieval research as a solution to the word mismatch problem between queries and documents. Using tags which were in the HTML file, this paper proposed a expansion method. Because of tags in the HTML file, using HTML files was better than using plain text to query expansion. In this paper, the training collection was made up of the search result from several search engines. And it utilized the local co-occurrence information in top-ranked documents and global information in training collection to select most appropriate expansion terms. Then it used the vector space model(VSM) to index the documents of the testing collection, and compared the result of query expansion model using location information with the model which was not using query expansion and the query expansion model. Query expansion is applicable when the query terms are short and the content of the documents are dispersed, and it can effectively improve the query result. After that, the query expansion which uses the location information in the HTML file can more effective.

【关键词】 信息检索查询扩展共现
【Key words】 Information RetrievalQuery ExpansionCo-Occurrence
【基金】 国家自然科学基金项目(60272084)北京市教育委员会科技发展计划重点项目(KZ200310772013)北京市教委项目(KM200510772008,KM200610772008)
  • 【会议录名称】 第三届学生计算语言学研讨会论文集
  • 【会议名称】第三届学生计算语言学研讨会
  • 【会议时间】2006-08
  • 【会议地点】中国辽宁沈阳
  • 【分类号】TP312.2
  • 【主办单位】中国中文信息学会
节点文献中: 

本文链接的文献网络图示:

本文的引文网络