节点文献
基于非内容信息的网络关键资源有效定位
Web key resource page selection based on non-content information
【摘要】 网络信息的爆炸式增长,使得当前任何搜索引擎都只可能索引到Web上一小部分数据,而其中又充斥着大量的低质量信息.如何在用户查询无关的条件下找到Web上高质量的关键资源,是Web信息检索面临的挑战.基于大规模网页统计的方法发现,多种网页非内容特征可以用于关键资源页面的定位,利用决策树学习方法对这些特征进行综合,即可以实现用户查询无关的关键资源页面定位.在文本信息检索会议(TREC)标准评测平台上进行的超过19G文本数据规模的实验表明,这种定位方法能够利用20%左右的页面覆盖超过70%的Web关键信息;在仅为全部页面24%的关键资源集合上的检索结果,比在整个页面集合上的检索有超过60%的性能提高.这说明使用较少的索引量获取较高的检索性能是完全可能的.
【Abstract】 Information growth makes it impossible for search engines to crawl and index all pages on the Web.Meanwhile indexed page set is filled with low quality information and spam.It is quite a challenge to select high quality Web pages(key resource pages)query-independently.With analysis in non-content features of key resources,a pre-selection method was introduced in topic distillation research.A decision tree was constructed to locate key resource pages using query-independent non-content features including in-degree,document length,URL-type and two novel proposed features involving site’s self-link structure analysis.Although the result page set contained only about 20% pages of the whole collection,it covered more than 70% of key resources.Furthermore,information retrieval on this page set made more than 60% improvement with respect to that on all pages.It shows an effective way to get better performance in topic distillation with a smaller data set.
【Key words】 web information retrieval; key resource page; topic distillation; link structure analysis;
- 【文献出处】 智能系统学报 ,CAAI Transactions on Intelligent Systems , 编辑部邮箱 ,2007年01期
- 【分类号】TP18;TP391.3
- 【被引频次】4
- 【下载频次】98