节点文献

面向Deep Web本地化数据集成的数据源两层选择模型

Data Source Two-layer Selection Model for Deep Web Localized Data Integration

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 鲜学丰崔志明方立刚顾才东孙逊

【Author】 XIAN Xuefeng;CUI Zhiming;FANG Ligang;GU Caidong;SUN Xun;Jiangsu Province Support Software Engineering R & D Center for Modern Information Technology Application in Enterprise;Institute of Intelligent Information Processing and Application,Soochow University;

【机构】 江苏省现代企业信息化应用支撑软件工程技术研发中心苏州大学智能信息处理及应用研究所

【摘要】 针对基于数据源质量选择方法的数据源在数据爬取时存在代价大、重复率高的问题,提出一种结合两层选择模型的Deep Web数据源选择和集成方法。该方法根据数据源本身质量和数据源的效用构建数据源的两层选择模型。给出基于该模型的递归增量数据源选择和集成策略,采用基于数据源质量的选择器过滤大量低质量Deep Web数据源,仅选择若干个高质量的数据源作为第2层选择器的输入。从候选数据源集合中递归地选择,使集成系统在获得尽可能多的高质量数据的同时,避免出现较高覆盖率的k个数据源,作为集成系统最终需要爬取和集成的数据源。实验结果表明,该方法结合两类选择器的优点,缩减了候选数据源的空间并保证集成数据的质量,同时避免了系统处理大量重复数据,有效降低Deep Web数据爬取与集成的代价。

【Abstract】 Aiming at the problems that the data source based on the selection method of data source quality exists in selection process are heavy crawling price and high repetition rate,this paper proposes a two-layer selection model for source selection and integration.The selection model is built based on the quality and utility of the data source,and a recursive incremental data source selection and integration strategy is presented based on the model.The strategy adopts a data source quality classifier to filter majority low-grade Deep Web resources,only leaveing several high-quality ones as the input of the second layer utility classifier.The second layer classifier chooses the processed candidate resources recursively,which enables the integrated system to extract as much high qualified resources while escaping to get high coverage over k.Experimental results show that,combined the ascendency of two classifiers,the designed model can reduce the space of candidate data resources while assuring the quality,and it simultaneously avoids processing huge amounts of repeated data and reduces the integrated cost of Deep Web resources extraction effectively.

【基金】 国家自然科学基金(61440053,61472268,41201338);苏州市科技计划研究项目(SYG201342,SYG201343,SS201344)
  • 【文献出处】 计算机工程 ,Computer Engineering , 编辑部邮箱 ,2017年03期
  • 【分类号】TP393.09;TP301.6
  • 【被引频次】3
  • 【下载频次】121
节点文献中: 

本文链接的文献网络图示:

本文的引文网络