节点文献

基于语义的数据清理技术

Data cleaning technology based on semantic

推荐 CAJ下载
PDF下载
不支持迅雷等下载工具，请取消加速工具后下载。

【Author】 Cao Zhongsheng Wan Jinwei Associate Prof.;College of Computer Sci. & Tech., Huazhong Univ. of Sci. & Tech., Wuhan 430074, China.

【机构】华中科技大学计算机科学与技术学院；华中科技大学计算机科学与技术学院湖北武汉430074；湖北武汉430074；

【摘要】为弥补传统的基于文本相似函数(如编辑距离、语音距离等)的方法在重复记录的识别方面的不足,对记录内部单个字段的语义以及字段之间的语义进行了研究;采用字段名与统计分析相结合的方法来识别字段内部的语义,采用语义规则库来识别字段之间的层次语义和依赖关系;将语义引入到优先队列算法中,提出了改进的优先队列法(IPQM),在计算记录之间的相似度时,显式考虑字段之间的层次语义关系,对不同的字段类型调用不同的相似度计算方法.同时将语义规则库引入到数据清理框架,在预处理阶段利用语义来处理等价类型错误,在处理阶段利用IPQM来计算记录之间的相似度.实验结果表明该框架提高了数据清理的质量,遗漏率低于7%,误判率不超过3%.更多还原

【Abstract】 In order to remedy the deficiency of traditional textual similarity function in duplicate records, the semantics of single field and one among the fields were analyzed. The field name and statistics was used to judge the field semantic and the semantic rules were used in recognizing the hierarchy semantic and dependence among the fields. The semantic was introduced into the Prior Queue and the Improved Prior Queue Method(IPQM) was presented. On computing the similarity degree between two records, the hierarchy semantic was considered explicitly and diverse similarity degree computing methods were called for different fields. A semantic rule-based framework for data cleaning was presented. The semantic was used to clean equivalence error at pre-processing stage and the IPQM was used to calculate similarity degree between two records at processing stage. The experimental results show that method can improve the quality of data cleaning and the recall is exceed 93?% and false-positive error is under 3?%.更多还原

【关键词】数据清理；消重；文本相似函数；语义；
【Key words】 data cleaning； duplicate elimination； textual similarity function； semantic；

【基金】国家科技攻关计划资助项目(2001BA110B01).

【文献出处】华中科技大学学报(自然科学版) ,Journal of Huazhong University of Science and Technology , 编辑部邮箱 ,2005年02期

【分类号】TP311.13
【被引频次】6
【下载频次】209

知网节下载

节点文献中：

本文链接的文献网络图示:

本文的引文网络

节点文献