节点文献
基于语义的数据清理技术
Data cleaning technology based on semantic
【摘要】 为弥补传统的基于文本相似函数(如编辑距离、语音距离等)的方法在重复记录的识别方面的不足,对记录内部单个字段的语义以及字段之间的语义进行了研究;采用字段名与统计分析相结合的方法来识别字段内部的语义,采用语义规则库来识别字段之间的层次语义和依赖关系;将语义引入到优先队列算法中,提出了改进的优先队列法(IPQM),在计算记录之间的相似度时,显式考虑字段之间的层次语义关系,对不同的字段类型调用不同的相似度计算方法.同时将语义规则库引入到数据清理框架,在预处理阶段利用语义来处理等价类型错误,在处理阶段利用IPQM来计算记录之间的相似度.实验结果表明该框架提高了数据清理的质量,遗漏率低于7%,误判率不超过3%.
【Abstract】 In order to remedy the deficiency of traditional textual similarity function in duplicate records, the semantics of single field and one among the fields were analyzed. The field name and statistics was used to judge the field semantic and the semantic rules were used in recognizing the hierarchy semantic and dependence among the fields. The semantic was introduced into the Prior Queue and the Improved Prior Queue Method(IPQM) was presented. On computing the similarity degree between two records, the hierarchy semantic was considered explicitly and diverse similarity degree computing methods were called for different fields. A semantic rule-based framework for data cleaning was presented. The semantic was used to clean equivalence error at pre-processing stage and the IPQM was used to calculate similarity degree between two records at processing stage. The experimental results show that method can improve the quality of data cleaning and the recall is exceed 93?% and false-positive error is under 3?%.
【Key words】 data cleaning; duplicate elimination; textual similarity function; semantic;
- 【文献出处】 华中科技大学学报(自然科学版) ,Journal of Huazhong University of Science and Technology , 编辑部邮箱 ,2005年02期
- 【分类号】TP311.13
- 【被引频次】6
- 【下载频次】209