节点文献
一种中文地址类相似重复信息的检测方法
Detection Method of Approximately Duplicated Chinese Address Information
【摘要】 数据仓库中相似重复记录的识别与消除是数据清洗的热点问题,其中地址类信息对相同实体识别起着非常重要的作用.针对中文地址类信息的处理,建立了包含分词规则的元数据库,提出一种相似重复检测模型.在此基础上,描述了基于特征字符的分词算法和利用可变权值策略计算记录相似度的算法.实验结果表明该方法能有效解决中文地址类重复信息的检测,提高了算法的执行效率及检测精度.
【Abstract】 It’s a hot issue to eliminate approximately duplicated records in data cleansing operation of data warehouse,in which the address information play an important role to identify the same entity.Aiming at the processing of Chinese address information,the meta-database of segment rules is established,and an approximately duplicated detection model is proposed.The feature word based segment algorithm and similarity computation algorithm are presented.The experiment results indicate that this method can detect approximately duplicated records effectually,the algorithm running efficiency and detect precision can be improved.
【Key words】 approximately duplicated records; Chinese address information; tagged word; segment; variable weight;
- 【文献出处】 小型微型计算机系统 ,Journal of Chinese Computer Systems , 编辑部邮箱 ,2008年04期
- 【分类号】TP311.13
- 【被引频次】10
- 【下载频次】203