节点文献
基于无监督学习的数据清洗算法
Data Cleaning Algorithm Based on Unsupervised Learning
【摘要】 为了解决数据仓库中相似重复记录的数据问题,提出了基于无监督学习的数据清洗算法。该算法采用基于Hebb ian假设的自适应学习方法,并通过相似度确定奖励和惩罚等级。在学习过程中根据需要增加新的聚类,在学习结束后,通过分析聚类情况删除错误的聚类,从而避免了死神经元问题并使聚类更加准确。实验表明,该算法能准确地完成实体识别。
【Abstract】 To resolve the similarity and iteration record problem in the data warehouse,a data cleaning algorithm which is based on unsupervised learning was put forward.The learning method is based on the Hebbian postulate and the main idea of the learning is that the similarity level decides the rewarded and penalized rate.To overcome the problem of dead cluster a new cluster is constituted when no existing cluster is similar to one pattern.After learning,another important task is to detect whether there are wrong clusters,if one is found,the cluster will be deleted and combined with the cluster which is the most similar cluster to it,and thus the result of clustering is more accurate.In the experiments,the learning algorithm is applied to clustering task to check its capability and the results show that it performs accurately.
【Key words】 data warehouse; data extract; data transform; data cleaning; data loading;
- 【文献出处】 吉林大学学报(信息科学版) ,Journal of Jilin University(Information Science Edition) , 编辑部邮箱 ,2008年06期
- 【分类号】TP311.13
- 【被引频次】7
- 【下载频次】394