节点文献

基于无监督学习的数据清洗算法

Data Cleaning Algorithm Based on Unsupervised Learning

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 孙铁民于杰尚程田大新张丽华

【Author】 SUN Tie-mina,YU Jieb,SHANG Chengc,TIAN Da-xinc,ZHANG Li-huaa(a.Department of Science and Technology;b.College of Communication Engineering;c.College of Computer Scince and Technology,Jilin University,Changchun 130012,China)

【机构】 吉林大学科技处吉林大学通信工程学院吉林大学计算机科学与技术学院

【摘要】 为了解决数据仓库中相似重复记录的数据问题,提出了基于无监督学习的数据清洗算法。该算法采用基于Hebb ian假设的自适应学习方法,并通过相似度确定奖励和惩罚等级。在学习过程中根据需要增加新的聚类,在学习结束后,通过分析聚类情况删除错误的聚类,从而避免了死神经元问题并使聚类更加准确。实验表明,该算法能准确地完成实体识别。

【Abstract】 To resolve the similarity and iteration record problem in the data warehouse,a data cleaning algorithm which is based on unsupervised learning was put forward.The learning method is based on the Hebbian postulate and the main idea of the learning is that the similarity level decides the rewarded and penalized rate.To overcome the problem of dead cluster a new cluster is constituted when no existing cluster is similar to one pattern.After learning,another important task is to detect whether there are wrong clusters,if one is found,the cluster will be deleted and combined with the cluster which is the most similar cluster to it,and thus the result of clustering is more accurate.In the experiments,the learning algorithm is applied to clustering task to check its capability and the results show that it performs accurately.

【基金】 吉林省科技厅基金资助项目(20071103)
  • 【文献出处】 吉林大学学报(信息科学版) ,Journal of Jilin University(Information Science Edition) , 编辑部邮箱 ,2008年06期
  • 【分类号】TP311.13
  • 【被引频次】7
  • 【下载频次】394
节点文献中: 

本文链接的文献网络图示:

本文的引文网络