节点文献
考虑边界稀疏样本的非平衡数据处理方法
Unbalanced data processing method considering boundary sparse samples
【摘要】 针对现有非平衡数据处理方法存在的局限性,提出一种考虑边界4稀疏样本的混合采样方法(considering boundary sparse samples-hybrid sampling,CBSS-HS)。通过计算每个样本的边界因子识别边界点,将样本空间划分为边界域和非边界域,对非边界域内的负类样本进行欠采样,而由于边界域上样本的稀疏性,对其上正类样本使用基于最大距离的合成少数类过采样技术(max distance-synthetic minority oversampling technique,MD-SMOTE)进行过采样,最大限度地保留正类样本的信息,最终达到2类样本基本平衡。将Recall,F1-value,G-mean和AUC(area under the curve)值作为评价指标,使用CBSS-HS+支持向量机(support vector machines,SVM)算法在5个不同平衡度的数据集上验证其有效性,并与其他4种组合模型的分类效果做对比。结果表明,提出的CBSS-HS算法在不同数据集上各个评价指标都有良好的表现,平均提高了4.6%。因此,该方法可以作为处理非平衡数据的一种有效手段。
【Abstract】 Aiming at the limitations of existing unbalanced data processing methods, a hybrid sampling method considering boundary sparse samples is proposed. By calculating the boundary factor of each sample to identify the boundary points, the sample space is divided into boundary and non-boundary domains, and negative samples in the non-boundary domain are undersampled, and due to the sparsity of the samples on the boundary domain. MD-SMOTE algorithm, SMOTE based max distance, is performed for positive samples on it, which maximumly reserve the information of positive samples and ultimately achieves a basic balance between the two types of samples. The Recall, F1-value, G-mean and AUC values were used as evaluation indicators. The CBSS-HS+SVM algorithm was used to verify the validity of the datasets on five different balances, and compared with the classification model of the other four combined models. The results show that the CBSS-HS algorithm presented in this paper has a good performance on different evaluation indexes, with an average increase of 4.6%. Therefore, this method can be used as an effective means of processing unbalanced data.
【Key words】 unbalanced data; mixed sampling; boundary factor; SMOTE algorithm;
- 【文献出处】 重庆邮电大学学报(自然科学版) ,Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition) , 编辑部邮箱 ,2020年03期
- 【分类号】TP311.13
- 【被引频次】8
- 【下载频次】142