节点文献

静态与动态的不平衡分类问题研究

Research on Static and Dynamic Imbalanced Classification Problems

【作者】 张建军

【导师】 吴永贤;

【作者基本信息】 华南理工大学 , 计算机科学与技术, 2020, 博士

【摘要】 机器学习技术与其他产业的加速融合与发展,促进了以物联网、大数据、机器人等产业为代表的自动化和智能化产业集群的形成,成为推动生产生活及经济发展的新动力。机器学习需要大量数据进行模型训练,当机器学习技术被应用到某一具体场景时,往往存在数据类别不平衡现象,即某些类别的数据数量远多于其它一些类别的数据数量。这些样本数量较少的类别(少数类)在实际应用中比较重要,忽略或错误分类这些少数类别的样本可能会导致严重后果,例如错误区分病人为正常人会导致病人错失治疗的机会。然而,现有许多算法面对类别不平衡问题很容易过拟合多数类的样本,从而降低了模型对少数类样本的识别率。所以在实际应用中,必须要考虑类别不平衡问题,在维持对多数类样本的高准确率时,避免错误分类少数类样本。为此,本文针对机器学习中出现的类别不平衡问题,着重研究静态不平衡问题中“缺少一种可以高效量化类别不平衡对学习任务造成负面影响的理论”、动态不平衡问题中“出现概念漂移现象”、“出现新类别”这三个关键问题,旨在深入了解不平衡学习的本质与难点,提高机器学习算法在真实环境中的可用性、准确性、与稳定性,为发展智能社会助力。本文的主要研究内容与主要创新点归纳如下:(1)静态不平衡问题中缺少可以高效量化类别不平衡对学习任务造成负面影响的理论,为此,本文提出一种基于不平衡敏感度的静态不平衡学习方法POSENS(Perturbation-based Over-Sampling ENSemble),可以系统分析和量化静态类别不平衡对不同样本造成的负面影响,为理解不平衡学习难题的本质提供了一个简单有效的工具。此外,利用量化计算得到的信息,提出了新的过采样组合方法,可以合理生成含有效信息的少数类样本以减少新噪声的引入,进一步提高过采样方法的有效性与分类方法的泛化性能。本方法在三十五个数据集上实验并与九个较流行的方法进行比较,大量的实验结果显示本文提出的方法在三个性能指标上可以取得统计上显著更好的效果。(2)针对动态不平衡数据流中常出现类别先验概率改变、概念漂移现象导致模型性能下降这两个难题,本文提出了一种有效的动态不平衡学习方法CWIB(Cost-sensitive Weighting and Imbalance-reversed Bagging)。该算法主要包含两大模块:不平衡反转bagging算法与代价敏感的动态加权机制。针对动态数据流中出现类别先验概率改变这一难题,本文提出了不平衡反转bagging算法,可以在类别先验概率改变时依然保持较高的真阳性率和较低的假阳性率;同时针对动态数据流中由于出现概念漂移而导致模型性能下降这一难题,提出了代价敏感的动态加权算法,可以维持算法的准确度与稳定性。本文将提出的算法与六种对比方法在真实电价数据集上进行实验,大量实验结果显示所提出的算法在四种性能指标上取得统计上显著更好的结果。(3)动态数据流中新类别的出现容易引致类别不平衡问题。针对新类别检测算法存在的对已有类别识别率低、不能很好应对动态变化的不平衡数据环境、维护成本过高且运行时间过长的难点,本文提出一种基于k-近邻的新类别检测方法KNNEND(KNearest Neighbors-based Ensemble for New class Detection)。它利用最近邻组合器缓解不平衡问题的影响以及提高对已有类别的识别率。此外,通过快速动态更新新类别子模型,维护固定数量的新类别子模型,降低了模型维护成本与运行时间,提高了检测算法在真实应用场景下的可用性。

【Abstract】 The accelerated integration and development of machine learning technologies and other industries has promoted the formation of automated and intelligent industrial clusters represented by the Internet of Things,big data,robotics and other industries.They have become a new driving force for production life and economic development.Machine learning requires large amounts of data for model training.Class imbalance is often an issue when machine learning techniques are applied to a specific scenario,i.e.,the amount of data in some classes is much greater than that in some other classes.These classes with smaller numbers of samples(i.e.minority classes)are more important in practical applications.Ignoring or misclassifying samples from these classes easily leads to serious consequences.For example,misclassifying patients as normal easily may lead to patients missing treatment opportunities.However,many existing algorithms face the problem of class imbalance and easily overfit samples from the majority class,thus reducing the model’s recognition rate for minority samples.Therefore,in practice,it is important to take into account the class imbalance problem and to avoid misclassifying minority samples while maintaining a high accuracy for majority samples.Therefore,this thesis focuses on the class imbalance problem in machine learning and mainly investigates three key issues,i.e.,the lack of a theory that can efficiently quantify the impact of class imbalance on learning tasks in static imbalanced problems,the concept drift phenomenon in dynamic imbalanced problems,and the emergence of new classes in dynamic imbalanced problems.The aim is to understand the nature and difficulties of imbalanced learning,to improve the applicability,accuracy,and robustness of machine learning algorithms in real-world environments,and to help develop an intelligent society.The main research content and main contributions of this thesis are summarized as follows.(1)In static imbalanced problems,there is a lack of theory that can efficiently quantify the impact of class imbalance on learning tasks.This thesis proposes a method,POSENS(Perturbation-based Over-Sampling ENSemble),that systematically analyzes the impacts of static class imbalance on different samples,quantify the negative effects of imbalance on data,and provide a simple yet effective tool for understanding the nature of the imbalanced learning problems.In addition,based on the computed information,a new oversampling ensemble method is proposed that can reasonably generate informative minority samples,reduce the introduction of new noise,and further improve the effectiveness of the oversampling method and the generalization performance of the classification method.The proposed method is evaluated on thirty-five datasets and compared with nine popular reference methods.Extensive experimental results show that the proposed method can achieve statistically significantly better results in three performance metrics compared to nine reference methods.(2)A new effective dynamic imbalanced learning algorithm,CWIB(Cost-sensitive Weighting and Imbalance-reversed Bagging),is proposed to address the two challenges in dynamic data streams,namely,the frequent change of class prior probabilities and the degradation of model performance due to the concept drift phenomenon.The algorithm mainly contains two modules,the imbalance-reversed bagging algorithm,and the cost-sensitive dynamic weighting mechanism.To address the problem of class prior probability change in dynamic data streams,the imbalance-reversed bagging algorithm is proposed,which can maintain high true positive rate and low false positive rate even when the class prior probability changes;meanwhile,to address the problem of model performance degradation due to concept drift in dynamic data streams,the cost-sensitive dynamic weighting algorithm is proposed,which can maintain both the accuracy and stability of the model.On a real electricity pricing dataset,the proposed algorithm achieves statistically significantly better results on four performance metrics compared to six reference methods.(3)The emergence of new classes in a dynamic data stream can easily lead to class imbalance problems.Existing new class detection algorithms have several drawbacks,i.e.,low recognition rate for known classes,failing to handle dynamically changing imbalanced data environment,high maintenance costs,and long run time.To alleviate these issues,we propose a new k-nearest neighbor-based algorithm,KNNEND(K-Nearest Neighbors-based Ensemble for New class Detection),for new class detection.It utilizes nearest neighbor ensembles to mitigate the impact of class imbalance and to improve the recognition rate for known classes.In addition,it dynamically updates new class sub-models and maintains fixed number of new class models,which lowers the maintenance costs and run time.These improves the applicability of the detection model in real-world scenarios.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络