节点文献

结构驱动的单类分类器设计及拓展研究

Structure Driven for One-Class Classifiers Design and Extended Research

【作者】 冯爱民

【导师】 陈松灿;

【作者基本信息】 南京航空航天大学 , 计算机应用技术, 2011, 博士

【摘要】 分类器设计中的一个根本问题是如何提高其泛化能力,即根据从已有样本中获取的知识对未知样本进行判别的能力。因此,尽可能从样本中获取足够多的先验知识是提高泛化能力的重要途径之一。在单类分类中,由于异常样本缺失使数据信息大为减少,从而使分类器设计更为困难,因此尽力挖掘仅有的一类目标数据中的先验知识尤为重要。本文围绕着单类分类器设计,并针对数据局部密度信息、结构信息、簇信息以及少量异常数据信息等先验知识的挖掘和利用进行了较为深入的研究,所取得的主要成果如下:(1)以密度估计和支持域方法为主线综述了单类分类器的主要研究方法,并提出了一种两者结合的混合模型。从密度估计、支持域的角度重新梳理了单类分类器的主要算法,并针对作为研究热点的支持域方法,在洞察出各算法联系的基础上深入分析了超平面和超球模型的主要改进/变形模型;通过在支持域方法中嵌入局部密度,提出了适用于非对称数据分布的混合模型——局部密度嵌入的单类支持向量机线性规划算法。(2)提出了结构驱动学习策略及相应单类模型——结构单类支持向量机(SOCSVM),并推导出对应的错分误差界。针对现有超平面单类分类器只注重局部或全局学习之不足,提出了全局与局部学习兼备的结构驱动学习策略,并以单类支持向量机为原型,通过嵌入数据结构分布信息设计出相应模型——结构单类支持向量机。作为本文后续工作的研究基础,理论证明了其最优解具有唯一性和鲁棒性,并推导出相应的错分误差界。人工和UCI数据上的实验结果表明:考虑了数据分布趋势的SOCSVM具有更强的数据描述能力,从而验证了结构驱动学习策略的有效性;(3)发展出了适用于单类多簇数据分布的结构驱动模型——结构大间隔单类分类器(SLMOCC)。基于数据分布的结构驱动学习使多簇目标数据的处理不同于单簇数据,并因此衍生出相应的分类器——结构大间隔单类分类器(SLMOCC)。通过分别约束各簇数据到最优超平面的马氏距离,SLMOCC最大程度地利用了数据的结构信息并因此具有了更精细的数据描述。为捕捉数据的多簇分布,采用了可自动确定聚类数目的凝聚型层次聚类算法。人工和UCI数据上的实验结果表明SLMOCC的性能有显著提高。(4)构建了单类和两类问题以及数据不平衡问题的统一框架。通过在SOCSVM最小化正半空间的同时最大化正负类间隔或将结构驱动学习策略作用于ν-SVM算法并引入超平面阈值,发展出了两等价模型并统称为偏结构数据描述与判别机(BSD3M)框架,放宽了经典SVM中两类支持向量平衡的限制,并因此能够根据需要控制超平面的位置。通过合理设置目标函数及判别函数中的相应参数,BSD3M不仅可用于含极少量异常的单类问题以提高数据描述能力,同时也可推广于正负类数据大致平衡的两类问题及少量数据更为重要的数据不平衡问题。部分UCI数据集上含5%异常数据的实验结果表明:充分利用负类信息的BSD3M较之单类算法和SVM算法更为准确地描述了目标数据区域。(5)推广出了一系列与上述各模型学习能力相当的线性规划快速算法以提高计算效率。通过最小化目标数据到超平面的函数距离,推广出SOCSVM的线性规划算法SlpOCSVM并因此使计算复杂度从二次规划的O ( n 3)提升至最快O ( n );进一步将该思想应用于多簇目标数据并以各簇数据协方差之和取代整体协方差矩阵,可将SLMOCC二次锥规划的多项式时间大幅降低;而将SlpOCSVM嵌入数据类间隔亦发展出了BSD3M的线性规划算法。上述各快速算法的实验结果验证了结构驱动学习及多簇信息嵌入同样适用于非间隔算法。

【Abstract】 The key problem for the classifier design is how to improve the generalization capability, that is the power to predict the unseen samples according to the knowledge obtained from the training data. One of the best ways to improve this issue is acquiring the prior knowledge mostly from the data. As for one-class classifiers design, the absent or few of the abnormality reduced the information abundantly. Under this much harder situation, it is more imperative to mine the prior knowledge from the only existed normality. This thesis concentrated on mining the prior knowledge for one-class classifiers design including local density information, structure information, cluster distribution and implicit information embodied in the few abnormality etc., the main contributions are listed as follows:(1) Review the key methods about one-class classifiers with the viewpoint of the density estimation and supporting domain, then proposed a hybrid model combined by these two methods. Summarized the main algorithms of one-class classifiers based on density estimation and supporting domain point of view, further analyzed the improve/variant algorithms of supporting domain models divided by hyperplane and hypersphere according to pinpoint the relations among each algorithm. Proposed a hybrid model adapted to nonsymmetrical data through embedding the local density into the supporting domain method.(2) Proposed the Structure-Driven Learning (SDL) strategy, designed the corresponding algorithm named Structured One-Class Support Vector Machine (SOCSVM), and derived its error bound. Perceived the drawbacks of the present one-class classifiers which merely emphasize on one side of local or global leaning but neglect the other, proposed the integrated learning strategy called Structure-Driven Learning. Through embedding the structure distribution information into the prototype of One-Class Support Vector Machine, a novel algorithm, SOCSVM, which satisfied SDL was proposed with more competence on data description and generalization capability. The uniqueness, resistance and the derived error bound for its optimal solution have been theoretically proved. As the foundation of the following research, the results of toy and UCI benchmark dataset illustrated the validity of SOCSVM and its leaning strategy.(3) Proposed a new Structure-Driven Learning model for multi-cluster data. Structure-Driven Learning strategy implemented by structure information embedded makes multi-cluster data’s processing much different from the single ones. So, it is more reasonable to consider the structure of each cluster respectively than just simply treat all of them as a whole. Structure Large Margin One-Class Classifier (SLMOCC), an algorithm adapted to multi-cluster data, fulfills the above strategy by restricting each data’s Mahalanobis distance to the hyperplane. Through maximizing the minimum Mahalanobis distance margin, SLMOCC can find a more reasonable optimal hyperplane attributed to its finer cluster granularity description. As for extracting the underlying data structure, SLMOCC uses the Ward’s agglomerative hierarchical clustering on input data or mapping data in kernel space. The experiment results on toy and UCI benchmark data demonstrate the improved generalization capability of SLMOCC.(4) Developped a unified framework for one-class, binary-class and imbalance data problems. On the basis of the SDL model, an integrated framework named Biased Structure Data Description & Discrimination Machine (BSD3M) is developed either by further embedding the class margin into SOCSVM or also optimizing the threshold of the hyperplane with the target function ofν-SVM. Thus, the broken constraint of the equal positive and negative support vectors in classical SVM makes it possible to control the position of hyperplane on demanding. Through reasonable predefined the parameters of the object function and discrimination function, BSD3M not only can be used for one-class problem with few outliers to improve its description capability, but also can be extended to binary-class with almost balance data or imbalance problem with more important minority data for improving their discrimination ability. The primary experiment results of the majority normal data with 5 percent outliers indicate the more competitive boundary for BSD3M than the above one-class classifiers ascribe to the consideration of the outliers.(5) Generalized a series of linear programming algorithms for the above models with the comparative power with the advantage of efficiency computation. Through minimizing the average function distance from the target data to the hyperplane, a linear programming algorithm called SlpOCSVM can be obtained with the reduced computation complexity from O ( n 3)of SOCSVM’s quadratic programming even to O ( n ). Further apply this idea to multi-cluster data and then replace the whole covariance matrix with summing up all cluster’s covariance matrix, the simple version of SLMOCC sharply decreases the polynomial complexity of Second-order Cone Programming. Same embedded the class margin to SlpOCSVM also can get the linear programming of BSD3M. The experiment results of SlpOCSVM and simple version of SLMOCC showed that the validity of the Structure-Drive Learning and multi-cluster information embedding for non-margin linear programming algorithms.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络