节点文献

模式识别中的样本选择研究及其应用

Research on Sample Selection and Its Applications in Pattern Recognition

【作者】 姜文瀚

【导师】 杨静宇;

【作者基本信息】 南京理工大学 , 计算机应用, 2008, 博士

【摘要】 随着信息科技的发展,人们需要处理的信息数据日渐呈现出高维和海量的特点。然而,随之而来的困扰是如何有效地分析和利用这些数据。这是模式识别、数据挖掘、神经网络、机器学习等学科所共同面临的问题。在统计模式识别中,许多分类方法的计算复杂度随着训练集样本个数的增加而快速增长。因此对于较大规模数据的处理常常陷入困难。一个直接有效的解决途径就是在保证学习算法分类性能的前提下,通过样本选择来约简训练样本集。样本选择既可以起到降低算法计算代价,加快学习速度的作用,也可能避免“过拟合”现象的发生,从而提高分类算法的泛化能力。本文针对分类决策与训练样本凸包有关的一类分类器,包括线性支持向量机、非线性支持向量机、最近邻凸包分类器和核最近邻凸包分类器,提出了几种样本选择方法,并通过实验分别对它们的有效性进行了验证。本文首先提出了子类凸包样本选择方法。该方法针对一类训练样本,通过迭代逐一选择距离选择集凸包最远的样本,从而使得选择集凸包尽可能地逼近原凸包。经证明,该方法选择的样本为原训练集凸包边缘点。本文将该样本选择方法分别与线性支持向量机和最近邻凸包分类器相结合,并取得了良好的实验效果。本文将核函数方法与子类凸包样本选择算法相结合,提出了核子类凸包样本选择方法。该方法利用核函数替代子类凸包样本选择算法中的内积运算,从而巧妙的在特征空间中实现了子类凸包样本选择的过程。本文将该样本选择方法分别用于非线性支持向量机和核最近邻凸包分类器的训练集约简。实验结果表明,由该样本选择方法选择的少量样本可以有效地支撑非线性支持向量机和核最近邻凸包分类器分类。本文提出子空间样本选择方法。该样本选择方法同样是一种类内样本选择方法,通过迭代逐一选择那些到已选样本集张成子空间距离最远的样本。经证明,该方法选择的样本不但是原训练集样本凸包的边缘点,而且彼此线性无关。本文将该样本选择方法应用于线性支持向量机。实验表明,该方法选择的样本在保持线性支持向量机较高识别性能的前提下,使得分类器的训练和测试时间明显缩短。本文在子空间样本选择方法的基础上,引入核函数,形成核子空间样本选择方法。首先通过非线性映射将各类别训练样本映射到特征空间,然后在该空间内执行与子空间样本选择方法相同的选样过程。在验证实验中,该方法为非线性支持向量机选择样本。在保持分类器泛化能力的同时,该方法选择样本少,选样速度快,表现出了明显的比较优势。

【Abstract】 The digital technologies and computer advances have led to information explosion and massive data collection. The following puzzle is how to analysis and utilize so much data. That is a common challenge for pattern recognition, data mining, neural network and machine learning. In statistical pattern recognition, the computational complexity of many classifiers increases quickly as the number of training samples increases. So when applied to a large data set, those classifiers often become computational intractable. One available way to circumvent this difficulty is to reduce the training set in advance via sample selection. Sample selection not only decreases the computational costs and speeds up the learning processes, but also might avoid the "overfitting" phenomenon and improve the prediction performance of the classifiers.This dissertation mainly studies on the sample selection strategies for some classifiers, which include linear support vector machine, nonlinear support vector machine, nearest neighbor convex hull classifier and kernel nearest neighbor convex hull classifier. These classifiers all involve solving convex quadratic programming problems, which require large memory and enormous time for large-scale problems. Therefore, it is desirable to remove the data from the training set without degrading the prediction accuracy. According to the geometric interpretations and the principles of these classifiers, the convex hull of each class plays an important role in the decisions. Thus, those samples on the edge of the convex hull are expected to be more informative and can be taken as the candidates. Based on this idea, some selection methods are presented in this dissertation. The applicability of them is demonstrated through the above classifiers using the face recognition databases or the object image library. We also compare the proposed data selection strategies with other methods in terms of their ability to reduce the training set size while maintaining the generalization performance of the classifiers.First of all, a novel sample selection method named subclass convex hull sample selection algorithm is presented. That is an iterative method. In one class, at each step only one sample, which is the furthest one to the convex hull of the chosen samples, is picked out. It can be proved that the samples selected be the points on the edge of the convex hull of the original training set. The convex hull of the chosen set can approximate the convex hull of the original set as soon as possible. The proposed selection algorithm is used to serve as the preprocessing step to the linear support vector machine and the nearest neighbor convex hull classifier. The experiments show that a significant amount of training data can be removed without degrading the performance of the resulting classifiers.The second is the kernel subclass convex hull sample selection algorithm. By replacing the dot product in the subclass convex hull sample selection algorithm with kernel functions, the proposed kernelized method is constructed. The experimental results provide the promising evidence that it is possible to successfully employ the proposed algorithm ahead of the nonlinear support vector machine and the kernel nearest neighbor convex hull classifier.Another proposed method is the subspace sample selection algorithm, which reduces the member of each class independently. It is also an iterative procedure. But at each step, the furthest sample to the subspace of the chosen samples is selected. Moreover, those chosen samples are the linear independence edge points of the convex hull of the original training set. The analytic solution of distance computation can be obtained. Therefore the method has faster selecting speed than the subclass convex hull sample selection algorithm. That has been verified by the experiments. The linear support vector machine is applied to synthetic face sets and object image library using the result of this new data reduction algorithm as the training data set. Experiments show that few samples chosen by the proposed mechanism lead to the computation time being significantly reduced without any decrease in the prediction accuracy.Similarly, the kernel trick can also be introduced into the subspace sample selection algorithm. Then, the kernel subspace sample selection algorithm is presented. By the kernel mapping, the samples in the input space can be mapped into a high-dimensional feature space in which the selection processes similar to the subspace sample selection algorithm are performed. In the experiments, computational times for the nonlinear support vector machines with the proposed reduction algorithm are much smaller than the other compared methods. Without sacrificing the classification accuracy, the proposed algorithm exhausts the less time and selects the less training samples.

  • 【分类号】TP18
  • 【被引频次】21
  • 【下载频次】1671
节点文献中: 

本文链接的文献网络图示:

本文的引文网络