节点文献
基于有效距离的特征提取和特征选择算法研究
Research on Feature Extraction and Feature Selection Algorithms Based on Effective Distance
【作者】 张丹;
【导师】 张道强;
【作者基本信息】 南京航空航天大学 , 计算机科学与技术, 2017, 硕士
【摘要】 在机器学习和模式识别领域,特征提取和特征选择技术已经成为了解决高维数据的重要途径,并且在信息检索、文本分类和疾病诊断等领域都得到了广泛的应用。研究表明多数的特征提取和特征选择算法都利用相似性来衡量样本之间的关系,而样本之间的相似性往往都是使用传统的欧氏距离计算。由于欧氏距离的静态本质,因此它往往忽略了周围其他样本对目标样本的影响以及样本与样本之间潜在的动态结构。为了可以充分反映出样本之间这种潜在的动态结构,本文提出在全局拓扑结构关系的基础上,考虑到其他样本与目标样本之间的关系,然后计算样本之间的距离,即有效距离。接着我们利用了有效距离计算样本之间的相似性,提出了基于有效距离改进的特征提取和特征选择算法。本文的主要创新点和研究工作总结主要如下:一方面,我们提出了两种方式计算样本之间的有效距离,分别为基于KNN (k Nearest Neighborhood)的有效距离和基于稀疏表示的有效距离。这两种有效距离的计算都要依赖于样本之间的拓扑结构关系,因此我们首先利用样本之间的稀疏重构关系或样本之间的近邻关系构造出一个双向的拓扑网络,然后依赖于这个双向网络计算了两个样本之间的有效距离。接着,我们把基于有效距离得到的相似性矩阵引入到特征提取算法中,得到了基于有效距离的特征提取算法。实验结果表明,基于有效距离改进的特征提取算法,能够有效地获取样本的全局和局部结构信息,从而得到更加优越的分类性能。另一方面,我们首先通过稀疏表示得到样本之间的稀疏重构关系,然后基于这种稀疏重构关系构建了全局的拓扑结构,从而可以计算样本之间的有效距离。通过有效距离,我们可以计算不同样本之间基于有效距离的相似性,在特征选择过程中用于衡量特征的重要性。此外,我们在特征选择过程中加入了迭代的思想,逐渐地去选择最优的特征子集。因此,我们提出了基于有效距离的迭代特征选择算法。我们在一系列的UCI数据集上进行了验证,实验结果表明,相比于使用欧氏距离的特征选择算法,本文提出的基于有效距离的特征选择算法可以选择出更优的特征,进而可以提升分类性能。
【Abstract】 In machine learning and pattern recognition domain, feature extraction and feature selection are important approaches to deal with high-dimensional data, which have been widely used in information retrieval, text classification and disease diagnosis. Researches showed that many feature extraction and feature selection algorithms focus on using Euclidean distance to measure the similarity of samples However, Euclidean distance usually ignores the influence of other samples and fails to capture the dynamic structure due to its static characteristics. To reflect the underlying dynamic structure of data,in this thesis, we measure the effective distance of samples by considering the relationship between the target sample and other samples, which based on the global topological structure. Then we propose a set of effective distance-based feature extraction and feature selection algorithms by using the effective distance-based similarity. The main innovation and research of this thesis are as followsOn the one hand, we develop two ways to compute the effective distance of samples, including k Nearest Neighborhood-based effective distance and sparse representation-based effective distance The computation of effective distance is depended on the topological structure of samples. First, we construct one bilateral network using sparse reconstruction relationship of samples or neighborhood relationship of samples. Based on this bilateral network, we can compute the effective distance of two samples. Then we propose effective distance-based feature extraction methods by using the effective distance-based similarity matrix. Experimental results show that effective distance-based feature ex traction algorithms can effectively preserve the structure of samples and achieve better classification performance than conventional methods using Euclidean distance.On the other hand, we firstly obtain the sparse reconstruction relationship of samples through sparse representation, which is used to construct the global topological structure. Then the effective distance of different samples could be measured using the topological structure. In the process of feature selection,the similarity based on effective distance is used to evaluate the importance of features. Besides, we take advantage of the idea of iteration to achieve the optimal feature subset gradually. As a result, we develop the modified iterative feature selection algorithms based on effective distance. Experiments are conducted on a series of UCI data sets and the results indicate that our effective distance-based feature selection methods can select much better features and boost the classification performance.
【Key words】 Feature extraction; feature selection; dynamic structure; topological structure; effective distance; similarity; classification;