节点文献
考虑特征变量异质性的分类方法及其在风险决策中的应用研究
Research on Classification Method and Its Application in Risk Decision-Making with Feature Space Heterogeneity
【作者】 王昱;
【作者基本信息】 中国科学技术大学 , 管理科学与工程, 2009, 博士
【摘要】 风险决策中广泛存在着具有如下特征的一类问题,即风险决策者首先需要建立历史数据样本与自然状态之间的依赖关系,然后根据该依赖关系估计一个新的数据样本所对应自然状态的出现概率,最后建立风险决策模型,以最大化收益函数(或最小化风险损失函数)为目标选择最优行动方案。从建立历史数据样本与自然状态的依赖关系这一意义上,可以将这类风险决策问题归纳为数据挖掘中的分类问题,因而数据挖掘中的各种分类方法可以应用于该类风险决策问题。由于分类的高效性和准确性对于风险决策具有关键性影响,因此分类方法及其在风险决策中的应用研究具有重要的理论和现实意义。已有的相关研究主要从分类方法及其应用的角度考虑风险决策问题。事实上,在应用各种数据挖掘技术前,探索了解数据的特性将对挖掘结果有重要甚至关键性的影响。在分类问题中,特征变量异质性是一种重要的数据特性,对于分类方法的应用结果具有显著影响。因此,本文主要研究考虑特征变量异质性的分类方法及其在风险决策问题中的应用。研究目的在于探索数据中存在的特征异质性这一数据特性,并提出相应的解决方法以提高分类的准确性,使分类方法能够更好地为风险决策提供支持。本文从第一章到第六章的内容安排如下:第一章概述了本文的研究背景,回顾了分类方法及其在风险决策问题中的相关研究和分类问题中特征变量异质性的研究现状,说明了本文的研究内容和研究意义。第二章首先对分类问题进行简要介绍,然后概述分类问题中的特征变量相关性和特征变量选择问题。在此基础上,根据一系列理论和实际应用研究对特征变量异质性的概念进行描述。由于特征变量异质性无法从数据样本集合中直接观察和测度,本章基于荟萃分析的基本思想,提出了一种利用全局特征变量选择和数据样本集合随机划分的方法来测度特征变量异质性的方法。在一系列基准数据集和人工构造的混合数据集上的实验计算结果说明了该测度方法的有效性。第三章主要考察特征变量异质性对分类方法效果的影响。本章首先对特征变量异质性的影响进行简要分析,然后通过实证研究说明分类问题中存在的特征变量异质性对分类方法的准确率具有较为显著的影响。本章采用的分类方法为一种将logistic回归与支持向量机集成的分类方法,该方法的主要思想是应用logistic回归的输出概率为支持向量机提供支持信息,以提高分类判别准确性。实证研究以企业财务困境预测这一风险管理和决策问题为背景,通过比较考虑特征变量异质性前后的分类预测准确率,说明了在存在特征变量异质性的分类问题中,考虑特征变量异质性有利于提高分类方法的准确率。第四章提出了一种基于因子分析和聚类分析的分类策略,该策略的基本思想是首先将原始的特征变量转化为新的特征变量,使得新的特征变量能够体现出原始特征变量在数据样本空间中的异质性,然后通过聚类分析得到各个具有特征变量同质性的样本子集,并在每个样本子集中分别建立相应的分类模型,从而减小特征变量异质性对分类方法准确率的影响。对于一个未知类标记样本,该分类策略首先将其转化为因子得分向量,然后将该因子得分向量按照最近邻规则划分到与之最近的样本子集合中,最后利用该样本子集合中的分类模型进行分类。在一系列基准数据集上的实验计算结果说明了该分类策略的有效性。第五章提出了一种考虑特征变量异质性且具有增量式学习特性的分类方法,可有效应用于一类具有特征变量异质性且需要决策者利用分类方法进行在线实时风险决策的问题。该方法首先利用基于网格的有指导聚类对数据样本集合进行划分,从而得到若干数据簇,且每个数据簇中数据样本点的类标记相同。在去除异常数据点后,该方法计算各个数据簇的特征变量相关性,并将该相关性作为距离测度中各特征变量的权重,应用最近邻方法进行分类。本章最后将该分类方法应用于一系列基准数据集和市场营销中的顾客确定问题,实证结果说明了该方法的有效性。第六章总结本文的工作,归纳本文的主要创新点,指出目前本文研究的局限,并根据已取得的研究结果对未来进一步研究的问题进行思考。本文工作的主要创新点如下:(1)提出了一种有效的特征变量异质性的测度方法,该方法可用于探索分类问题中存在的特征变量异质性,为解决分类问题提供策略性的信息。(2)提出了一种基于logistic回归与支持向量机集成的分类方法,该方法利用logistic回归得到的后验概率信息对支持向量机的输出结果进行修正,能够有效提高传统支持向量机的分类准确性。(3)提出了一种有效的考虑特征变量异质性的分类策略,该策略将具有特征变量异质性的数据样本集合划分为若干同质性的子集合,然后通过在各个子集合中分别建立分类模型来提高分类的准确性。(4)提出了一种具有增量式学习特性的分类方法,该方法能够以增量学习的方式处理由于数据样本频繁更新而导致的特征变量异质性模式变化,可以在存在特征变量异质性且需要实时在线决策的风险决策问题中得到有效的应用。
【Abstract】 In risk decision-making,there exists a kind of problems in which a decision maker needs to establish a certain relationship between the historical data samples and the states of nature,and then for a new data sample,estimates the probability of each state of nature.Based on the information obtained,the decision maker would make the decision in order to maximize the function of expected revenue(or minimize the risk loss function) by using a risk decision-making model.From the perspective of establishing a relationship between the historical data samples and the states of nature, the above described problems would boil down to the classification problems in data mining.Therefore,various classification techniques could be applied to this kind of risk decision-making problems.Since the accuracy and efficiency of the classification techniques used are critically important,the research on classification methods and their applications in risk decision-making problems play an important role in both theory and practice.Most of the related researches have focused on classification techniques and their applications in different kinds of risk decision-making problems.As a matter of fact, exploring and knowing the characteristics of the data before any data mining technique is applied are important for the results.In classification,feature space heterogeneity is an important kind of data characteristics,and impacts significantly on the classification performance.This paper is focused on classification methods and their applications in risk decision-making considering feature space heterogeneity. The main aim of this research is to explore the existence of feature space heterogeneity in classification problems,and develop some novel classification approaches to deal with the feature space heterogeneity and improve the classification accuracy,which is helpful for risk decision-making.The organization of the thesis is as follows:In Chapter 1,we firstly explain the background of this paper,and then review the literature on various classification approaches and their their applications in risk decision-making,as well as the researches on feature space heterogeneity in classification problems.The content and significance of this thesis are addressed at the end of Chapter 1.In Chapter 2,the basic idea of classification in data mining is first introduced,followed by a brief description of feature relevance and feature selection in classification problems.Then we introduce the concept of feature space heterogeneity addressed in this paper.Since feature space heterogeneity is not directly observable from the data set,we propose a measurement for detecting and evaluating the feature space heterogeneity in a classification problem based on the main idea of meta-analysis.The main steps of the proposed measurement include global feature selection and random sample partitioning.Experimental results on a series of benchmark data sets and artificially mixed data sets verify the effectiveness of the proposed measurement.In Chapter 3,the impact of feature space heterogeneity on classification performance is investigated.We first briefly analyze the characteristics of feature space heterogeneity in classification,and then demonstrate that the feature space heterogeneity would degrade the classification performance if it is not considered.In this chapter,we propose a novel classification approach based on integration of logistic regression and support vector machines(SVMs).The main idea of this approach is to use the posterior probabilities obtained by logistic regression to modify the outputs of SVMs.In the experimental study,we demonstrate that for a classification problem with feature space heterogeneity,it is advantageous to partition the sample data set into homogeneous subsets and construct a specific classifier in each subset.In Chapters 4 and Chapter 5,two different classification approaches for dealing with the feature space heterogeneity are presented.Chapter 4 proposes a Classification Algorithm based on Factor Analysis and Clustering(CAFAC) to eliminate the feature space heterogeneity and improve the classification performance. In the proposed CAFAC,orthogonal factor analysis model is first applied to transform the original features into new features without irrelevance and redundancy. Heterogeneity in the original feature space can be reflected by the differences of new features,and captured by the clustering method adopted in our approach.Therefore, we could obtain a number of subsets in each of which the feature space is homogeneous.A component classifier is then constructed in each subset for classification.Experimental results on a series of benchmark data sets and artificially mixed data sets verify the effectiveness of the proposed CAFAC.In Chapter 5,we develop a novel classification algorithm,Supervised Clustering for Classification with Feature Space Heterogeneity(SCCFSH),which can be applied to some online risk decision-making problems with hard time and resource constraint.Our approach consists of four main steps:grid-based supervised clustering,supervised hierarchical grouping of clusters,feature relevance evaluation in each cluster,and weighted distance calculation for classification.The main advantage of the proposed SCCFSH is that it is enabled to deal with feature space heterogeneity in classification problems in a scalable and incremental way.Computational results in the experiments verify the efficiency and effectiveness of the proposed approach.Chapter 6 concludes the thesis, and gives some directions for further research.Innovations and contributions of this thesis are briefly summarized as follows:(1) An effective measurement for identifying and evaluating feature space heterogeneity in a classification problem is proposed.The measurement can be used to explore the data characteristics and provide some information for improving classification performance.(2) A novel classification approach based on integration of logistic regression and support vector machines is proposed.The new approach utilizes the posterior probabilities obtained by logistic regression to modify the output of SVMs and is capable of improving the classification accuracy in comparison with conventional SVMs.(3) For classification problems with significant feature space heterogeneity,a new classification algorithm based on factor analysis and clustering is proposed.The proposed algorithm is enabled to eliminate the feature space heterogeneity by partitioning the sample data set into homogeneoue subsets,and thus improve the classification performance.(4) A new classification approach capable of solving a classification problem with feature space heterogeneity in an incremental way is developed.This new method is favorable for on-line classification tasks with continuously changing data and hard constraints on time and resources.
【Key words】 risk decision-making; classification; feature space heterogeneity; factor analysis; clustering; incremental learning;