节点文献
多模块集成式贝叶斯分类模型的研究
Research of Multiply Sectioned Integration Bayesian Classifier Model
【作者】 孙铭会;
【导师】 董立岩;
【作者基本信息】 吉林大学 , 计算机软件与理论, 2007, 硕士
【摘要】 贝叶斯网络分类方法是解决数据挖掘分类任务的一种重要方法,但是由于学习无限制条件的贝叶斯网络的时间复杂度很高,因此,研究具有限制性结构的贝叶斯网络逐渐成为一个活跃的研究领域。朴素贝叶斯分类模型是一种具有限制性结构且简单而有效的贝叶斯分类模型,但是属性的条件独立性假设使其无法正确表达属性变量间存在的依赖关系,影响了它的分类性能。本文针对它的这一缺点,通过分析贝叶斯定理的变异形式以及运用分治法的思想,提出一种新的基于贝叶斯定理的分类模型——多模块集成式贝叶斯分类模型(MSIB)。模型运用本文提出的一种基于信息熵的属性分割算法(FDBE)把特征属性集划分为若干个相互独立的特征属性子集,又根据这些特征属性子集的特点,提出一种混合式朴素贝叶斯分类模型(MNB),得到各子模块的分类条件概率表,然后通过贝叶斯定理的变形公式对各子模块的条件概率表进行整合,最后得到总体的分类结果。通过理论分析以及与朴素贝叶斯分类模型、树增广贝叶斯模型的实验对比表明,MSIB模型在解决分类问题时,具有较好的分类效果。
【Abstract】 Data mining, as a multidisciplinary subject, is developing rapidly in recent years. It is involved with database, statistics, artificial intelligence, machine learning and so on. Its major task is to extract valuable knowledge and obtain more available information. In data mining domain, classification is one of the most techniques. It can be used to analyze and study a vast number of related data and establish classifying models in many areas of related problems. A Bayesian network classifier is an important model in the process of knowledge discovery and is a very active topic in many fields. But for difficulties in constructing its network structure and very high time complexity, it has not been considered as a classifier algorithm until the emergence of Naive Bayesian Classifier.A Naive Bayesian classifier can be viewed as a strongly restricted Bayesian network classifier. But its attribute independence assumption makes it unable to express the dependence among attributes in the real world, and affects its classification performance. Many people become focus on how to relax the independence assumption. The TAN relaxed the condition independent hypothesis, by allowing the attributes variable to form a tree which represents their inferior correlation, it is proved to be highly effective and accurate. However, when there are more attribute variables that have complex correlations with each other, the tree structure are unable to reflect the real relations between the attributes, as a result, its accuracy drops down.A Divide and Conquer Method is the one of the best method to deal with the substantive data. The main idea of the method is to divide a big problem into several sub-problems whose complication is less than the big one. We introduce the divide and conquer method to deal with the domain of Bayesian classification and divide the classification task into several sub-modules. Finally it will combine the Conditional Probability Tables and obtain the conclusion.In chapter one, introduce the data mining technology. It contains the produce background and development actuality, and then introduces the common classification models. It includes Decision Tree、Rough Set、Genetic Algorithm、Neural Networks、Bayesian Learning and so on . In chapter two, give the basic Bayesian knowledge, we analyze the Na?ve Bayesian Classification models, Tree Augmented Naive Bayesian Classifier, Bayesian Network Augmented Naive Bayesian Classifier.In chapter three, first introduce the data preprocessing, it includes data cleaning, data sampling, data transformation, data specification and discuss the concept and formula of information entropy and mutual information. Then declare a new arithmetic named Feature Divide Based on Entropy. According to the physical meanings of information entropy and mutual information, we declare the concepts of strong Dependency, general Dependency and weak Dependency .Then we analyze the relativity among the different attributes qualitatively and quantitatively. The original attribute set is divided into several subsets which are conditional independence. The aim of the method is to prepare the data for the new Bayesian Classification models declared in chapter four.In chapter four, In order to deal with the condition independent assumption of Naive Bayesian Classifier, we declare a new Bayesian Classifier models: (MSIB: Multiply Sectioned Integration Bayesian Classifier Model). Firstly, we review the changes of Naive Bayesian Classifier made by other people, introduce the idea of divide and conquer and the using conditions. And then declare the MSIB. The process of building the model is as follows: first, sub-modules are formed by the decision attribute and the subsets divided by FDBE. They get the CPTs themselves. And we integrate the CPTs by the formula and get the final result. We declare a new Bayesian Classifier named Mixed Naive Bayesian. The reference attribute is regarded as the parent node, because the subset is selected by it when Data Preprocessing. We use the TAN module to describe the structure in the sub-modules and add the reference attribute as the parent node of the other attributes and the child node of decision attribute. Through the process, MNB model is built. Finally we compare the new models with the Naive Bayesian Classifier and Tree Augmented Naive Bayesian Classifier. The result of experiment shows that MSIB algorithm has higher accuracy than the TAN when there are many attributes in data set and there are strong relations between them, while, the two classifier are comparable on other situations, and for most of data sets, the MSIB classifier gets higher accuracy than NBC. Compared with NBC, MSIB relax the independence assumptions, it assumes that subset of attributes between independence but the attributes within the subset of attributes among each other have relation. The relation between them is expressed by MNB. NBC assumes all the attributes are independent of each other. Compared with TAN, MSIB is integrated by several sub-classification module, it can describe the more detailed relationship between the attributes. TAN assumes that the most attributes only have two parent nodes. Summing up experiments and theoretical analysis, we can see that MSIB has the better classification effect.As an important method for data mining, the learning of Bayesian classifier still has lots of problems and technical difficulties, it include many domain (for example probability theory, information theory, machine learning and so on ), they are many problems to be ready to research. When we discuss the Multiply Sectioned Integration Bayesian Classifier Model, we find they are some field should be investigated. At the end of this paper, we present several future directions of our research work. This algorithm only deals with the dataset full of the discrete attributes. For the dataset full of the continuous attributes, we reprocess it by discretization tools. But it will lead the loss of information. In the future, we will study how to deal with the dataset full of the continuous attributes by MSIB directly. For the sub-modules, this algorithm regard them as the same importance. But they have different influence on the decision attribute. So we are interest in how to add weight for the CPTs of sub-modules. And more research will be done in the future.
- 【网络出版投稿人】 吉林大学 【网络出版年期】2007年 03期
- 【分类号】TP18
- 【被引频次】1
- 【下载频次】214