节点文献
基于粗糙集和遗传算法的大数据集数据挖掘应用研究
Application and Research of Large Database Mining Based on Rough Set and Genetic Algorithm
【作者】 张亦军;
【导师】 胡彧;
【作者基本信息】 太原理工大学 , 计算机软件与理论, 2007, 硕士
【摘要】 数据挖掘(Data Mining,DM)是从存放在数据库、数据仓库或其他信息库中大量的不完全的有噪声的模糊的随机的数据中提取隐含在其中的人们事先未知、但是潜在有用的信息和知识的过程。粗糙集理论由Z.Pawlak提出,经历了20年的发展。该理论作为一种全新的数学概念,已经在理论和应用上取得了丰硕的成果。它不依赖于数据集之外的附加信息,是处理含有噪声、不精确、不完整数据的有力工具,在医疗诊断、模式识别、专家系统、机器学习、数据挖掘等领域获得广泛应用,是进行数据挖掘的有力工具。遗传算法是Holland于1975年首先提出来的一种基于自然群体遗传演化机制的高效探索算法。它摒弃了传统的搜索方式,模拟自然界生物进化过程,采用人工进化的方式对目标空间进行随机化搜索。它将问题域中的可能解看作是群体的一个个体或染色体,并将每一个体编码成符号串形式,模拟达尔文的遗传选择和自然淘汰的生物进化过程,对群体反复进行基于遗传学的操作(选择,交叉和变异),根据预定的目标适应度函数对每个个体进行评价,依据适者生存,优胜劣汰的进化规则,不断得到更优的群体。本文应用粗糙集理论对知识分类的特点,结合遗传算法进化理论,对大型决策表中最优规则提取做了深入研究,提出了一个新的数据挖掘模型。在应用该模型的系统中包含有数据预处理,数据离散化,知识约简,规则提取--数据挖掘的一些基本过程。针对大数据表字段过多,信息冗余大的特点,本文采用粗糙集的理论方法进行处理,在数据预处理和数据离散化的基础上,对条件属性进行约简。属性约简是挖掘的核心步骤,这里运用粗化算法通过判断表的相容性进行约简;对于数据量大的决策表仅仅属性约简是不够的,对大量的规则还要进行筛选提取。使用遗传算法进行优化筛选处理,通过选择,交叉,变异后从大量的规则中得到较优的规则集。在系统的构建上,采用VC++开发工具和SQL SERVER数据库具体实现了一个基于粗糙集理论和遗传算法为核心模块算法的数据挖掘系统。最后,介绍了该模型在太原网通公司小灵通短信系统中的应用,提取出用户收发短信息成功与否的规则模式。通过验证分析,结果表明该系统是合理、有效的,实验结果有助于维护人员分析故障原因。其中群发短信查询分析模块已经安装在监控设备上运行了一年,及时发现了多起设备故障,为企业挽回了大量的经济损失。事实证明对于提高短信系统运行效率,改善网络运行质量有着显著作用。该方法模型的应用同时也是对多方法融合进行数据挖掘的有益探索。
【Abstract】 Data mining is a process that people extract unknown but useful information and knowledge from data which are vast,incomplete, blurry,stochastic stored in databases,warehousees or other information repositories.Rough Set (RS) theory was put forward by pawlak Zdislaw in 1982. After about twenty years’ development,it has received fruitful achievements on both theory and application. RS doesn’t depend on additional information beyond the data set,and it is a potent tool for dealing with vague,imprecise, incomplete and uncertain data,and it is also a new technology in data mining. RS theory is mostly used in knowledge reduction and analysis of knowledge dependency,and also widely used in medical diagnosis,pattern recognition, expert system,machine study and data mining.Genetic algorithm (GA) adopts searching method based on random theory. It’s searching process begins from a group of original nodes,not begin with a singal node. This mechanism means searching process can jump out of local extremum,and not only get the most accurate value around extremum,but also can explore in the whole question area,so the probability of getting most accurate value is greatly improved.The character that rough set theory can class knowledge and genetic algorithm’s evolution theory about extracting best rules from large table are applied in this paper, and a new model of data mining is introduced. The system includes data foreclosing,data dispersing,knowledge reduction,ruler extraction—the basic process of data mining. Because of many fields and redundance information in large table the paper adopts rough set to process,after data foreclosing and data dispersing the conditional fields are to be reduced. Field reduction is a core step in data mining,the reduction makes use of rough algorithm through judging if a table is consistent to work; Reduction is not enough to meet the need of data mining in large table,the large number of rulers must be selected. The selection process applies genetic algorithm to work,through selection,intersection,variance the bset rules come out from large table. About construction of the system vc++ tools and sqlserve database are adopted to build the data mining system based on rough set theory and genetic algorithm as core model algorithm.Finally the paper introduces a example of the model that is used in PHS short message system in the Taiyuan network communications corporation,extracts rulers about if the short messages can be sending and receiving succefully. Through validation the result shows that the system is reliable,and the result helps to administrator to analyse the reason of questions. The model of query and analysis of short messages has been installed in monitor and runs over a year,has find many problems and saves a lot of money. The result proves it helps to enhance efficiency of system and improve running quality of network,and it is also a helpful research about multimode on data mining.
【Key words】 Rough Sets; Data Mining; Attribute Reduction; Genetic Algorithms;
- 【网络出版投稿人】 太原理工大学 【网络出版年期】2008年 04期
- 【分类号】TP18;TP311.13
- 【被引频次】6
- 【下载频次】1254