节点文献
模糊C-均值聚类在基因表达数据分析中的应用与研究
Application and Research of the Fuzzy C-Means Clustering in Gene Express Data
【作者】 张敬华;
【导师】 李金铭;
【作者基本信息】 福建农林大学 , 生物信息科学与技术, 2012, 硕士
【摘要】 随着微阵列技术的飞速发展,产生了呈指数增长的海量微阵列数据。面对如此庞大的微阵列数据,若不能采取有效的方法进行处理,大量的数据资源将会变成“数据灾难”或是无用的“数据垃圾”。由于微阵列数据具有海量、高维、高变异、高污染、样本少、含噪声等特点,使有效的从中提取有意义的生物信息为人类服务,具有极大的挑战性。为了迎接挑战,特别是在没有任何先验信息或先验信息匮乏的前提下研究、分析问题,促使模糊聚类的理论和应用的研究成为近年来生物信息学的研究热点。本文就目前最常用的且研究最多的基于目标函数的模糊C-均值聚类算法进行了深入的研究,然后针对其存在的问题并结合基因表达数据的特点进行了一些改进,最后将其运用到基因表达数据分析中,其主要工作与创新点列举如下:一、在阐述基因表达数据预处理技术的数据筛选时,充分考虑基因表达数据的特点,将数据获取的实验条件与数据指标DETECTION P_VALUE、ABS_CALL表示的生物学含义与统计学意义结合起来,提出一种新的数据粗筛选方法,并在前人研究的基础上提出了数据筛选的“三步走”。二、仔细研究了模糊C-均值聚类算法理论与研究概况,针对其存在的不足,结合基因表达数据的特点,引入前人提出的加权模糊C-均值聚类算法,本文结合主成分分析的降维特点提出了一种基于损失信息补偿的新的权重确定方法。三、鉴于模糊C-均值聚类容易受到初始参数特别是聚类数、初始聚类中心的影响,聚类结果不稳定。本文在前人研究的基础上对聚类数进行了新的确定,有效的避免了无根据确定聚类数的盲目性。接着,在系统聚类的基础上,提出一种新的初始聚类中心确定方法。最后,在随机选取初始聚类中心与聚类中心初始化条件下,采用标准模糊C-均值聚类算法与改进的算法对来自不同时间与不同品牌香烟烟雾环境的支气管上皮细胞样本进行了分类,经验证改进后的聚类算法获得了比较好的聚类结果,同时也加快了收敛速度。四、对基因表达数据聚类结果给出合理的生物学解释。
【Abstract】 With the rapid development of microarray technology, it has brought about themass of microarray data in exponential growth pattern. Facing such a large number ofmicroarray data, if it can not take effective ways to deal with data resources, a lot ofuseful micro-array data resources, will become a "data disaster" or un-useful "datarubbish". Because microarray data have mass, high dimension, less sample, includingnoise, high pollution and high mutation rate ect characteristic. How to extractimportant biological information from these data and make the results of the analysisfor human services, will be a great challenge. In order to meet the challenge andimprove the overall utilization of information, especially under the premise that thereis no a priori knowledge or the lack of priori information, the convenience of researchand analysis problems which promoted the theory and application of fuzzy C-meansclustering has become a hot topic of bioinformatics research in recent years. In viewof it,this article have thorough studied on the most popular fuzzy C-means clusteringalgorithm based on the objective function and combined with the characteristic of thegene expression data and improved it, then it is applied to gene expression dataanalysis, the main work and innovation are as followed:Firstly, in the course of the pre-processing technology of gene expression data,especially when data screening,the characteristic of the gene expression data has beentaken into account.Combined the experimental conditions of gene expression dataacquisition with the biological meaning and statistical significance of gene expressiondata indicators: DETECTION P_VALUE and ABS_CALL, proposed a new datacoarse screening method,then put forward to the “three-step”data filtering method onthe base of previous research.Secondly, carefully study the fuzzy C-means clustering algorithm theory andresearch profile, in view of its shortcomings, fully considering gene expression datacharacteristics, recommended previous weighted fuzzy C-means clustering algorithm.In this paper, combining with dimensionality reduction characteristic of principalcomponent analysis, the author has put forward to a new-weight determiantion method based on the compensation of loss information.Third, fuzzy C-means clustering algorithm is particularly vulnerable to the initialparameters,such as the number of clusters C, the initial cluster centers, and comes intobeing unstable clustering results. First of all, on the basis of previous research,theauthor redefined the number of clusters C so that effectively avoid the blindness ofrandomly selecting the number of clusters. then preferred the initial cluster centers,based on the system clustering. Finally, in the condition that the initial cluster centersare selected randomly or cluster centers is initialized,adopting the standard fuzzy Cmeans clustering method and the improved algorithm is classified on the bronchialepithelial cell samples which are affected by time conditions and the smoke thatcomes from different brands of cigarette. Practice has proved that the improvedalgorithmit not only obtains the better clustering results, but also accelerates theconvergence rate.Fourth, it gives a reasonable biological explanation on the gene expression dataclustering results.
【Key words】 gene chip; gene expression data; fuzzy C-means clustering;