节点文献

双聚类算法研究及其在基因表达数据中的应用

A Research on Biclustering Algorithm and Its Application in the Gene Expression Data

【作者】 王翔

【导师】 李卫平;

【作者基本信息】 中国科学技术大学 , 信息与通信工程, 2021, 硕士

【摘要】 随着基因采样技术的发展,人们已可以低成本地获取大量不同生物的基因信息,这些数据称为基因表达数据。通常情况下,这些基因信息以矩阵形式储存,即基因表达矩阵,它具有数据维度超高而样本稀疏的特点。传统的聚类算法处理这类数据时表现不佳。双聚类算法是在此背景下出现的一类高效分析基因表达矩阵的新方法。通过同时考虑矩阵的行列关系,可以获得矩阵内部更为复杂的信息。此前学者根据对基因表达矩阵内部隐含结构的不同假设,提出了不同的双聚类算法。然而,目前主流的双聚类算法普遍存在三个方面的问题:1:较高的计算复杂度;2:对噪声敏感。3:无法在迭代过程中显式利用上一轮迭代的计算结果;本文主要围绕这三方面问题开展研究。针对问题1和2,本文提出了一种基于奇异值分解(SVD)的预处理方法。该方法利用了 SVD对矩阵行列聚类信息的分离能力和对噪声的抑制特性,一方面通过矩阵的行列信息分离,将双聚类问题降级为普通的一维聚类问题,避免迭代计算,从而降低计算复杂度;另一方面通过基于SVD的低秩重建来降低矩阵内部噪声,从而提升聚类性能。实验结果表明,该预处理具有极好的算法兼容性,可以稳定提升各种双聚类算法在不同噪声条件下的聚类精确度。针对问题3,本文提出了多采样聚类,并在多采样聚类的基础上建立了一种新的双聚类算法。该算法可以显式利用上一轮迭代的聚类信息从而提升聚类精度。此外,因为多采样数据结构在现实生活中广泛存在,除了辅助建立谱双聚类算法,多采样聚类算法本身也可以应用在推荐系统、生产管理、服务器集群构建、天气预报等领域。最后,本文在人工数据集和肺癌基因数据集上对所提方法进行了实验,实验结果表明,相比于此前主流的双聚类算法,本文算法可以在降低计算复杂度的同时显著提升聚类精度。这为未来人们更加高效精确地识别基因表达矩阵中的信息做出了贡献。

【Abstract】 With the development of gene sampling technology,people can obtain a huge amount of gene information from various creatures at a limited cost and this data are called as the gene expression data.Generally speaking,the sampled gene expression data stored in matrix format is called the gene expression matrix.Due to the high sample dimension and low sample quantity,traditional clustering algorithms can not handle gene expression matrix well.Biclustering algorithm,a new method to analyze the gene expression data efficiently,arises under such background.By simultaneously considering the rows and columns relationship it can get more complex information inside the matrix.The researchers proposed different kinds of biclustering algorithms based on the assumption of the implicit data structure inside the gene expression matrix.However,three problems pervasively exist in the current algorithms:1:high computation complexity;2:Sensitive to noise;3:Can not explicitly take advantage of the clustering information from the last iteration.This dissertation tries to develop research in these three aspects.To address the problem 1 and 2,this dissertation comes up with the SVD preprocessing.This method utilizes the SVD’s row-column information separation ability and noise suppression property.For one thing,SVD preprocessing can separate the row and column clustering information,downgrade the biclustering problem into a one-dimensional clustering problem to avoid repeat computation and so reduce the computation complexity.For the other thing,SVD low-rank reconstruction can suppress the noise inside the matrix and improve the clustering performance.The simulation results show that this preprocessing method has great compatibility and can consistently improve the clustering accuracy under various biclustering algorithms and noise conditions.Aiming at solving problem 3,this dissertation proposed the multiple sample clustering and then designed an iterative spectral clustering-based biclustering algorithm.This algorithm can explicitly takes advantage of the result from the last iteration to improve the clustering performance.What’s more,because of the widespreadness of the multiple sample data structure,apart from applied in biclustering algorithm,the multiple sample clustering can also contribute to the recommendation system,producing management,server clustering building,and weather prediction,etc.Finally,we utilize the introduced algorithms into the synthetic dataset and lung cancer dataset.The simulation results support the point that the adjusted algorithm can greatly improve the clustering accuracy while reducing the computation complexity compared with the traditional algorithm.This can contribute to efficiently identifying information inside the gene expression data.

  • 【分类号】TP311.13;Q811.4
  • 【被引频次】1
  • 【下载频次】223
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络