节点文献
基于SNP数据的精神分裂症的诊断模型研究与系统实现
Research and System Implementation on Diagnosis Model for Schizophrenia Using SNP Data
【作者】 张波;
【导师】 周从华;
【作者基本信息】 江苏大学 , 软件工程, 2019, 硕士
【摘要】 精神分裂症是一种慢性的遗传疾病,由于其发病率高、发病周期长的特性,已经对整个社会造成了很大的影响,并且其尚未被完全知晓的发病机制对整个医疗领域都是一个很大的挑战。而基于单核苷酸多态性(Single Nucleotide Polymorphism,SNP)的全基因组关联分析(Genome-Wide Association Study,GWAS)的研究虽然已经在精神分裂症的诊断研究中取得了显著的成果,但也因其时间周期长、依赖于大量样本等原因阻碍着其发展。随着大数据时代的到来以及数据挖掘技术的迅速发展,研究者可以通过机器学习和深度学习从大量数据中挖掘疾病的致病机制并设计诊断模型。本研究以精神分裂症为主要研究对象,探讨SNP的诊断模型。首先基于改进的模糊聚类算法对SNP数据进行聚类和特征选择;然后采用提出的深度学习模型进行分类;最后设计并实现面向精神分裂症的智能诊断原型系统。具体工作如下:(1)针对SNP位点多达数万个但大部分并不能表示致病机制,且冗余的特征会造成“维数灾难”,严重影响后期诊断效果的问题,提出一种新的聚类方法GN-FCM,并将其运用在SNP选择中。一方面,在模糊C均值聚类的基础上提出SNP权重因子,以解决现有的SNP聚类算法未能考虑SNP位点重要程度差异性的问题;另一方面,提出重点SNP邻域正则项并将其引入模糊聚类的损失函数中,以解决高度重要的SNP与其邻域内的其他SNP的关联性问题。实验结果表明,新提出的聚类方法较其他聚类算法具有更好的收敛性,而且基于该聚类算法构造出来的SNP子集在多个分类器的实验中效果均有提升,其中在准确率上表现最好的分类器是支持向量机,较第二好的选择方法MRMR构造的SNP子集平均提升了5.83%,在F1上表现最好的分类器是决策树,较MRMR平均提升了5.51%。(2)针对现有的分类方法无法适用于过长的SNP序列,且无法充分利用SNP空间距离等信息会造成的模型分类效果降低等问题,提出了一种新的SNP序列分类的神经网络模型Bi-SNP。该模型基于双路径设计,一方面通过采用提出的“滑动窗口采样”的方法重新构造多个较短的子序列,并对每个子序列使用基于Attention机制的LSTM模型进行特征学习,以此来解决原始SNP过长导致的分类精度降低的问题。另一方面,提出了一种新的数据转化方法,将SNP权重、基因距离和染色体影响等有机地结合起来,从而把每个样本转换成一个稠密矩阵,然后使用CNN进行局部空间特征学习。两次学习到的特征经过整合后输送给LSTM模型进行进一步地学习,并由一个随机森林分类器做出最终的决策。实验结果表明,添加了Attention机制的Bi-SNP模型相比其他参与对比的模型都具有明显的优势,较其他表现最好的模型Bi-Stream-CNN在分类准确率和F1上分别平均提升了3.25%和4.36%。(3)在上述研究的基础之上,本文还完成了基于SNP数据的精神分裂症的智能诊断原型系统的设计与实现。
【Abstract】 Schizophrenia is a chronic genetic disease that has a great impact on society because of its high incidence and long morbidity,and its pathogenesis,which is not fully known,is a big challenge for the entire medical field.While the genome-wide Association Study(GWAS)based on Single Nucleotide Polymorphism(SNP)has yielded significant results in the diagnosis of schizophrenia,it is hampered by its longtime cycle and dependence on a large number of samples.With the advent of the era of big data and the rapid development of data mining technology,researchers can use machine learning or deep learning to mine disease pathogenesis and design diagnostic models from a large amount of data.In this thesis,schizophrenia was taken as the main research object,and the selection method and diagnostic model of SNP were discussed.Firstly,data clustering and feature selection are carried out based on the improved fuzzy clustering algorithm.Then the proposed deep learning model is used for SNP classification.Finally,an intelligent diagnostic prototype system for schizophrenia is designed and implemented.The specific work is as follows:(1)Aiming at the problem that there are many SNP sites but most of them cannot represent the pathogenic mechanism,and the redundant features will cause "dimensional disaster",which will seriously affect the effect of the later diagnosis model,a new clustering method based on fuzzy clustering was proposed and applied in the SNP selection.On the one hand,SNP weight factor is introduced into the loss function of the fuzzy C-Means algorithm to solve the problem that the existing SNP clustering algorithm fails to consider the difference in importance of SNP sites;On the other hand,the key SNP neighborhood regularization term is proposed and introduced into the loss function of fuzzy clustering to solve the problem of the relationship between highly important SNP and others in its neighborhood.The experimental results show that the proposed clustering method has better convergence than others,and the performance of the SNP subset constructed based on the proposed algorithm is greatly improved compared with other methods in classification experiments using multiple classifiers.Among them,support vector machine is the best classifier in classification accuracy,with an average increase of 5.83% compared with the second best selection method MRMR,and so is decision tree in F1 score,with an average increase of 5.51%.(2)In this thesis,a new classification model Bi-SNP was proposed to solve the problem that the SNP data sequence is too long and the existing classification methods or models ignore the spatial distance and other information inside SNP sequences,which will increase the complexity of the model and reduce the classification effect.The model is designed based on bi-stream.On the one hand,the "sliding window sampling" method was adopted to reconstruct several shorter sub-sequences from raw long sequences,and then the LSTM model based on the attention mechanism was used for feature learning for each sub-sequence.On the other hand,a new data transformation method is proposed to turn each sample into a SNP-Chromosome mapping matrix and then a CNN model is used for local feature learning.The features learned from the previous two branches are integrated and transmitted to the LSTM model for further learning,and a random forest classifier makes the final prediction.The experimental results show that the Bi-SNP model with Attention mechanism has obvious advantages compared with others participating in the comparison.Compared with other best performing models Bi-Stream-CNN,has an average increase of 3.25% in classification accuracy and 4.36% in F1 score.(3)On the basis of the above research,this thesis has also completed the design and implementation of an intelligent diagnostic prototype system for schizophrenia based on SNP data.
【Key words】 Schizophrenia; Single nucleotide polymorphism; Random forest; Support vector machine; CNN; LSTM; Deep learning;