节点文献

基于蚁群算法和随机森林的上位性识别研究

Research on Epistasis Detection Based on ACO and Random Forest

【作者】 吴迪

【导师】 刘桂霞;

【作者基本信息】 吉林大学 , 计算机应用技术, 2018, 硕士

【摘要】 上世纪90年代以来,随着生物技术的发展,特别是基因测序技术的发展,生物学的发展进入到一个全新的阶段,其中的标志性事件是人类基因组计划的启动,截止到2003年,人类基因组计划的测序工作已经全部完成。随着计算机科学与生物学的迅猛发展,二者组合产生了生物信息学这一交叉学科。生物信息学利用计算机技术来对生物数据进行处理来探究生物学的研究问题。全基因组关联分析,通常是指在人类基因组范围找出与疾病或者性状相关的单核苷酸多态性。虽然GWAS(Genome-wide association study)已经取得了很多的成果,但是对于大多数疾病,GWAS只能解释一部分遗传可能性。许多人类疾病和特征背后的遗传因素并没有被找到,这一现象被称为“遗传性缺失”。遗传性缺失的一个可能原因是:标准的GWAS分析中使用的可加性模型并不能很好的拟合基因型间的交互作用。遗传的可加性模型假设每一个遗传变异对疾病的作用是与其他遗传变异独立的。在实际中,这个假设可能是不成立的,基因之间可能存在上位性效应。本文在总结之前的研究方法的基础之上,提出了一种基于蚁群算法和随机森林的snp上位性探测方法。本文利用随机森林obb score作为对snp集合的评价标准,其基于的理论在于:对于存在上位性的snp集合应该比不存在上位性的集合能更好的实现对疾病人群和正常人群的分类。随机森林在上位性的检测中获得了较为广泛的应用,获得了一些成果,但也有学者对其的局限性提出了质疑。本文将随机森林整合到蚁群算法的框架中,希望在避免其局限性的情况下发挥其优势。本文以蚁群算法作为整体的算法框架,在蚂蚁的路径选择和对选取路径的评价上以及信息素更新上采用了符合数据特点的选择和创新。另外,好的启发式信息对蚁群算法的提升有着较为重要的作用。因此,本文提出了一种启发式信息生成算法SNPRANK。其结果可以用于特征选择,也可以用于指导蚁群算法的蚂蚁路径选择。实验证明,本文提出的组合算法有着较好的效果,融合算法的效果比单独算法的效果要优异。为了获得更好的时间效率和更高的探测精度,本文首先利用SNPRANK对snp做一个筛选,去掉一些噪声snp,防止其在后续处理中干扰算法的运行结果。之后再将SNPRANK中生成的启发式信息融合到蚁群算法的框架中去。实验证明,本文提出的算法比以往基于蚁群算法的方法具有更好的效果和对不同模型的鲁棒性。在未来的工作中,我们将通过研究更好的启发式信息和局部搜索算法来增强算法的效果和鲁棒性。

【Abstract】 With modern biotechnology,especially gene-sequencing technology’s develop at full speed,biology has been the stage in the brand-new vigorous development since 1990 s.One of monumental events is that the International Human Genome Project,known as life science’s "moom shot",was launched in 1990.By the end of 2003,the human genome’s sequencing was formally completed.With the rapid development of computer science and biology,bioinformatics boomed in the past 10 years.Bioinformatics makes use of computers to process data and study biological problem.The whole-genome association studies generally mean that searching single nucleotide polymorphism related to complex disease.GWAS(Genome-wide association study)has already borne fruit,however,for most diseases,GWAS only accounts for a little heritability.The genetic factors of many diseases and traits are not found,which is called missing heritability.One possible reason of missing heritability is that additive model in standard GWAS can not fit gene interaction.The additive model assumes that genetic variation contributes to complex disease independently.In practice,this assumption may not be true because there exists epistasis between genes.We proposed one method to detect epistasis based on ACO and Random Forest through summarizing previous algorithm.ANTRF takes obb score of random forest as evaluation criterion for SNP set,which reason is that the SNP set containing morbific SNP differentiate diseased population and normal population.Random forests have been widely used to detect epistasis and obtain achievement.However,some researchers raised question about its limit.ANTRF integrates random forests into ACO to avoid its limit and exert its advantages.We adopt ACO as algorithmic framework,where we use appropriate ants path selection and path evaluation.In addition,good heuristic information plays a role in improvement of ACO.Therefore,we put forward one method named SNPRANK to generating heuristic information.On the one hand,the result of this algorithm can be used to select features;on the other hand,it also can be used in ants path selection.Experiments prove that SNPRANK has a good effect.In order to get better time efficiency and higher detection precision,we firstly use SNPRSNK to filter a part of noisy SNP to avoid their distractions;then,merge heuristic information generated by SNPRANK into ACO framework.Experiments prove that algorithm has a good effect.In the future work,we will enhance the robustness and effect of the algorithm by studying better heuristic information and local search algorithm.

【关键词】 上位性蚁群算法随机森林ReliefF复杂疾病
【Key words】 EpistasisACORandom ForestReliefFcomplex disease
  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2019年 01期
  • 【分类号】TP18
  • 【下载频次】139
节点文献中: 

本文链接的文献网络图示:

本文的引文网络