节点文献
基于序列拼接的基因组插入变异集成检测
Integrated Sequence Assembly Based Approach for Calling Genomic Long Insertion
【作者】 叶露;
【导师】 高敬阳;
【作者基本信息】 北京化工大学 , 软件工程, 2017, 硕士
【摘要】 随着高通量测序技术的快速发展,出现了很多应用高通量测序数据的结构变异检测方法。由于高通量测序本身的局限性,例如片段较短、测序误差偏大等因素,常规的检测方法存在较大的局限性,检测精度和敏感度不够。针对这个问题,本文主要针对插入变异,提出了一种基于序列拼接的基因组插入变异集成检测方法,取名为ISALins。本文的主要内容如下:(1)设计基因组插入变异检测流程,分析了检测流程用到的实验数据。鉴于千人基因组发布的插入变异数量太少,本文通过编程实现生成实验要用到的插入变异标准集。为了全面验证实验的检测效果,根据实验需求准备了 NA12878个体真实测序数据和变异基准数据,仿真数据和真实数据为本课题奠定了数据基础。(2)分析了插入变异的片段特征,提出了一个聚簇支持插入变异片段的聚簇算法,保证了后续序列拼接和变异检测的有效性。同时提出一种解决基于De Brujin图序列拼接算法中重复序列的有效策略。(3)提出了一种基于序列拼接的插入变异集成检测策略,在仿真数据和真实数据上分别进行了实验。该策略的实施分为四个阶段:第一阶段,为了在保证检测敏感度的情况下,提高长插入变异的检测精度,通过融合多个工具的检测结果得到一个初始插入变异可疑断点集合;第二阶段,通过在每个可疑断点附近聚簇OEA片段,并进行软切片段(soft-clipped read)分析来得到高质量软切片段;第三阶段,利用基于De Brujin图的方法来进行局部拼接,通过使动态k-mer和k-mer频率分析策略来消除基因组重复序列造成的错误拼接问题。第四阶段,通过将重叠群片段contigs使用比对工具bwa和blat和参考基因比对后进行插入变异检测。实验结果表明,相对于传统的插入变异检测方法,本文所提出的策略对高覆盖度和低覆盖度测序数据的变异检测效果良好,在一定程度上提高了结构变异检测精度。
【Abstract】 With the rapid development of high-throughput sequencing technology,many structural mutation detection methods using high-throughput sequencing data have emerged.Due to the limitations of high-throughput sequencing itself,such as short segment and sequencing error,conventional detection methods have large limitations,and the accuracy and sensitivity of detection are not enough.Aiming at this problem,this paper proposes an integrated detection method named ISALins of genomic insertion mutation based on sequence assembly.The main contents of this paper are as follows:(1)Designing genomic insertion mutation detection process,and analysing the experimental procedures used in the test flow.In view of the fact that the number of insertions released by thousand human genomes is too small,we program to generate experiment standard set of insertion mutants.In order to fully verify the results of experiment,NA12878 individual real sequencing data and mutation benchmark data were prepared according to the experimental requirements.The data of simulation and the real data laid the data foundation for the subject.(2)The feature of insertion mutation is analyzed,and a clustering algorithm is proposed to cluster reads which support the insertion mutation,and make sure the validity of subsequent sequence assembly and mutation detection.At the same time,an effective strategy for solving repetitive sequences based on De Brujin graph sequence assembly algorithm is proposed.(3)An insertion mutation integrated detection strategy based on sequence assembly is proposed,and the experiment is carried out on the simulation data and the real data respectively.The implementation of this strategy is divided into four stages:the first stage,in order to ensure the detection sensitivity of the case and improve the long insertion mutation detection accuracy,multiple tools is merged to obtain a result of the initial insertion suspicious breakpoint set;In the second stage,clustering the OEA fragment near each suspect breakpoint and a high-quality soft-clipped read is obtained by analyzing soft-clipped read;the third stage,local assembly is performed using the method based on the De Brujin diagram,by making the dynamic k-mer and k-mer frequency analysis strategies to eliminate the problem of the wrong assembly caused by genome repetitive sequence.The fourth stage,the insertion mutation is detected by mapping contigs to reference genome using bwa and blat.The experimental results show that the proposed method has good effect on the detection of high coverage and low coverage sequencing data,and improves the accuracy of structural variation detection to a certain extent,compared with the traditional method of insertion mutation detection