节点文献

生物序列比对算法的并行优化设计与实现

Parallel Optimization Design And Implementation of Biological Sequence Alignment Algorithm

【作者】 李研

【导师】 季振洲;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2015, 硕士

【摘要】 生物序列比对是生物信息学的基础和核心,随着生命科学的迅猛发展,需要研究的蛋白质和核酸序列的信息显著增加。常见的双序列比对串行算法时间复杂度为O(N2),多序列比对时间复杂度更高,随着序列长度的增加和比对规模的扩大,时间开销难以接受。在此情况下如果能够挖掘算法的潜在并行性,通过数据划分、流水处理等方式将算法并行化,使其在多个处理器上并行执行,可以大大提升算法效率。本文首先对生物序列比对和并行算法的相关概念进行了介绍。采取流水处理方式,将Needleman-Wunsch算法进行并行化设计。在此基础上提出了多个序列比对算法的并行化方法。通过找到多个checkpoint将矩阵划分若干部分,为每个处理器分配矩阵的一个部分,使各处理器并行执行子矩阵的Hirschberg算法并对完成后的结果进行整合,将Hirschberg算法进行并行化;通过提出使用活动结点和活动结点链表创建后缀树的方法,使按分支并行构建后缀树和按分支获取最大唯一匹配可以并行执行并且摆脱了对后缀链表的依赖,之后并行对锚点间的空位进行合并,将MUMmer算法进行并行化;通过对角线并行方式并行获取和扩展热点区域,采取将距离矩阵进行分块方式使多处理器并行获取不同对角线上热点区域的距离长度,将FASTA算法进行并行化;通过距离矩阵分块方式并行获取序列两两比对的距离长度,以及并行从进化树的叶子向根进行合并的方式,将Clustal W算法进行了并行化。最后对上述4种算法及并行化算法分别进行了实现,将串行算法运行时间与双核及四核条件下并行算法的运行时间进行了比对。实验结果证明了并行化算法比串行化算法速度得到了很大提升,并且得到了不错的加速比,更适用于多核结构。

【Abstract】 Biological sequence alignment is the foundation and core of bioinformatics. With the rapid development of biological science, the information of the sequence of protein and nucleic needed to study is dramatically increasing. The time complexity of common double sequence alignment algorithm is O(N2). The multiple sequence alignment time complexity is higher. With the increase of the length of the sequence and the enlargement of the comparison scale, the time cost is unbearable. So, if the potential parallelism of algorithm can be mined, and algorithm can be parallelized via data partitioning and pipelining processing, so as to make it executed in parallel on multiple processors, the time efficiency can be greatly increased.Firstly, this paper introduces the concepts of biological sequence alignment and parallel algorithm, and conducts parallel design of Needleman-Wunsch algorithm adopting the mode of pipeline processing. Based on it, several parallel sequence alignment algorithms are put forward. By finding multiple checkpoints, dividing a matrix into several parts, and making processors executing the Hirschberg algorithm in parallel, Hirschberg algorithm is parallelized. By building suffix tree and acquiring MUMs(Maximum Unique Matches) in parallel using the mechanism of active node and active node link, MUMmer algorithm is parallelized. FASTA algorithm is parallelized via acquiring and enlarging hot spots by means of diagonal parallel, and acquiring in parallel distances between the two hot spots on different diagonals by dividing the distance matrix into parts. The Clustal W algorithm is parallelized via acquiring the distances of all double sequence alignments through dividing distance into parts, as well as merging from the leaves of evolutionary tree to the root in parallel.Finally, the above four algorithms and their parallel algorithms are implemented respectively. The running time of serial algorithms and that of parallel algorithms under dual-core and quad-core are compared, and the results show that compared with serial algorithms, the proposed parallel algorithms is superior in terms of speed, and a sound speedup is thus obtained. So it is more suitable for multi-core architecture.

  • 【分类号】Q811.4;TP301.6
  • 【被引频次】2
  • 【下载频次】343
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络