节点文献
面向全基因组关联分析的大数据存储架构设计与实现
Design and Implementation of Big Data Storage Architecture for Genome-wide Association Study
【作者】 王博;
【导师】 董守斌;
【作者基本信息】 华南理工大学 , 计算机科学与技术, 2018, 硕士
【摘要】 近年来,随着生物基因技术的飞速发展,基因数据呈现井喷式增长,生物基因学进入大数据时代。丰富的基因数据给生物医学领域解锁了新的研究方向,同时也带来了大数据存储的难题。全基因组关联分析是当下最具意义的研究领域之一,是实现精准医疗的重要方法,其依靠基因变异数据作为分析的基础,然而由于全基因组关联分析对基因变异数据的访存要求多样,当前没有适合于该场景的大数据存储架构。根据该场景下的数据特点与访存要求,设计并实现一个简单易用的高扩展性基因变异数据存储架构,解决存储瓶颈问题,对推动精准医疗的发展具有重大的意义。在现有大数据存储研究的基础上,提出了面向全基因组关联分析的架构模型,融合了新型列式存储引擎Kudu、大数据查询引擎Impala及分布式位图索引等技术。基因变异数据文件经过基于样本人种信息的切分,存入新型列式存储Kudu中,以提供低耦合的基础数据访问服务。通过Kudu的存储特点,满足场景中对变异数据的低延迟随机访问以及高效的范围分析的要求,并保证系统的高可用性和高扩展性;针对Kudu索引方式单一,在进行复杂分析时退化成全表扫描的问题,根据数据特征提出了基于Kudu的分布式位图索引方案,并通过大数据处理框架Spark实现了压缩位图索引的并行建立和处理算法,设计的分布式位图索引具有高效的分析能力,且在大规模数据场景中能良好扩展;为了提高系统的易用性,使用了大数据查询引擎Impala提供SQL-like语法,提供了一个简单高效的查询、分析统一平台。通过实验对比分析了本架构与其他大数据存储方案在全基因组关联分析场景中的性能表现,发现本方案模型简单,在处理场景中各种数据查询时具有统一的高效表现,且较其他方案有数量级的性能提升。整个架构模型具有良好的通用性和扩展性,适合一般云计算平台,为全基因组关联分析打下了良好的基础。
【Abstract】 In recent years,with the rapid development of biological technology,the gene data has been experiencing a growth spurt and biogenetics has entered the era of big data.Rich genetic data has unlocked new research fields of biomedical,and the same time it has brought the problems of data storage.Genome-wide association study(GWAS),the key method for precision medicine,is one of the most significant research areas in the world.Gene variation data is the standard starting point of GWAS.However,due to the various requirements for accessing gene variation data,there is no big data storage suitable for this scenario.Based on the data characteristics and access requirements,designing and implementing an easy-to-use,highly scalable storage architecture for gene variant data will be significative.An architecture model based on the columnar storage engine Kudu,the big data query engine Impala and distributed bitmap indexes is proposed on the basis of recent big data storage researchs.The gene variation data is split based on human species information and then stored in Kudu to provide a basic data access service.Through Kudu’s storage characteristics,data can be random accessed at low latency and scanned efficiently with high availability and scalability.While dealing with complex analysis,Kudu will process a full-table scan,which is timeconsuming.To address the problem,a distributed bitmap index based on Kudu is proposed,and a parallel generation and processing algorithms for compressed bitmap indexes is implemented.The designed distributed bitmap index has high-efficiency analysis capabilities and can be well extended in large-scale data scenarios.In order to improve the system usability,the big data query engine Impala is used to provide SQL-like syntax.Through a performance comparison of our design and other solutions in dealing with GWAS,we found several orders of magnitude improvement over other solutions.And our solution is simple,while can meet all data access requirements in this scenario.The architecture model has good generality and extensibility,and is suitable for cloud platform.It is a good storage option for genome-wide association analysis.
【Key words】 GWAS; big data storage; Kudu; distributed bitmap index; Spark;