节点文献
基于机器学习的siRNA沉默效率预测方法研究
The Study of siRNA Silencing Activity Prediction Method Based on Machine Learning Methods
【作者】 韩烨;
【导师】 刘元宁;
【作者基本信息】 吉林大学 , 计算机科学与技术(生物信息学), 2017, 博士
【摘要】 RNA干扰(RNA interference,RNAi)是一种利用双链RNA(double-stranded RNA,ds RNA)依据碱基互补配对原则,实现转录后的基因沉默现象。植物、真菌、无脊椎动物和哺乳动物等真核生物都能够实现RNAi过程。在哺乳动物细胞中,ds RNA被剪切成较短的21-23nt的双链RNA,即小干扰RNA(small interfering RNA,siRNA),诱导靶标m RNA的降解。近年来RNAi在研究基因功能、基因治疗以及药物研发中具有非常广泛的应用,对于RNAi技术过程中起关键作用的siRNA,更是受到了研究人员的关注。由于靶向同一m RNA不同位置的一系列siRNA会产生不同的沉默效率,且大部分的siRNA产生的沉默效率都不理想,因此,如何设计高效的siRNA使得靶标m RNA的沉默效率达到最高,已成为RNAi研究中最关键的问题。siRNA设计是将RNAi技术应用到研究基因功能与药物研发等领域的重要前提,也已经成为RNAi研究的一个热点。目前siRNA设计方法主要分为两类:基于统计规则的siRNA设计方法以及基于机器学习的siRNA设计方法。研究表明,基于机器学习的siRNA设计方法能够更准确地定量预测siRNA对靶标m RNA的沉默效率。然而,尽管目前已经产生了一系列基于机器学习的siRNA设计算法,但预测效率仍有待提高,siRNA序列上与siRNA沉默效率相关的潜在特征还需进一步发掘,许多新颖的高性能机器学习模型尚待尝试用于siRNA效率预测。本文将从siRNA序列中挖掘潜在影响RNAi过程的特征,并在此基础上提出基于随机森林预测模型定量预测siRNA沉默效率的方法;此外,为探测siRNA序列中不同长度motif对siRNA沉默效率的影响,本文还提出了基于卷积神经网络的siRNA效率预测模型。全文的主要研究内容如下:1、提出将二模和三模motif位置编码作为siRNA沉默效率预测的新特征,并建立随机森林预测模型定量预测siRNA的沉默效率。由于siRNA序列是影响RNAi效率的重要因素,从siRNA序列中挖掘更多潜在的特征也一直是研究的重点。有研究表明,当siRNA序列中每一位的2-3bp RNA被DNA代替,RNAi的效率会发生一定的变化。这说明,不仅单碱基位置与组成与RNAi效率相关,siRNA序列上特定位置的二模和三模motif也与RNAi效率相关。本文首先根据已知的siRNA样本验证siRNA序列中不同位置二模和三模motif在高效siRNA和低效siRNA之间存在显著的偏好性;然后,提出将二模和三模motif位置编码作为新的预测特征;随后,利用基于z-score的最优特征集合搜索方法,筛选与siRNA沉默效率最相关的特征子集,构建基于随机森林的siRNA沉默效率预测模型,并据此开发高效siRNA沉默效率在线预测平台siRNApred。在Huesken数据集上进行的验证实验表明,siRNApred预测结果的PCC值达0.722,比Biopredsi、i-score、Thermo Composition-21、DSIR等已有siRNA沉默效率预测方法分别提高了9.39%,10.39%,9.56%和7.76%。此外,在多个独立数据集上进行预测实验考察siRNApred的泛化能力,结果均显示其比其他方法性能更稳定。siRNApred工具的在线地址为http://www.jlucomputer.com:8080/RNA/。2、设计卷积神经网络实现siRNA siRNA沉默效率预测方法。siRNA序列对RNAi效率的影响不仅在于二模和三模motif,多模motif也可能与siRNA沉默效率密切相关。然而,现有的siRNA特征提取方法未能体现多模motif对siRNA沉默效率的贡献。为探寻多模motif对siRNA沉默效率的影响,本文提出基于卷积神经网络的siRNA效率预测模型。在卷积神经网络中的卷积层,设计合理尺寸的卷积核作为motif探测器,以数据驱动方式自动学习多模motif更抽象、更贴近本质、更利于分类的潜在特征模式,并形成综合多模motif作用共同预测siRNA沉默效率的模型。该模型经过实验调校模型超参数,形成由一个卷积层,一个池化层和一个输出层构成的卷积神经网络。其中卷积层使用6×4至19×4共14种尺寸卷积核探测潜在motif特征模式,池化层使用最大值算子和均值算子选取最具代表性神经元构成特征表达,输出层使用逻辑回归映射预测结果。在综合多个siRNA数据集的大规模样本上进行比较实验,结果显示该方法的PCC值和AUC值达0.717和0.894,均高于Biopredsi,DSIR以及siRNApred方法。这体现该方法能够深入挖掘siRNA序列中不同长度motif对siRNA沉默效率的贡献,更充分地将siRNA序列的局部特性、碱基和motif组成以及位置排列等有价值线索蕴含于特征模式中。这种由数据驱动的特征学习模式比依赖专家知识预设的特征提取模式性能更优。本文主要创新点包括:(1)、首先提出将二模和三模motif位置编码作为siRNA沉默效率预测的新特征,其次提出基于z-score的特征选择算法并对siRNA单碱基编码、siRNA和m RNA序列组成、二模和三模motif位置编码和热力学参数进行特征筛选,最后开发siRNA沉默效率在线预测平台siRNApred;(2)、设计用于探测siRNA序列中多模motif特征模式的卷积核,提出并验证基于卷积神经网络的siRNA效率预测模型。综上所述,本文旨在进一步挖掘与siRNA沉默效率相关的特征,并综合多种siRNA特征表示和特征选择算法,建立依据生物学属性的最佳特征集合,并在随机森林分类器上提升siRNA沉默效率预测效果;同时,设计合理的卷积神经网络结构,数据驱动地学习多模motif潜在特征模式,从而设计更高效siRNA。文章提出了两个siRNA效率预测模型,并详细描述了每个模型的细节,设计比较实验验证这两个模型的精度,结果显示本文方法与当前主流的siRNA沉默效率预测方法相比性能均有所提升。
【Abstract】 RNA interference(RNAi)is a cellular process whereby double-stranded RNA(ds RNA)leads to posttranscriptional gene silencing through base-pairing interactions and is found in many eukaryotic systems,including plants,fungi,invertebrates and mammals.In mammalian cells,long ds RNA is processed into short 21–23 nucleotide(nt)ds RNAs known as small interfering RNA(siRNA)and induces instant target m RNA knockdown.In recent years,RNAi has been widely applied to study of gene function,gene therapy and drug development.And siRNA which plays critical role in RNAi has attracted more attentions from researchers.Since the siRNAs targeting different positions of a single m RNA will produce different silencing efficiencies,and most silencing efficiencies are not ideal.Thus how to design active siRNAs to achieve the highest silencing efficiencies has become the most important issue of RNAi.The siRNA design is an important prerequisite for the application of RNAi to gene function and drug development,and has become a hotspot in RNAi study.At present,siRNA design methods are divided into two categories: siRNA design methods based on statistical rules and siRNA design methods based on machine learning algorithm.The results show that the siRNA design methods based on machine learning can more accurately predict the siRNA silencing efficiencies.However,despite much machine-based siRNA design algorithms have been produced,the predictive silencing efficiencies is still unsatisfactory.More potential features in siRNA sequence associated with siRNA silencing efficiencies are needed to be further explored and more novel high-performance machine learning algorithms can be applied to siRNA efficiency prediction.This paper detected the potential features associated with silencing efficiencies from siRNA sequence and developed siRNA silencing efficiencies based on Random Forest algorithm.And then,to detect the effect of different motifs of siRNA on silencing efficiencies,a convolutional neural network model to predict siRNA silencing efficiencies was proposed.The main contents of this paper are as followed:1.This paper extracted the new features from 2-mer and 3-mer motif based on position encoding and developed Random Forest prediction model for silencing efficiency prediction.Since siRNA sequence is an important factor in the RNAi process,dipping more potential features from siRNA sequence is always the research focus.Studies have shown that when the 2-3bp RNA at every position of a siRNA sequence was substituted by DNA,the RNAi activity changed.Thus,not only the position and composition of the single nucleotide on the siRNA sequence are related to the efficiency of RNAi,the 2-mer and 3-mer motif at specific positions of the siRNA sequence are also associated with RNAi efficiency.In this paper,we first demonstrated that the 2-mer and 3-mer motif at different positions of the siRNA sequence were significantly different between avtive siRNA and inactive siRNA.Then,the 2-mer and 3-mer motifs based on position encoding are extracted as new features.And the feature selection algorithm based on RF-Variable importance was used to select the feature subset which was most relevant to the silencing efficiency of siRNA,and the siRNA silencing efficiency prediction model based on random forest was constructed.The results of the validation experiments on the Huesken dataset showed that the predicted PCC value of the siRNApred prediction is 0.722,which is 9.39%,10.39%,9.56% and 7.76% higher than Biopredsi,i-score,Thermo Composition-21 and DSIR respectively.In addition,predictive experiments were performed on multiple independent data sets to examine the generalization of siRNApred.Our model showed more stable performance than other methods.The online address of the siRNApred tool is http://www.jlucomputer.com:8080/RNA/.2.The prediction method of siRNA silencing efficiency based on convolution neural network is proposed.The effect of siRNA sequence on RNAi efficiency is not only related to 2-mer and 3-mer motif,but the multimode motif may also be closely related to siRNA silencing efficiency.However,the existing siRNA feature extraction method does not reflect the contribution of multimode motif to siRNA silencing efficiency.In order to explore the effect of multimode motif,this paper proposed a siRNA efficiency prediction model based on convolution neural network.In the convolution layer,we designed a reasonable size of the convolution kernel as a motif detector to automatically learn the potential feature pattern of multimode motif and combined multiple motifs to develop siRNA silencing efficacy prediction model.This model is developed by the model superparameters experimentally calibrated and consists of a convolution layer,a pool layer and an output layer.The convolution layer used the 14 convolution kernel from 5 × 4 to 18 × 4 to detect the potential motif feature pattern.The maximum pooling and mean pooling were used in the pooling layer to select the most representative neurons to form the feature expression.The output layer utilized logical regression to compute the prediction result.The results showed that the PCC and AUC values of the method were 0.717 and 0.894,which were higher than those of Biopredsi,DSIR and siRNApred.This method can deeply extract the contribution of multimode motif to siRNA silencing efficiency in siRNA sequence,and more fully contains the valuable traits such as the local characteristics,base and motif composition and position arrangement of siRNA sequence in the feature pattern.This data-driven feature learning model is superior to the feature extraction pattern that relies on expert knowledge presets.In this paper,the main innovations include:(1)extracted the new features from 2-mer and 3-mer motifs based on position encoding;proposed feature selection algorithm based on z-score,and proposed a siRNA silencing efficacy prediction model combining single nucleotide representation,nucleotide composition,the features from 2-mer and 3-mer motifs based on position encoding and thermodynamic features,then developed an on-line platform for siRNA silencing efficiency prediction.(2)designed a suitable convolution kernel to detect the motif feature pattern;developed and validated the siRNA efficacy prediction model based on convolution neural network.In summary,this paper is designed to explore more features associated with siRNA silencing efficacy and bring together various siRNA features and feature selection method to build an optimal feature set according to biological property and increase the prediction efficacy using Random Forest predictor.At the same time,a reasonable convolution neural network structure is designed to learn the potential feature pattern of multimode motif to design more active siRNA for the targeted m RNA.In this paper,two siRNA efficiency prediction models were developed.We explained the two models in detail and verified the prediction accuracies of the two models by comparative experiments.The results showed that the proposed methods have better performance than the existing siRNA silencing efficiency prediction methods.
【Key words】 siRNA design; RNA interference; Random Forest; Feature Selection; Convolutional Neural Networks;