节点文献
基于SVM集成学习的miRNA靶基因预测研究
Microrna Target Gene Prediction Using Support Vector Machine with Ensemb Lelearning
【作者】 陈志茹;
【导师】 洪文学;
【作者基本信息】 燕山大学 , 仪器科学与技术, 2015, 博士
【摘要】 人们近年来发现一类对生命体具有重要调节功能的非编码内源性RNA分子-Micor RNAs(mi RNA)。mi RNA是一类大小长约20~25个核苷酸,其5’端带有磷酸基因、3’端带有羟基的非编码单链小RNA生物分子。它通过与靶基因m RNA 3’UTR(untranslated regions)区域碱基互补匹配和相互作用,在后转录时期发挥重要的基因调控功能。mi RNA广泛存在于真核生物细胞内,通过对mi RNA靶基因的表达调控,在细胞生长、发育、分化、代谢等生命活动中发挥着重要作用。mi RNA靶基因预测是研究和分析mi RNA分子生物学功能的重要组成部分,也是深入研究mi RNA作用机制的关键。基于支持向量机(Support Vector Machines,SVM)理论,针对mi RNA靶基因样本数据不平衡,导致阳性样本预测准确率低和整体分类效果不佳的问题,提出基于欠采样技术的集成学习算法,以提高mi RNA靶基因预测模型的分类识别准确率和泛化能力。论文主要研究三个方面的问题:数据集特征选择方法;与欠采样相结合的集成学习模型建立;mi RNA靶基因预测模型惩罚参数和核函数参数优化。针对mi RNA靶基因绑定结构的特征,在识别范围量化标准基础上,提出了基于分类间隔的特征选择算法SVM-FSCI。构建了mi RNA靶基因预测模型的性能,按照每个特征对支持向量机分类间隔的贡献,定义了特征有效率,对原始提取的特征向量集以特征有效率为标准进行排序,删除冗余和低效特征,最终得到优化后的特征向量子集。针对mi RNA靶基因样本数据集不平衡,导致阳性样本预测准确率低和整体分类效果不佳的问题,提出了基于欠采样技术的集成学习算法SVM-IUSW。算法采用SVM作为基本学习算法,以Ada Boost为集成框架,迭代过程中嵌入基于聚类的欠采样,以降低阴、阳样本数据分布不平衡程度,同时在自适应样本权重调整过程中,以样本权重平滑机制剔除阴性样本中的异常点,最终以带权重的投票机制,组合多个弱分类器预测结果,构成mi RNA靶基因预测的集成分类器。(3)针对不同惩罚参数的支持向量机,在分类不平衡靶基因数据集时效果不同的问题,提出了基于数据集分布平均密度求取惩罚参数的SVM-DODN算法;在此基础上采用自适应混合遗传算法,对mi RNA靶基因SVM模型核函数和惩罚参数进行优化,共同弥补不平衡样本空间造成的样本偏斜问题。提出了基于分类间隔的特征选择算法、基于欠采样技术的集成学习算法和自适应混合遗传算法,解决了mi RNA靶基因预测过程中出现的数据集特征提取及其特征选择、靶基因预测模型的构建和靶基因预测模型参数优化三个方面出现的问题。仿真实验表明,与其他算法相比,论文所提出的基于SVM的集成学习预测mi RNA靶基因算法,在解决mi RNA靶基因样本不平衡问题时具有很好的学习和泛化能力。
【Abstract】 Micor RNAs(mi RNA) are a family of single-stranded non-coding RNAs with about 22~25 nucleotides in length, which have phosphorics acid and a hydroxyl respectively in 5’ region and 3’ region of a mi RNA. They play important roles in post-transcriptional regulatory functions through complementary base pairing interaction in 3’UTR of message RNA(m RNA). Experimental investigation shows that mi RNAs are widely present in plants and animals, involved in cell growth, development, differentiation, metabolism, and other important life activitie s. Mi RNA target recognition is a key and important part in researching and analyzing mi RNA molecular biology function. It’s also the key to the study of mi RNA.mechanisms’.As the sample data of mi RNA target are unbalanced, which lead to the lower prediction accuracy of positive samples and poor overall classification results, this paper proposes a target prediction algorithm based on Support Vector Machines(SVM), in which under-sampling technology is embedded into Ensemble Learning. The algorithm can effectively improve the classification accuracy and generalization ability of mi RNA target prediction model. This paper studies the three issues: Feature selection method based on dataset, ensemble learning model with a combination of under-sampling and mi RNA target prediction model based on kernel parameter optimization.Firstly, mi RNA:target binding characteristics of the structure as well as the region has been studied. 9 kinds of mi RNA target identification rules and the quantitative criteria of features have been proposed. Based on rules of mi RNA target identification, we extracted 90 features on dataset by perl language.Secondly, the performance of mi RNA target prediction model built on 90-dimensional feature vector set has been analysed. The feature selection algorithm SVM-FSCI based on classification gap has been proprosed. The algorithm defines features’ effective rate based on classification of SVM. It sorted the original 90-dimensional feature vector set with the features’ effective rate, and remove redundant and inefficient features in order to find the best feature subset. Experiments show that mi RNA target prediction models built on the optimal feature set achieved a good result.Finally, this paper proposes a target prediction algorithm—SVM-IUSW, in which under-sampling technology is embedded into Ensemble Learning. The algorithm uses SVM algorithm as the basic learning algorithm. While Ada Boost is used for the integration framework, under-sampling based on clustering is embeded to reduce the degree of unbalanced distribution of positive and negative samples within the iterative process. In order to avoid over-learning,the algorithm also fuses robust sample weights smoothing mechanism so as to eliminate the abnormal samples in negative sample at the same time. Finally, predictions of multiple sub-classifiers combines as a result of the mi RNA target integrated classifier by weighted voting mechanism. The experiments show that, SVM-IUSW algorithm can obtain better classification and generalization performance than the current popular machine learning algorithms.
【Key words】 mi RNA target genes; SVM; unbalanced data; integrated learning; feature selection;