节点文献

基于间隔理论的序列数据挖掘研究

Time Series Data Mining Based on Large Margin Theory

【作者】 于霄

【导师】 于达仁;

【作者基本信息】 哈尔滨工业大学 , 控制科学与工程, 2012, 博士

【摘要】 时间序列数据在各个领域广泛存在,序列数据的分析和数据挖掘研究成为科学研究领域持续关注的热点。序列数据的知识发现因数据的高维度、时间维度信息的非独立等特性,导致信息很难得到有效利用,许多传统机器学习算法难以取得理想效果。针对时序数据特殊性,运用机器学习中的大间隔理论,对时序数据挖掘的几个问题进行了研究,具体从以下几个方面进行了研究:设计了基于间隔的时间序列相似度量方法。相似度量作为机器学习的核心问题,直接关系到算法在时序数据挖掘中的效果。不同的时序问题普遍存在形式多样的相位偏移现象,本文设计了基于间隔理论的动态时间弯曲相似度量约束学习方法。相比现有欧氏或者动态时间弯曲距离等相似度量体系,改进了序列扭曲的匹配策略。针对距离集中问题,通过基于间隔的范数学习的方法来加强度量函数在高维空间下的有效性。设计了时序特征片段提取及基于片段的分类算法。时序数据挖掘的难点之一,就是有效判别信息常常隐藏于局部的片段而不是整个序列区域,这一现象常常存在于图像边缘运动轨迹等序列问题。本文设计了针对序列的特征片段抽取方法,通过各个片段有效信息的对比,选择判别能力最大的若干片段来表征整个序列。这种基于片段的特征提取/数据重新表达方法与传统方法相比,特别适用图像边缘或运动轨迹曲线得到的序列数据,提高了分类精度、效率和可解释性。同时与同类知名算法shapelet进行了对比,实验验证了该算法的分类性能。提出基于间隔的序列粗粒化表达算法。研究了序列数据从数值到符号的转化中,有效信息及无效信息的变化关系。发现数据形式的变换过程虽然会造成部分有效信息的损失,但也会带来了无效数据的约简。提出基于间隔的有监督的序列数据粗粒化方法,提高了分类精度和效率,并通过实验验证。设计了基于大间隔关键案例加权的时序分类模型。通过给离群点和冗余样本以较低的权值,提高分类模型的泛化能力,通过减少冗余训练样本还能提高分类模型的计算效率。设计关键样本集时,利用大间隔理论评价每个样本的效能,增加能产生最大假设间隔的样本的权值,减小离群点和冗余样本的权值,提高了分类模型的泛化能力。最后通过实验验证了这一思想方法的有效性。

【Abstract】 The time series have been widely used in each field. Sequence data analysis anddata mining become hot spots and continuous attention has been paid in scientific area.Data of high dimension and features such as non-independent of time dimension informa-tion lead to the difficulty of using information effectively in knowledge discovery fromsequence data. Therefore, many traditional machine learning algorithms can not readilyobtain satisfactory results. Aiming at the particularity of the time series data, the largemargin theory in machine learning is adopted to study the time series data mining in thisdissertation. Some of the important problems are as follows:A sequential similarity measure method is designed based on the large margin the-ory. As a core problem in machine learning, similarity measure directly relates to theeffect of algorithm in the time series data mining. According to various phase shift phe-nomena commonly existed in sequential sample, a dynamic time warping similarity mea-sure method is designed based on the large margin theory. Compared with the Euclideanor dynamic time warping distance, the matching strategy for sequence distortion is im-proved. As for the distance instability phenomenon of high-dimensional data measure,the effectiveness of distance measurement is optimized through the norm learning.The feature extraction of supervised learning/data re-expression algorithm is de-signed based on the sequential characteristics of fragments. One of difficulties in thetime series data mining is that the effective identification information is often hidden inthe local sequence fragments rather than the entire area. This phenomenon often existsin sequence problems such as the trajectory from image edge. By contrasting variousfragments of useful information, several fragments with the largest discriminant capacityare selected to represent the entire sequence. Compared with traditional methods, thisfragment-based feature extraction/data re-expression method is especially suited for thetrajectory from the edge or sequence data obtained by curve. It also can improve theclassification accuracy, efficiency, and interpretability. Besides that, this method is com-pared with the well-known similar algorithm shapelet.The classification performance ofthis model is verified by the experiment.The sequence coarse graining algorithm is proposed based on the large margin the- ory. The changing relationship between useful and useless information is studied duringthe transformation of sequence data from values to symbols. Although some useful infor-mation is lost during the transformation, useless information is also reduced significantly.A supervised discretization method of sequence data is proposed to improve the classifi-cation accuracy and efficiency, which is also verified by the experiment.The sequential classification model is designed based on critical cases. During thedesign of critical sample set, the efficiency of each sample is evaluated by using the largemargin theory. The weights of samples which can produce the largest assumptions mar-gin are increased while the weights of outliers and the redundant samples are decreased.Those above can improve the generalization ability of the classification model. In addi-tion, the computational efficiency of classification model can be improved by reducingredundant training samples. The validity of this method is confirmed by the experiment.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络