节点文献
基于大规模文本数据集的相似检测关键技术研究
Research on Key Approaches of Similar Detecting Based on Massive Text Data Set
【作者】 王海涛;
【导师】 刘淑芬;
【作者基本信息】 吉林大学 , 计算机系统结构, 2016, 博士
【摘要】 随着互联网技术及相关产业的迅猛发展,数据正以前所未有的规模急速增加,数据是与自然资源、人力资源一样重要的战略资源;掌控数据资源的能力是国家数字主动权的体现。因此数据的收集、存储、处理、分析以及由此产生的信息服务正成为全球信息技术发展的主流,大数据研究和应用已成为产业升级与新产业崛起的重要推动力量。作为一种商业资本和战略资源,大数据在带给推动力的同时,也带来了挑战;如何在海量数据中探寻有价值的资源,是摆在研究者面前的首要任务。然而,海量信息中也夹杂了大量重复或相似内容,这些内容的存在,不但浪费了大量存储资源,降低网络的传输速度,也直接影响搜索引擎的整体性能,加重了用户寻找有价值资源的负担。大数据处理目标是以有效的信息技术手段和计算方法,挖掘和提取数据中的深度价值资源,为行业提供高附加值的应用和服务。因此,如何有效管理和利用海量信息,使用合适的技术,筛选掉其中无用或不相关的内容,快速高效地发掘出潜在价值的知识与信息,并进行合理分类、准确定位,是当前大数据处理中亟需解决的问题。因此,本论文针对大规模文本数据的相似检测问题,围绕数据的分类与挖掘、特征提取、相似检测、Map Reduce计算模型等相关理论和关键技术的研究,设计了以关联规则和朴素贝叶斯为基础的多重置信门限值分类分类方法;提出基于互信息的词频文本特征提取方案;构建了并行化的大规模文本相似检测平台。本研究在理论上具有创新性,实践上有可行性。具体来说,主要工作与创新体现在以下方面:1.针对相似检测的背景知识,研究针对文本分类的相关理论与技术。文本分类的任务是在给定的分类体系下,根据每类样本的数据信息,建立相应的类别判定公式和类别判定规则,并总结出分类规律。这样,当需要为待分类文本确定其类别时,根据已总结的类别判定公式和类别判定规则,就能够把待分类文本划分到相应的类别中去。分类过程包括:文本预处理、特征选择、特征加权、文本表示和分类算法等关键技术;论文在对上述过程进行深入的研究后,重点研究分类器的设计与实现、分类评价标准等;上述过程的研究,为文本相似检测奠定理论基础。2.针对相似检测过程分类精确率低等问题,结合朴素贝叶斯分类和关联规则挖掘,提出了一种具有多重门限值的分类方法。该方法应用于大规模文本数据集的分类过程,能有效地提高文档分类精确度。由于贝叶斯分类优点在于计算过程简单,但忽略了文本词组间的相互联系,所以可采用关联规则挖掘方法,为有关系的文本类设定合适置信门限值,分类器将为文档分类的执行获得较高的精确率,从而弥补了贝叶斯分类的缺点。该方法首先通过算法将预处理的文本数据库转变成关联规则,在排序规则集中,对训练数据集的分类从第一条规则开始判断,若第一条规则分类精确率比指定规则的置信门限值高,则从训练数据集中删除通过指定规则分类的数据,并且保存这一规则到规则集中去,创建关联分类器;否则,这一规则将从分类器中删除。该过程一直重复进行下去,直到所有排序的规则都被识别后,获得所有支持度大于最小支持度的关联规则。实验验证表明,该方法相比较于单独分类器的分类结果,能够获得较高的分类精确率和召回率。3.针对在相似检测过程中,提取特征向量精度低、特征子集数量大的问题,提出了基于互信息的词频文本特征提取方法。该方法以输入类别集合、各个类别中文本和每个文本词条在类别中出现次数为初始条件,首先,对输入类别的文本进行分词建立索引,然后对文本中词条循环读取,在训练集内的每个类别文本中计算特征词出现次数大于或等于某个值的文本数;接着,计算特征词相对于每个类别的特征频率和每个文本中出现的平均次数;最后,在各个类别中计算词条互信息值,将值最大的词条放入特征集合中,直到特征词个数达到阈值后,完成对文本特征的提取。通过在Sogou T语料库上测试和验证,该方法不仅能够获取较小特征子集,而且保持了较高分类精度。4.针对大规模文本数据相似检测过程中,数据量大、并行设计方法复杂且效率低的问题,提出了云平台下的大规模文本相似检测方法。该方法借鉴Sim Hash算法,使用段落加权长句的方法来获得段落指纹,然后利用Map Reduce计算模型计算相似度。具体来说,首先采用特征提取方法获得文本的段落指纹,将该指纹作为关键字排序并建立索引;其次用待检测文本的段落指纹在已有的文本库中进行索引,检索出可能重复或相似的文本;最后,根据检索的结果,与待检测文本执行相互的具体相似度计算,根据计算结果,决定待检测文本是否与已有文本近似。通过搭建Hadoop实验平台,采用三种不同规模的网页数据集来验证所设计方案的可行性,在数据集上分别执行运算时间和加速度比测试,实验数据表明,经过Map Reduce并行化设计后,相似检测的执行时间和效率得到显著的改善,特别是随着数据规模的增大和Hadoop集群中机器数量的增多,算法效率提升更明显,针对大规模数据集的相似检测优势更加突出。
【Abstract】 With the rapid development of industry for Internet and Internet of Things, data is growing and increasing at an unprecedented scale, and becoming a kind of strategic resource which is just as important as the natural resource or human resource, the ability of data resource mastery represents the national digital initiative capability. So, the massive data collecting, storing, processing, analyzing and the resulting information service is becoming the mainstream of the global information technology development, the research and application on big data have become an important driving force for the industrial upgrading and new industrial emerging.As a commercial capital and strategic resource, the big data not only gives us advancing force, but also a challenge. How to explore the valuable resource in the massive data is the most important task for researchers. However, the massive data also included a lot of duplicate or similar, this fact occurred not only wasted the storage resource and reduced the speed of transmission, but also directly affected the overall performance of searching engine, added the burden of searching work.The target of big data processing is to mine and obtain the deep valuable resource from the data sets, offer the high-value added application and service for a field through the effective information technologies and computing methods. So, how to find the potential knowledge and information quickly and efficiently, reasonably classify and accurately locate these data resource is a hot research topic at the current time through the duplicate or similar content checking or eliminating technologies.So, this paper offered a similar checking approach based on massive text data set and applied on de-duplication work through research on data classification and mining, feature selection, similar checking, Map Reduce computing model and so on. Specifically, the main research work and innovation viewpoints listed as follows.Firstly, research on the relevant theory and technology of text categorization. The task of text categorization is to build the formula and rules of category determination based on data info of each class sample, then classfies the indeterminate text to a specific class through the formula and rules built. The stages of categorization consist of text preprocessing, feature selecting, text expressing, categorization algorithms etc. this paper emphasizes on classifier devising, evaluation criterion of categorization. The above knowledge research founded the theory basis for similar checking work.Secondly, aiming to the low precision ratio at the course of categorization, this paper presents a multi threshold value categorization approach for massive text sets through combining byes classification and mining associative rules, improve the precision ratio for documents text. The byes classification has the advantage of simple calculation process, however, ignores the mutual linkages between the texts. Adopting the method of associative rule mining sets the appropriate confidence threshold value for text class owing relation, the classifier will obtain the high precision ratio for text categorization, therefore settle the defect of byes classification. This method make the preprocessing text data turn into the associative rule through CBA-RG method designed, then determine whether the categorization precision ratio of the first rule is higher than the confidence threshold value of the specific rule or not, if the former is higher than the latter, then eliminate the specific rule class data from the train data set and store it into rules set. Otherwise, eliminate the specific rule class data from classifier. Repeat the above course until all the order rules are determined and obtain the associative rule that all support degree is higher than the minimum one. The experiment shows the approach this paper offered can obtain the high categorization precision ratio and recall ratio compared to using single classifier.Thirdly, aiming to the defects of low accurate degree of selection feature vector, this paper offered a word frequency text feature selection method based on mutual information(FSTM), which applied on feature vector and text feature obtaining. The initial condition of this method is the text class set, the number which each word emerge in category, first of all, make the text entered word segment and index, repeat to traverse the word of text, then calculate the number of text that the frequency feature word emerged is more than or equal to min for each kind of text in the train set, subsequently calculate the average times of feature word which emerged in all text, finally, calculate the mutual information value of word in all classes, pick the word owned the most value and put it into feature set until the number of feature word reach the threshold value and finish the text feature selection.Fourthly, aiming to the defeats in the course of massive web pages similar checking, such as complicate parallel designing approach, low efficiency of work and massive date volume etc, this paper presents a similar detection approach based on Mapreduce, which utilized the simhash algorithm and paragraph weighting long sentence to obtain the paragraph fingerprint, then compute the similar degree. First of all, using the feature selecting method which the former chapter offered, obtain the paragraph fingerprint of web page, order and index, then conduct the index course in web data based existed through paragraph fingerprint of web to be detected, obtain the web page which is probably duplicate or similar to the exist one, finally, according to the detection result, compute the similar degree and determine the whether the web to be detected is similar to the one existed or not. Through the hadoop experiment platform, utilize three kinds scale of data set to test the feasibility of project designed, implement the operation time and accelerated speed ratio test respectively, the experimental data shows that the operation time and efficiency is improved remarkably in the course of similar web page detection, especially, with the data volume and the number of hadoop cluster scale increasing, it’s extremely obvious to promote the algorithm efficiency and outstanding advantage of similar detection for massive data sets.
【Key words】 Big Data; Similar Detection; Categorization; Feature Selection; Cloud Computing; Text Set;