节点文献
中文农业网页去重及相似度判断研究
Research on Duplicate Removal and Similarity Evaluation of Chinese Agricultural Web Pages
【作者】 赵涛;
【导师】 张太红;
【作者基本信息】 新疆农业大学 , 农业机械化工程, 2014, 硕士
【摘要】 随着网络信息技术的飞速发展,农业信息化的建设、服务水平得到了极大的促进与提高。互联网中海量、重复的农业信息为从事农业领域的朋友们带来方便的同时,也增加了快速、准确获取有效信息的难度。如何对农业网页中重复以及近似重复的网页进行有效的管理,成为农业垂直搜索引擎领域研究的重要课题之一。本文的工作主要包括以下几个方面:1)深入研究了文本去重及相似度判断的关键技术,网页预处理、网页正文内容提取、中文分词、特征加权算法、网页去重方法、文本相似度计算方法以及相似度评价标准技术,以农业网页语料库为基础,重点研究了网页去重技术、特征加权算法以及相似度计算的方法。2)对中文农业网页中重复及近似重复的网页的定义标准进行研究,构建出中文农业网页语料库。建立一个由人工鉴别出的网页集合,包含225组网页集,每组网页集中有2至14张近似重复网页,共1110篇网页作为网页测试集。3)首先对网页进行预处理,使用MD5方法去除网页集合中完全相同的网页,再对其余网页提取出正文内容,利用庖丁解牛分词方法进行分词、去除停用词后,分别使用布尔权重、词频权重、词频倒文档权重三种方法对特征词进行加权计算;最后分别使用三种相似度算法(向量空间模型、基于《知网》的语义相似度、潜在语义分析)对三种不同权重的特征向量空间模型进行了相似度计算,最终得到9组中文农业网页相似度判断结果。4)分析比较了9组实验的准确率、召回率、F1测度。结果表明,没有哪种特征加权算法对相似度判断有绝对的优势,三种特征加权算法在不同的相似度判断中各有优劣。不同相似度判断方法分析对比表明潜在语义分析相似度判断结果最好。通过MD5方法去除了41篇与其它网页完全重复的网页,对剩余1069篇网页使用不同的相似度判断方法结合权重计算对农业网页去重及相似度判断进行了深入研究。通过实验结果的分析与对比,结果表明潜在语义分析结合布尔权重值获得的结果,对农业网页相似度判断有最好的结果,综合评价F1测度为90.1%,且准确率达到了93.7%。
【Abstract】 With the rapid development of network information technology, construction of agriculturalinformation, service level has been greatly facilitated and improved. The massive and repetitive agriculturalinformation in the internet not only convenience to the friends who engages in agriculture, but also increasethe difficult of getting useful information quickly and accurately. How to manage the duplication andrepetition of similar web pages of agriculture effectively has become one of the important topics ofagriculture vertical search engine research field. The main work of this paper include the following aspects:1)Depth study the key technologies of removing text repetition and similarity judgments, webpretreatment, web page text content extraction, Chinese word segmentation, feature weighting algorithm,method of removing repetition web,method of text similarity calculation, similarity evaluation criteria.This article,which is based on agriculture web corpus, focuses on the technologies of removing repetitionweb, feature weighting algorithm and the method of similarity calculation.2)This paper research on the definition standards of the duplication and repetition of similar web pagesin Chinese agriculture, which has built a Chinese agriculture web corpus. A collection of web pagesidentified by manual has been build. The collection contains225pages set. Each web page has a2-14approximate duplicate pages. A total of1110web page as a test set.3)The Webpage pretreatment, removing set in exactly the same Webpage using the MD5method, andthen the rest Webpage extract text, word segmentation, word segmentation method using Paoding removestop words, respectively, using Boolean weighting, word frequency, inverse document frequency weight ofthree methods were weighted calculation of feature words; finally, we use three kinds of similarityalgorithm (vector space model, based on the HowNet semantic similarity, latent semantic analysis) on threedifferent weights of the feature vector space model of similarity calculation, finally got9group Chineseagricultural Webpage similarity judgment results.4)The accuracy, recall, F1measure of9experiments have been analyzed and compared. The resultsshow that no single feature weighting algorithm to determine the similarity has the absolute advantage. Allthree feature weighting algorithm in different similarity judgments have advantages and disadvantages. Theanalysis and comparison of different methods of similarity judgments shows that the method of similarityjudgments of latent semantic analysis has the best result.Through the MD5method to remove the41completely duplicate with other Webpage of Webpage,judging method of calculation on agricultural Webpage duplicate removal and similarity judgment isstudied combining weights using different similarity on the remaining1069Webpage. The analysis and theexperimental results, results show that latent semantic analysis combined with Boolean weighting valueobtained results, the agricultural Webpage similarity judgment has the best results, F1comprehensiveevaluation index is90.1%, and the accuracy was93.7%.
【Key words】 Chinese Agricultural Webpage; MD5; Vector Space Model; HowNet; Latent SemanticAnalysis;