节点文献
基于分块策略的近似文档检测系统的设计
The Design of Similar Document Detection System Based on Text Segmentation
【作者】 李兵;
【导师】 赫枫龄;
【作者基本信息】 吉林大学 , 软件工程, 2010, 硕士
【摘要】 本文的主要工作是对基于分块策略的近似文档检测系统的设计。由于在实际应用中,近似文档的检测技术已经不局限于单纯的对现有文档进行检测,所以我们的检测系统就以实际应用为前提,从网上搜索信息开始,进行网页收集、噪声排除、正文文档的提取,在获得了目标文档之后,采用以Shingle为最小粒度的分块方法对文档进行分块,再将分得的文档块转换成Hash值并且映射到一个Hash表中。这样,每个参与比较的文档都生成一个Hash表,然后通过对Hash表中Hash值的逐一比较,统计出表中相同的Hash值的数量,通过事先选取的一个阈值来判断:如果参与比较的文档之间相同Hash值的数量达到这个阈值,我们就可以判断其为相似文档。本文的第1章介绍了近似文档检测技术的研究背景、目的和意义。第2章就近似文档检测技术现有的主要方法做出了简要的介绍。第3章是本文论述的主体,论述内容是对基于分块策略的近似文档检测系统的设计过程以及算法的选择。第4章给出了实现过程和实验结果。第5章对论文工作和取得的成果做出了简要的总结。
【Abstract】 With the rapid development of Internet Technology .It provides us with the diversification of resources at an alarming rate in the steady increase. In these circumstances ,the appearance of search engines become a main way to search for information ,users of internet can use the search engines to search more information of they need. It is so easy to know the information what they want to know.But, with the development of Internet Technology, with a high-speed increase of resources, the information of the internet are also become more and more. This will inevitably makes the information of the web pages show diversity in some cases. Some sites copy most of the information form the other for its own purposes .Even some sites directly reprinted the original information of some other. This lead the information which is obtained by the search engine is full of a lot of similar or even duplicate. First, it is a waste of cyberspace. Secondly, it causes much inconvenience to the users in the process of extracting the useful information. It not only reduces efficiency, but also has a negative impact on the reliability of the information. Removal of the duplicate information has become an imperative step in order to obtain valid information on the Internet. So my thesis focus on extracting the information of the homepage, and doing a similar contrast detection in current enforceable platform with the help of the contribution of the teachers and seniors. By detecting the document can know which information is approximate, or repetitive. Removal of the duplicate information will bring great convenience in our daily work and improve the efficiency of processing information.In the beginning of my thesis, I summed up the main steps and various methods of removing the duplicate information, such as: removal of the similar information based on the approximate eigenvectors, removal of duplicate information based on the fingerprint algorithm, the detection of the similar document based on the keywords, sub-signature algorithm and stochastic mapping algorithm and so on. These strategies and algorithms have their own characteristics and application. Finally, we use the block-based strategy for the detection of the similarity. We not only do the detection of the similar documents in order to get close to real-life application of the specific issues. But we set up a hypothesis that we get the information for our own purpose through the help of the Internet when we are in trouble in our daily life. This thesis focuses on doing a comparison with each other through the Hash value, detecting the prepared documents whether similar or not by comparing the threshold given before the testing, using the following steps: information search, web collection, page segmentation, noise processing, text extraction, Hash calculation, Hash table mapping. And this process provides a series of relatively complete solution to this kind of problems.As web searching for information and collection of web page, they are the technology we often to use be, here we do not to say more about these. We use Google search engine to finish the work of searching and collecting information. Next, we talk about the steps of page segmentation technology. At the first, we though the resolve of these web page we have, so we can find the DOM tree of them, we also can get many information form the DOM trees, and then cut off the parts just like which advertisement, guide of the page and some other links. From one side, we could use a filter to remove〈img〉,〈script〉,〈style〉from the pages, for another, DOM trees will calculate the size of their node. Threshold depends on the blocks size and positions. Here we use node’s link to divide by the characters which are not the links, when it is bigger than the threshold, and we can say:the nodes are links, and remove these nodes. After having deleted Web page noise, Simple making use of Sun’s java JDK 1.40 include a open and with.Regular expressions java.util.regex package. Write a program of java language to get the code of the web page, and then use regular expressions in java.util.regex package, at the last, keep those words in a document. Next step is divide the documents to segments, we line the blocks size from small to big like, Word, Shingle, and Document. There are some problems that we should know, the segment size is small, to the result we need is accurate, but the time using of calculate is too long to wait, this kind of process is not we want to see. On the opposite situation, because the block’s size is too big, a little difference will change the Hash code, it will be ignore more documents that not the copy but similar documents. All above these, we considered the speed of calculation, efficiency and results, we choose Shingle methods to divide the documents to segments, use java language to change the segments to Hash codes, and then project those Hash codes to a Hash table. Based on the table elements traverse, compare with documents by the Hash table mapping. At the last, we can easy to get a number that how many same Hash code are there, and compare this number with the threshold, if the number bigger than threshold, then we consider that the two documents are similar.
- 【网络出版投稿人】 吉林大学 【网络出版年期】2010年 09期
- 【分类号】TP393.092
- 【被引频次】2
- 【下载频次】73