节点文献

基于Hadoop和支持向量机的紧密度后处理的研究与实现

A Post-Process Method to Tightness Based on Hadoop and Support Vector Machine

【作者】 杨光

【导师】 陈旭东;

【作者基本信息】 北京交通大学 , 软件开发技术(专业学位), 2015, 硕士

【摘要】 如何将用户所查结果准确地提取出来并展示已经成为目前搜索引擎的主要目标。搜索引擎涉及多项技术,自然语言处理是极为重要的一项,也是其他技术研究进行提升的基础。紧密度是分词并去停用词之后的关键技术之一,用于描述分词之后的最小单位(Term)之间的关系,是网页搜索的相关性排序中一项重要指标数据,对于排序的结果起着决定性的作用,在搜索引擎中都发挥着重要的作用,同时对于提升用户搜索结果的准确率以及召回率有着十分重要的意义。由于分词的策略是最小切割,会尽可能地将语句进行细粒度切分,这就会将一些长词组切分成多个Term,在随后的搜索结果中,会召回一些不符合用户的搜索需求的网页,影响搜索结果的准确率,并造成较差的用户体验。论文以搜狗搜索引擎的实际项目为背景,对于搜索引擎的中文分词中新词发现的算法策略进行了研究,设计了基于策略进行Term关系提取的算法,将这些关系进行提取组成特征,通过支持向量机(Support Vector Machine, SVM)进行特征分类,并对紧密度的实际效果进行提升。论文主要完成了下面的几项工作:(1)数据预处理。对原始搜索日志进行分词以及初始统计工作,得出后续策略的基础数据。(2)基于搜索回话日志的初步后处理。通过对搜索会话数据计算搜索语句差异值,得出部分会话数据,并对紧密度进行初步后处理;(3)基于网页正文的二步后处理。针对专有名词级别的紧密度结果,基于新词发现的算法,利用信息熵、互信息等方法,得出两两term之间的特征关系,并将特征值通过SVM进行分类。(4)实验结果验证以及分析,通过训练集合对最终离线数据进行验证,紧密度后处理的策略提升了相关性排序的效果,使得搜狗搜索引擎搜索结果更加准确。(5)策略效果。通过后处理策略对紧密度值进行调整,使得在相关性排序的结果更加准确,将优质结果排序较前,差的结果靠后。

【Abstract】 How to get the exact results that the users want has become the main goal of modern Search Engines. Search Engine is based on several techniques, Natural Language Procession is a significant one, which is also the foundation of improvement to other researches. Along with the Segmentation&Stop Words, the Tightness, as a significant index data to the Relevance Ranking of Web Search, is a dominating factor to the ranked results and takes a big part in the Search Engine. Tightness means a lot to improve the precision and recall of the searched results.Segmentor will segment the sentence to several parts as tiny as possible, which makes long-term phrases apart into several terms, and lead to recalling a lot of web pages that are not satisfied with the query requirements from users, decreasing the precision of search results, and making bad user experience to users. In this paper, based on actual project in Sogou Search Engine, the author researches the strategies and algorithms of new phrases discovery in Chinese segmentation, designs the method of extracting the relations between terms based on strategies, and forms those relations into several features, classifies different terms through Support Vector Machine, improve the result of the Tightness. The paper mainly completes following works:(1) Processing of meta-data, segmentation and statistics to the original query logs, getting the foundation data to the following algorithms.(2) Category based on Session Log. Calculates the query distance in the query session logs, gets some session data.(3) Category based on Web Page. To improve the result of proper nouns, calculates and statistics the foundation data based on the new phrases discovery algorithms, such like Information Entropy, Mutual Information. Gets the relations and features between terms. Classifies those features through SVM.(4) Validation and analysis. Does examination through the train set to the final off-line data, post-processing strategies improve the result of Relevance Ranking and the precision of search results.(5) Categories’ result. After post-process to Tightness, results of Relevance Ranking become more accurate, good pages get front positions, bad ones get backs.

  • 【分类号】TP391.3;TP18
  • 【下载频次】112
节点文献中: 

本文链接的文献网络图示:

本文的引文网络