节点文献

面向话题追踪的特征选取与文本表示技术的研究

Study on Feature Extraction and Text Representation Technology in Topic Tracking

【作者】 王会珍

【导师】 朱靖波;

【作者基本信息】 东北大学 , 计算机应用技术, 2005, 硕士

【摘要】 随着互联网的出现和普及,为人们提供的信息急剧膨胀。在这种情况下,人们很难快捷准确地获取自己感兴趣的信息。而且与一个话题相关的信息往往孤立地分散在不同的时间段和不同的地方。仅仅使用现有的技术,人们对某些事件难以做到全面的把握。话题检测与追踪(Topic Detection and Trackina,TDT)技术正是为了满足这种需要,它研究如何检测新发生的事件并追踪事件后继发展动态的信息智能获取技术。它能帮助人们把分散的信息有效地汇集并组织起来,从整体上了解一个事件的全部细节以及该事件与其它事件之间的相关性。话题追踪是TDT的一个子任务,它的目的是监控新闻故事流识别出与预先给定几个新闻故事表述的话题相关的后继故事。 本文根据话题追踪任务的特点,研究了面向话题追踪的特征选取和文本表示技术。本文研究了不同层次上的特征选取方法。提出了词对和词簇两种特征表示方法。话题追踪的很多研究工作都是使用“bag of words”来表示文本。本文考虑了词性信息,提出了词对作为特征的表示方法(BOP),并采用了一元语法模型和向量空间模型进行话题追踪。本文选用TDT3语料作为测试语料,实验结果表明,使用本文选用的追踪系统,用词对作为文本特征不能够提高话题追踪的性能。本文还引入了k-means聚类技术,将词簇做为表示文本的特征(BOC)。实验结果表明,使用词簇做为文本特征,能够大大降低特征维数,很大程度上提高了追踪系统的效率。 通过对故事的观察,本文提出了双向量模型。使用命名实体识别技术,将文本表示成两个向量。在对故事进行追踪时,将对应的两个向量分别计算相似度,再将两相似度加权求和得到最终的打分,追踪系统根据该打分做出判断。为了更好的去除噪音数据,本文不但采用了禁用词表,还构造了禁用词性集合,用来去除噪音数据。本文选用TDT4语料作为测试语料,实验结果显示双向量模型能够提高话题追踪的性能;使用禁用词性集合对话题追踪系统的性能也有较大提高。 本文采用向量空间模型和一元语法模型来构造追踪系统。通过实验分析了影响中文话题追踪性能的2个因素:平滑参数和特征数目。本文选用TDT3和TDT4语料作为测试语料,实验结果显示选取适当的特征数目、使用好的分词技术、使

【Abstract】 With the appearance and popularization of the Internet, the amount of information available grows explosively. Under this circumstance, people can hardly get information that they are interested in quickly and correctly. Moreover, information that is relevant to a topic always spreads separately in different time and different place. We can’t understand some events roundly while using resent technology. The topic detection and tracking technology is just to meet this need. The initial motivation for research in TDT is to provide a core technology for an envisioned system that would monitor broadcast news and alert an analyst to new and interesting events happening in the world. Topic tracking is a subtask of TDT. It aims at monitoring the stream of news stories to find additional stories on a topic that is identified using several sample stories.According to the characteristic of topic tracking task, we study the feature extraction and text representation technology in it. We study feature extraction methods from different levels. We present two feature extraction methods: word pairs and word clusters. In most of the research on topic tracking, texts are represented in "bag of words". In this paper, we took part of speech in consideration, and proposed a representation method of using word pairs as features (BOP). We used unigram model and vector space model to perform topic tracking. In this paper we use TDT3 corpus as testing corpus. Experimental results show that in the tracking system we selected, using word pairs as text features cannot improve the performance. We also introduced k-means clustering technique in this paper, and used word clusters as text features (BOC). Experimental results show that using word clusters as text features can largely reduce feature dimension, thus greatly improved the efficiency of tracking system.Through observation on stories, we proposed double-vector model. Text is represented with two vectors using named entity recognition technology. While tracking stories, we compute similarities of each vector, and acquire the final score through weighted sum of the two similarities. Tracking system makes judgment according to this score. In order to better remove noise data, we choose TDT4 corpus as testing corpus. Experimental results show that double-vector model can improve the performance of topic tracking, and the use of stop part of speech set also helps to improve system performance greatly.

  • 【网络出版投稿人】 东北大学
  • 【网络出版年期】2005年 07期
  • 【分类号】TP391.1
  • 【被引频次】9
  • 【下载频次】473
节点文献中: 

本文链接的文献网络图示:

本文的引文网络