节点文献

基于概念的文本表示模型的研究

Study on Text Representation Model Based on Concept

【作者】 张剑

【导师】 李春平;

【作者基本信息】 清华大学 , 软件工程, 2006, 硕士

【摘要】 在本文中,我们提出了基于概念的文本表示模型。该模型以WordNet语言本体库为主要的概念知识源,即将该本体库中的每个同义词集合看作是能表示明确语义的一个概念,再将文本中的词条均用与该词条对应的概念来代替,也就是用在本体库中该词条所属的同义词集合来代替,建立文本的概念向量空间作为文本特征向量空间,同时考虑概念间的上下位关系,调整特征向量空间的各维度的权值,从而体现出文本中更抽象的语义信息。本文中我们提出两个基于概念的文本表示模型(Text Representation Model based on Concept,简称TRMC),一个适用于文本分类(TRMC for Text Categorization,简称TRMC-TCA),一个适用于文本聚类(TRMC for Text Clustering,简称TRMC-TCL)。其中TRMC-TCA,我们在处理训练文本集合时,使用训练文本的类别信息,修正表示训练文本特征的概念向量的权值,即将概念的反类别频度作为概念向量的权值影响因子之一。为了测试TRMC-TCA和TRMC-TCL的效果,我们进行如下两组实验:一组实验是使用路透社RCV1新闻文本集合,对TRMC-TCA与基于词条的向量空间的文本表示模型,使用相同的文本分类算法进行性能比较。实验结果显示,我们的TRMC-TCA在训练文本集合很小时,能保证令人满意的分类精度;在训练文本集合较大时,在不影响分类性能的前提下,能保持文本特征向量空间的维度在可控的范围之内。第二组实验是使用20新闻组(20Newsgroups)文本集合,对TRMC-TCL与基于词条的向量空间的文本表示模型,使用相同文本聚类算法进行性能比较。实验结果显示,采用层次聚类算法时,我们的TRMC-TCL能有效地提高聚类的性能。

【Abstract】 In this thesis, we present a text representation model based on concept. The model takes WordNet as the main source of knowledge. That is to say, the model takes every synonymy set, which WordNet contains, as a concept which can describe definite meaning. We describe a text by establishing concept vector space in which we replace terms with synonymy sets in WordNet and adjust the weights of concept vectors by considering hypernymy-hyponymy relation between synonymy sets. Then we can extract high-level information from the text.We present two text representation models based on concept (TRMC) in this thesis. The one can be used for text representation of text categorization (TRMC-TCA). The other can be used for text representation of text clustering (TRMC-TCL). In TRMC-TCA, we adjust the weights of concept vectors based on the category information of training texts. That is to say, we take the inverse category frequency of concept as one of the weight impact factors.We conduct two group experiments to test the effect of TRMC-TCA and TRMC-TCL. In Group I experiment, we choose documents from Reuters Corpus Volume I (RCV1) dataset to form our training and test sets. And we compare TRMC-TCA with text representation model based on term by the same text categorization algorithm. The result is shown that, TRMC-TCA can guarantee satisfactory precision when the number of training texts is small; and can set the number of dimensionality of concept vector space as not large value and not reduce the precision when the number of training texts is large. In Group II experiment, we use 20Newsgroups dataset to form test set. And we compare TRMC-TCL with text representation model based on term by the same text clustering algorithm. The result is shown that, TRMC-TCL can improve the performance of agglomerative hierarchical clustering algorithm.

【关键词】 文本表示WordNet概念向量空间
【Key words】 Text RepresentationWordNetConcept Vector Space
  • 【网络出版投稿人】 清华大学
  • 【网络出版年期】2007年 02期
  • 【分类号】TP391.1
  • 【被引频次】20
  • 【下载频次】770
节点文献中: 

本文链接的文献网络图示:

本文的引文网络