节点文献
科技文献关键词自动标注算法研究
Study of Automatic Keywords Labeling for Scientific Literature
【摘要】 未标注或遗失关键词给科技文献的分类和导航工作带来一定困难,针对这一问题,提出了基于文献摘要内容的关键词自动标注算法。该算法使用标注过关键词的文献摘要作为训练文本,分别采用语言模型、LatentDirichletAllocation(LDA)模型、ProbabilisticAuthor-Topic模型及语言模型+LDA模型的组合模型对训练集中的摘要文本和关键词建模,建立关键词和组成摘要文本特征词之间的关系,然后利用这些模型在未标注关键词的科技文献摘要上进行关键词的预测。在中英文数据上的实验结果表明,自动标注的关键词能较好地反映科技文献的内容;在所有模型中,语言模型+LDA组合模型的效果最佳。
【Abstract】 Keywords of scientific literatures provided by authors are helpful for readers.But there are also some scientific literatures that are not labeled with keywords due to all sorts of reasons.So this paper proposed a new abstract-based automatic keywords prediction algorithm for scientific literatures without keywords.The abstracts of scientific literatures,which had been given keywords by authors,were used as the training data set.Four text modeling methods:language model(LM),latent dirichlet allocation(LDA),probabilistic author-topic model,and a combination of LM and LDA were employed to model the abstracts and the keywords in training set to build the relations between keywords and terms of abstracts.Then the trained models were used to predict keywords for the abstracts of scientific literatures without keywords.The experimental results on both Chinese data sets and English data sets show that the keywords predicted by the proposed algorithms can reflect the content of scientific literature well.Among all of the models,the combination of LM and LDA is best.
【Key words】 Language model; Tag prediction; Latent dirichlet allocation; Probabilistic author-topic model;
- 【文献出处】 计算机科学 ,Computer Science , 编辑部邮箱 ,2012年09期
- 【分类号】TP391.1
- 【被引频次】2
- 【下载频次】152