节点文献
基于Word2vector的文本特征化表示方法
Characteristic representation method of document based on Word2vector
【摘要】 针对基于词语统计的特征化表示无法有效提取文本的词义特征的问题,提出一种基于上下文关系的文本特征化表示方法。该方法利用Word2vector提取词义特征,获得词向量;再对词向量进行"最优适应度划分"的聚类,并根据聚类结果将词语替代表示为聚类质心;根据质心及其所代表的词语的词频,构成词向量聚类质心频率模型(semantic frequency-inverse document frequency,SF-IDF),用于特征化表示文本。在不依赖语义规则的情况下,分别以路透社文本集Reuter-21578、维基百科(extensible markup language,XML)数据为文本数据集,采用神经网络语言模型(neural network language model,NNLM)算法进行文本分类实验,并采用F1-measure标准进行样本分类的效果评估,词向量聚类质心频率模型SF-IDF(semantic frequency-inverse document frequency,SF-IDF)向量与现有技术中词频-逆向文件频率(term frequency-inverse document frequency,TF-IDF)向量的分类效果对比,与TF-IDF模型进行对比实验;在Reuter-21578数据集上平均准确率由原有的57.1%提高到63.3%,在Wikipedia XML数据集上平均准确率由原有的48.7%提高到59.2%。SF-IDF模型可适用于现行的基于特征向量的信息检索算法,且较TF-IDF模型有更高的文本相似性分析效率,可提升文本分类准确率。
【Abstract】 Document representations based on statistical term measure can not extract lexical semantics effectively.Therefore,this work proposed a document representation method based on context.Using Word2 vector,the method is able to extract lexical semantics in the form of word vector.And it can carry out clustering on word vector with‘optimized fitness value partition’,then make cluster centroids represent words in each word vector cluster.On the basis of cluster centroids representing and word frequency,to characterize document,the method constructed cluster centroids frequency model,semantic frequency-inverse document frequency(SF-IDF).Without semantic database,respectively by Reuter 21578 and Wikipedia extensible markup languag(XML) as text data sets,using neural network language model(NNLM) algorithm for text classification experiment,and the F1-measure standard to evaluate the effect of sample classification,SF-IDF vector with the existing technology of term frequency-inverse document frequency(TF-IDF) vector classification result contrast,comparative experiment with the TF-IDF model was carried out.The average accuracy on Reuter 21578 data sets increases from the original 57.1% to 57.1%; average Wikipedia XML data set improves from the original 48.7% to 48.7% accuracy.SF-IDF could apply to VSM-based algorithms for information retrieval.And it shall perform better in text similarity analyzing,leading to higher precision in text classification work.
【Key words】 Word2vector; context; characteristic presentation; text classification;
- 【文献出处】 重庆邮电大学学报(自然科学版) ,Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition) , 编辑部邮箱 ,2018年02期
- 【分类号】TP391.1
- 【被引频次】48
- 【下载频次】701