节点文献
基于TF-IDF与word2vec的台词文本分类研究
Research on line text classification based on TF-IDF and word2vec
【摘要】 为提高文本分类的准确性,针对健康节目台词文本各类别之间样本数量及各样本之间词数不平衡的特点,提出了一种基于word2vec均值算法及改进的词频-逆文本频率(TFIDF)算法的分类方法 .该方法通过引入信息熵及修正因子,缓解了数据不平衡对分类准确率及召回率造成的不良影响.实验结果表明:所提出的分类方法在准确率及召回率上与word2vec均值模型相比,分别提高7.3%及10.5%.
【Abstract】 In order to improve the classification accuracy of line text,a classification method based on word2vec average algorithm and improved term frequency-inverse document frequency(TF-IDF)algorithm was proposed,which took into account the characteristic of unbalanced sample quantity and word number among different categories of line text for health TV programs.By introducing information entropy and correction factors,the adverse impact of data imbalance on classification accuracy and recall rate was alleviated.The experimental results showed that the classification accuracy and recall rate of the proposed method were improved by 7.3% and 10.5% respectively compared with the word2vec average model.
【Key words】 term frequency-inverse document frequency(TF-IDF); word2vec; information entropy; text classification; machine learning; weight;
- 【文献出处】 上海师范大学学报(自然科学版) ,Journal of Shanghai Normal University(Natural Sciences) , 编辑部邮箱 ,2020年01期
- 【分类号】TP391.1
- 【被引频次】10
- 【下载频次】369