节点文献

基于TF-IDF与word2vec的台词文本分类研究

Research on line text classification based on TF-IDF and word2vec

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 但宇豪黄继风杨琳高海

【Author】 DAN Yuhao;HUANG Jifeng;YANG Lin;GAO Hai;College of Information,Mechanical and Electrical Engineering,Shanghai Normal University;Shanghai Development Center of Computer Software Technology;Shanghai Gaochuang Computer Technology Co.,Ltd.;

【通讯作者】 黄继风;

【机构】 上海师范大学信息与机电工程学院上海计算机软件技术开发中心上海高创电脑技术工程有限公司

【摘要】 为提高文本分类的准确性,针对健康节目台词文本各类别之间样本数量及各样本之间词数不平衡的特点,提出了一种基于word2vec均值算法及改进的词频-逆文本频率(TFIDF)算法的分类方法 .该方法通过引入信息熵及修正因子,缓解了数据不平衡对分类准确率及召回率造成的不良影响.实验结果表明:所提出的分类方法在准确率及召回率上与word2vec均值模型相比,分别提高7.3%及10.5%.

【Abstract】 In order to improve the classification accuracy of line text,a classification method based on word2vec average algorithm and improved term frequency-inverse document frequency(TF-IDF)algorithm was proposed,which took into account the characteristic of unbalanced sample quantity and word number among different categories of line text for health TV programs.By introducing information entropy and correction factors,the adverse impact of data imbalance on classification accuracy and recall rate was alleviated.The experimental results showed that the classification accuracy and recall rate of the proposed method were improved by 7.3% and 10.5% respectively compared with the word2vec average model.

【基金】 上海市科研计划项目(17DZ2292100)
  • 【文献出处】 上海师范大学学报(自然科学版) ,Journal of Shanghai Normal University(Natural Sciences) , 编辑部邮箱 ,2020年01期
  • 【分类号】TP391.1
  • 【被引频次】10
  • 【下载频次】369
节点文献中: 

本文链接的文献网络图示:

本文的引文网络