节点文献
基于特征降维和语义拓展的短文本分类方法研究
Research on Short Text Classification Method Based on Feature Reduction and Semantic Extension
【作者】 周明;
【作者基本信息】 合肥工业大学 , 计算机应用技术, 2020, 硕士
【摘要】 随着网络时代的发展尤其是在在线社交的推动下,短文本数据逐渐成为一种主流的文本形式。与传统的文本形式相比,短文本的文本长度较短而数据规模大,因而高维稀疏问题是在进行短文本数据挖掘时首先要面临的挑战。其次短文本包含的语义信息较少且信息存在歧义等问题,导致传统的文本挖掘方法通常难以高效、准确地完成分类任务。因此,如何进一步压缩文本的特征维度,拓展文本原有语义信息,提高短文本表示与分类性能成为短文本挖掘领域的研究热点。本文针对短文本的高维稀疏问题开展分类方法研究,其主要工作如下:(1)针对短文本数据的高维稀疏问题,提出一种基于标记哈希特征降维的短文本分类方法。该方法首先对待处理的短文本进行预处理,采用改进的jieba-fast多线程分词来划分词组,同时去除停用词等提高文本表示性能;其次,为降低海量短文本的高维问题,使用标记的哈希映射方法将高维短文本映射至固定维度的向量空间中,以稀疏矩阵的形式存放文本内容,并对可能产生歧义的文本加以区分。最后,采用随机森林作为分类模型进行预测。实验结果表明:所提方法在短文本分类准确度上表现优异,同时在硬件消耗和模型准确度上取得了良好的平衡。(2)针对短文本语义信息少导致文本表示效果差的问题,提出一种基于层次聚类和LSTM的模糊语义拓展短文本分类模型。首先,采用Skip-Gram训练数据集词向量,在词嵌入空间中进行层次聚类,聚类中心矢量根据语义相似度与外部语料库的词向量进行模糊匹配,得到包含语义信息的文本表示。进而,引入LSTM进行高层特征提取,同时导入Stochastic-pooling池化层提取全局特征并进一步降维,最后连接softmax层输出分类结果。实验结果表明:该方法能够有效补充短文本的语义信息,并输出较高准确度的分类结果。
【Abstract】 In the development process of the Internet era,the data format of short texts has gradually become a mainstream text format under the impetus of online socialization.As compared with traditional text forms,short texts have shorter text lengths and larger data scales,so the problem of high-dimension and sparseness is the first challenge to be faced when mining short text data.Furthermore,short texts contain less semantic information and ambiguity information etc,which makes it difficult for traditional text mining methods to complete classification tasks efficiently and accurately.Therefore,how to further compress the feature dimensions of short texts,improving the performance of short texts representation,and then achieving a higher classification accuracy has become a research hotspot in the field of short text mining.In view of the above problems,this dissertation focuses on short text classification,and our main work is as follows:(2)Aiming at the high-dimension and sparsity problem of short texts,a classification method based on signed hash feature reduction is proposed.The method first preprocesses the short texts,uses improved jieba-fast multi-thread word segmentation to divide the phrase,and removes stop words to improve the performance of text representation.Secondly,to reduce the high-dimensional problem of massive short text,we use a signed hash mapping method to project high-dimensional short texts into a vector space with a fixed dimension,stores the text content in the form of a sparse matrix,and distinguishes text that may be ambiguous.Finally,the random forest is used as a classification model to predict.Experimental results show that the proposed method performs well in short texts classification accuracy,meanwhile,it achieves a good balance between hardware consumption and model accuracy.(3)Aiming at the poor performance of the text representation caused by the less semantic information of short texts,in terms of hierarchal clustering and LSTM,a classification model based on fuzzy semantic extension is proposed.First,the proposed model uses the Skip-Gram to train the word vector of data sets and uses hierarchical clustering in the word embedding space.And the clustering center vector is fuzzy matched with the word vector of the external corpus according to the semantic similarity to obtain a text representation containing semantic information.Second,access to LSTM(Long Short-Term Memory)for high-level feature extraction,and then import the Stochasticpooling pooling layer to extract global features and further dimensionality reduction,and finally connect the softmax layer to output classification results.Experimental results show that this method can effectively supplement the semantic information of short texts and output a higher accuracy classification result.
【Key words】 short texts classification; hash map; random forest; hierarchical clustering; semantic extension;