节点文献
基于泛化信息和记忆信息的短文本分类研究
Research on Short Text Classification Based on Generalization and Memorization
【作者】 张帅;
【导师】 周国强;
【作者基本信息】 南京邮电大学 , 计算机技术(专业学位), 2019, 硕士
【摘要】 随着互联网的普及以及其硬件水平的快速更新,短文本的数量呈现爆炸式增长的趋势,这种趋势在用户量巨大的社交网络平台上尤为明显,例如Twitter、Facebook、微博等。这些社交软件的用户规模已经达到数十亿,尤其是活跃用户的日常评论导致了短文本的规模不断猛增。因此,迫切需要自动语言理解技术来处理和分析这些文本。在这些技术中,文本分类被证明是一种基本的,关键的,在各种场景中都很有用的自然语言处理任务方法,但是在字符个数较少的短文本中如何充分的利用其信息方法将在很大程度上影响短文本分类的准确度。目前,短文本分类的主流方法包括传统机器学习文本分类方法和深度学习文本分类方法这俩种,传统的机器学习方法中存在着文本表示高纬稀疏、特征工程复杂和分类器选择的问题,这导致了短文本分的效果不理想。虽然深度学习方法在一定程度上解决了上述的这三个问题,但是其对文本局部相关性的信息利用也并不充分。基于上述的问题和需求,本文利用记忆信息的记录已知信息的相关性和共现性的优点以及泛化信息低纬稠密和可表现未知新特征的优点,提出了基于泛化信息和记忆信息的短文本分类技术。通过在深度学习CNN模型上集成泛化信息和记忆信息提出了GM-CNN模型,GM-CNN较充分的利用文本信息,实验中的结果也好于现有的一些基准模型。在提出了GM-CNN模型后,接着研究了GM-CNN模型中尚待优化的一些问题。基于这些问题,利用批正则技术和一维分段最大化池化技术进行了改进,提出了IGM-CNN模型。实验结果表明IGM-CNN比GM-CNN模型取得了更好的分类效果。同时也对分段最大化池化的段数大小进行了实验,使得可以在保持模型较好分类效果的前提下最大程度的降低模型的参数数量和模型的复杂度。
【Abstract】 With the popularity of the Internet and the rapid update of its hardware level,the number of short texts is exploding,especially on social networking platforms where users are huge.particularly in the rapid development of social networking platforms such as Twitter,Facebook,Weibo and etc.The number of users of these social software has reached billions,especially the daily comments of active users have led to the continuous increase in the size of short text.Therefore,there is an urgent need for automated language understanding techniques to process and analyze these texts.Among these techniques,text classification technique has proven to be a basic,critical,natural language processing task method that is useful in various scenarios,but how to make full use of its information in short texts with fewer characters will greatly affect the accuracy of short text classification.At present,mainstream methods for short text classification include traditional machine learning text classification methods and deep learning text classification methods.In the traditional machine learning method,there are problems of text representation of high latitude sparseness,feature engineering complexity and classifier selection,which leads to effect of short text task is not ideal for traditional machine learning methods.Although the deep learning method solves the above three problems to some extent,its use of information on local relevance of text is not sufficient.Based on the above problems and requirements,this paper proposes the use of generalization and memorization by using the advantages of memorization to record the correlation and co-occurrence of known information and the advantages of generalization low-latitude dense and can express unknown new features for short text classification task.By integrating generalization on CNN depth learning model proposed GM-CNN model information and memory information,GM-CNN makes full use of text information,and the results in the experiment are better than some existing benchmark models.After the GM-CNN model is proposed,some problems that need to be optimized in the GM-CNN model are studied.Based on these problems,the BatchNormalization technique and the Chunk-Max Pooling technique were improved,and the IGM-CNN model was proposed.The experimental results show that IGM-CNN has better classification result than GM-CNN model.At the same time,the number of chunks of the Chunk-Max Pooling experiment is also carried out,which can minimize the number of parameters of the model and the complexity of the model while maintaining the better classification result of the model.
【Key words】 Short Text Classification(STC); Generalization; Memorization; Deep Learning; BatchNormlization; Chunk-Max Pooling;