节点文献
基于交叉度的多级话题聚类研究
Research on Multi-Level Topic Clustering Based on Cross Degree
【作者】 刘兵;
【导师】 孔兵;
【作者基本信息】 云南大学 , 计算机系统结构, 2017, 硕士
【摘要】 识别和检测热点话题一直是学者研究的重点,也是社会舆情监测的主要方法。网络的发展一方面给我们的生活带来了便利;另一方面,网络中也有一些不法分子利用网络的便捷性和传播的快速性,随意散播虚假和不良的新闻,对社会的安定造成恶劣的影响。本文研究热点话题的发现,就是将网络中分散的新闻数据,利用算法将它们聚类在一起,从中发现时下热点的事件,并监测事件的发展和变化,及时做出应对的方案。发生地震灾害的时候,往往伴随着各个方面的工作,例如救援受灾人员、疫情的预防、救援物质的运送和基础设施的恢复,单独看每一个工作都是一个热点话题,只有合并起来才是对地震事件的完整描述。使用传统的话题聚类算法对该事件聚类,结果可能将各个方面的新闻全部都聚类到一个话题中,只得到笼统的关于地震的报道,聚类结果并不理想。话题聚类不仅要体现具体的分支话题,而且要体现分支话题属于事件的整体性。本文提出多级话题聚类,即在原有话题(一级话题)的基础上进行再聚类。首先,针对话题模型容易出现维度爆炸的问题,提出动态权重方法,动态改变特征词的权重直到低于阈值被剔除,该方法在保持正确率的情况下有效的降低了话题模型的维度。其次,利用改进的single-pass算法对数据集进行一级聚类,得到关于事件的各个子话题。再次,引入交叉度来计算话题之间的相似度,任意两个话题类都可以使用交叉度算法来计算相似值,以此来判定两个话题类的相似性。最后利用基于交叉度的多级话题聚类算法将相似的子话题再次聚类在一起,发现子话题之间的联系。实验结果表明本文提出的算法是有效的,实验表明使用动态权重算法之后,向量维度得到了明显的下降,基于话题交叉度的相似性计算更加的准确,话题聚类的结果更加符合实际情况。
【Abstract】 Identifying and detecting hot topics has always been the focus of scholars’ research and the main method of public opinion monitoring.The development of the Internet on the one hand to bring convenience to our lives;the other hand,there are some criminals using the convenience of the network to free spread rumors,caused a bad influence on the stability of society.This paper study on the discovery of hot topics,using the algorithms to cluster scattered news data together,from which to find hot events,and monitor the development and changes of events,and timely to make the appropriate measures.In the event of an earthquake disaster,it is often accompanied by various aspects of work,such as the rescue of the affected people,the prevention of the epidemic,the delivery of the rescue material and the restoration of the infrastructure.All aspects of work are hot topics,by merging all topics to complete description of the earthquake event.Using the traditional topic clustering algorithm to cluster the event,the results may be all aspects of the news are all clustered into a topic,only a general report on the earthquake,clustering results are not ideal.Topic clustering not only to reflect the specific branch of the topic,but also to reflect the branch of the topic is the integrity of the event.This paper proposes multi-level topic clustering,that is,in the original topic(first-level topic)on the basis of re-clustering.Firstly,in order to solve the problem of dimension explosion in the topic model,the dynamic weight method is proposed,which dynamically changes the weight of the characteristic word until it is lower than the threshold.The experiment proves that the method can effectively reduce the dimension of the topic model while maintaining the correct rate.Secondly,the improved single-pass algorithm is used to cluster the data sets first,and then the sub-topics are obtained.Thirdly,introduce cross degree to calculate the similarity between the topics.Any two topic classes can use the cross degree algorithm to calculate the similarity value to determine whether two topics can be merged.Finally,the similar sub-topics are clustered together using the multi-level topic clustering based on the cross degree,to find the relationship between sub-topics.The experimental results show that the algorithm proposed in this paper is effective,Experiments show that the vector dimension is obviously reduced after using the dynamic weight algorithm,and the similarity calculation based on topic cross degree is more accurate and the result of topic clustering is more realistic.
【Key words】 topic discovery; text clustering; single-pass algorithm; vector space model;
- 【网络出版投稿人】 云南大学 【网络出版年期】2019年 05期
- 【分类号】TP391.1
- 【下载频次】24