节点文献
冬奥会新闻文本采集及分类分析系统的设计与实现
Design and Implementation of the Winter Olympics News Text Collection and Classification Analysis System
【作者】 刘娜;
【作者基本信息】 河北工程大学 , 计算机技术(专业学位), 2020, 硕士
【摘要】 随着互联网技术的发展,网络信息数量不断增加。网络数据多以文本类型展现,但文本信息分布发散,内容复杂,分类单一,导致网络信息的采集和分析难度较大。为解决数据采集困难和文本分类粗糙的问题,本论文以主题爬虫和文本分类技术为基础,利用Python语言设计并实现了冬奥会新闻文本采集及分类分析系统。该系统主要包括数据采集、数据分类、数据可视化三个功能模块。在数据采集模块中,为了采集与冬奥会主题相关的新闻文本数据,定制了主题爬虫。所获得的数据为冬奥会信息的分类与分析提供了数据支撑,并实现了对冬奥会网络信息的初步数据整合。数据分类模块主要分为两个部分:数据筛选和文本分类。为实现对无关信息的筛选,本论文基于近邻算法SNN引入局部密度和相似度,提出了基于局部密度和相似度的自适应SNN算法(AK-SNN)。为验证AK-SNN算法的性能,分别在UCI数据集和冬奥会新闻文本数据集上进行了对比实验。实验结果表明,AK-SNN具有更好的鲁棒性和预测精度。为进一步对网络文本数据进行类别细分,采用极限学习机(ELM)作为文本分类器实现文本信息的多分类。结果表明,ELM在多类别的文本分类中获得了良好的分类精度。在数据可视化模块中,为了直观展示采集和分类结果,利用Django框架设计了Web展示界面。为挖掘信息中的潜在价值,对分类结果、新闻来源、新闻发布日期等多方面进行数据分析,并对分析结果进行了可视化。本论文的设计与实现为2022年冬奥会网络信息的采集和分析提供了一定的数据支持和技术支撑,同时为挖掘大型体育赛事相关网络新闻文本中的潜在价值信息提供了一种可供借鉴的思路。
【Abstract】 With the development of internet technology,the amount of network information continues to increase.Network data is mostly displayed in text types,but the distribution of text information is divergent,the content is complex,and the classification is single,so it is difficult to collect and analyze network information.In order to solve the problems of difficult data collection and rough text classification,in this paper,the Winter Olympics news text collection and classification analysis system is designed and implemented by Python language based on the focused crawler and text classification technology.The system mainly includes three functional modules which are data collection module,data classification module,and data visualization module.In the data collection module,in order to collect news text data related to the theme of the Winter Olympics,a focused crawler is customized.The obtained data provided support for the classification and analysis of the Winter Olympics information,and realized the preliminary data integration of the Winter Olympics network information.The data classification module is mainly divided into two parts which are data filtering part and text classification part.In order to achieve the screening of irrelevant information,in this paper,by introducing the local density and similarity to SNN,an adaptive SNN algorithm based on local density and similarity(AK-SNN)is proposed.To verify the performance of the AK-SNN algorithm,the comparative experiments were carried out on the UCI dataset and the Winter Olympics news text dataset.Experimental results show that AK-SNN has better robustness and prediction accuracy.In order to further classify the network text data,the extreme learning machine(ELM)is used as a classifier to achieve multi-classification of text information.The results show that ELM has achieved good classification accuracy in multi-category text classification.In the data visualization module,to visually display the collection and classification results,the web display interface is designed using the Django framework.In order to explore the potential value of the information,the data analysis was carried out on classification results,news sources,news release dates,etc.The analysis results were displayed.The design and implementation of this paper provide certain data support and technical support for the collection and analysis of network information for the 2022 Winter Olympics.At the same time,it provides a way of thinking for mining the potential information in the relevant online news texts of large-scale sports events.
【Key words】 2022 Winter Olympics; text data; focused crawlers; text classification; data analysis; data visualization;