节点文献
新闻类信息的组织和话题监控
Content Organization and Topic Monitoring on the News Platform
【作者】 张勇;
【导师】 刘瑞芳;
【作者基本信息】 北京邮电大学 , 信号与信息处理, 2014, 硕士
【摘要】 论文针对网络新闻的平台,提出利用自然语言处理和机器学习等算法进行内容组织和话题监控,从而提供用户便捷地定位“兴趣信息”的浏览体验。通过这套文本处理系统,用户可以采集实时新闻,定制喜欢的新闻,以及按类别细化查找想看的文章。此外,用户还可以发现实时热点话题,跟踪感兴趣话题的动态。论文工作首先用传统的文本处理手段进行新闻组织、用户频道定制和话题发现,主要有:基于文本分类器自动划分用户感兴趣的新闻;基于Single-pass、NMF和LDA等文本处理算法对历史新闻进行话题发现。然后提出了一系列创新的新闻平台解决方案,主要有:基于HFTC算法进行新闻组织,自动构建出具有层次性的新闻聚类结构,帮助用户按带有语义描述信息的类别进行新闻查找;基于WBN-FTC进行话题发现,克服了FTC算法支持度阈值难以选择的缺点,不仅可以像LDA那样有效发现话题,而且摆脱了VSM模型的限制,在海量数据上的时间性能更好,此外,还可以通过调整参数来设置话题发现的粒度;在工程实践方面,本文提出基于搜索引擎技术实现挖掘算法,不仅提高系统运行效率,而且降低编程成本。同时,论文提出两套话题跟踪方案,分别基于查询扩展和组合分类器,并提出利用时序特征进行话题预测和模式识别。这些都为话题监控领域提供了更广阔的应用前景。
【Abstract】 Based on the platform of news websites on the Internet, the paper focuses on content organization and topic monitoring With NLP and machine learning algorithms, so as to provide users with a brand new browsing experience of finding the "interesting information" conveniently. Through this text processing system, users can get real-time news, customize favorite news, and find news that might interest them by category. In addition, users can also find real-time hot topics and tracking them.In this paper, firstly, use traditional textual analysis methods to realize news organization, topic detection and tracking and users’news customization. For example, extract users’ interested news based on text classifier automatically. Detect hot topics on historical news with text processing algorithms such as Single-pass, NMF and LDA. Then, the paper puts forward a series of innovative solutions of news platform. The solutions include news organization based on HFTC algorithm and topic detection based on WBN-FTC. HFTC builds a news hierarchical clustering structure to help users find news by category carrying the semantic description. The WBN-FTC overcomes the shortcoming of FTC algorithm that the support threshold is difficult to choose. It can’t only find topics effectively like LDA, but also get rid of the limit of VSM. So it performs better in mass data. In addition, it can set topics’size by adjusting parameters. Meanwhile, in the engineering realization, the paper use search engine to implement text mining algorithm. It not only improves the efficiency of system, but also reduces the program cost.At the same time, the paper puts forward two topic tracking schemes based on query expansion and combined classifier respectively, and brings in the idea of using time series features to realize topic prediction and pattern recognition. All of these methods lay the foundation of more brand new applications in the topic monitoring field.
【Key words】 news organization; topic detection and tracking; HFTC; WBN-FTC; topic dynamic;