节点文献

新闻线索与主题探测

News Thread and Theme Detection

【作者】 李峰

【导师】 李芳;

【作者基本信息】 上海交通大学 , 计算机应用技术, 2008, 硕士

【摘要】 网络新闻已经成为互联网时代人们获取信息、了解天下的主要渠道。但同其他网络资源一样,存在着信息过载的问题。已有的信息检索的成功应用--搜索引擎,初步解决了用户快速获取自己想要的信息这一问题。凭借强大的搜索功能,用户可以很快的找到其关注事件的相关新闻。但对于影响大、背景复杂的事件,简单的新闻罗列与组织(新闻分类)已经无法满足互联网时代人们快速吸收、理解信息的要求。自动寻找话题内的结构,帮助用户理解把握事件的全局概貌和来龙去脉,理清事件的内在逻辑,成为当下的一个研究热点。本文首先提出了话题结构化的内容,给出了自己的定义,认为一个话题的自动组织包含以下四点内容:一个话题有多条发展线索;每条发展线索形成多个事件主题;主题和主题之间有着因果联系和细化联系;不同主题有着不同的影响力(重要度)。根据这个定义,我们首先使用基于命名实体的单遍增量聚类得到话题的线索,在聚类中我们提出了不同于单连接和全连接的混合连接算法;然后我们在每条线索内使用NMF聚类得到线索包含的主题;接着我们基于相似度和是否共现计算了主题的相关性;最后我们根据主题包含的新闻数量和相关的主题数,推出主题的重要度。本文综合网络新闻常用的RSS新闻发布技术和开源全文检索引擎Lucene实现了一个原型系统。并设计了系列实验验证了两次核心聚类算法的有效性。

【Abstract】 The internet is the main source for people to gather information and know the world now. It has the same problem of "information overload" or called "information sea" as the other internet resource. The search engines, which are successful applications of information retrieve --, have taken the first step to help people to find their required information. With the aid of many news web sites, people can get news rapidly. However, when hot events or break-in news appear, it’s not enough for people to understand the whole event if only list the related news by time. How to organize the news and find the structure of topics is a research hotspot. The aim is to help people to know sequence, cause and effect of events.In this thesis we first give the definition of the topic structure, we define: a topic consists of several developing threads; every thread has several themes in it; There are two relations between event themes: causation and elaboration; different theme has different importance. According to this definition, we propose a method to get the thread and theme of a topic based on two-level clustering results. The first clustering uses single-pass method based on Named Entities recognition. In this clustering, we propose a new hybrid linkage method , which combines single-linkage and complete-linkage. The second clustering method is NMF (Nonnegative matrix Factorization) based clustering. Then we calculate the relevance between themes by comparing their content and check the co-appearance. At last we calculate the importance of each theme based on the number of news belong to the theme and the number of related themes.We implement a prototype which combines the RSS technology and open source full text search project: Lucene. Our experiments show that the clustering methods we proposed are very suitable to our system.

  • 【分类号】TP391.3
  • 【被引频次】1
  • 【下载频次】248
节点文献中: 

本文链接的文献网络图示:

本文的引文网络