节点文献

基于知识粒度的Web文档聚类研究

Web Document Clustering Based on Knowledge Granularity

【作者】 黄发良

【导师】 张师超; 严小卫;

【作者基本信息】 广西师范大学 , 计算机软件与理论, 2005, 硕士

【摘要】 飞速发展的互联网(WWW)极大地改变了人们的生活, 它已经成为人们交流思想和获取信息的主流性的渠道。在这浩瀚无边的网络数据的海洋中潜藏着大量有价值的知识,从这个海量数据源中快速高效地获取有用知识是包括企业、个人等在内的所有用户必须要面临并解决的问题。 于是,运用数据挖掘(Data Mining)技术进行Web 数据挖掘(Web Data Mining)成为数据分析领域中的一个重要研究热点,引起了专家学者们的广泛关注。经过近十年的成长,Web数据挖掘取得了丰硕的成果,许多相关技术已经趋于成熟稳定并在实际生产和生活中得到了很好的应用,例如搜索引擎给信息猎取的人们带来极大的便利,电子商务已为工业企业界提供了一种全新的经营方式。 与传统数据相比较,网络数据具有结构复杂、形式多样与内容广泛等特点,而且用户对Web 数据挖掘的功能需求是五花八门的,这对数据分析领域提出了更大的挑战。Web 数据挖掘可以粗略地分为三个部分:内容挖掘、使用挖掘和结构挖掘。它们采用的主要技术有:关联分析、时序分析、聚类分析等。其中,Web 数据聚类分析是Web 数据挖掘的一个核心的基础研究课题。聚类分析具有压缩搜索空间、加快检索速度等多方面的作用。它能帮助知识工作者高效而准确的发现与某个文档最相似的文档;提高信息检索系统的返回率(Recall)和精度(Precision);很好地提高搜索引擎的个性化程度。在网络上最常见的也是最重要的一种数据形式就是以标记语言表示的Web 文档。因此,对Web 文档进行聚类分析是一项非常重要并且很有价值的工作。 本文在深入理解现有的Web 数据挖掘技术,尤其是Web 文档聚类分析技术的基础之上,分析了传统文本表示模型与文本聚类算法,分析了现有表示模型与现有聚类算法的优点与缺陷。为了克服现有聚类算法的不足,本文将引入知识粒度理论,提出了基于知识粒度的Web文档聚类方法。本论文的研究工作主要包括以下几个方面: (1) 传统的Web 聚类方法主要基于“文档—特征词”二级知识粒度的,这样会导致“假相关”的聚类结果,因此,本文提出了基于多级粒度的Web 文档表示机制及理论,并给出一个具体的基于多级粒度的Web 文档表示模型:“文档—段落—特征词”三级粒度表示模型(简称为“D-P-T”表示模型); (2) 在这种表示模型中,我们注意到,基于VSM 的相似度量计算一般采用“特征词—特征词”、“文档—文档”等方法,这会导致大量“零相似”的产生,基于这些问题,我们引入容差粗集理论,提出了基于粗集的文本表示扩展模型:EVSM; (3) 在聚类算法的选择过程中,既考虑到传统K-means 聚类方法适合海量文档集的处理,又考虑到它对孤立点数据比较敏感(这对非球形数据的聚类效果不够理想),因此,我们在传统K-means 的基础上提出了一种改进的K-means 聚类算法:NK-means。 (4) 最后,我们提出并实现了一个用于Web 数据分析的平台WebAnalyser,并进一步在此平台实现了用于Web 文档聚类分析的WCBGK 算法。

【Abstract】 With rapidly advanced, Internet (or WWW) has enormously changed people’s life mode, nowadays’WWW has become a main information channel that makes for better communication and information acquisition. There is a multitude of valuable knowledge characterized as latency in the large, distributed data repository on the Internet. All users, individuals or enterprises, must confront the challenging issue: how to efficiently and effectively acquire potentially useful knowledge from Internet. Web Data Mining derived from the Data Mining has been a hot and important topic in Data Analysis that has attracted great many experts and researchers. In last ten years, Web Data Mining has been widely studied and achieved a great progress. Many Web mining technologies have advanced to a mature stage and have been successfully applied to real world applications. For example, search engines make good the information acquiring from Internet; e-business provides a novel business mode for benefiting enterprises. Unlike traditional data, Web data is characterized as complicated structure, various forms and rich contents, and users’requirements can be diverse. This leads to Web data mining much more challenging. The most common yet important data is of the form: Web pages represented by markup language. Existing Web data mining can be roughly classified into three categories: content mining, usage mining and structure mining. Dominating technologies used in Web data mining are association analysis, time sequence analysis, and clustering analysis. Web data clustering is a key task in Web Data Mining. Clustering analysis assists in reducing search space and decreasing information retrieving time. It is helpful for efficiently discovering documents likely similar to another one. It is also useful to improve the recall and precision of IR systems and personalize search engines effectively. Thereby, Web clustering is a key task in Web mining. In this thesis, based on deeper understanding of the existing data mining and Web document clustering methods, we first analyze the traditional text representation models, text clustering algorithms, and their limitations. And then we adopt knowledge granularity to build Web documents clustering theory and algorithm. The main contributions are as follows. (1) Traditional Web clustering algorithms are based on two-level knowledge granularity, i.e. document and term. It can lead to that clustering results are “false relevant”. This thesis proposes a new method for Web document representation that is of many-level knowledge granularity, referred to a concrete model: “Document-Paragraph-Term”(abbreviated as “D-P-T”) model. (2) As well known, traditional VSM similarity measures can result in lots of “zero similarity”. To solve this problem, we use tolerance rough set theory to design an Extended VSM (abbreviated as EVSM) similarity. (3) We use K-Means as our clustering algorithm. However, although K-Means clustering is good for dealing with huge documents, it is outlier-sensitive(this can generate a poor output when clustering non-spherical data). For this, we innovate upon the K-means clustering, named as NK-means algorithm. (4) Finally, we develop a platform for Web data analysis, named WebAnalyser. The core part is a web-clustering algorithm WCBGK. It has been experimentally evaluated, and demonstrates that our approach WCBKG, compared with traditional Web document clustering algorithms, has both higher classification accuracy and better understandability.

  • 【分类号】TP393.092
  • 【被引频次】3
  • 【下载频次】471
节点文献中: 

本文链接的文献网络图示:

本文的引文网络