节点文献

电子文档信息挖掘系统的研究

【作者】 蔡立军

【导师】 张大方; 郭克俊;

【作者基本信息】 湖南大学 , 控制工程, 2003, 硕士

【摘要】 随着Internet及其信息服务的爆炸性增长,继数据挖掘技术成功地用于传统的数据库领域之后,人们对网络信息挖掘特别是Web数据挖掘技术也开始研究。 本文首先介绍了数据挖掘的定义、功能、模型和算法;研究了数据挖掘的背景、技术演变过程和现状。 接着描述了数据挖掘系统的原型框架,并着重对最常用的三种Web数据挖掘技术进行了分析:Web日志挖掘采用的模型有较大的缺陷:精度较低、模型代价太大、效率不高,不适合电子文档的数据挖掘;向量空间模型VSM法和基于示例学习的文档过滤法其实都是一种文档比较、过滤模型的方法,这种方法的主要缺陷是向量的维数和计算开销非常巨大,挖掘效率低。处理包含模糊特性的事物,效果不是很好。对中心词进行模糊测度处理时,会产生较大的偏差。 最后,论文给出了一个实用的电子文档信息挖掘系统的解决方案。Internet上文档类型繁多,语种复杂,针对这些文档建立一个格式一致的数据库将是一项很复杂的事情。因此,本文采取了建立Internet服务器的文件资料镜像站点的方法,采用基于传统数据挖掘的逆过程,即先对电子文档进行挖掘后,把对用户有用的电子文档资料再进行建库,从而提高用户对信息处理的能力和处理速度。系统采用I2DEF方法建立了结构模型、动态模型和功能模型;设计了双扫描缓冲区的无回溯搜索算法及搜索过程的双栈结构;根据电子邮件监控系统和电子文档挖掘技术的特征,设计了Bayes分类器并使用了增强型方法,提出了一种运用电子文档挖掘技术的电子邮件监控系统;构建了C/S和B/S双重体系结构;并给出了挖掘过程的部分函数调用关系及系统挖掘的处理过程、部分处理程序。系统能够实现电子文档的挖掘、发布、管理、电子邮件监控、系统维护等功能。

【Abstract】 With the surprising growth of Internet and its information service, data mining (DM) technology has been successfully used in data base, Which makes it possible for people to make a study of Web information mining, especially Web data mining.Beginning with the introduction of the definition of DM, its function, model and arithmetic, the paper also makes a study of its background, technology evaluation and its present situation. Then it describes the framework of DM system, focusing on the analysis of three most common Web DM technologies. Because Web daily record mining model is of great deficiency: such as low accuracy, high cost and inefficiency, it is unfit for electronic documents. Vector space model (VSM) as well as document filtration based on sample leaning is actually a way of documentary comparison and model filtration, in this way vector dimensions as well as their arithmetic cost are very huge but ineffiently. It is ineffective while handling indefinite things, for deviation may appear while estimating key words. Finally the paper proposes a practical electronic documentary information mining system as a solution, it is very complicated to set up a data base of the same pattern on Internet because of various types of documents and languages. Inverse to traditional data mining process, this paper uses a method of establishing mirror image sites of Internet service. That is , once electronic documents are mined up, a base is set up again for the documents useful to users in order to increase their ability and speed of handling information. Employing IDEF to establish framework, dynamitic and functional models, the system also designs a non-back shifting search arithmetic for double-scanning buffer zone and a double-track structure for searching process. According to the characteristics of E-mail control and electronic documentary mining technology, Bayes classifiers are made to strengthen the electronic control system in which electronic documentary mining technology is used; and moreover the double systematic structure of C/S & B/S is constructor with the presence of some function relationships in mining process as well as systematic mining and program handling. The system has the function of mining, issuing, managing electronic files, E-mail control and systematic safeguard.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2003年 03期
  • 【分类号】TP311.13
  • 【下载频次】208
节点文献中: 

本文链接的文献网络图示:

本文的引文网络