节点文献

网络信息挖掘系统IDGS的实现

THE DESIGN AND IMPLEMENTATION OF AN INFORMATION MINING SYSTEM

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 邹涛戚广智蔡丽娟张福炎

【Author】 ZOU Tao, QI Guang zhi, CAI Li juan, ZHANG Fu yan (Department of Comupter Science and Technology,Nanjing University,Nanjing, 210093,China)

【机构】 南京大学多媒体计算机研究所软件新技术国家重点实验室!江苏南京210093南京大学多媒体计算机研究所软件新技术国家重点实验室!江?

【摘要】 网络信息挖掘是网络信息处理领域中的一项新课题 .介绍一个基于WWW的信息挖掘系统IDGS的设计与实现 ,并讨论了基于统计的文本信息特征提取技术和BP神经网络模型在网络信息挖掘中的应用 ,及在WWW上进行信息挖掘所需采用的方法和策略 .

【Abstract】 Information Mining on Internet is a new technology of network information processing, and is also an important application of Data Mining in Internet area. This paper describes the design and implementation of an Information Mining system, called IDGS, which can gather HTML documents and mine out documents users want by using BP neural network model and Backpropagation algorithm on World Wide Web. Data Mining(DM) and Knowledge Discovery in Databases (KDD) is defined as the non trivial extraction of implicit, previously unknown and potentially useful information from data. Data Mining is a new technology arising with the problem of “Rich Data Poor Information”. Network Information Mining is an application of Data Mining on Internet, and is referred to extract potential pattern from target learning samples, and then to extract useful information from Internet resources with the pattern. IDGS system consists of 4 modules: Pattern Extraction and Feature Selection Module, Raw Document Collection Module, Pattern Marching Module and Document Database Module, and adopts BP neural network model with BP algorithm to march information content. The neural networks that IDGS system adopts have 20 input neurons, one output neuron and 2 hidden layers. Each input neuron corresponds to one feature extracted from learning samples, and the output neuron corresponds to the relevance with mining target. The strategy of feature selection is based on statisics. We select the words or phrases as the features if the frequency they appear in relevance documents is more than in the unrelevant documents. To segment Chinese sentence and compute the frequency of words, we setup 3 dictionaries: Main dictionary, Thesaurus dictionary and Implini dictionary. We would involve all the words in that 3 dictionaries when we compute word frequency, so that we can solve the problem of words diversity. Meanwhile, we set several weight coefficients such as CofTitle, CofLinkText, CofH1 and CofH2 etc. to utilize the mark text of HTML. Collecting raw document is an important step in Network Information Mining. In order to improve the collecting efficiency, we submit queries to WWW search engines, such as Yahoo, Altavista and Infoseek, to get the starting collection URL first, with then we adopt WWW Robot technology to traverse the Web site with several heuristic policies. At last, We compare the result of IDGS system with the Inquery system of University of Massachusetts. The comparison shows that the IDGS system work effectively.

【关键词】 信息挖掘神经网络BP算法WWW
【Key words】 Information MiningNeural NetworkBP AlgorithmWWW
【基金】 江苏省科委95科技攻关资助项目!(No :BE96 0 17)
  • 【文献出处】 南京大学学报(自然科学版) ,JOURNAL OF NAIJING UNIVERSITY (NATURAL SCIENCES) , 编辑部邮箱 ,2000年02期
  • 【分类号】TP393
  • 【被引频次】22
  • 【下载频次】111
节点文献中: 

本文链接的文献网络图示:

本文的引文网络