节点文献
网络信息挖掘系统IDGS的实现
THE DESIGN AND IMPLEMENTATION OF AN INFORMATION MINING SYSTEM
【摘要】 网络信息挖掘是网络信息处理领域中的一项新课题 .介绍一个基于WWW的信息挖掘系统IDGS的设计与实现 ,并讨论了基于统计的文本信息特征提取技术和BP神经网络模型在网络信息挖掘中的应用 ,及在WWW上进行信息挖掘所需采用的方法和策略 .
【Abstract】 Information Mining on Internet is a new technology of network information processing, and is also an important application of Data Mining in Internet area. This paper describes the design and implementation of an Information Mining system, called IDGS, which can gather HTML documents and mine out documents users want by using BP neural network model and Backpropagation algorithm on World Wide Web. Data Mining(DM) and Knowledge Discovery in Databases (KDD) is defined as the non trivial extraction of implicit, previously unknown and potentially useful information from data. Data Mining is a new technology arising with the problem of “Rich Data Poor Information”. Network Information Mining is an application of Data Mining on Internet, and is referred to extract potential pattern from target learning samples, and then to extract useful information from Internet resources with the pattern. IDGS system consists of 4 modules: Pattern Extraction and Feature Selection Module, Raw Document Collection Module, Pattern Marching Module and Document Database Module, and adopts BP neural network model with BP algorithm to march information content. The neural networks that IDGS system adopts have 20 input neurons, one output neuron and 2 hidden layers. Each input neuron corresponds to one feature extracted from learning samples, and the output neuron corresponds to the relevance with mining target. The strategy of feature selection is based on statisics. We select the words or phrases as the features if the frequency they appear in relevance documents is more than in the unrelevant documents. To segment Chinese sentence and compute the frequency of words, we setup 3 dictionaries: Main dictionary, Thesaurus dictionary and Implini dictionary. We would involve all the words in that 3 dictionaries when we compute word frequency, so that we can solve the problem of words diversity. Meanwhile, we set several weight coefficients such as CofTitle, CofLinkText, CofH1 and CofH2 etc. to utilize the mark text of HTML. Collecting raw document is an important step in Network Information Mining. In order to improve the collecting efficiency, we submit queries to WWW search engines, such as Yahoo, Altavista and Infoseek, to get the starting collection URL first, with then we adopt WWW Robot technology to traverse the Web site with several heuristic policies. At last, We compare the result of IDGS system with the Inquery system of University of Massachusetts. The comparison shows that the IDGS system work effectively.
- 【文献出处】 南京大学学报(自然科学版) ,JOURNAL OF NAIJING UNIVERSITY (NATURAL SCIENCES) , 编辑部邮箱 ,2000年02期
- 【分类号】TP393
- 【被引频次】22
- 【下载频次】111