èŠ‚ç‚¹æ–‡çŒ®

ç½‘ç»œä¿¡æ¯æŒ–æŽ˜ç³»ç»ŸIDGSçš„å®žçŽ°

THE DESIGN AND IMPLEMENTATION OF AN INFORMATION MINING SYSTEM

æŽ¨è CAJä¸‹è½½
PDFä¸‹è½½
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ é‚¹æ¶›ï¼› æˆšå¹¿æ™ºï¼› è”¡ä¸½å¨Ÿï¼› å¼ ç¦ç‚Žï¼›

ã€Authorã€‘ ZOU Tao, QI Guang zhi, CAI Li juan, ZHANG Fu yan (Department of Comupter Science and Technology,Nanjing University,Nanjing, 210093,China)

ã€æœºæž„ã€‘ å—äº¬å¤§å¦å¤šåª’ä½“è®¡ç®—æœºç ”ç©¶æ‰€è½¯ä»¶æ–°æŠ€æœ¯å›½å®¶é‡ç‚¹å®žéªŒå®¤!æ±Ÿè‹å—äº¬ï¼› 210093ï¼› å—äº¬å¤§å¦å¤šåª’ä½“è®¡ç®—æœºç ”ç©¶æ‰€è½¯ä»¶æ–°æŠ€æœ¯å›½å®¶é‡ç‚¹å®žéªŒå®¤!æ±Ÿ?ï¼›

ã€æ‘˜è¦ã€‘ ç½‘ç»œä¿¡æ¯æŒ–æŽ˜æ˜¯ç½‘ç»œä¿¡æ¯å¤„ç†é¢†åŸŸä¸çš„ä¸€é¡¹æ–°è¯¾é¢˜ .ä»‹ç»ä¸€ä¸ªåŸºäºŽWWWçš„ä¿¡æ¯æŒ–æŽ˜ç³»ç»ŸIDGSçš„è®¾è®¡ä¸Žå®žçŽ° ,å¹¶è®¨è®ºäº†åŸºäºŽç»Ÿè®¡çš„æ–‡æœ¬ä¿¡æ¯ç‰¹å¾æå–æŠ€æœ¯å’ŒBPç¥žç»ç½‘ç»œæ¨¡åž‹åœ¨ç½‘ç»œä¿¡æ¯æŒ–æŽ˜ä¸çš„åº”ç”¨ ,åŠåœ¨WWWä¸Šè¿›è¡Œä¿¡æ¯æŒ–æŽ˜æ‰€éœ€é‡‡ç”¨çš„æ–¹æ³•å’Œç–ç•¥ .æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Information Mining on Internet is a new technology of network information processing, and is also an important application of Data Mining in Internet area. This paper describes the design and implementation of an Information Mining system, called IDGS, which can gather HTML documents and mine out documents users want by using BP neural network model and Backpropagation algorithm on World Wide Web. Data Mining(DM) and Knowledge Discovery in Databases (KDD) is defined as the non trivial extraction of implicit, previously unknown and potentially useful information from data. Data Mining is a new technology arising with the problem of â€œRich Data Poor Informationâ€. Network Information Mining is an application of Data Mining on Internet, and is referred to extract potential pattern from target learning samples, and then to extract useful information from Internet resources with the pattern. IDGS system consists of 4 modules: Pattern Extraction and Feature Selection Module, Raw Document Collection Module, Pattern Marching Module and Document Database Module, and adopts BP neural network model with BP algorithm to march information content. The neural networks that IDGS system adopts have 20 input neurons, one output neuron and 2 hidden layers. Each input neuron corresponds to one feature extracted from learning samples, and the output neuron corresponds to the relevance with mining target. The strategy of feature selection is based on statisics. We select the words or phrases as the features if the frequency they appear in relevance documents is more than in the unrelevant documents. To segment Chinese sentence and compute the frequency of words, we setup 3 dictionaries: Main dictionary, Thesaurus dictionary and Implini dictionary. We would involve all the words in that 3 dictionaries when we compute word frequency, so that we can solve the problem of words diversity. Meanwhile, we set several weight coefficients such as CofTitle, CofLinkText, CofH1 and CofH2 etc. to utilize the mark text of HTML. Collecting raw document is an important step in Network Information Mining. In order to improve the collecting efficiency, we submit queries to WWW search engines, such as Yahoo, Altavista and Infoseek, to get the starting collection URL first, with then we adopt WWW Robot technology to traverse the Web site with several heuristic policies. At last, We compare the result of IDGS system with the Inquery system of University of Massachusetts. The comparison shows that the IDGS system work effectively.æ›´å¤š è¿˜åŽŸ