èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽã€ŠçŸ¥ç½‘ã€‹çš„æ–‡æœ¬èšç±»ç ”ç©¶

Research on Text Clustering Based on Hownet

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ å¼ é¾™ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ æ²³åŒ—å·¥ä¸šå¤§å¦ ï¼Œ è®¡ç®—æœºæŠ€æœ¯ï¼Œ 2012ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ K-Meansç®—æ³•æ˜¯æ•°æ®æŒ–æŽ˜æŠ€æœ¯ä¸çš„ä¸€ç§ç»å…¸ç®—æ³•ï¼Œæœ‰ç€å½¢å¼ç®€å•å’Œç©ºé—´æ—¶é—´å¤æ‚åº¦ä½Žçš„ä¼˜ç‚¹ï¼Œåœ¨æ–‡æœ¬æŒ–æŽ˜æ–¹é¢ä¹Ÿå¾—åˆ°æžå¤§çš„åº”ç”¨ã€‚è®ºæ–‡ç ”ç©¶äº†æ–‡æœ¬èšç±»çš„å…³é”®æŠ€æœ¯å’Œç®—æ³•ï¼Œé’ˆå¯¹æ–‡æœ¬èšç±»ä¸å¦‚ä½•åˆ©ç”¨è¯è¯çš„è¯ä¹‰ä¿¡æ¯å’Œä½ç½®ä¿¡æ¯è¿›è¡Œäº†ç ”ç©¶ï¼Œä½¿ç”¨æ”¹è¿›çš„æ–‡æœ¬ç›¸ä¼¼åº¦è®¡ç®—æ–¹æ³•å¯¹æ–‡æœ¬é›†åˆè¿›è¡Œäº†èšç±»ï¼Œå¹¶å¯¹K-Meansç®—æ³•è¿›è¡Œäº†ç›¸åº”çš„æ”¹è¿›ã€‚è®ºæ–‡çš„ä¸»è¦å·¥ä½œæ˜¯å¯¹ä¸‰ç§æ–‡æœ¬ç›¸ä¼¼åº¦è®¡ç®—æ–¹æ³•å¯¹K-Meansç®—æ³•èšç±»æ•ˆæžœå½±å“çš„æŽ¢ç´¢ã€‚åˆ†åˆ«ä½¿ç”¨åŸºäºŽä¼ ç»Ÿå‘é‡ç©ºé—´æ¨¡åž‹çš„æ–‡æœ¬ç›¸ä¼¼åº¦è®¡ç®—æ–¹æ³•ã€åŸºäºŽã€ŠçŸ¥ç½‘ã€‹çš„æ–‡æœ¬ç›¸ä¼¼åº¦è®¡ç®—æ–¹æ³•å’Œç»“åˆä½ç½®ä¿¡æ¯çš„æ–‡æœ¬ç›¸ä¼¼åº¦è®¡ç®—æ–¹æ³•åšä¸ºèšç±»ç®—æ³•çš„ç›¸ä¼¼åº¦åº¦é‡å®žçŽ°äº†K-Meansç®—æ³•ï¼Œå¹¶å¯¹èšç±»æ•ˆæžœè¿›è¡Œäº†æ¯”è¾ƒã€‚åœ¨å®šä¹‰åŸºäºŽã€ŠçŸ¥ç½‘ã€‹çš„æ–‡æœ¬ç›¸ä¼¼åº¦è®¡ç®—æ–¹æ³•æ—¶ï¼Œä¸ºäº†æé«˜ç®—æ³•æ•ˆçŽ‡å’Œå‡†ç¡®çŽ‡å®žçŽ°äº†ä¸€ç§æ–°çš„å‘é‡ç©ºé—´çš„ç”Ÿæˆæ–¹æ³•ï¼Œä¸å†ä½¿ç”¨æ•´ä¸ªæ–‡æ¡£é›†åˆä¸æ‰€æœ‰çš„è¯ç”Ÿæˆä¸€ä¸ªå›ºå®šç»´æ•°çš„å‘é‡ç©ºé—´ï¼Œè€Œæ˜¯é’ˆå¯¹æ¯ç¯‡æ–‡ç« ç”Ÿæˆä¸€ä¸ªå‘é‡ï¼Œæ¯ç¯‡æ–‡ç« ç”Ÿæˆå‘é‡çš„ç»´æ•°ç‰äºŽè¯¥æ–‡ç« åŒ…å«çš„è¯æ•°è€Œä¸æ˜¯æ•´ä¸ªæ–‡æ¡£é›†åˆåŒ…å«çš„è¯æ•°ï¼Œä»Žè€Œé™ä½Žæ•°æ®çš„é«˜ç»´æ€§å’Œç¨€ç–æ€§ï¼Œå¹¶è®¨è®ºäº†è¿™ç§å‘é‡ç©ºé—´å’Œæ¬§å¼ç©ºé—´çš„å…³ç³»ï¼›åœ¨å®šä¹‰ç»“åˆä½ç½®ä¿¡æ¯çš„æ–‡æœ¬ç›¸ä¼¼åº¦è®¡ç®—æ–¹æ³•æ—¶ï¼Œé€šè¿‡ä½¿ç”¨ä¾å˜åˆ†æžçš„æ–¹æ³•å¯¹æ–‡æœ¬ä¸è¯çš„ä½ç½®ä¿¡æ¯è¿›è¡Œç»Ÿè®¡ï¼Œæå‡ºè¯è¯çš„ç›¸ä¼¼åº¦åº”å½“ç”±è¯è¯çš„è¯ä¹‰ç›¸ä¼¼åº¦å’Œä½ç½®ä¿¡æ¯ç›¸ä¼¼åº¦å…±åŒå†³å®šçš„ã€‚å¹¶æŽ¢ç´¢ä½¿ç”¨è¯è¯ä½ç½®ä¿¡æ¯å¯¹ä¸¤ä¸ªè¯è¯çš„ç›¸ä¼¼åº¦è¿›è¡Œä¿®æ£çš„æ–¹æ³•ï¼Œå®žçŽ°ä¸¤è€…çš„ç»“åˆã€‚åœ¨è¿™ä¸¤ä¸ªæ–¹é¢å¯¹æ–‡æœ¬ç›¸ä¼¼åº¦è®¡ç®—æ–¹æ³•è¿›è¡Œæ”¹è¿›åŽå®žçŽ°çš„K-Meansç®—æ³•æ‹¥æœ‰è¾ƒå¥½çš„èšç±»æ•ˆæžœã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ K-Means algorithm is a classical algorithm of data mining technology, and it has the advantage of brief form and low time and space cost. It is also used widely in text mining. The paper researches on the key technology and algorithm in text clustering and puts forward a new method of calculating the similarity of texts based on hownet and improves the K-Means algorithm.The main work of the paper is to explore the effect of three text similarity calculating methods on K-Means algorithm. Using the classical vector space model based text similarity calculating method, hownet based text similarity calculating method and position information involved text similarity calculating method, the paper completes K-Means algorithm. To define the hownet based text similarity calculating method, the paper put forward a new way of generating vector space. It use the words of one text to generate a vector for the text,thus, the dimension of the vector equals to the number of words in the only text but not the number of words in all the text set. In this method, the high dimension and sparsity is reduced. The paper also talks something about the relation between the space and Euclid space. To define the position information involved text similarity calculating method, The paper also put forward that the similarity of two words should be decided by the words meaning similarity and position similarity. The paper also explore the method that how to correct the similarity of two words.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ æ–‡æœ¬èšç±»ï¼› å‘é‡ç©ºé—´æ¨¡åž‹ï¼› æ–‡æœ¬ç›¸ä¼¼åº¦ï¼› çŸ¥ç½‘ï¼›
ã€Key wordsã€‘ text clusteringï¼› vector space modelï¼› hownetï¼› textual similarityï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ æ²³åŒ—å·¥ä¸šå¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.3
ã€è¢«å¼•é¢‘æ¬¡ã€‘4
ã€ä¸‹è½½é¢‘æ¬¡ã€‘155

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽã€ŠçŸ¥ç½‘ã€‹çš„æ–‡æœ¬èšç±»ç ”ç©¶

Research on Text Clustering Based on Hownet

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

åŸºäºŽã€ŠçŸ¥ç½‘ã€‹çš„æ–‡æœ¬èšç±»ç ”ç©¶