èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽç‰¹å¾é™ç»´å’Œè¯ä¹‰æ‹“å±•çš„çŸæ–‡æœ¬åˆ†ç±»æ–¹æ³•ç ”ç©¶

Research on Short Text Classification Method Based on Feature Reduction and Semantic Extension

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ å‘¨æ˜Žï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ åˆè‚¥å·¥ä¸šå¤§å¦ ï¼Œ è®¡ç®—æœºåº”ç”¨æŠ€æœ¯ï¼Œ 2020ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ éšç€ç½‘ç»œæ—¶ä»£çš„å‘å±•å°¤å…¶æ˜¯åœ¨åœ¨çº¿ç¤¾äº¤çš„æŽ¨åŠ¨ä¸‹,çŸæ–‡æœ¬æ•°æ®é€æ¸æˆä¸ºä¸€ç§ä¸»æµçš„æ–‡æœ¬å½¢å¼ã€‚ä¸Žä¼ ç»Ÿçš„æ–‡æœ¬å½¢å¼ç›¸æ¯”,çŸæ–‡æœ¬çš„æ–‡æœ¬é•¿åº¦è¾ƒçŸè€Œæ•°æ®è§„æ¨¡å¤§,å› è€Œé«˜ç»´ç¨€ç–é—®é¢˜æ˜¯åœ¨è¿›è¡ŒçŸæ–‡æœ¬æ•°æ®æŒ–æŽ˜æ—¶é¦–å…ˆè¦é¢ä¸´çš„æŒ‘æˆ˜ã€‚å…¶æ¬¡çŸæ–‡æœ¬åŒ…å«çš„è¯ä¹‰ä¿¡æ¯è¾ƒå°‘ä¸”ä¿¡æ¯å˜åœ¨æ§ä¹‰ç‰é—®é¢˜,å¯¼è‡´ä¼ ç»Ÿçš„æ–‡æœ¬æŒ–æŽ˜æ–¹æ³•é€šå¸¸éš¾ä»¥é«˜æ•ˆã€å‡†ç¡®åœ°å®Œæˆåˆ†ç±»ä»»åŠ¡ã€‚å› æ¤,å¦‚ä½•è¿›ä¸€æ¥åŽ‹ç¼©æ–‡æœ¬çš„ç‰¹å¾ç»´åº¦,æ‹“å±•æ–‡æœ¬åŽŸæœ‰è¯ä¹‰ä¿¡æ¯,æé«˜çŸæ–‡æœ¬è¡¨ç¤ºä¸Žåˆ†ç±»æ€§èƒ½æˆä¸ºçŸæ–‡æœ¬æŒ–æŽ˜é¢†åŸŸçš„ç ”ç©¶çƒç‚¹ã€‚æœ¬æ–‡é’ˆå¯¹çŸæ–‡æœ¬çš„é«˜ç»´ç¨€ç–é—®é¢˜å¼€å±•åˆ†ç±»æ–¹æ³•ç ”ç©¶,å…¶ä¸»è¦å·¥ä½œå¦‚ä¸‹:(1)é’ˆå¯¹çŸæ–‡æœ¬æ•°æ®çš„é«˜ç»´ç¨€ç–é—®é¢˜,æå‡ºä¸€ç§åŸºäºŽæ ‡è®°å“ˆå¸Œç‰¹å¾é™ç»´çš„çŸæ–‡æœ¬åˆ†ç±»æ–¹æ³•ã€‚è¯¥æ–¹æ³•é¦–å…ˆå¯¹å¾…å¤„ç†çš„çŸæ–‡æœ¬è¿›è¡Œé¢„å¤„ç†,é‡‡ç”¨æ”¹è¿›çš„jieba-fastå¤šçº¿ç¨‹åˆ†è¯æ¥åˆ’åˆ†è¯ç»„,åŒæ—¶åŽ»é™¤åœç”¨è¯ç‰æé«˜æ–‡æœ¬è¡¨ç¤ºæ€§èƒ½;å…¶æ¬¡,ä¸ºé™ä½Žæµ·é‡çŸæ–‡æœ¬çš„é«˜ç»´é—®é¢˜,ä½¿ç”¨æ ‡è®°çš„å“ˆå¸Œæ˜ å°„æ–¹æ³•å°†é«˜ç»´çŸæ–‡æœ¬æ˜ å°„è‡³å›ºå®šç»´åº¦çš„å‘é‡ç©ºé—´ä¸,ä»¥ç¨€ç–çŸ©é˜µçš„å½¢å¼å˜æ”¾æ–‡æœ¬å†…å®¹,å¹¶å¯¹å¯èƒ½äº§ç”Ÿæ§ä¹‰çš„æ–‡æœ¬åŠ ä»¥åŒºåˆ†ã€‚æœ€åŽ,é‡‡ç”¨éšæœºæ£®æž—ä½œä¸ºåˆ†ç±»æ¨¡åž‹è¿›è¡Œé¢„æµ‹ã€‚å®žéªŒç»“æžœè¡¨æ˜Ž:æ‰€ææ–¹æ³•åœ¨çŸæ–‡æœ¬åˆ†ç±»å‡†ç¡®åº¦ä¸Šè¡¨çŽ°ä¼˜å¼‚,åŒæ—¶åœ¨ç¡¬ä»¶æ¶ˆè€—å’Œæ¨¡åž‹å‡†ç¡®åº¦ä¸Šå–å¾—äº†è‰¯å¥½çš„å¹³è¡¡ã€‚(2)é’ˆå¯¹çŸæ–‡æœ¬è¯ä¹‰ä¿¡æ¯å°‘å¯¼è‡´æ–‡æœ¬è¡¨ç¤ºæ•ˆæžœå·®çš„é—®é¢˜,æå‡ºä¸€ç§åŸºäºŽå±‚æ¬¡èšç±»å’ŒLSTMçš„æ¨¡ç³Šè¯ä¹‰æ‹“å±•çŸæ–‡æœ¬åˆ†ç±»æ¨¡åž‹ã€‚é¦–å…ˆ,é‡‡ç”¨Skip-Gramè®ç»ƒæ•°æ®é›†è¯å‘é‡,åœ¨è¯åµŒå…¥ç©ºé—´ä¸è¿›è¡Œå±‚æ¬¡èšç±»,èšç±»ä¸å¿ƒçŸ¢é‡æ ¹æ®è¯ä¹‰ç›¸ä¼¼åº¦ä¸Žå¤–éƒ¨è¯æ–™åº“çš„è¯å‘é‡è¿›è¡Œæ¨¡ç³ŠåŒ¹é…,å¾—åˆ°åŒ…å«è¯ä¹‰ä¿¡æ¯çš„æ–‡æœ¬è¡¨ç¤ºã€‚è¿›è€Œ,å¼•å…¥LSTMè¿›è¡Œé«˜å±‚ç‰¹å¾æå–,åŒæ—¶å¯¼å…¥Stochastic-poolingæ± åŒ–å±‚æå–å…¨å±€ç‰¹å¾å¹¶è¿›ä¸€æ¥é™ç»´,æœ€åŽè¿žæŽ¥softmaxå±‚è¾“å‡ºåˆ†ç±»ç»“æžœã€‚å®žéªŒç»“æžœè¡¨æ˜Ž:è¯¥æ–¹æ³•èƒ½å¤Ÿæœ‰æ•ˆè¡¥å……çŸæ–‡æœ¬çš„è¯ä¹‰ä¿¡æ¯,å¹¶è¾“å‡ºè¾ƒé«˜å‡†ç¡®åº¦çš„åˆ†ç±»ç»“æžœã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ In the development process of the Internet era,the data format of short texts has gradually become a mainstream text format under the impetus of online socialization.As compared with traditional text forms,short texts have shorter text lengths and larger data scales,so the problem of high-dimension and sparseness is the first challenge to be faced when mining short text data.Furthermore,short texts contain less semantic information and ambiguity information etc,which makes it difficult for traditional text mining methods to complete classification tasks efficiently and accurately.Therefore,how to further compress the feature dimensions of short texts,improving the performance of short texts representation,and then achieving a higher classification accuracy has become a research hotspot in the field of short text mining.In view of the above problems,this dissertation focuses on short text classification,and our main work is as follows:(2)Aiming at the high-dimension and sparsity problem of short texts,a classification method based on signed hash feature reduction is proposed.The method first preprocesses the short texts,uses improved jieba-fast multi-thread word segmentation to divide the phrase,and removes stop words to improve the performance of text representation.Secondly,to reduce the high-dimensional problem of massive short text,we use a signed hash mapping method to project high-dimensional short texts into a vector space with a fixed dimension,stores the text content in the form of a sparse matrix,and distinguishes text that may be ambiguous.Finally,the random forest is used as a classification model to predict.Experimental results show that the proposed method performs well in short texts classification accuracy,meanwhile,it achieves a good balance between hardware consumption and model accuracy.(3)Aiming at the poor performance of the text representation caused by the less semantic information of short texts,in terms of hierarchal clustering and LSTM,a classification model based on fuzzy semantic extension is proposed.First,the proposed model uses the Skip-Gram to train the word vector of data sets and uses hierarchical clustering in the word embedding space.And the clustering center vector is fuzzy matched with the word vector of the external corpus according to the semantic similarity to obtain a text representation containing semantic information.Second,access to LSTM(Long Short-Term Memory)for high-level feature extraction,and then import the Stochasticpooling pooling layer to extract global features and further dimensionality reduction,and finally connect the softmax layer to output classification results.Experimental results show that this method can effectively supplement the semantic information of short texts and output a higher accuracy classification result.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ çŸæ–‡æœ¬åˆ†ç±»ï¼› å“ˆå¸Œæ˜ å°„ï¼› éšæœºæ£®æž—ï¼› å±‚æ¬¡èšç±»ï¼› è¯ä¹‰æ‹“å±•ï¼›
ã€Key wordsã€‘ short texts classificationï¼› hash mapï¼› random forestï¼› hierarchical clusteringï¼› semantic extensionï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ åˆè‚¥å·¥ä¸šå¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.1
ã€è¢«å¼•é¢‘æ¬¡ã€‘1
ã€ä¸‹è½½é¢‘æ¬¡ã€‘81
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽç‰¹å¾é™ç»´å’Œè¯­ä¹‰æ‹“å±•çš„çŸ­æ–‡æœ¬åˆ†ç±»æ–¹æ³•ç ”ç©¶

Research on Short Text Classification Method Based on Feature Reduction and Semantic Extension

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

åŸºäºŽç‰¹å¾é™ç»´å’Œè¯ä¹‰æ‹“å±•çš„çŸæ–‡æœ¬åˆ†ç±»æ–¹æ³•ç ”ç©¶