èŠ‚ç‚¹æ–‡çŒ®

ç§»åŠ¨å¹³å°ä¸‹çš„ä¸æ–‡çŸä¿¡å†…å®¹è¿‡æ»¤æŠ€æœ¯çš„ç ”ç©¶ä¸Žå®žçŽ°

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ é™ˆæ¬£ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ ç”µåç§‘æŠ€å¤§å¦ ï¼Œ ä¿¡æ¯ä¸Žé€šä¿¡å·¥ç¨‹ï¼Œ 2008ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ é¢å‘ä¸æ–‡çš„çŸä¿¡è¿‡æ»¤æŠ€æœ¯æ˜¯ä¸æ–‡ç§»åŠ¨å¸‚åœºè¿«åˆ‡éœ€è¦çš„ä¸€ç§æŠ€æœ¯ã€‚ç›®å‰ç§»åŠ¨å¹³å°ä¸Šçš„ä¸æ–‡çŸä¿¡è¿‡æ»¤æŠ€æœ¯ä»¥é»‘åå•è¿‡æ»¤å’Œå…³é”®è¯è¿‡æ»¤ä¸ºä¸»ã€‚æœ¬è®ºæ–‡ä¸»è¦ä»‹ç»äº†ä¸€ä¸ªä¸Žå½“å‰ä¸»æµä¸æ–‡çŸä¿¡è¿‡æ»¤ä¸åŒçš„æ–°åž‹è¿‡æ»¤æŠ€æœ¯ã€‚è¯¥æŠ€æœ¯æ˜¯ä¾¿äºŽåœ¨ç§»åŠ¨è®¾å¤‡ä¸Šå®žçŽ°,ç»“åˆäº†ä¸æ–‡çŸä¿¡çš„å†…å®¹ç‰¹å¾,ä»¥è§„åˆ™åº“è¿‡æ»¤ä¸ºåŸºç¡€çš„å†…å®¹è¿‡æ»¤æŠ€æœ¯ã€‚è¯¥æŠ€æœ¯æé«˜äº†çŸä¿¡è¿‡æ»¤å‡†ç¡®çŽ‡ã€åžƒåœ¾çŸä¿¡å¬å›žçŽ‡,é™ä½Žäº†æ£å¸¸çŸä¿¡é”™åˆ¤çŽ‡ã€‚çŸä¿¡å†…å®¹è¿‡æ»¤æ˜¯æ–‡æœ¬åˆ†ç±»æŠ€æœ¯çš„ä¸€ç§,ç›®å‰åº”ç”¨æœ€å¹¿æ³›çš„æ–‡æœ¬åˆ†ç±»æŠ€æœ¯æœ‰å¾ˆå¤š,æœ€å¤§ç†µå’Œå†³ç–æ ‘ä¸¤ç§ç®—æ³•åˆ†åˆ«ä½œä¸ºåŸºäºŽç»Ÿè®¡çš„å’ŒåŸºäºŽè§„åˆ™çš„æ–‡æœ¬åˆ†ç±»æŠ€æœ¯çš„ä»£è¡¨ç®—æ³•å¤§é‡åº”ç”¨äºŽå†…å®¹è¿‡æ»¤ã€‚æœ¬æ–‡ä¹Ÿå°†è¿™ä¸¤ç§ç®—æ³•ä¸Žæœ¬æ–‡æå‡ºçš„åŸºäºŽè½»é‡çº§è§„åˆ™åº“çš„å†…å®¹è¿‡æ»¤æŠ€æœ¯è¿›è¡Œå¯¹æ¯”å®žéªŒ,ä»¥éªŒè¯æœ¬æ–‡æå‡ºçš„åŸºäºŽè½»é‡çº§è§„åˆ™åº“çš„å†…å®¹è¿‡æ»¤æŠ€æœ¯æ˜¯å¦æ»¡è¶³å®žé™…è¦æ±‚ã€‚æœ¬æ–‡æå‡ºçš„åŸºäºŽè½»é‡çº§è§„åˆ™åº“çš„å†…å®¹è¿‡æ»¤æŠ€æœ¯ç”±ä¸¤éƒ¨åˆ†æž„æˆ:ç¬¬ä¸€éƒ¨åˆ†,è§„åˆ™åŒ¹é…ã€‚è§„åˆ™åŒ¹é…æ˜¯çŸä¿¡å†…å®¹è¿‡æ»¤çš„ç¬¬ä¸€é˜¶æ®µã€‚åœ¨è¯¥é˜¶æ®µä¸å…³é”®è¯è§„åˆ™åŒ¹é…æ˜¯æ ¸å¿ƒã€‚å…³é”®è¯è§„åˆ™çš„åŒ¹é…éœ€è¦ä½¿ç”¨ä¸æ–‡å¤šæ¨¡å¼å—ç¬¦ä¸²åŒ¹é…ç®—æ³•ã€‚å›½é™…ä¸Šç»å…¸çš„å—ç¬¦ä¸²åŒ¹é…ç®—æ³•éƒ½æ˜¯é’ˆå¯¹è‹±æ–‡å—ç¬¦ä¸²è¿›è¡ŒåŒ¹é…çš„ã€‚å¤šæ¨¡å¼ä¸²åŒ¹é…ç®—æ³•ä¹Ÿæ˜¯å¦‚æ¤,ä¾‹å¦‚,AC,WMç‰ç‰ã€‚æœ¬æ–‡æå‡ºä¸€ç§é’ˆå¯¹ä¸æ–‡çš„å¤šæ¨¡å¼ä¸²åŒ¹é…ç®—æ³•UIACã€‚åŒæ—¶,ä¸ŽUIACç®—æ³•é…åˆè¿˜æœ‰å…¶ä»–è§„åˆ™åŒ¹é…æ–¹æ³•:çŸä¿¡æ–‡æœ¬é•¿åº¦,æ–‡æœ¬ä¸å«æœ‰çš„æ ‡ç‚¹,ç”µè¯å·ç ,URLç‰ç‰ç‰¹å¾ã€‚å¦å¤–,åœ¨è¯¥é˜¶æ®µè¿˜è¦åšæ‰‹æœºå¹³å°ä¸Šä¸æ–‡ç¼–ç çš„è½¬æ¢ç‰å¤„ç†å·¥ä½œã€‚è¯¥é˜¶æ®µçš„è¾“å‡ºæ˜¯ä¸é—´å‘é‡æ–‡ä»¶ã€‚ç¬¬äºŒéƒ¨åˆ†,è¿‡æ»¤ã€‚è¿‡æ»¤æ˜¯çŸä¿¡è¿‡æ»¤å¤„ç†çš„ç¬¬äºŒé˜¶æ®µã€‚æœ¬æ–‡æå‡ºäº†è½»é‡çº§è§„åˆ™åº“è¿‡æ»¤ç®—æ³•ã€‚è¯¥ç®—æ³•ä¸Žæœ€å¤§ç†µå’Œå†³ç–æ ‘ä¸¤ç§ç»å…¸ç®—æ³•ç›¸æ¯”,æ›´åŠ æœ‰åˆ©äºŽåœ¨èµ„æºæœ‰é™çš„ç§»åŠ¨è®¾å¤‡ä¸Šå®žçŽ°ã€‚ä½œä¸ºå¯¹æ¯”,åœ¨è§„åˆ™åŒ¹é…çš„è¯•éªŒé˜¶æ®µé™¤äº†äº§ç”Ÿè½»é‡çº§è§„åˆ™åº“è¿‡æ»¤ä¸é—´å‘é‡æ–‡ä»¶å¤–è¿˜äº§ç”Ÿäº†æœ€å¤§ç†µä¸é—´å‘é‡æ–‡ä»¶å’Œå†³ç–æ ‘ä¸é—´å‘é‡æ–‡ä»¶,å¹¶ä¸”åˆ†åˆ«ç”¨æœ€å¤§ç†µæ¨¡åž‹å’Œå†³ç–æ ‘æ¨¡åž‹è¿›è¡Œå¤„ç†ã€‚ä¹‹åŽå¯¹æ¯”äº†è½»é‡çº§è§„åˆ™åº“å’Œå…¶ä»–ä¸¤ç§ç®—æ³•çš„å‡†ç¡®çŽ‡ã€å¬å›žçŽ‡ä»¥åŠæ£å¸¸çŸä¿¡è¯¯åˆ¤çŽ‡ã€‚å®žéªŒä½¿ç”¨çš„çŸä¿¡æ¡æ•°ä¸º1000æ¡,æ£å¸¸çŸä¿¡å’Œåžƒåœ¾çŸä¿¡å„500æ¡ã€‚å¯¹è½»é‡çº§è§„åˆ™åº“ã€æœ€å¤§ç†µã€å†³ç–æ ‘åˆ†åˆ«è¿›è¡Œäº†å®žéªŒ,å¹¶ä¸”å°†ä¸‰ç§ç®—æ³•ç»“æžœè¿›è¡Œæ¯”è¾ƒã€‚å®žéªŒç»“æžœæ˜¾ç¤º,è½»é‡çº§è§„åˆ™åº“ä¸Žå…¶å®ƒä¸¤ç§æ–¹æ³•ç›¸æ¯”,æ€§èƒ½æŽ¥è¿‘,åœ¨æ£å¸¸çŸä¿¡è¯¯åˆ¤çŽ‡æ–¹é¢æœ‰è¾ƒå¤§æé«˜,å¹¶ä¸”æ›´ä¾¿äºŽåœ¨æ‰‹æœºå¹³å°ä¸Šå®žçŽ°ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ The Chinese oriented SMS filtering technology is needed in the nowadays Chinese Mobile Market. At present, there are mature SMS filtering technologies in English. However, todayâ€™s Chinese SMS filtering technology is based mainly on Junk-list and Key word filtering. The system proposed by this article realizes the Simple Rules Filtering technology, which combines SMS content features and promotes the Precision, Recall Rate, and reduces the Normal False Alarm Rate.SMS Content Filtering is a type of Text Categorization technologies. At present, there are two most popular technologies applied to Content Filtering: Maximum Entropy and Decision Tree. In this Article, these two algorithms are used to do a contrast filtering test with a newly introduced Chinese SMS Content Filtering technology. This technology is divided into two parts: The first part is Rules Matching. Rules Matching is the first phase of SMS Content Filtering. In this phase, the Key Rule Matching is the most important algorithm. Key Rule Matching needs to use a Chinese multi-pattern Matching Algorithm. However, the classic algorithms like AC and WM are both designed for English content. This Article introduces a new Chinese Oriented multi-pattern algorithm UIAC. Together with UIAC, we also use other rules to abstract the content features of Chinese SMS: the Length of the short messages,the phone numbers, punctuations, and URL, et al. Besides, in this phase, the Chinese Encoding transformation should be done. The output file of this phase is the vector intermediate file. The second part is filtering. Filtering is the second phase of the SMS Filtering. This Article introduces Simple Rules Fitering Algorithm. Compared with Maximum Entropy and Decision Tree, the algorithm is easier to implement on resource limited mobile platform.As a contrast, in the Rules Matching phase of the test, there are three vector intermediate files: Simple Rules Filtering vector file, Maximum Entropy vector file and Decision Tree vector file. The last two files are processed by Maximum Entropy Model and Decision Tree Model. Then compare the Precision Rate, Recall Rate and Normal SMS False Alarm Rate of the three different algorithms. The test uses 1000 Short Messages, with 500 normal ones and 500 junk ones. The 1000 SMS are used as input data in the three algorithms mentioned above. The results show that the Simple Rules Algorithm has a close performance with the other two algorithms. Moreover, it has an advantage in the aspect of Normal SMS False Alarm Rate and efficiency of implementation.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ çŸä¿¡å†…å®¹è¿‡æ»¤ï¼› è½»é‡çº§è§„åˆ™åº“ï¼› å¤šæ¨¡å¼ä¸²åŒ¹é…ï¼› æœ€å¤§ç†µï¼› å†³ç–æ ‘ï¼›
ã€Key wordsã€‘ SMs Content Filteringï¼› Simple Rules Setï¼› Multiple Pattern Matchï¼› Maximum Entropyï¼› Decision Treeï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ ç”µåç§‘æŠ€å¤§å¦

ã€åˆ†ç±»å·ã€‘TN915.09
ã€è¢«å¼•é¢‘æ¬¡ã€‘2
ã€ä¸‹è½½é¢‘æ¬¡ã€‘231
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

ç§»åŠ¨å¹³å°ä¸‹çš„ä¸­æ–‡çŸ­ä¿¡å†…å®¹è¿‡æ»¤æŠ€æœ¯çš„ç ”ç©¶ä¸Žå®žçŽ°

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

ç§»åŠ¨å¹³å°ä¸‹çš„ä¸æ–‡çŸä¿¡å†…å®¹è¿‡æ»¤æŠ€æœ¯çš„ç ”ç©¶ä¸Žå®žçŽ°