èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽHadoopå’Œæ”¯æŒå‘é‡æœºçš„ç´§å¯†åº¦åŽå¤„ç†çš„ç ”ç©¶ä¸Žå®žçŽ°

A Post-Process Method to Tightness Based on Hadoop and Support Vector Machine

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ æ¨å…‰ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ åŒ—äº¬äº¤é€šå¤§å¦ ï¼Œ è½¯ä»¶å¼€å‘æŠ€æœ¯ï¼ˆä¸“ä¸šå¦ä½ï¼‰ï¼Œ 2015ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ å¦‚ä½•å°†ç”¨æˆ·æ‰€æŸ¥ç»“æžœå‡†ç¡®åœ°æå–å‡ºæ¥å¹¶å±•ç¤ºå·²ç»æˆä¸ºç›®å‰æœç´¢å¼•æ“Žçš„ä¸»è¦ç›®æ ‡ã€‚æœç´¢å¼•æ“Žæ¶‰åŠå¤šé¡¹æŠ€æœ¯,è‡ªç„¶è¯è¨€å¤„ç†æ˜¯æžä¸ºé‡è¦çš„ä¸€é¡¹,ä¹Ÿæ˜¯å…¶ä»–æŠ€æœ¯ç ”ç©¶è¿›è¡Œæå‡çš„åŸºç¡€ã€‚ç´§å¯†åº¦æ˜¯åˆ†è¯å¹¶åŽ»åœç”¨è¯ä¹‹åŽçš„å…³é”®æŠ€æœ¯ä¹‹ä¸€,ç”¨äºŽæè¿°åˆ†è¯ä¹‹åŽçš„æœ€å°å•ä½(Term)ä¹‹é—´çš„å…³ç³»,æ˜¯ç½‘é¡µæœç´¢çš„ç›¸å…³æ€§æŽ’åºä¸ä¸€é¡¹é‡è¦æŒ‡æ ‡æ•°æ®,å¯¹äºŽæŽ’åºçš„ç»“æžœèµ·ç€å†³å®šæ€§çš„ä½œç”¨,åœ¨æœç´¢å¼•æ“Žä¸éƒ½å‘æŒ¥ç€é‡è¦çš„ä½œç”¨,åŒæ—¶å¯¹äºŽæå‡ç”¨æˆ·æœç´¢ç»“æžœçš„å‡†ç¡®çŽ‡ä»¥åŠå¬å›žçŽ‡æœ‰ç€ååˆ†é‡è¦çš„æ„ä¹‰ã€‚ç”±äºŽåˆ†è¯çš„ç–ç•¥æ˜¯æœ€å°åˆ‡å‰²,ä¼šå°½å¯èƒ½åœ°å°†è¯å¥è¿›è¡Œç»†ç²’åº¦åˆ‡åˆ†,è¿™å°±ä¼šå°†ä¸€äº›é•¿è¯ç»„åˆ‡åˆ†æˆå¤šä¸ªTerm,åœ¨éšåŽçš„æœç´¢ç»“æžœä¸,ä¼šå¬å›žä¸€äº›ä¸ç¬¦åˆç”¨æˆ·çš„æœç´¢éœ€æ±‚çš„ç½‘é¡µ,å½±å“æœç´¢ç»“æžœçš„å‡†ç¡®çŽ‡,å¹¶é€ æˆè¾ƒå·®çš„ç”¨æˆ·ä½“éªŒã€‚è®ºæ–‡ä»¥æœç‹—æœç´¢å¼•æ“Žçš„å®žé™…é¡¹ç›®ä¸ºèƒŒæ™¯,å¯¹äºŽæœç´¢å¼•æ“Žçš„ä¸æ–‡åˆ†è¯ä¸æ–°è¯å‘çŽ°çš„ç®—æ³•ç–ç•¥è¿›è¡Œäº†ç ”ç©¶,è®¾è®¡äº†åŸºäºŽç–ç•¥è¿›è¡ŒTermå…³ç³»æå–çš„ç®—æ³•,å°†è¿™äº›å…³ç³»è¿›è¡Œæå–ç»„æˆç‰¹å¾,é€šè¿‡æ”¯æŒå‘é‡æœº(Support Vector Machine, SVM)è¿›è¡Œç‰¹å¾åˆ†ç±»,å¹¶å¯¹ç´§å¯†åº¦çš„å®žé™…æ•ˆæžœè¿›è¡Œæå‡ã€‚è®ºæ–‡ä¸»è¦å®Œæˆäº†ä¸‹é¢çš„å‡ é¡¹å·¥ä½œï¼š(1)æ•°æ®é¢„å¤„ç†ã€‚å¯¹åŽŸå§‹æœç´¢æ—¥å¿—è¿›è¡Œåˆ†è¯ä»¥åŠåˆå§‹ç»Ÿè®¡å·¥ä½œ,å¾—å‡ºåŽç»ç–ç•¥çš„åŸºç¡€æ•°æ®ã€‚(2)åŸºäºŽæœç´¢å›žè¯æ—¥å¿—çš„åˆæ¥åŽå¤„ç†ã€‚é€šè¿‡å¯¹æœç´¢ä¼šè¯æ•°æ®è®¡ç®—æœç´¢è¯å¥å·®å¼‚å€¼,å¾—å‡ºéƒ¨åˆ†ä¼šè¯æ•°æ®,å¹¶å¯¹ç´§å¯†åº¦è¿›è¡Œåˆæ¥åŽå¤„ç†ï¼›(3)åŸºäºŽç½‘é¡µæ£æ–‡çš„äºŒæ¥åŽå¤„ç†ã€‚é’ˆå¯¹ä¸“æœ‰åè¯çº§åˆ«çš„ç´§å¯†åº¦ç»“æžœ,åŸºäºŽæ–°è¯å‘çŽ°çš„ç®—æ³•,åˆ©ç”¨ä¿¡æ¯ç†µã€äº’ä¿¡æ¯ç‰æ–¹æ³•,å¾—å‡ºä¸¤ä¸¤termä¹‹é—´çš„ç‰¹å¾å…³ç³»,å¹¶å°†ç‰¹å¾å€¼é€šè¿‡SVMè¿›è¡Œåˆ†ç±»ã€‚(4)å®žéªŒç»“æžœéªŒè¯ä»¥åŠåˆ†æž,é€šè¿‡è®ç»ƒé›†åˆå¯¹æœ€ç»ˆç¦»çº¿æ•°æ®è¿›è¡ŒéªŒè¯,ç´§å¯†åº¦åŽå¤„ç†çš„ç–ç•¥æå‡äº†ç›¸å…³æ€§æŽ’åºçš„æ•ˆæžœ,ä½¿å¾—æœç‹—æœç´¢å¼•æ“Žæœç´¢ç»“æžœæ›´åŠ å‡†ç¡®ã€‚(5)ç–ç•¥æ•ˆæžœã€‚é€šè¿‡åŽå¤„ç†ç–ç•¥å¯¹ç´§å¯†åº¦å€¼è¿›è¡Œè°ƒæ•´,ä½¿å¾—åœ¨ç›¸å…³æ€§æŽ’åºçš„ç»“æžœæ›´åŠ å‡†ç¡®,å°†ä¼˜è´¨ç»“æžœæŽ’åºè¾ƒå‰,å·®çš„ç»“æžœé åŽã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ How to get the exact results that the users want has become the main goal of modern Search Engines. Search Engine is based on several techniques, Natural Language Procession is a significant one, which is also the foundation of improvement to other researches. Along with the Segmentation&Stop Words, the Tightness, as a significant index data to the Relevance Ranking of Web Search, is a dominating factor to the ranked results and takes a big part in the Search Engine. Tightness means a lot to improve the precision and recall of the searched results.Segmentor will segment the sentence to several parts as tiny as possible, which makes long-term phrases apart into several terms, and lead to recalling a lot of web pages that are not satisfied with the query requirements from users, decreasing the precision of search results, and making bad user experience to users. In this paper, based on actual project in Sogou Search Engine, the author researches the strategies and algorithms of new phrases discovery in Chinese segmentation, designs the method of extracting the relations between terms based on strategies, and forms those relations into several features, classifies different terms through Support Vector Machine, improve the result of the Tightness. The paper mainly completes following works:(1) Processing of meta-data, segmentation and statistics to the original query logs, getting the foundation data to the following algorithms.(2) Category based on Session Log. Calculates the query distance in the query session logs, gets some session data.(3) Category based on Web Page. To improve the result of proper nouns, calculates and statistics the foundation data based on the new phrases discovery algorithms, such like Information Entropy, Mutual Information. Gets the relations and features between terms. Classifies those features through SVM.(4) Validation and analysis. Does examination through the train set to the final off-line data, post-processing strategies improve the result of Relevance Ranking and the precision of search results.(5) Categoriesâ€™ result. After post-process to Tightness, results of Relevance Ranking become more accurate, good pages get front positions, bad ones get backs.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ è‡ªç„¶è¯è¨€å¤„ç†ï¼› ç´§å¯†åº¦ï¼› æ”¯æŒå‘é‡æœºï¼› ä¸æ–‡åˆ†è¯ï¼›
ã€Key wordsã€‘ Natural Language Processingï¼› Tightnessï¼› Support Vector Machineï¼› Chinese Segmentationï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ åŒ—äº¬äº¤é€šå¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.3;TP18
ã€ä¸‹è½½é¢‘æ¬¡ã€‘112

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽHadoopå’Œæ”¯æŒå‘é‡æœºçš„ç´§å¯†åº¦åŽå¤„ç†çš„ç ”ç©¶ä¸Žå®žçŽ°

A Post-Process Method to Tightness Based on Hadoop and Support Vector Machine

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

åŸºäºŽHadoopå’Œæ”¯æŒå‘é‡æœºçš„ç´§å¯†åº¦åŽå¤„ç†çš„ç ”ç©¶ä¸Žå®žçŽ°