èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽå£è¯åº¦çš„å£è¯è¯è¯è‡ªåŠ¨æå–ç ”ç©¶

Automatic Extraction of Spoken Words by the Spoken Language Measurement

æŽ¨è CAJä¸‹è½½
PDFä¸‹è½½
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ ä¾¯æ•ï¼› å¼ çŽ‰å¼ºï¼› ä½•ä¼Ÿï¼› é‚¹ç…œï¼› æ»•æ°¸æž—ï¼›

ã€Authorã€‘ Hou Min,Zhang Yuqiang,He Wei,Zou Yu,Teng Yonglin Broadcasting Media Language Research Center,Communication University of China,Beijing 100024

ã€æ‘˜è¦ã€‘ å£è¯è¯è¯è‡ªåŠ¨æå–çš„æœ€å¤§éšœç¢åœ¨äºŽå£è¯è¯æ–™çš„éš¾ä»¥èŽ·å–å’Œå£è¯è¯è¯ç•Œå®šçš„æ¨¡ç³Šæ€§ã€‚æœ¬æ–‡å……åˆ†åˆ©ç”¨å¹¿æ’ç”µè§†è¯æ–™å…¼å…·ä¹¦é¢è¯ä½“å’Œå£è¯è¯ä½“çš„ç‰¹ç‚¹,æå‡ºäº†å£è¯åº¦è®¡ç®—æ¨¡åž‹,è¯¥æ¨¡åž‹ä»¥Logisticå›žå½’æ¨¡åž‹ä¸ºåŸºç¡€,ä»¥è¯è¯ç©ºé—´åˆ†å¸ƒé€šç”¨çŽ‡ä¸ºåå˜é‡,é€šè¿‡è¡¡é‡è¯è¯åœ¨ä¹¦é¢è¯ä½“è¯æ–™å’Œå£è¯ä½“è¯æ–™ä¸çš„ç©ºé—´åˆ†å¸ƒå·®å¼‚,èƒ½å¤Ÿæœ‰æ•ˆåœ°åº¦é‡è¯¥è¯è¯çš„å£è¯åº¦,ä»Žè€Œå®žçŽ°å£è¯è¯è¯çš„è‡ªåŠ¨æå–ã€‚åœ¨çº¦1100ä¸‡å—è¯æ–™ä¸Šçš„å®žéªŒç»“æžœè¡¨æ˜Ž,å£è¯å’Œä¹¦é¢è¯å…±çŽ°è¯è¯ä¸æå–å£è¯è¯è¯å‡†ç¡®çŽ‡ä¸º85%,å£è¯ç‹¬çŽ°è¯è¯ä¸æå–å£è¯è¯è¯å‡†ç¡®çŽ‡ä¸º76.5%,å¹³å‡æ£ç¡®çŽ‡è¾¾åˆ°79.3%ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Lack of spoken corpus and the ambiguity definition of spoken words is the biggest obstacle to automatic extraction of spoken word.This paper used the broadcasting corpus which comprised both written and spoken language,and proposed the spoken measurement calculation model.The model is based on Logistic Regression Model with the words generalization as covariates,which could measure the differences of spatial distribution between words in the written corpus and that in spoken corpus,thus the probability of spoken words can be effectively measured and the spoken words can be extraction automatically.The results of experiments on about 11 million words show the precision of extraction is 85%for the words occurred in both spoken and written language and 76.5%for the words occurred only in spoken language,the total precision is 79.3%.æ›´å¤š è¿˜åŽŸ