èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽç›®æ ‡é€¼è¿‘ç‰¹å¾å’ŒåŒå‘è”æƒ³è´®å˜å™¨çš„æƒ…æ„Ÿè¯éŸ³åŸºé¢‘è½¬æ¢

F0 Transformation for Emotional Speech Synthesis Using Target Approximation Features and Bidirectional Associative Memories

æŽ¨è CAJä¸‹è½½
PDFä¸‹è½½
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ é«˜ä¸½ï¼› å‡Œéœ‡åŽï¼› æˆ´ç¤¼è£ï¼›

ã€Authorã€‘ Ligao;Zhen-Hua Ling;Li-Rong Dai;University of Science and Technology of China;

ã€æœºæž„ã€‘ ä¸å›½ç§‘å¦æŠ€æœ¯å¤§å¦ç”µåå·¥ç¨‹ä¸Žä¿¡æ¯ç§‘å¦ç³»ï¼›

ã€æ‘˜è¦ã€‘ æœ¬æ–‡æå‡ºäº†ä¸€ç§ç”¨äºŽæƒ…æ„Ÿè¯éŸ³åˆæˆçš„åŸºé¢‘è½¬æ¢æ–¹æ³•ã€‚è¯¥æ–¹æ³•ä½¿ç”¨å®šé‡ç›®æ ‡é€¼è¿‘(q TA)ç‰¹å¾ä½œä¸ºè¯éŸ³éŸ³èŠ‚å±‚çš„åŸºé¢‘æè¿°,ä½¿ç”¨é«˜æ–¯åŒå‘è”æƒ³è´®å˜å™¨(GBAM)å®žçŽ°ä¸æ€§åˆæˆè¯éŸ³éŸ³èŠ‚å±‚q TAå‚æ•°å‘ç›®æ ‡æƒ…æ„Ÿè¯éŸ³éŸ³èŠ‚å±‚q TAå‚æ•°çš„è½¬æ¢ã€‚åœ¨æ¨¡åž‹è®ç»ƒé˜¶æ®µ,é¦–å…ˆåŸºäºŽä¸æ€§è¯æ–™åº“å’Œç»Ÿè®¡å‚æ•°è¯éŸ³åˆæˆæ–¹æ³•æž„å»ºä¸æ€§è¯éŸ³åˆæˆç³»ç»Ÿ;ç„¶åŽåˆ©ç”¨å°‘é‡æƒ…æ„Ÿå½•éŸ³æ•°æ®,å°†ä»Žæƒ…æ„Ÿè¯éŸ³æ–‡æœ¬å¯¹åº”çš„ä¸æ€§åˆæˆè¯éŸ³ä¸æå–çš„q TAå‚æ•°ä½œä¸ºæºæ•°æ®,å°†æƒ…æ„Ÿå½•éŸ³ä¸æå–çš„q TAå‚æ•°ä½œä¸ºç›®æ ‡æ•°æ®,è¿›è¡ŒGBAMè½¬æ¢æ¨¡åž‹çš„è®ç»ƒã€‚åœ¨æƒ…æ„Ÿè¯éŸ³åˆæˆé˜¶æ®µ,åˆ©ç”¨è®ç»ƒå¾—åˆ°çš„GABMæ¨¡åž‹,å®žçŽ°ä¸æ€§åˆæˆè¯éŸ³åŸºé¢‘ç‰¹å¾å‘ç›®æ ‡æƒ…æ„Ÿçš„è½¬æ¢ã€‚å®žéªŒç»“æžœè¡¨æ˜Ž,è¯¥æ–¹æ³•åœ¨ç›®æ ‡æƒ…æ„Ÿæ•°æ®è¾ƒå°‘çš„æƒ…å†µä¸‹å¯ä»¥å–å¾—æ¯”æœ€å¤§ä¼¼ç„¶çº¿æ€§å›žå½’(MLLR)æ¨¡åž‹è‡ªé€‚åº”æ–¹æ³•æ›´å¥½çš„æƒ…æ„Ÿè¡¨çŽ°åŠ›ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ In this paper, we proposed a F0 transformation method for emotional speech synthesis. We use quantitative target approximation(q TA) features to represent F0 contour in syllable level. And Gaussian Directional Associative Memories( GBAM) is used to complete the transformation for syllable-level q TA parameters from synthesized neutral speech to target emotional recordings. In the training stage, firstly we use HMM-based statistical parametric speech synthesis to construct a neutral speech synthesis system with neutral corpus as training set. And then, with a small amount of emotional recording data, GBAM-based transformation model is trained by using the q TA parameters extracted from synthesized neutral speech corresponding to the emotional text as the source feature and the q TA parameters extracted from target emotional recordings as the target patterns of GBAM, respectively. In the generation of emotional speech, we utilize the trained GBAM model to complete the transformation for syllable-level F0 features from synthesized neutral speech to target emotional recordings. The experiment resultes indicate that, in the case of little emotional recording data, our proposed method performed better than the adaptation method by using Maximum Likelihood Linear Regression( MLLR) in emotional expressivity.æ›´å¤š è¿˜åŽŸ