节点文献

个性化语音生成研究

Research of Personalized Speech Generation

【作者】 双志伟

【导师】 戴礼荣; 王仁华;

【作者基本信息】 中国科学技术大学 , 信号与信息处理, 2011, 博士

【摘要】 个性化语音生成是指生成具有特定说话人特征的语音。个性化语音生成的应用非常广泛,比如可以改变语音合成系统的语音以提供个性化的合成语音,或在语音聊天、在线游戏中隐藏说话人的真实身份,也可以在多媒体彩信中模仿他人以达到娱乐的效果等。目前最常用的个性化语音生成方法有两种:一是说话人转换方法二是语音合成模型自适应方法。两种方法有着各自的优缺点,适用于不同的应用场合。本文对这两种方法的特点和联系进行探讨,针对不同方法存在的问题和应用的实际需求进行改进,并通过系统评测验证了改进的效果。论文包括五部分内容:在第一部分,论文将对个性化特征、个性化语音生成的实际需求以及不同个性化语音生成方法的特点和使用场景进行总结分析。首先对人的发音过程的声学机理和数学模型进行了简要介绍,在此基础上总结了不同的说话人特征参数。然后对个性化语音生成的实际需求进行分析,并对个性化语音生成方法的优缺点和适用场景进行讨论。在第二部分,论文则是对最常用的两类说话人转换方法:基于GMM的方法和基于码本映射的方法,进行一个系统的分析。论文首先介绍GMM方法以及几种最重要的分支。接下去,对Abe的经典码本映射方法以及Alsan提出的STASC码本映射方法进行介绍。之后,论文将对这两种方法进行一个系统的分析比较,指出各自的优点和不足。最后,将讨论在实践中发现的GMM方法和码本映射方法存在的两个共同的问题:1.源和目标说话人对应数据的不匹配问题;2.转换频谱的过平滑问题。这些分析和讨论将指引本文探索新的说话人转换方法。论文在第三部分,针对现有说话人转换方法存在的问题,提出了一种基于频谱弯曲的说话人转换方法,其中频谱弯曲函数是基于源说话人和目标说话人的映射共振峰参数生成。这种方法有两个优点:一是所需训练数据非常少。二是转换语音具有较高的音质。为了进一步提高与目标说话人的相似度,论文提出了一种结合频谱弯曲和单元挑选的说话人转换方法以提高频谱细节上的相似度。该方法首先进行频谱弯曲,然后将弯曲后的频谱作为目标进行单元挑选。接下去将部分弯曲后的频谱用挑选到的目标说话人的真实频谱进行替换,最后重构出转换后的语音。评测结果表明,基于本文的频谱弯曲方法获得的转换语音音质远优于其他方法,并且在音质和相似度上取得较好的平衡。评测结果同时表明结合频谱弯曲和单元挑选可以比频谱弯曲获得显著的相似度提高。论文在第四部分,针对多语种语音合成系统遇到的实际问题,创新性地利用语音合成模型自适应和说话人转换方法相结合的个性化语音生成实现多语种语音合成系统。当今社会,中英文混合的文本越来越多。为了保证合成语音的自然连贯,通常要求中英文混合的文本内容用一个声音合成出来。然而,由于很多中文发音人的英文并不专业,直接用这种不专业的英文训练出来的模型合成的英文听起来会很不自然。这里,我们提出使用个性化语音生成方法借助一个英文母语发音人的模型,以获得更自然的中文说话人音色的英文合成语音。在使用最大似然语音合成模型自适应修改频谱模型的同时,我们利用说话人转换的韵律调整对韵律模型进行修改以获得更自然的合成韵律。评测结果表明,这种方法可以得到比其他方法更为自然的合成语音和一致的中英文音色。值得一提的是,该系统已被应用于2010年上海世博会官方网站,以帮助弱视人士聆听网站内容。第五部分将对本文进行总结,并对下一步的工作进行展望。

【Abstract】 Personalized speech generation is to generate speech with the characteristics of a target speaker. There are many applications of personalized speech generation. An important application is to build customized text-to-speech system for different companies, in which a TTS system with one company’s favorite voice can be created quickly and inexpensively by modifying origin speaker’s speech corpus. Personalized speech generation can also be used for hiding speaker’s identiy during chatting and on-line gaming or mimicking another person’s voice in multimedia message for entertainment. Crrently, there are two popularly used personalized speech generation method: 1. voice conversion, 2. speech synthesis model adaptation. Both methods have their own advantages and disadvantages, which can be used for different applications. In this thesis, we analyze the characteristics and connections of these two methods, and make improvements according to the existing problems of different methods and the practical requirements of real applications. Evaluation results prove the effectiveness of our improvements.In the first chapter, we summarize the speaker characteristics, the requirements of personalized speech generation and the merits and appropriate usage scenarios of different personalized speech generation methods. We first introduce pronunciation models, based on which we summarize different speaker characteristic features. Then we analyze the practical requirements of different personalized speech generation applications, and discuss the characteristics and appropriate applications of different methods.In the second part, we make a detailed introduction and analysis of two most popular groups of methods for voice conversion: those by GMM and those by codebook mapping. We first introduce the GMM based methods and several most important variations, and then introduce the traditional codebook mapping method proposed by Abe and STASC coding by Alsan. Then, we compare and analyze the advantages and disadvantages of these two mehtods. Finally, we discuss the common problems of these two methods that we find in practical application: 1. the mismatch between aligned training data of source speaker and target speaker. 2. The oversmoothing problem of the converted spectrum. These comparisons and discussions guide us to investigate a new voice conversion method.In Chapter 3, we propose a novel voice conversion method using frequency warping according to the problems of current methods. The frequency-warping function is generated based on mapping the formants of the source and target speakers. With this proposed voice conversion method, only a very small amount of training data is required to generate the warping function, thereby greatly facilitating its application. To further improve the similarity to the target speaker, we propose a new method that combines frequency warping and unit selection of the target speaker’s real spectrum. We use frequency warping to generate the warped source spectrum, which is used as an estimated target for the later unit selection of the target speaker’s spectrum. Part of the warped source spectrum is then replaced by the selected target speaker’s real spectrum before the converted speech is reconstructed. Formal voice conversion evaluation results show that the proposed frequency-warping method can achieve a much better quality of converted speech than other methods while also achieving a good balance between quality and similarity. Evaluation results also show that the combined method can significantly improve the similarity score when compared to using only frequency warping.In Chapter 4, to solve the practical problem that we meet in speech synthesis system for mixed language, we implement a mixed language speech systhesis system based on a novel personalized speech generation method combining speech synthesis model adaptation and voice conversion technology. When synthesizing Chinese text mixed with English text, it is usually preferred to synthesize the mixed languages content with a single voice. However the synthesized English of HMM based TTS may sound unnatural if the models are directly built with a Chinese speakers’unprofessional English data. In this paper, we proposed to use personalized speech generation to leverage a native English speaker’s model to generate more natural English for the Chinese speaker. MLLR speaker adaptation method is used to adapt the spectrum models of a native speaker, while the prosody adjustment of voice conversion is applied on the prosody models for a better prosody. In synthesis stage, mixed language contents share a unified prosody tree to improve the continuity between Chinese and English contents. Evaluation results show that the proposed method significantly improve the speaker consistency and naturalness of synthesized speech for mixed language text compared to using directly built models. It is worth mentioning that this system has been used in the offical website of Shanghai EXPO 2010 to help visual impaired people to listen to the web content.Chapter 5 summarizes this article, and discusses the future work.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络