前置知识
style token
结论性知识
- 端到端需要至少10小时的数据量。
- According to [1], it concludes that around 10 hours of speech-transcript pairs from one speaker are needed to get high quality by a neural end-to-end TTS model such as Tacotron.
- 对于多说话人,每人需要几十分钟。
- In order to support multiple speakers, we usually have to use tens of minutes of training data for every speaker, which make collecting high quality data a laborious work.
- 关于multi-sopeake,主要依赖于学习说话人的隐藏结构。 most of these methods rely on a speaker embedding.
- speaker adaptation method : fine-tuning a pre-trained multi-speaker model entirely or merely to the speaker embedding。
- speaker encoding method : training a seperate model to predict the new speaker embedding with few data。
- speaker encoder 学到的 speaker embedding, 可以表示两种语言之间的发音关系。
问题
- 什么叫做 Initialize the decoder with speaker embedding?
论文
贡献
- 多说话人、跨语言(cross-language: Cross-lingual TTS aims to build a system which can synthesize speech in a specific language not spoken by the target speaker)、端到端。
- 实现了一种单独训练的 neural speaker embedding network, 用于表示不同说话人以及隐藏发音的隐藏结构。
- 双语训练,中文说话人说英语,以及相反。
- 小数据量训练新的说话人。
工作
- One is that we further discuss how to use limited amount of data to achieve multi-speaker TTS.
- Secondly, we analyze endto-end models in cross-lingual setting.
系统结构
包含 speaker encoder, T2, vocoder
speaker encoder
- follows ResCNN in [2]。 本文图例为图1, [2]为图2,3.
- the filter size of convolution layers is 64, 128, 256, and 512, respectively.
- 单独训练,再finetune。 We firstly train the speaker embedding network separately on a speaker verification task with softmax loss, and then finetune the whole model with triplet loss which maps speech into a feature space where the distances correspond to speaker similarity.
- Firstly, we pre-train the speaker encoder for 10 epochs with softmax loss and using a minibatch size of 32 as it converge to an approximate local optimal point.
- then the model is fine-tuned with triple loss for 10 epochs using a minibatch size of 64.
Mel-Spectrogram Generation Network
- 框架使用 Tacotron2。
- encoder 输入是 音素序列
- 尝试四种speaker embedding 的使用方法::
- (1). Concatenate speaker embedding to each time step of the encoder;
- (2). Add an affine transformation to speaker embedding, then splicing to each time step of the encoder;
- (3). Initialize the encoder with the speaker embedding;
- (4). Initialize the decoder with speaker embedding.
- 结论:
- (1)效果很差,音色不一致。
- (3)有明显的噪音。
- (2)更加稳定且流畅
- (4)效果最好
- 最后使用(2)+(4),先经过一次仿射层,拼接到encoder, 初始化decoder
- We normalize the mel-spectrograms to [-4, 4] in preprocess in order to reduce blurriness in synthesized audio.
vocoder
实验
数据
数据库
用两种单语语料训练模型
语种 |
音库 |
时长 |
说话人 |
句子数 |
various accents |
for |
EN |
VCTK corpus |
44h |
109 |
400 |
YES |
train |
ZH |
subset of [4] |
- |
855 |
120 |
- |
train |
EN |
[5] |
- |
7 |
- |
YES |
test |
ZH |
internal |
- |
7 |
120 |
- |
test |
数据参数
- 16kHz
- trim leading and tailing silence.
数据使用
for |
data |
training multi-speaker model |
337 Chinese speakers and 109 English speakers |
validation |
8 Chinese speakers and 8 English speakers |
testing (Seen) |
2 Chinese, 2 English speakers |
new speaker adaptation (Unseen) |
3 Chinese, 3 English speakers |
- all have similar distribution with training dataset in terms of gender and accent.
- 用 IPA(International Phonetic Alphabet)将中英文转成相同表示形式。
- improves the pronunciation accuracy
- improves unifies phonetic transcriptions of different languages.
- 都转换成音素序列。( use grapheme-to-phoneme (G2P) library )
实验参数
- batch size = 32
- one Nvidia V100GPU
- L2 regularization
方法
结论
- the different training sets have significant impact on speaker embedding.
- the learned speaker embedding can represent the relation between pronunciations across the two languages.
- for the bilingual speaker voice generated by our model, it imposes the effect of
his/her mother tongue while speaking another language
- We observe the phonemes with similar pronunciation are inclined to stay closer than the others across the two languages.
- Our result shows that the multi-speaker TTS model can extract the speaker characteristics as well as language pronunciations with speaker embbeding from the latent space
- MOS 得分来看,效果并不理想
- MOS 低于3.6时。效果就不能很好的满足需求了。
idea
- speaker ID 应该放在 decoder (initialize)
reference
- [1] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017
- [2] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu,“Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
- [3] 一文读懂卷积神经网络
- [4] surfing.ai, “St-cmds-20170001 1, free ST Chinese Mandarin corpus,” 2017
- J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in Fifth ISCA workshop on speech synthesis, 2004.