临时的

前置知识

style token

结论性知识

  • 端到端需要至少10小时的数据量。
  • According to [1], it concludes that around 10 hours of speech-transcript pairs from one speaker are needed to get high quality by a neural end-to-end TTS model such as Tacotron.
  • 对于多说话人,每人需要几十分钟。
    • In order to support multiple speakers, we usually have to use tens of minutes of training data for every speaker, which make collecting high quality data a laborious work.
  • 关于multi-sopeake,主要依赖于学习说话人的隐藏结构。 most of these methods rely on a speaker embedding.
    • speaker adaptation method : fine-tuning a pre-trained multi-speaker model entirely or merely to the speaker embedding。
    • speaker encoding method : training a seperate model to predict the new speaker embedding with few data。
  • speaker encoder 学到的 speaker embedding, 可以表示两种语言之间的发音关系。

问题

  • 什么叫做 Initialize the decoder with speaker embedding?

论文

贡献

  • 多说话人、跨语言(cross-language: Cross-lingual TTS aims to build a system which can synthesize speech in a specific language not spoken by the target speaker)、端到端。
  • 实现了一种单独训练的 neural speaker embedding network, 用于表示不同说话人以及隐藏发音的隐藏结构
  • 双语训练,中文说话人说英语,以及相反。
  • 小数据量训练新的说话人。

工作

  • One is that we further discuss how to use limited amount of data to achieve multi-speaker TTS.
    • 待回答
  • Secondly, we analyze endto-end models in cross-lingual setting.
    • 待回答

系统结构

包含 speaker encoder, T2, vocoder

在这里插入图片描述

speaker encoder

  • follows ResCNN in [2]。 本文图例为图1, [2]为图2,3.
  • the filter size of convolution layers is 64, 128, 256, and 512, respectively.
  • 单独训练,再finetune。 We firstly train the speaker embedding network separately on a speaker verification task with softmax loss, and then finetune the whole model with triplet loss which maps speech into a feature space where the distances correspond to speaker similarity.
  • Firstly, we pre-train the speaker encoder for 10 epochs with softmax loss and using a minibatch size of 32 as it converge to an approximate local optimal point.
  • then the model is fine-tuned with triple loss for 10 epochs using a minibatch size of 64.

在这里插入图片描述在这里插入图片描述在这里插入图片描述

Mel-Spectrogram Generation Network

  • 框架使用 Tacotron2
  • encoder 输入是 音素序列
  • 尝试四种speaker embedding 的使用方法::
  • (1). Concatenate speaker embedding to each time step of the encoder;
  • (2). Add an affine transformation to speaker embedding, then splicing to each time step of the encoder;
  • (3). Initialize the encoder with the speaker embedding;
  • (4). Initialize the decoder with speaker embedding.
  • 结论:
    • (1)效果很差,音色不一致。
    • (3)有明显的噪音。
    • (2)更加稳定且流畅
    • (4)效果最好
    • 最后使用(2)+(4),先经过一次仿射层,拼接到encoder, 初始化decoder
  • We normalize the mel-spectrograms to [-4, 4] in preprocess in order to reduce blurriness in synthesized audio.

vocoder

  • 60轮迭代的 Griffin-Lim

实验

数据

数据库

用两种单语语料训练模型

语种 音库 时长 说话人 句子数 various accents for
EN VCTK corpus 44h 109 400 YES train
ZH subset of [4] - 855 120 - train
EN [5] - 7 - YES test
ZH internal - 7 120 - test

数据参数

  • 16kHz
  • trim leading and tailing silence.

数据使用

for data
training multi-speaker model 337 Chinese speakers and 109 English speakers
validation 8 Chinese speakers and 8 English speakers
testing (Seen) 2 Chinese, 2 English speakers
new speaker adaptation (Unseen) 3 Chinese, 3 English speakers
  • all have similar distribution with training dataset in terms of gender and accent.
  • IPA(International Phonetic Alphabet)将中英文转成相同表示形式。
    • improves the pronunciation accuracy
    • improves unifies phonetic transcriptions of different languages.
  • 都转换成音素序列。( use grapheme-to-phoneme (G2P) library

实验参数

  • batch size = 32
  • one Nvidia V100GPU
  • L2 regularization

方法

  • 预训练decoder

结论

  • the different training sets have significant impact on speaker embedding.
  • the learned speaker embedding can represent the relation between pronunciations across the two languages.
  • for the bilingual speaker voice generated by our model, it imposes the effect of
    his/her mother tongue while speaking another language
  • We observe the phonemes with similar pronunciation are inclined to stay closer than the others across the two languages.
  • Our result shows that the multi-speaker TTS model can extract the speaker characteristics as well as language pronunciations with speaker embbeding from the latent space
  • MOS 得分来看,效果并不理想
  • MOS 低于3.6时。效果就不能很好的满足需求了。

idea

  • 音素输入
  • 受母语音色影响

  • speaker ID 应该放在 decoder (initialize)

reference

  • [1] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017
  • [2] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu,“Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
  • [3] 一文读懂卷积神经网络
  • [4] surfing.ai, “St-cmds-20170001 1, free ST Chinese Mandarin corpus,” 2017
  • J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in Fifth ISCA workshop on speech synthesis, 2004.