做者:zqh_zy
连接:http://www.jianshu.com/p/c5fb943afaba
來源:简书
著做权归做者全部。商业转载请联系做者得到受权,非商业转载请注明出处。网络
本文经过简单kaldi源码,分析DNN训练声学模型时神经网络的输入与输出。在进行DNN训练以前须要用到以前GMM-HMM训练的模型,以训练好的mono模型为例,对模型进行维特比alignement(对齐),该部分主要完成了每一个语音文件的帧到transition-id的映射。
不妨查看对齐后的结果:app
$ copy-int-vector "ark:gunzip -c ali.1.gz|" ark,t:- | head -n 1 speaker001_00003 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 16 15 15 15 18 890 889 889 889 889 889 889 892 894 893 893 893 86 88 87 90 89 89 89 89 89 89 89 89 89 89 89 89 89 89 194 193 196 195 195 198 197 386 385 385 385 385 385 385 385 385 388 387 387 390 902 901 901 904 903 906 905 905 905 905 905 905 905 905 905 905 905 914 913 913 916 918 917 917 917 917 917 917 752 751 751 751 751 751 754 753 753 753 753 753 753 753 753 756 755 755 926 925 928 927 927 927 927 927 927 927 930 929 929 929 929 929 929 929 929 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 16 18
对于一个训练语音文件speaker001_00003,后面的每个数字标示一个transition-id,同时每一个数字对应一个特征向量(对应的向量能够copy-matrix查看,参考特征提取)。一样查看transition-id:dom
$ show-transitions phones.txt final.mdl Transition-state 1: phone = sil hmm-state = 0 pdf = 0 Transition-id = 1 p = 0.966816 [self-loop] Transition-id = 2 p = 0.01 [0 -> 1] Transition-id = 3 p = 0.01 [0 -> 2] Transition-id = 4 p = 0.013189 [0 -> 3] Transition-state 2: phone = sil hmm-state = 1 pdf = 1 Transition-id = 5 p = 0.970016 [self-loop] Transition-id = 6 p = 0.01 [1 -> 2] Transition-id = 7 p = 0.01 [1 -> 3] Transition-id = 8 p = 0.01 [1 -> 4] Transition-state 3: phone = sil hmm-state = 2 pdf = 2 Transition-id = 9 p = 0.01 [2 -> 1] Transition-id = 10 p = 0.968144 [self-loop] Transition-id = 11 p = 0.01 [2 -> 3] Transition-id = 12 p = 0.0118632 [2 -> 4] Transition-state 4: phone = sil hmm-state = 3 pdf = 3 Transition-id = 13 p = 0.01 [3 -> 1] Transition-id = 14 p = 0.01 [3 -> 2] Transition-id = 15 p = 0.932347 [self-loop] Transition-id = 16 p = 0.0476583 [3 -> 4] Transition-state 5: phone = sil hmm-state = 4 pdf = 4 Transition-id = 17 p = 0.923332 [self-loop] Transition-id = 18 p = 0.0766682 [4 -> 5] Transition-state 6: phone = a1 hmm-state = 0 pdf = 5 Transition-id = 19 p = 0.889764 [self-loop] Transition-id = 20 p = 0.110236 [0 -> 1] ...
惟一的Transition-state对应惟一的pdf,其下又包括多个 Transition-id,
接下来看神经网络的输入与输出究竟是什么。这里以steps/nnet为例。追溯脚本到steps/nnet/train.sh,找到相关的命令:oop
... labels_tr="ark:ali-to-pdf $alidir/final.mdl \"ark:gunzip -c $alidir/ali.*.gz |\" ark:- | ali-to-post ark:- ark:- |" ... feats_tr="ark:copy-feats scp:$dir/train.scp ark:- |" ... # input-dim, get_dim_from=$feature_transform num_fea=$(feat-to-dim "$feats_tr nnet-forward \"$get_dim_from\" ark:- ark:- |" -) # output-dim, num_tgt=$(hmm-info --print-args=false $alidir/final.mdl | grep pdfs | awk '{ print $NF }') ... dnn) utils/nnet/make_nnet_proto.py $proto_opts \ ${bn_dim:+ --bottleneck-dim=$bn_dim} \ $num_fea $num_tgt $hid_layers $hid_dim >$nnet_proto ;;
从上面关键的几个神经网络的训练的准备阶段能够看出,神经网络的输入很清楚是变换后的特征向量(feats_tr),输出是labels_tr,下面单独运行上面的命令,来查看神经网络的输出(target)是什么。labels_tr的生成分两步:post
$ ali-to-pdf final.mdl "ark:gunzip -c ali.1.gz|" ark,t:- | head -n 1 speaker001_00003 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 3 3 4 440 440 440 440 440 440 440 441 442 442 442 442 38 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 92 92 93 93 93 94 94 188 188 188 188 188 188 188 188 188 189 189 189 190 446 446 446 447 447 448 448 448 448 448 448 448 448 448 448 448 448 452 452 452 453 454 454 454 454 454 454 454 371 371 371 371 371 371 372 372 372 372 372 372 372 372 372 373 373 373 458 458 459 459 459 459 459 459 459 459 460 460 460 460 460 460 460 460 460 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 4
观察前两帧,结合文章一开始,transition-id 分别为4和1,而对应的pdf均为0。对该结果再进行ali-to-post:lua
$ ali-to-pdf final.mdl "ark:gunzip -c ali.1.gz|" ark,t:- | head -n 1 | ali-to-post ark,t:- ark,t:- speaker001_00003 [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] ...... [ 3 1 ] [ 3 1 ] [ 3 1 ] [ 3 1 ] [ 4 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 441 1 ] [ 442 1 ] [ 442 1 ] [ 442 1 ] [ 442 1 ] [ 38 1 ] [ 39 1 ] [ 39 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 92 1 ] [ 92 1 ]...... [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 3 1 ] [ 4 1 ]
获得pdf-id以及相应的后验几率,这里均为1。code
由此获得了训练数据以及对应的target label。进一步来看神经网络的输入与输出的维度,网络结构被utils/nnet/make_nnet_proto.py写到nnet_proto文件中,该Python脚本的两个重要参数 num_fea和num_tgt分别为神经网络的输入与输出的维度。其中num_fea是由feat-to-dim得到:orm
$ feat-to-dim scp:../tri4b_dnn/train.scp ark,t:- | grep speaker001_00003 speaker001_00003 40
这里为fbank特征,维度为40,而在真正做为神经网络输入时,进一步对特征向量进行的变换,从源码steps/nnet/train.sh也能够看到splice参数(默认值为5),指定了对特征向量的变换:取对应帧先后5帧,拼成一个11帧组成的大向量(维度为440)。该部分特征变换的拓扑也被保存到final.feature_transform:blog
$ more final.feature_transform <Nnet> <Splice> 440 40 [ -5 -4 -3 -2 -1 0 1 2 3 4 5 ] <!EndOfComponent> ... ...
后面在进行神经网络的训练时会使用该拓扑对特征向量进行变换,最终的神经网络输入维度为440。
而num_tgt的维度则是经过hmm-info得到:ip
$ hmm-info final.mdl number of phones 218 number of pdfs 1026 number of transition-ids 2834 number of transition-states 1413 $ hmm-info final.mdl | grep pdfs | awk '{ print $NF }' 1026
所以,看到神经网络的输出维度为1026,这时查看nnet_proto:
<AffineTransform> <InputDim> 440 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.037344 <MaxNorm> 0.000000 <Sigmoid> <InputDim> 1024 <OutputDim> 1024 <AffineTransform> <InputDim> 1024 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.109375 <MaxNorm> 0.000000 <Sigmoid> <InputDim> 1024 <OutputDim> 1024 <AffineTransform> <InputDim> 1024 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.109375 <MaxNorm> 0.000000 <Sigmoid> <InputDim> 1024 <OutputDim> 1024 <AffineTransform> <InputDim> 1024 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.109375 <MaxNorm> 0.000000 <Sigmoid> <InputDim> 1024 <OutputDim> 1024 <AffineTransform> <InputDim> 1024 <OutputDim> 1026 <BiasMean> 0.000000 <BiasRange> 0.000000 <ParamStddev> 0.109322 <LearnRateCoef> 1.000000 <BiasLearnRateCoef> 0.100000 <Softmax> <InputDim> 1026 <OutputDim> 1026
这里能够看到神经网络的输入维度有40变为440,输出为pdf的个数(对应HMM状态的个数)。
若是继续追查代码,最后能够找到单次神经网络的训练实现,kaldi/src/nnetbin/nnet-train-frmshuff.cc:
Perform one iteration (epoch) of Neural Network training with mini-batch Stochastic Gradient Descent. The training targets are usually pdf-posteriors, prepared by ali-to-post.
继续分析代码,能够看到几个关键步骤:
// get feature / target pair, Matrix<BaseFloat> mat = feature_reader.Value(); Posterior targets = targets_reader.Value(utt);
const CuMatrixBase<BaseFloat>& nnet_in = feature_randomizer.Value(); const Posterior& nnet_tgt = targets_randomizer.Value(); const Vector<BaseFloat>& frm_weights = weights_randomizer.Value();
// forward pass, nnet.Propagate(nnet_in, &nnet_out);
// evaluate objective function we've chosen, if (objective_function == "xent") { // gradients re-scaled by weights in Eval, xent.Eval(frm_weights, nnet_out, nnet_tgt, &obj_diff); } else if (objective_function == "mse") { // gradients re-scaled by weights in Eval, mse.Eval(frm_weights, nnet_out, nnet_tgt, &obj_diff); } ...
if (!crossvalidate) { // back-propagate, and do the update, nnet.Backpropagate(obj_diff, NULL); }
total_frames += nnet_in.NumRows(),
最终由调用该部分代码的/steps/nnet/train_scheduler.sh指定最大迭代次数max_iters或accept训练的模型,
accepting: the loss was better, or we had fixed learn-rate, or we had fixed epoch-number
在进行DNN训练前,
解码时,用训练好的DNN-HMM模型,输入帧的特征向量,获得该帧为每一个状态(对应pdf)的几率。
其中 x_t 对应t时刻的观测值(输入),q_t=s_i 即表示t时刻的状态为 s_i。p(x_t) 为该观测值出现几率,对结果影响不大。p(s_i) 为 s_i 出现的先验几率,能够从语料库中统计获得。最终获得了与GMM相同的目的:HMM状态到观测帧特征向量的输出几率。就有了下面的示意图: