"Duration Prediction" existed in early TTS system as HMM-based models and even in Deep Voice 2 (2017).
Seq-to-seq model w/ attention mechanism removed the need for duration prediction.
Tacotron 2 (2018)
seq-to-seq model
auto-regressive decoder
attention mechanism (location-sensitive)
Problems with auto-regressive attention based models
early cutoff
repetition
skipping
Efforts to improve the robustness of auto-regressive attention based models
adversarial training(Guo et al. 2019)
regularization to encourage the forward and backward attention to be consistent(Zheng et al., 2019)
Gaussian mixture model attention(Graves, 2013; Skerry-Ryan et al., 2018)
forward attention(Zhang et al., 2018)
stepwise monotonic attention(He et al., 2019)
dynamic convolution attention(Battenberg et al., 2020)
Resurrection of duration predictor
FastSpeech(Ren et al., 2019)
AlignTTS(Zeng et al., 2020)
TalkNet(Beliaev et al., 2020)
JDI-T(Lim et al., 2020)
Classification of some TTS models
Non-Attentive Tacotron and the contribution of the paper
Replace the attention mechanism with duration prediction and upsampling modules
Gaussian upsampling
Global and fine-grained controlling of duration
Semi-supervised and unsupervised duration modeling, allowing the model to be trained with few to no duration annotation
More reliable evaluation metrics
Model structure
Duration prediction is trained to decrease the loss but not used during training.
Gaussian upsampling
h: to be upsampled
d: duration vector
sigma: range parameter
u: upsampled vector
akin to GMM attention
fully differentiable
Target Duration
Extracted by an external, flatstart trained, speaker-dependent HMM-based aligner with a lexicon(Talkin & Wightman, 1994)
Semi-supervised and unsupervised duration modeling
Neural TTS models using duration prediction need external alignments.
FastSpeech uses target durations extracted from a pre-trained autoregressive model in teacher focing mode.
In Non-Attentive Tacotron, target durations are extracted by an externa, flatstart trained, speaker-dependent HMM-based aligner with a lexicon.
If a reliable aligner is not trained, semi-supervised and unsupervised duration modeling is used.
A naive approach
Simply train the model using the predicted duration (instead of the target duration) for upsampling, and use only mel-spectrogram reconstruction loss.
Rescale the predicted duration to match the length of the utterance, by
Utterance-level duration loss can be added (semi-supervised?)
This approach does not work well. Poor naturalness!
Find-grained VAE (FVAE)
Use FVAE to model the alignment during training, to extract per-token latent features.
At training time, these latent features are fed into the duration predictor
At inference time, they can be sampled from the prior distribution, i.e., normal distribution or the mode of the prior distribution (0???)
The overall loss function is
Robustness Evaluation
Unaligned duration ratio (UDR)
any long audio segments (> 1 second) not aligned to any input token
over-generation
long-pauses, babbling, word repetitions or failure to stop after finishing the utterance
ASR word deletion rate (WDR)
under-generation
early cutoff or word skipping
Experiments
66 speakers w/ 4 English accents(US, British, Australian and Nigerian)
354 hous
Comparison to Tacotron 2 w/ GMMA and LSA
A preliminary experiment showed that GMMA performed better than monotonic, stepwise monotonic and dynamic convolution attention. So compare to GMMA...
Gaussian Mixture Model Attention (GMMA)
Naturalness
Robustness
Robustness w.r.t UDR and WDR of Non-Attentive Tacotron should be compared to other models with "duration predictor" such as FastSpeech.
Pace control
Utterance-wide control shows that, with 0.8x - 1.25x pace, the WERs were hardly impacted
MOS drops but most raters complained about the pace, "too slow or too fast to be natural"
Single word pace control is possible. ....