Non-Attentive Tacotron: Robust and Controllable Neural TTS systhesis Including Unsupervised Duration Modeling

2021/07/04 07:13
포스팅 종류
"Duration Prediction" existed in early TTS system as HMM-based models and even in Deep Voice 2 (2017).
Seq-to-seq model w/ attention mechanism removed the need for duration prediction.

Tacotron 2 (2018)

seq-to-seq model
auto-regressive decoder
attention mechanism (location-sensitive)

Problems with auto-regressive attention based models

early cutoff

Efforts to improve the robustness of auto-regressive attention based models

adversarial training(Guo et al. 2019)
regularization to encourage the forward and backward attention to be consistent(Zheng et al., 2019)
Gaussian mixture model attention(Graves, 2013; Skerry-Ryan et al., 2018)
forward attention(Zhang et al., 2018)
stepwise monotonic attention(He et al., 2019)
dynamic convolution attention(Battenberg et al., 2020)

Resurrection of duration predictor

FastSpeech(Ren et al., 2019)
AlignTTS(Zeng et al., 2020)
TalkNet(Beliaev et al., 2020)
JDI-T(Lim et al., 2020)

Classification of some TTS models

Non-Attentive Tacotron and the contribution of the paper

Replace the attention mechanism with duration prediction and upsampling modules
Gaussian upsampling
Global and fine-grained controlling of duration
Semi-supervised and unsupervised duration modeling, allowing the model to be trained with few to no duration annotation
More reliable evaluation metrics

Model structure

Duration prediction is trained to decrease the loss but not used during training.

Gaussian upsampling

h: to be upsampled
d: duration vector
sigma: range parameter
u: upsampled vector
akin to GMM attention
fully differentiable

Target Duration

Extracted by an external, flatstart trained, speaker-dependent HMM-based aligner with a lexicon(Talkin & Wightman, 1994)

Semi-supervised and unsupervised duration modeling

Neural TTS models using duration prediction need external alignments.
FastSpeech uses target durations extracted from a pre-trained autoregressive model in teacher focing mode.
In Non-Attentive Tacotron, target durations are extracted by an externa, flatstart trained, speaker-dependent HMM-based aligner with a lexicon.
If a reliable aligner is not trained, semi-supervised and unsupervised duration modeling is used.

A naive approach

Simply train the model using the predicted duration (instead of the target duration) for upsampling, and use only mel-spectrogram reconstruction loss.
Rescale the predicted duration to match the length of the utterance, by
Utterance-level duration loss can be added (semi-supervised?)
This approach does not work well. Poor naturalness!

Find-grained VAE (FVAE)

Use FVAE to model the alignment during training, to extract per-token latent features.
At training time, these latent features are fed into the duration predictor
At inference time, they can be sampled from the prior distribution, i.e., normal distribution or the mode of the prior distribution (0???)
The overall loss function is

Robustness Evaluation

Unaligned duration ratio (UDR)

any long audio segments (> 1 second) not aligned to any input token
long-pauses, babbling, word repetitions or failure to stop after finishing the utterance

ASR word deletion rate (WDR)

early cutoff or word skipping


66 speakers w/ 4 English accents(US, British, Australian and Nigerian)
354 hous

Comparison to Tacotron 2 w/ GMMA and LSA

A preliminary experiment showed that GMMA performed better than monotonic, stepwise monotonic and dynamic convolution attention. So compare to GMMA...

Gaussian Mixture Model Attention (GMMA)



Robustness w.r.t UDR and WDR of Non-Attentive Tacotron should be compared to other models with "duration predictor" such as FastSpeech.

Pace control

Utterance-wide control shows that, with 0.8x - 1.25x pace, the WERs were hardly impacted
MOS drops but most raters complained about the pace, "too slow or too fast to be natural"
Single word pace control is possible. ....

Semi-Supervised and Unsupervised Duration Modeling