Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

Myeonghun Jeong Hyeongju Kim Sung Jun Cheon Byoung Jin Choi Nam Soo Kim

Abstract

Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps. In order to learn the mel-spectrogram distribution conditioned on the text, we present a likelihood-based optimization method for TTS. Furthermore, to boost up the inference speed, we leverage the accelerated sampling method that allows Diff-TTS to generate raw waveforms much faster without significantly degrading perceptual quality. Through experiments, we verified that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.


Audio Examples

Example 1

Ground Truth

Ground Truth(Mel + HiFiGAN)

Tacotron2

Glow-TTS

Diff-TTS (T=400, γ=1)

Diff-TTS (T=400, γ=7)

Diff-TTS (T=400, γ=21)

Diff-TTS (T=400, γ=57)

Example 2

Ground Truth

Ground Truth(Mel + HiFiGAN)

Tacotron2

Glow-TTS

Diff-TTS (T=400, γ=1)

Diff-TTS (T=400, γ=7)

Diff-TTS (T=400, γ=21)

Diff-TTS (T=400, γ=57)

Example 3

Ground Truth

Ground Truth(Mel + HiFiGAN)

Tacotron2

Glow-TTS

Diff-TTS (T=400, γ=1)

Diff-TTS (T=400, γ=7)

Diff-TTS (T=400, γ=21)

Diff-TTS (T=400, γ=57)

Example 4

Ground Truth

Ground Truth(Mel + HiFiGAN)

Tacotron2

Glow-TTS

Diff-TTS (T=400, γ=1)

Diff-TTS (T=400, γ=7)

Diff-TTS (T=400, γ=21)

Diff-TTS (T=400, γ=57)

Example 5

Ground Truth

Ground Truth(Mel + HiFiGAN)

Tacotron2

Glow-TTS

Diff-TTS (T=400, γ=1)

Diff-TTS (T=400, γ=7)

Diff-TTS (T=400, γ=21)

Diff-TTS (T=400, γ=57)

Example 6

Ground Truth

Ground Truth(Mel + HiFiGAN)

Tacotron2

Glow-TTS

Diff-TTS (T=400, γ=1)

Diff-TTS (T=400, γ=7)

Diff-TTS (T=400, γ=21)

Diff-TTS (T=400, γ=57)

Example 7

Ground Truth

Ground Truth(Mel + HiFiGAN)

Tacotron2

Glow-TTS

Diff-TTS (T=400, γ=1)

Diff-TTS (T=400, γ=7)

Diff-TTS (T=400, γ=21)

Diff-TTS (T=400, γ=57)


Ablation Studies

1. Diversity


Example 1


temperature T = 0.1

temperature T = 0.333

temperature T = 0.667

Example 2


temperature T = 0.1

temperature T = 0.333

temperature T = 0.667

Example 3


temperature T = 0.1

temperature T = 0.333

temperature T = 0.667

Example 4


temperature T = 0.1

temperature T = 0.333

temperature T = 0.667

2. Length control


Example 1


0.75x

1.00x

1.25x

Example 2


0.75x

1.00x

1.25x

Example 3


0.75x

1.00x

1.25x