Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Abstract

Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also extend the proposed method to zero-shot multi-speaker TTS (ZS-TTS). The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization. We highlight that the single speaker TTS model fine-tuned on the only 10 minutes of labeled dataset outperforms the other baselines, and the ZS-TTS model fine-tuned on the only 30 minutes of single speaker dataset can generate the voice of the arbitrary speaker, by pre-training on unlabeled multi-speaker speech corpus.

Single Speaker TTS

The proposed model was pre-trained on the speech-only dataset of LJSpeech(22 hours), and fine-tuned on the subsets of labeled LJSpeech with denoted lengths. All the baselines were also trained on the denoted dataset.

Audio samples

Text

But they proceeded in all seriousness, and would have shrunk from no outrage or atrocity in furtherance of their foolhardy enterprise.

Ground Truth

(1) LJSpeech, 10 min

GlowTTS

FastSpeech2

VITS-baseline

Proposed

(2) LJSpeech, 1 hour

GlowTTS

FastSpeech2

VITS-baseline

Proposed

(3) LJSpeech, 10 hour

GlowTTS

FastSpeech2

VITS-baseline

Proposed

The proceeds of the robbery were lodged in a Boston bank,

On the other hand, he could have traveled some distance with the money he did have and he did return to his room where he obtained his revolver.

might have been more alert in the Dallas motorcade if they had retired promptly in Fort Worth.

Zero-Shot Multi-Speaker TTS

The proposed model was pre-trained on the speech-only dataset of LibriTTS (245 hours, 1151 speakers), and fine-tuned on the subsets of labeled VCTK with denoted lengths and speakers. All the baselines were also trained on the denoted dataset.

Audio samples

Text

I think a move would create a lot of interest.

Reference