Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus
Abstract
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also extend the proposed method to zero-shot multi-speaker TTS (ZS-TTS). The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization. We highlight that the single speaker TTS model fine-tuned on the only 10 minutes of labeled dataset outperforms the other baselines, and the ZS-TTS model fine-tuned on the only 30 minutes of single speaker dataset can generate the voice of the arbitrary speaker, by pre-training on unlabeled multi-speaker speech corpus.
Single Speaker TTS
The proposed model was pre-trained on the speech-only dataset of LJSpeech(22 hours), and fine-tuned on the subsets of labeled LJSpeech with denoted lengths. All the baselines were also trained on the denoted dataset.
Audio samples
Text
But they proceeded in all seriousness, and would have shrunk from no outrage or atrocity in furtherance of their foolhardy enterprise.
Ground Truth
(1) LJSpeech, 10 min
GlowTTS
FastSpeech2
VITS-baseline
Proposed
(2) LJSpeech, 1 hour
GlowTTS
FastSpeech2
VITS-baseline
Proposed
(3) LJSpeech, 10 hour
GlowTTS
FastSpeech2
VITS-baseline
Proposed
The proceeds of the robbery were lodged in a Boston bank,
On the other hand, he could have traveled some distance with the money he did have and he did return to his room where he obtained his revolver.
might have been more alert in the Dallas motorcade if they had retired promptly in Fort Worth.
Zero-Shot Multi-Speaker TTS
The proposed model was pre-trained on the speech-only dataset of LibriTTS (245 hours, 1151 speakers), and fine-tuned on the subsets of labeled VCTK with denoted lengths and speakers. All the baselines were also trained on the denoted dataset.
Audio samples
Text
I think a move would create a lot of interest.
Reference
Ground Truth
(1) LJSpeech 30 min, 1 speaker
Proposed
(2) VCTK 1 hour, 4 speakers
SC-GlowTTS
Meta-StyleSpeech
VITS-baseline
Proposed
(3) VCTK 5 hour, 20 speakers
SC-GlowTTS
Meta-StyleSpeech
VITS-baseline
Proposed
(4) VCTK 20 hour, 80 speakers
SC-GlowTTS
Meta-StyleSpeech
VITS-baseline
Proposed
Ask her to bring these things with her from the store.
This represents a tough game for us.
It can't just be a cynical marketing exercise.