Efficient Parallel Audio Generation using Group Masked Language Modeling

Abstract

We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this problem, we propose Group-Masked Language Modeling~(G-MLM) and Group Iterative Parallel Decoding~(G-IPD) for efficient parallel audio generation. Both the training and sampling schemes enable the model to synthesize high-quality audio with a small number of iterations by effectively modeling the group-wise conditional dependencies. In addition, our model employs a cross-attention-based architecture to capture the speaker style of the prompt voice and improves computational efficiency. Experimental results demonstrate that our proposed model outperforms the baselines in prompt-based audio generation.

Sentence 1: Supposing that it was his sister coming back from one of her farms, he kept on with his work.
GT	Prompt
SoundStorm (N=12)	SoundStorm (N=27)
Proposed (N=2)	Proposed (N=6)	Proposed (N=12)	Proposed (N=27)

Sentence 1:

Supposing that it was his sister coming back from one of her farms, he kept on with his work.

Prompt

SoundStorm (N=12)

SoundStorm (N=27)

Proposed (N=2)

Proposed (N=6)

Proposed (N=12)

Proposed (N=27)

Sentence 2: After proceeding a few miles, the progress of Hawkeye, who led the advance, became more deliberate and watchful.
GT	Prompt
SoundStorm (N=12)	SoundStorm (N=27)
Proposed (N=2)	Proposed (N=6)	Proposed (N=12)	Proposed (N=27)

Sentence 2:

After proceeding a few miles, the progress of Hawkeye, who led the advance, became more deliberate and watchful.

Prompt

SoundStorm (N=12)

SoundStorm (N=27)

Proposed (N=2)

Proposed (N=6)

Proposed (N=12)

Proposed (N=27)

Sentence 3: She was a melancholy, middle aged woman, without visible attractions of any sort-one of those persons who appear to accept the obligation of living under protest, as a burden which they would never have consented to bear if they had only been consulted first.
GT	Prompt
SoundStorm (N=12)	SoundStorm (N=27)
Proposed (N=2)	Proposed (N=6)	Proposed (N=12)	Proposed (N=27)

Sentence 3:

She was a melancholy, middle aged woman, without visible attractions of any sort-one of those persons who appear to accept the obligation of living under protest, as a burden which they would never have consented to bear if they had only been consulted first.

Prompt

SoundStorm (N=12)

SoundStorm (N=27)

Proposed (N=2)

Proposed (N=6)

Proposed (N=12)

Proposed (N=27)

Sentence 1: I forget all the other good things he did; but he ended by shooting himself through the head in his bed room, and that was not the worst thing ever he did.'
Semantic Source	Prompt
VITS + Ref.	SoundStorm (N=12)	SoundStorm (N=27)
Proposed (N=2)	Proposed (N=6)	Proposed (N=12)	Proposed (N=27)

Sentence 1:

I forget all the other good things he did; but he ended by shooting himself through the head in his bed room, and that was not the worst thing ever he did.'

Semantic Source