Audio Samples: Parallel vs Iterative Sampling

This page contains audio samples for the paper Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation. The text and speakers were randomly chosen from the evaluation sets. We include samples from seen and unseen speakers.

For each text/context pair, we include samples from:

A non-frame-stacked model (parallel sampling)
2x-frame-stacked model
4x-frame-stacked model

For each stacking factor, three model variants are included:

Without a Local Transformer (parallel sampling of all codebooks)
With a MaskGit-based Local Transformer (iterative sampling)
With an autoregressive Local Transformer (sequential codebook sampling)

The diagram below shows the speedup benefits of frame stacking.

Speedup Analysis

LibriTTS - Seen Speakers

"Then Mary Taylor, whose conscience was uncomfortable, said:"

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"Its depth remained invariable, still four, or at most five, fathoms; and although its bottom was assiduously dredged, it was only to prove it barren of marine production of any type."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"All the fingers and thumbs of the girl's hands had been carefully formed and stuffed and stitched at the edges, with gold plates at the ends to serve as finger nails."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"He allowed them to glow and fade, hue after hue: sunrise gold, the russet and green of apple orchards, azure of waves, the grey fringed fleece of clouds."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"Also, a draft on futurity, sometimes honored, but generally extended."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"The instinct of workmanship is present in all men, and asserts itself even under very adverse circumstances."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"You observe that the scratch on that table is slight at one side, but deepens in the direction of the bedroom door."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"Time was, sir, when I was butler to old Sir Jabez Gilchrist, this young gentleman's father. When he was ruined I came to the college as servant, but I never forgot my old employer because he was down in the world."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"Had the telegraph been invented in the days of ancient Rome, would the romans have accepted it, or have stoned Wheatstone? So thinking, I resolved that I was before my age, and that I must pay the allotted penalty."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"This shocking news reached the ears of her parents, whom Dona Estafania had concealed in another room that they might make their appearance at the right moment."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

LibriTTS - Unseen Speakers

"Little by little, however, the latter became hemmed and bound in the meshes of the various devices and proceedings which the territorial officials evolved from the bogus laws."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"With forms which must be ranked as undoubted species, a perfect series exists from those which are absolutely sterile when crossed, to those which are almost or completely fertile."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"He took no notice of her; he looked at me, but as if, instead of me, he saw what he spoke of."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"After his visit I told Esprit to take me to the Palais Royal, and I left him at the gates."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"I've managed to save something every year, and that with helping my three sisters now and then, and tiding poor Cousin Mike over bad seasons."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"For our part, we reserve to the word its ancient and precise, circumscribed and determined significance, and we restrict slang to slang."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"Already a North and a South were talked of-why not set up also a West?"

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"He refused at first to listen to the careful advice; it was repugnant to his liberal nature."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"It is not logically necessary to the existence of a memory belief that the event remembered should have occurred, or even that the past should have existed at all."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames

"saint Paul wrote this epistle because, after his departure from the Galatian churches, Jewish Christian fanatics moved in, who perverted Paul's Gospel of man's free justification by faith in Christ Jesus."

Stacking Factor	Context Audio	No Local Transformer	Local Transformer: MaskGit	Local Transformer: Autoregressive
1 (no frame stacking)
2 frames
4 frames