This page contains audio samples for the paper Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation. The text and speakers were randomly chosen from the evaluation sets. We include samples from seen and unseen speakers.
For each text/context pair, we include samples from:
For each stacking factor, three model variants are included:
The diagram below shows the speedup benefits of frame stacking.
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |
| Stacking Factor | Context Audio | No Local Transformer | Local Transformer: MaskGit | Local Transformer: Autoregressive |
|---|---|---|---|---|
| 1 (no frame stacking) | ||||
| 2 frames | ||||
| 4 frames |