In this work, we introduce Metis, a foundation model for unified speech generation. Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, (1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning features, and acoustic tokens directly quantized from waveforms. (2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. (3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters. Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. We will release code and model checkpoints soon.
Pre-training.
Fine-tuning.
The overview of pre-training (left) and fine-tuning (right). The model predicts masked SSL tokens without any condition during pre-training and then fine-tunes with the task-specific condition to adapt to various tasks.
Metis-TTS can generate high-fidelity speech with strong speaker similarity by only fine-tuning on 1K or 10K hours of data, without requiring text-speech alignment information or a duration predictor.
Domain | Prompt Speech | Text | Metis-TTS |
---|---|---|---|
Celebrities & Characters | Listen, folks, let me tell you about language models. it's tremendous, really. It's like having the smartest, most powerful brain in the world, doing all the thinking for you, making everything faster, better, and stronge. |
||
Life's too short to get stuck in the same loop, selling pieces of fabric to people who don't even care. You see, Morty, the universe is infinite, full of chaos and possibilities. |
|||
Truly, it feels like Zhen's smile carries a certain charm. |
|||
The Empress resides in Jingren Palace, while Her Majesty lives in Yikun Palace. In this harem, only our Yikun Palace shares the same "Kun" character as Kunning Palace, where the Emperor and Empress were wed. |
|||
Normal | This staggering figure highlights the immense financial burden faced by individuals during that period, underscoring the economic challenges and legal pressures of the time. |
||
This shift in audience demand is pushing the entertainment industry to embrace more diverse storytelling, with superheroes and protagonists from various cultural, racial, and gender backgrounds. |
|||
I want to share some of some of the traditions and customs that make it so special. Whether you're celebrating or just curious, I hope this gives you a deeper appreciation for what Thanksgiving means. |
|||
汉堡是一种源自德国的流行食品,如今已成为全球广受欢迎的快餐之一。它通常由两片圆形的面包夹着一块煎烤的肉饼,如牛肉、鸡肉或植物基替代品组成,并搭配各种配料。 |
|||
Accent | He had a vague recollection of hearing a strange noise, a sharp, almost metallic clang, but he couldn't place where it had come from. |
||
It is a very tenable hypothesis and will bear looking into. |
|||
Men like Joe Goose dated existence from drunk to drunk. |
|||
It is very plausible to such people a most convincing hypothesis. |
|||
Emotion | Please invite Tom if there is not requires. |
||
Born once every one hundred years, dies in flames! |
|||
On the twenty second of last march. |
|||
Perhaps you think is a queer title for this chapter. |
|||
Hard Cases | I thought a thought. But the thought I thought wasn't the thought I thought I thought. If the thought I thought I thought had been the thought I thought, I wouldn't have thought so much. |
||
Whether the weather be fine or whether the weather be not, whether the weather be cold or whether the weather be hot. Well weather the weather whether we like it or not. |
|||
墙上画凤凰,凤凰画在粉红墙。红凤凰、粉凤凰,红粉凤凰、花凤凰。红凤凰,黄凤凰,红粉凤凰,粉红凤凰,花粉花凤凰。 |
|||
南南和兰兰南南帮阿姨摆盘,兰兰帮阿姨摆碗。碗对着盘,盘对着碗。阿姨夸南南和兰兰,把碗盘摆的真好看。 |
Source Speech | Reference Speech | Metis-VC |
---|---|---|
Metis-SE supports general speech restoration, including denoising, dereverberation, declipping, and super-resolution up to 24 kHz.
Noisy Speech | Noisy Spectrogram | Metis-SE | Enhanced Spectrogram |
---|---|---|---|
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
![]() |
||
![]() |
![]() |
Metis-TSE supports target speaker extraction for diverse speakers across multilingual and in-the-wild scenarios.
Mixed Speech | Reference Speech | Target Speech | Metis-TSE |
---|---|---|---|
In-the-wild. |
|||
In-the-wild. |
|||
In-the-wild, zh. |
|||
In-the-wild, zh. |
|||
In-the-wild, en and zh. |
|||
In-the-wild, en and zh. |
|||
librimix. |
|||
librimix. |
|||
librimix. |
Video | Prompt Speech | Metis-L2S | Ground Truth |
---|---|---|---|
Given that our model is capable of synthesizing speech with high speaker similarity, it presents potential risks of misuse, including voice identification spoofing and speaker impersonation. To prevent potential misuse, it is crucial to develop robust synthesized speech detection models and implement a comprehensive system for reporting suspected misuse cases.