AI innovation by Microsoft: Create 90-minute podcasts in English or Mandarin from text, open for public experimentation

Microsoft has unveiled a novel open-source project named VibeVoice, designed to revolutionise the realm of text-to-speech (TTS) conversion. The project, created by Microsoft itself, is set to offer a host of unique features that aim to address challenges in traditional TTS systems.

VibeVoice boasts the ability to handle emotion and speak in multiple languages, although its singing capabilities are currently limited. The project is particularly useful as an accessibility tool, and its potential applications extend beyond simple text-to-speech generation.

The project is designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. Advanced examples of VibeVoice's capabilities, including multiple speakers and language demonstrations, can be found on its project page.

VibeVoice can produce decent text-to-speech output, as demonstrated in the embedded clip above. The model can synthesize speech up to 90 minutes long with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.

There are multiple versions of VibeVoice available to test, with two being currently accessible: a 1.5 billion parameter version and a 7 billion parameter version. The larger model, with 7 billion parameters, has a smaller 32k context window and can produce 45 minutes of audio, while the 1.5 billion parameter version can generate up to 90 minutes of audio with a 64k context window.

A lighter version of VibeVoice, with 0.5 billion parameters, is planned for release and is designed for real-time audio generation. The locally installed VibeVoice uses around 7GB of VRAM for the smaller model and up to 18GB for the larger one.

In addition to local installation, an online version of VibeVoice is also available for use. The project's GitHub repository and Hugging Face are resources for learning more about the project and setting it up locally.

Microsoft's VibeVoice addresses challenges in traditional TTS systems, particularly in scalability, speaker consistency, and natural turn-taking. The potential of VibeVoice includes chat assistant improvements and reducing reliance on external servers for streaming audio.

For those keen on staying updated with the latest news, reviews, and guides for Windows and Xbox diehards, don't forget to subscribe to the Newsletter. A video available (by Bijan Bowen) dives deeper into VibeVoice and its capabilities, providing a comprehensive understanding of this groundbreaking project.

With plans to support other languages in future refinements, VibeVoice is set to become a versatile tool in the realm of text-to-speech conversion. The streaming audio version of VibeVoice is planned for future release, further expanding its reach and potential applications.

AI innovation by Microsoft: Create 90-minute podcasts in English or Mandarin from text, open for public experimentation