Earlier this year, Meta introduced Voicebox, a state-of-the-art AI model that can perform speech generation tasks like editing, sampling, and stylizing. It was a breakthrough in generative AI in that it could generalize to speech-generation tasks it wasn’t specifically trained to accomplish — and execute these tasks with state-of-the-art performance.
Now, Audiobox, the successor to Voicebox, is advancing generative AI for audio even further by unifying generation and editing capabilities for speech, sound effects (short, discrete sounds like a dog bark, car horn, a crack of thunder, etc.), and soundscapes, with a variety of input mechanisms to maximize controllability for each use case.
Most notably, Audiobox lets people use natural language prompts to describe a sound or type of speech they want to generate. If someone wants to generate a soundscape, for example, they can give the model a text prompt like, “A running river and birds chirping.”
Similarly, to generate a voice, a user might input, “A young woman speaks with a high pitch and fast pace.”
The model also allows users to combine an audio voice input with a text style prompt to synthesize speech of that voice in any environment (e.g., “in a cathedral”) or any emotion (e.g., “speaks sadly and slowly”). To our knowledge, Audiobox is the first model to enable dual input (voice prompts and text description prompts) for freeform voice restyling.
Audiobox demonstrates state-of-the-art controllability on speech and sound effects generation. Our own tests show it significantly surpasses prior best models (AudioLDM2, VoiceLDM, and TANGO) on quality and relevance (faithfulness to text description) in subjective evaluations. Audiobox outperforms Voicebox on style similarity by over 30 percent on a variety of speech styles.
Why we created Audiobox
Audio plays a fundamental role in many forms of media, from movies to podcasts, audiobooks, and video games. But producing quality audio can often be a challenging process that requires access to extensive sound libraries as well as deep domain expertise (sound engineering, foley, voice acting, etc.) to yield optimal results — expertise that the public, or even hobbyists, may not possess.
We’re releasing Audiobox to a hand-selected group of researchers and academic institutions with a track record in speech research to help further the state of the art in this research area and ensure we have a diverse set of partners to tackle the responsible AI aspects of this work. In the future, we believe research breakthroughs like Audiobox will lower the barrier of accessibility for audio creation and make it easy for anyone to become an audio content creator. Creators could use models like Audiobox to generate soundscapes for videos or podcasts, custom sound effects for games, or any of a number of other use cases.
While Audiobox is built on top of the Voicebox framework, it can generate a larger variety of sounds, including speech in various environments and styles, non-speech sound effects, and soundscapes.
Being able to use text and voice inputs also greatly enhances Audiobox’s controllability compared to Voicebox. Audiobox users can use text description prompts to specify the style of speech and sound effects, a feature that was not supported in Voicebox. When a voice input and text prompt are used together, the voice input anchors the timbre, and the text prompt can be used to change other aspects.
Audiobox inherits Voicebox’s guided audio generation training objective and flow-matching modeling method to allow for audio infilling. With infilling, users can also use the model to polish sound effects (adding different thunder sounds into a raining soundscape, for example).
Our invitation to collaborate on responsible research
AI for audio generation has made significant progress over the past year. But, as with all AI innovations, we must work to help ensure responsible use. The known issues with AI cannot be addressed by any individual or single organization alone. That’s why collaboration with the research community on state-of-the-art models is more important now than ever.
For these tools to be better and safer for everyone, the AI community must be empowered to build on top of our work and continue to develop these innovations responsibly. But access must be shared in the right way. To honor this and our ongoing commitment to open science, we’re releasing Audiobox under a research-only license to a limited number of hand-selected researchers and institutions.
We've also released an interactive demo that showcases Audiobox’s capabilities.
Implementing Audiobox responsibly
Tools like Audiobox can raise concerns about voice impersonation or other abuses. As part of our commitment to building generative AI features responsibly, we’ve implemented new technologies to help address these issues.
Both the Audiobox model and our interactive demo feature automatic audio watermarking so any audio created with Audiobox can be accurately traced to its origin. Our watermarking method embeds a signal into the audio that’s imperceptible to the human ear but can be detected all the way down to the frame level using a model capable of finding AI-generated segments in audio.
We’ve tested this method against a broad range of attacks and found it to be more robust than even current state-of-the-art solutions, making it extremely difficult for bad actors to try and bypass detection by modifying the AI-generated audio.
Additionally, similar to how websites use CAPTCHAs to deter bots and spam, our interactive demo includes a voice authentication feature to safeguard against impersonation. Anyone who wants to add a voice to the Audiobox demo will have to speak a voice prompt using their own voice. The prompt changes at regular, rapid intervals and makes it extremely difficult to add someone else’s voice with pre-recorded audio.
To help ensure robustness across different groups of speakers, we tested the performance of Audiobox on speakers of different genders and with different native languages and verified that the performance is close across all groups of speakers.
Future use cases for Audiobox
In the long term, it will be crucial to move from building specialized audio generative models that can only generate one type of audio (such as speech or sound) to building generalized audio generative models that can generate any audio. With such models we can perform any generative audio task that requires understanding beyond a single modality. This will make it simpler for developers to build towards a more dynamic and wide range of use cases.
Audiobox is an important step toward democratizing audio generation. We envision a future where everyone can more easily and efficiently create audio that is tailored to their use cases. Our hope is that we can see the same creativity sparked by advancements in text and image generation happen for audio as well, for both professionals and hobbyists. Content creation, narration, sound editing, game development, and even AI chatbots can all benefit from the capabilities of audio generation models.
This blog post was made possible by the work of Akinniyi Akinyemi, Alice Rakotoarison, Andros Tjandra, Apoorv Vyas, Baishan Guo, Bapi Akula, Bowen Shi, Brian Ellis, Carleigh Wood, Chris Summers, Ivan Cruz, Joshua Lane, Jeff Wang, Jiemin Zhang, Liang Tan, Mary Williamson, Matt Le, Rashel Moritz, Robbie Adkins, Wei-Ning Hsu, William Ngan, Xinyue Zhang, Yael Yungster, and Yi-Chiao Wu.