Generative AI for audio

AudioCraft: generating high-quality audio and music from text

AudioCraft is a single-stop code base for all your generative audio needs: music, sound effects, and compression after training on raw audio signals. We have released controllable and high-quality models for music and audio generation from text inputs. It represents significant progress in the development of interactive AI systems enabling people to easily and naturally co-create with AI models.

The models

Open source, single-stage models, built with simplicity in mind

MusicGen paper

AudioGen paper

EnCodec paper

Multi-Band Diffusion paper

AudioCraft powers our audio compression and generation research and consists of three models: MusicGen, AudioGen, and EnCodec. MusicGen, which was trained with Meta-owned and specifically licensed music, generates music from text-based user inputs, while AudioGen, trained on public sound effects, generates audio from text-based user inputs. EnCodec, typically used foundationally in building MusicGen and AudioGen, is a state-of-the-art, real-time, high-fidelity audio codec that leverages neural networks to compress any kind of audio and reconstruct the original signal with high-fidelity. We further propose a diffusion-based approach to EnCodec to reconstruct the audio from the compressed representation with fewer artifacts.


The code

Open source code for the training and inference of generative audio models

Beyond a collection of models, AudioCraft is a single codebase for developing audio generative models. It provides a unified framework for building any auto-regressive models with arbitrary conditioning and dataset. We hope to see it foster research and innovation for a number of applications.


MusicGen and AudioGen overview

With AudioCraft, we simplify the overall design of generative models for audio compared to prior work.

Both MusicGen and AudioGen consist of a single autoregressive Language Model (LM) that operates over streams of compressed discrete music representation, i.e., tokens. We introduce a simple approach to leverage the internal structure of the parallel streams of tokens and show that with a single model and elegant token interleaving pattern, our approach efficiently models audio sequences, simultaneously capturing the long-term dependencies in the audio and allowing us to generate high-quality audio.

Our models leverage the EnCodec neural audio codec to learn the discrete audio tokens from the raw waveform. EnCodec maps the audio signal to one or several parallel streams of discrete tokens. We then use a single autoregressive language model to recursively model the audio tokens from EnCodec. The generated tokens are then fed to EnCodec decoder to map them back to the audio space and obtain the output waveform. Finally, different types of conditioning models can be used to control the generation such as using a pretrained text encoder for text-to-audio applications.


Audio generative model tasks

AudioGen and MusicGen represent two variants of audio generative models. While AudioGen is focused on text-to-sound generation and learned to produce audio from environmental sounds, MusicGen, is an audio generative model tailored specifically for music. MusicGen can generate music conditioned on textual or melodic features, allowing better controls over the generated output.

Text-to-sound generation

Text-to-music generation


AudioCraft Demos

Text-to-Sound

This AudiGen demo performs the task of text-to-audio generation. Given a textual description of an acoustic scene, the model can generate the environmental sound corresponding to the description with realistic recording conditions and complex scene context.

Listen to the demo

Whistling with wind blowing

Sirens and a humming engine approach and pass

Text-to-Music

This demo demonstrates the capabilities of MusicGen, the audio generation model specifically tailored for music generation. The MusicGen model was trained on roughly 400,000 recordings along with text description and metadata, amounting to 20,000 hours of music owned by Meta or licensed specifically for this purpose.

Listen to the demo

Pop dance track with catchy melodies, tropical percussion, and upbeat rhythms, perfect for the beach

Earthy tones, environmentally conscious, ukulele-infused, harmonic, breezy, easygoing, organic instrumentation, gentle grooves


Resources

Explore more on AudioCraft

Discover more about AudioCraft here — scroll through our resources, ranging from our research papers, model cards and more.

MusicGen paper
AI at Meta blog
AudioGen paper
Demo
EnCodec paper
Code
Multi-Band Diffusion paper
Model cards

Get the latest from AI at Meta in your inbox

Sign up