Generative AI for audio
AudioCraft: generating high-quality audio and music from text
AudioCraft is a single-stop code base for all your generative audio needs: music, sound effects, and compression after training on raw audio signals. We have released controllable and high-quality models for music and audio generation from text inputs. It represents significant progress in the development of interactive AI systems enabling people to easily and naturally co-create with AI models.
Open source, single-stage models, built with simplicity in mind
Multi-Band Diffusion paper
AudioCraft powers our audio compression and generation research and consists of three models: MusicGen, AudioGen, and EnCodec. MusicGen, which was trained with Meta-owned and specifically licensed music, generates music from text-based user inputs, while AudioGen, trained on public sound effects, generates audio from text-based user inputs. EnCodec, typically used foundationally in building MusicGen and AudioGen, is a state-of-the-art, real-time, high-fidelity audio codec that leverages neural networks to compress any kind of audio and reconstruct the original signal with high-fidelity. We further propose a diffusion-based approach to EnCodec to reconstruct the audio from the compressed representation with fewer artifacts.
Open source code for the training and inference of generative audio models
Beyond a collection of models, AudioCraft is a single codebase for developing audio generative models. It provides a unified framework for building any auto-regressive models with arbitrary conditioning and dataset. We hope to see it foster research and innovation for a number of applications.
MusicGen and AudioGen overview
With AudioCraft, we simplify the overall design of generative models for audio compared to prior work.
Both MusicGen and AudioGen consist of a single autoregressive Language Model (LM) that operates over streams of compressed discrete music representation, i.e., tokens. We introduce a simple approach to leverage the internal structure of the parallel streams of tokens and show that with a single model and elegant token interleaving pattern, our approach efficiently models audio sequences, simultaneously capturing the long-term dependencies in the audio and allowing us to generate high-quality audio.
Our models leverage the EnCodec neural audio codec to learn the discrete audio tokens from the raw waveform. EnCodec maps the audio signal to one or several parallel streams of discrete tokens. We then use a single autoregressive language model to recursively model the audio tokens from EnCodec. The generated tokens are then fed to EnCodec decoder to map them back to the audio space and obtain the output waveform. Finally, different types of conditioning models can be used to control the generation such as using a pretrained text encoder for text-to-audio applications.
Audio generative model tasks
AudioGen and MusicGen represent two variants of audio generative models. While AudioGen is focused on text-to-sound generation and learned to produce audio from environmental sounds, MusicGen, is an audio generative model tailored specifically for music. MusicGen can generate music conditioned on textual or melodic features, allowing better controls over the generated output.
This AudiGen demo performs the task of text-to-audio generation. Given a textual description of an acoustic scene, the model can generate the environmental sound corresponding to the description with realistic recording conditions and complex scene context.
Listen to the demo
Whistling with wind blowing
Sirens and a humming engine approach and pass
This demo demonstrates the capabilities of MusicGen, the audio generation model specifically tailored for music generation. The MusicGen model was trained on roughly 400,000 recordings along with text description and metadata, amounting to 20,000 hours of music owned by Meta or licensed specifically for this purpose.
Listen to the demo
Pop dance track with catchy melodies, tropical percussion, and upbeat rhythms, perfect for the beach
Earthy tones, environmentally conscious, ukulele-infused, harmonic, breezy, easygoing, organic instrumentation, gentle grooves
Explore more on AudioCraft
Discover more about AudioCraft here — scroll through our resources, ranging from our research papers, model cards and more.