May 9, 2023
When humans absorb information from the world, we innately use multiple senses, such as seeing a busy street and hearing the sounds of car engines. Today, we’re introducing an approach that brings machines one step closer to humans’ ability to learn simultaneously, holistically, and directly from many different forms of information — without the need for explicit supervision (the process of organizing and labeling raw data). We have built and are open-sourcing ImageBind, the first AI model capable of binding information from six modalities. The model learns a single embedding, or shared representation space, not just for text, image/video, and audio, but also for sensors that record depth (3D), thermal (infrared radiation), and inertial measurement units (IMU), which calculate motion and position. ImageBind equips machines with a holistic understanding that connects objects in a photo with how they will sound, their 3D shape, how warm or cold they are, and how they move.
ImageBind can outperform prior specialist models trained individually for one particular modality, as described in our paper. But most important, it helps advance AI by enabling machines to better analyze many different forms of information together. For example, using ImageBind, Meta’s Make-A-Scene could create images from audio, such as creating an image based on the sounds of a rain forest or a bustling market. Other future possibilities include more accurate ways to recognize, connect, and moderate content, and to boost creative design, such as generating richer media more seamlessly and creating wider multimodal search functions.
ImageBind is part of Meta’s efforts to create multimodal AI systems that learn from all possible types of data around them. As the number of modalities increases, ImageBind opens the floodgates for researchers to try to develop new, holistic systems, such as combining 3D and IMU sensors to design or experience immersive, virtual worlds. ImageBind could also provide a rich way to explore memories — searching for pictures, videos, audio files or text messages using a combination of text, audio, and image.
In typical AI systems, there is a specific embedding (that is, vectors of numbers that can represent data and their relationships in machine learning) for each respective modality. ImageBind shows that it’s possible to create a joint embedding space across multiple modalities without needing to train on data with every different combination of modalities. This is important because it’s not feasible for researchers to create datasets with samples that contain, for example, audio data and thermal data from a busy city street, or depth data and a text description of a seaside cliff.
Just as there have been exciting recent advances in generating images, videos, and audio from text (such as Make-A-Scene and Meta’s Make-A-Video), ImageBind’s multimodal capabilities could allow researchers to use other modalities as input queries and retrieve outputs in other formats. ImageBind is also an important step toward building machines that can analyze different kinds of data holistically, as humans do.
ImageBind is a multimodal model that joins a recent series of Meta's open source AI tools. This includes computer vision models like DINOv2, a new method that doesn’t require fine tuning training high-performance computer vision models, and Segment Anything (SAM) a universal segmentation model that can segment any object in any image, based on any user prompt. ImageBind complements these models as it focuses on multimodal representation learning. It tries to learn a single aligned feature space for multiple modalities, including, but not limited to, images and videos. In the future, ImageBind can leverage the powerful visual features from DINOv2 to further improve its capabilities.
Humans have the ability to learn new concepts from only a few examples. We can typically read a description of an animal and then recognize it in real life. We can also look at a photo of an unfamiliar model of a car and anticipate how its engine might sound. This is partly because a single image, in fact, can “bind” together an entire sensory experience. In the field of AI, however, as the number of modalities increases, the lack of multiple sensory data can limit standard multimodal learning, which relies on paired data. Ideally, a single joint embedding space — where many different kinds of data are distributed — could allow a model to learn visual features along with other modalities.
Previously, learning such a joint embedding space for all modalities would require collecting all possible combinations of paired data — an infeasible feat.
ImageBind circumvented this challenge by leveraging recent large-scale vision-language models and extending their zero-shot capabilities to new modalities just by using their natural pairing with images, such as video-audio and image-depth data, to learn a single joint embedding space. For the four additional modalities (audio, depth, thermal, and IMU readings), we use naturally paired self-supervised data.
Training image-text models has been extensively studied because of the abundance of images and co-occurring text on the internet. ImageBind uses the binding property of images, meaning they co-occur with a variety of modalities and can serve as a bridge to connect them, such as linking text to image using web data or linking motion to video using video data captured from wearable cameras with IMU sensors.
The visual representations learned from large-scale web data can be used as targets to learn features for different modalities. This allows ImageBind to align any modality that co-occurs with images, naturally aligning those modalities among themselves. Modalities with a strong correlation to images, such as thermal and depth, are easier to align. Modalities that are not visual, such as audio and IMU, have a weaker correlation. Consider that there are particular sounds, like a baby’s cries, that could accompany any number of visual contexts.
ImageBind shows that image-paired data is sufficient to bind together these six modalities. The model can interpret content more holistically, allowing the different modalities to “talk” to each other and find links without observing them together. For example, ImageBind can associate audio and text without seeing them together. This enables other models to “understand” new modalities without any resource-intensive training. ImageBind’s strong scaling behavior allows the model to substitute or enhance many AI models by enabling them to use other modalities. For instance, while Make-A-Scene can generate images by using text prompts, ImageBind could upgrade it to generate images using audio sounds, such as laughter or rain.
Image-aligned, self-supervised learning shows that the performance of our model can actually improve by using very few training examples. Our model has new emergent capabilities, or scaling behavior — that is, abilities that didn’t exist in smaller models but appear in larger versions. This might include recognizing which audio fits with a certain image or predicting the depth of a scene from a photo.
Our analysis shows that ImageBind’s scaling behavior improves with the strength of the image encoder. In other words, ImageBind’s ability to align modalities increases with the strength and size of the vision model. This suggests that larger vision models benefit nonvision tasks, such as audio classification, and the benefits of training such models go beyond computer vision tasks.
Among our experiments, we used the audio and depth encoders from ImageBind and compared them with prior work in zero-shot retrieval as well as audio and depth classification tasks.
We discovered that ImageBind features can be used for few-shot audio and depth classification tasks and can outperform prior methods tailored for those modalities. For example, ImageBind significantly outperforms Meta’s self-supervised AudioMAE model trained on Audioset and a supervised AudioMAE model fine-tuned on audio classification, with gains of approximately 40 percent accuracy in top-1 accuracy on ≤four-shot classification.
ImageBind also achieved new state-of-the-art performance on emergent zero-shot recognition tasks across modalities, even outperforming recent models that were trained to recognize concepts for that modality.
With the capability to use several modalities for input queries and retrieve outputs across other modalities, ImageBind shows new possibilities for creators. Imagine that someone could take a video recording of an ocean sunset and instantly add the perfect audio clip to enhance it, while an image of a brindle Shih Tzu could yield essays or depth models of similar dogs. Or when a model like Make-A-Video produces a video of a carnival, ImageBind can suggest background noise to accompany it, creating an immersive experience.
People could even segment and identify the objects in an image based on audio. This creates distinctive opportunities to create animations out of static images by combining them with audio prompts. For example, a creator could couple an image with an alarm clock and a rooster crowing, and use a crowing audio prompt to segment the rooster or the sound of an alarm to segment the clock and animate both into a video sequence.
While we explored six modalities in our current research, we believe that introducing new modalities that link as many senses as possible — like touch, speech, smell, and brain fMRI signals — will enable richer human-centric AI models.
There’s still a lot to uncover about multimodal learning. The AI research community has yet to effectively quantify scaling behaviors that appear only in larger models and understand their applications. ImageBind is a step toward evaluating them in a rigorous way and showing novel applications in image generation and retrieval.
We hope the research community will explore ImageBind and our accompanying published paper to find new ways to evaluate vision models and lead to novel applications.
Research Scientist
Research Scientist
PhD Resident
Foundational models
Latest news
Foundational models