May 20, 2022
We’ve built CommerceMM, a new approach to multimodal understanding for online shopping. Because so many product posts rely on both text and images, comprehension of multimodal data is crucial to make products more discoverable and help shoppers more easily find what they’re looking for.
CommerceMM relies on a novel set of pretraining tasks, called omni retrieval, to fuse the model’s characterizations of a post’s text and image, creating a representation of the post as a whole.
We achieved state-of-the-art performance on seven downstream tasks in research settings, such as product categorization and retrieval. An early version of CommerceMM has been deployed to provide more relevant search results and recommendations across Instagram Shops, Facebook Shops, and Marketplace listings.
When you shop online, how do you find what you’re looking for? If you’re browsing a traditional e-commerce site, designers may have spent months slotting products into the right categories and appending the most descriptive tags. The decisions are not always straightforward: Do blenders belong with pots and pans in kitchen supplies or in appliances alongside portable dishwashers? Gray areas like these demand nuanced interpretation and a keen understanding of how shoppers think.
At an online marketplace with a vast number of independent vendors — and an even vaster array of merchandise — the task becomes significantly more complex. Given the sheer scale of Meta’s shopping experiences across Marketplace and Shops on Facebook and Instagram, we leverage AI tools to help categorize and label products. To address this need, we’ve developed a powerful new approach to pretraining and a versatile new model for Commerce MultiModal Representation (CommerceMM).
We’ve taught CommerceMM to analyze a post as a whole rather than as merely isolated blocks of images and words. That’s a vital skill because so many commerce posts are multimodal, with pictures, captions, and other text working in concert to express a wealth of information. Often, a product’s description identifies the features important to a particular shopper — that the wood in a chair was responsibly harvested or a pair of pants is machine washable — but in other cases, those attributes are instead folded into a series of hashtags. Sometimes, the photo, not the text, reveals the crucial detail, such as the profile of the perfect sofa or cut of a flattering dress. Fully comprehending a product post means parsing the subtleties of this sort of multimodal content.
CommerceMM achieves a richer understanding of multimodal data by integrating its characterizations of a post’s text and image.
Previous researchers have pretrained Transformer-based models to associate an image with its accompanying text description, using medium-scale image-text pairs as the default training data. But online shopping is conducive to more diverse text and image data, which we can use to teach AI systems to find new relationships between modalities. For example, when a shopper searches for “planter” and then clicks on a porcelain flowerpot, the shopper confirms the affinity between the query text and the multimodal product page. That page might show multiple views of the flowerpot, which suggests a connection both among the images and between each image and the entire page (which would include varying multimodal content). Influencers can tag the flowerpot in their own photos, which builds a multimodal-to-multimodal mapping between their posts and the product page.
We began to realize that these connections had another use: to help develop a generalized multimodal representation for various commerce-related applications. As with Meta AI’s recent breakthroughs in multimodal training, including our open source MMF framework for vision and language research, the Situated and Interactive Multimodal Conversations (SIMMC) data set, the Halo annotation platform, and the Learning from Videos project, we wanted to use our resources to help deepen AI’s understanding of complex data. We found that we could employ those multimodal mappings as weak labels in our novel approach to pretraining, which teaches the model how to correlate an image, its corresponding text, and the post as a whole.
CommerceMM is composed of an image encoder, a text encoder, and a multimodal fusion encoder. The encoders translate data into embeddings, which are sets of mathematical vectors. In our text encoder, the embeddings represent the continuums along which a sentence could be similar to or different from another: structure, tense, subject matter, sentiment, formality, perspective, and a host of other shades of meaning. For an image, the embeddings might quantify, for example, patterns in edges, the sharpness of angles, or the contours implied by shadows. These embeddings consolidate an abundance of information, encapsulating the unique qualities that characterize each passage in a text or object in a photograph.
What makes CommerceMM unique is that for each photo-and-text input, in addition to separate text and image representations, the system creates a dedicated multimodal embedding that represents the post as a whole. First, the image encoder analyzes each picture, and a Transformer-based text encoder handles the accompanying text. Both pass the resulting embeddings to a Transformer-based multimodal fusion encoder, where the two modalities learn to interact to create a joint representation.
The goal of training is to teach the model to bunch similar inputs together in the multidimensional embedding space and to push unrelated data apart. We train all three encoders simultaneously on a suite of tasks combining masking and contrastive learning. In the masking tasks — masked language modeling, masked image modeling KL-divergence, and masked image modeling feature regression — part of the image or text is blacked out, and the model learns to reconstruct the missing section based on its surroundings. Contrastive learning approaches — in our case, image-text contrastive learning and image-text matching — teach a model to cluster the representations of nearly identical inputs close together in embedding space while pushing them away from dissimilar examples.
Once we have our three embeddings — image, text, and multimodal — we continue to calibrate them with a novel set of tasks called omni retrieval. This step refines the relationships between all the embedding modalities: image and image, image and text, image and multimodal, text and multimodal, etc. To begin, we feed two text-image pairs into two identical versions of our model. Each side returns three embeddings. Our goal is to teach the system how to associate the two sets of three embeddings. There are in total nine pairs of relationships, where all three embeddings from one model should be highly correlated with every embedding from the replica. With contrastive learning, we can coach the model to learn all these relationships. Omni retrieval has been shown to learn a more discriminative and generalized representation.
We iterate between the image-text tasks and omni retrieval. The two phases are complementary, which makes training more efficient; the system runs only one forward and backward pass for each set of tasks.
We introduced another novel idea, modality randomization, to deepen CommerceMM’s multimodal comprehension. Before each training step, we randomly shuffle layers between the text and multimodal encoders. As the layers move from Transformer to Transformer, they learn from both modalities and share their knowledge with the rest of the system. We found that modality-randomized pretraining improves performance on downstream tasks over any fixed architecture.
Once our system has been pretrained to learn these representations, researchers can easily fine-tune the model for any number of specialized duties. We used CommerceMM to achieve state-of-the-art performance on seven tasks in research settings: Catalog product categorization for Instagram Shops and Facebook Shops, product categorization for Marketplace, image-to-text retrieval, text-to-image retrieval, query-to-product retrieval, image-to-product retrieval, and image-to-image retrieval. Our model outperforms all the existing systems dedicated to these individual use cases. The detailed experiments are in arXiv paper.
Last year, Meta deployed an early version of CommerceMM to improve category filters, such as beauty and home — providing more relevant results and recommendations on Instagram Shops, Facebook Shops, and Marketplace listings. We also used it to improve attribute filters, like color and material, in English on Instagram Shops and Facebook Shops.
CommerceMM is becoming a standard model to inform ranking and product recommendation across the company. We hope to deploy CommerceMM to support more of Meta’s applications, such as product search in Marketplace and visual search on Instagram. Our ultimate goal is for CommerceMM to help shoppers find exactly what they’re looking for.
Applied Research Scientist