Open Source

Sharing new breakthroughs and artifacts supporting molecular property prediction, language processing, and neuroscience

May 14, 2025
12 minute read
  

Takeaways

 
  • Meta FAIR is sharing new research artifacts that highlight our commitment to achieving advanced machine intelligence (AMI) through focused scientific and academic progress.
  • The work we’re sharing includes Open Molecules 2025, a dataset for advancing molecular discovery, and Meta’s Universal Model for Atoms.
  • We’re also sharing advancements in learning diffusion processes ideal for machine learning projects and a joint study between Meta and the Rothschild Foundation Hospital that seeks to decode how humans learn language.
  • By making our research widely available, we aim to provide easy access for the AI community and help foster an open ecosystem that accelerates progress, drives innovation, and benefits society as a whole, including our national research labs.
 

As we work toward achieving our goal of advanced machine intelligence (AMI), we’re excited to announce new releases from our Meta Fundamental AI Research (FAIR) team. Today, we’re releasing several groundbreaking models, benchmarks, and datasets that will transform the way we approach molecular property prediction, language processing, and neuroscience. These advancements are the result of focused scientific and academic progress and represent a significant step towards AMI. By sharing this work with the research community, we aim to accelerate progress, foster collaboration, and drive innovation in fields such as molecular property prediction, language processing, and neuroscience.

 

Open Molecules 2025 (OMol25) and Meta’s Universal Model for Atoms (UMA): Revolutionizing design at the atomic scale

Many important technological challenges, including developing new molecules to accelerate industrial progress and discovering new materials for energy storage and climate change mitigation, require scientists and engineers to design at the atomic scale. Traditional experimental discovery and design processes are extremely time consuming and often take decades from ideation to scaled manufacturing. Building on some of the collaborative work we’ve done with the Department of Energy’s Lawrence Berkeley National Laboratory (Berkeley Lab) and Princeton University, Genentech, a member of the Roche Group, Stanford University, University of Cambridge, Carnegie Mellon University, New York University, Los Alamos National Laboratory, and UC Berkeley, Meta FAIR is drastically accelerating this process by developing accurate and generalizable machine learning models. These models predict motion and behavior at the atomic scale, ultimately reducing the development cycle in molecular and materials discovery and unlocking new possibilities for innovation and impact.

We’re excited to release a new Density Functional Theory (DFT) dataset, Open Molecules 2025 (OMol25), that extends the family of Meta’s open science simulation datasets—which include Open Catalyst 2020-2022, Open DAC 2023, and Open Materials 2024—to molecular chemistry. Foundational quantum chemistry methods like DFT can be used to predict properties of molecules and materials at the atomic-level scale, especially in complex scenarios where chemical bonds are breaking and forming.

As the largest and most diverse dataset of high-accuracy quantum chemistry calculations for biomolecules, metal complexes, and electrolytes, OMol25 enables unprecedented accuracy in atomic-scale design in healthcare and energy storage technologies. Built with the high-performance quantum chemistry program package ORCA (Version 6.0.1), OMol25 contains simulations of large atomic systems that, until now, have been out of reach. Previous molecular datasets were much smaller, with simulations that only included 20 to 30 atoms and limited elements. Requiring 6 billion core hours of compute, the OMol25 dataset is a major leap forward with configurations up to 10 times larger, including complex interactions between many different elements.

We’re also sharing Meta’s Universal Model for Atoms (UMA), a machine learning interatomic potential that sets new standards for modeling the interaction of atoms across a wide range of materials and molecules. UMA is trained on over 30 billion atoms contained in all of the datasets released by Meta in the past five years, including those with both molecules and materials. UMA offers researchers a foundational model that provides more accurate predictions and improved understanding of molecular behavior and serves as a versatile base for downstream use cases and fine-tuning applications.

Together, OMol25 and UMA have the potential to unlock new capabilities in molecular and materials research. At Meta, we see this as the next step in a long journey of open science releases to accelerate atomic-scale materials design. We’re also working with partners like the Lawrence Livermore National Laboratory to extend these datasets and models to new classes of molecules, such as polymers.

We’re excited to see how the community uses OMol25 and UMA to help advance molecular and materials science research that could ultimately lead to new breakthroughs across industries. Additionally, we’re pleased to continue our collaboration with national research labs as we work to advance scientific discovery and leadership.

Download the OMol25 dataset and models

Read the OMol25 paper

Download the UMA model

Read the UMA paper


Adjoint Sampling: A breakthrough in highly scalable, reward-driven generative modeling

Most well-known generative models take in data and produce fabricated samples that mimic the patterns they find in the data. In specialized applications, there may be extremely limited or zero training data available, so training these models can be impractical. Instead, only a scalar reward signal is provided that verifies whether the model is producing good samples. Example applications include fine-tuning image and video generative models or learning to sample from physical or chemistry foundation models.

Adjoint Sampling presents a way forward using a highly scalable, reward-based objective to train generative models without any data. Rather than finding patterns in existing data, Adjoint Sampling finds patterns by iteratively refining its own samples according to a provided reward model. Based on theoretical foundations developed at FAIR, Adjoint Sampling leads to a highly scalable practical algorithm and can become the foundation for further research into highly scalable reward-driven generative modeling.

 

Adjoint Sampling demonstrates exceptional performance on generating diverse molecules from large-scale energy models such as the Universal Model for Atoms (UMA). We want to encourage more research into highly scalable methods, so we’re releasing our algorithm and a new large-scale benchmark, which we hope will help lead to more progress in the field of computational chemistry.

Download the model

Download the code and benchmark

Read the paper

 

Unlocking how the human brain develops language

 

For years, understanding how the brain acquires language has been one of the greatest challenges in AI and neuroscience. Today, Meta FAIR and the Rothschild Foundation Hospital present the first large-scale study that uses extensive neural recordings to systematically map how the representations of language emerge in the brain during development—revealing striking parallels with large language models (LLMs).

 

Until recently, language was a trait unique to humans—no machine or species could understand or generate sentences they had never encountered before. Today, however, LLMs are showing remarkable language abilities. What’s even more striking is that the internal activations of these models spontaneously resemble those of the human brain—offering a powerful tool for decoding neural activity in real time.

Yet this similarity between AI and the human brain hides a major gap: Humans acquire language with extreme efficiency—with about 1,000 times fewer words than LLMs. How our brain achieves this feat remains one of the greatest mysteries in cognitive science and a major challenge to make AI systems learn and reason like humans.

As part of an ongoing collaboration, Meta and the Rothschild Foundation Hospital present an important step forward in this scientific pursuit. As part of their treatment, over 40 patients suffering from epilepsy were fit with cortical recording devices. By analyzing the neural signals recorded from over 7,000 of these electrodes while the individuals listened to an audiobook, researchers revealed an unprecedented view of how the neural representations of language evolve throughout childhood. This large-scale dataset offers rare insights into the timing and development of language processing in the brain.

Using AI algorithms to decode these neural signals, the team shows that even children as young as 2 years old exhibit a broad range of speech representations in the cortex. As the brain matures, these neural representations support increasingly rich and complex processing. Remarkably, this developmental trajectory is spontaneously mirrored in AI models like wav2vec 2.0 and Llama 3.1, whose internal activations become, with training, increasingly similar to those of the adult brain.

These findings show how AI models, once inspired by the brain, can now help unveil the brain’s inner workings—offering not just a path toward new clinical tools for supporting language development, but a new framework for understanding human intelligence.

Download the paper

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
FEATURED
Research
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
FEATURED
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023