



We advance AI capabilities in expressive communication, social interaction and use of language. Through foundational research in natural language processing and multimodal AI, we develop systems that enable more natural, meaningful interactions between humans and machines.
We advance the fundamental capabilities needed for AI to understand and act within the physical and digital world. Through our research, we hope to unlock a wide variety of future agents that help humans do more throughout all aspects of their lives. From robots that can move around, interact with objects, to help accomplish household tasks, to wearable glasses that understand the real and digital world and support people throughout their daily tasks.
Our research focuses on aligning models and decisions with human intent and societal interests through deeper fundamental understanding and enhanced steerability and efficiency of AI models. The pillar is at the forefront of research on AI for science and AI for society.
We conduct fundamental research in pre-training methods and new architectural paradigms that enable foundation models to learn and reason with agility and efficiency across novel downstream challenges. Our work expands the frontier of approaches such as world models, non-autoregressive architectures, and memory-augmented models to unlock new capabilities in adaptive intelligence.
We develop code world models as foundational models for code and agents, and advance methods to do reinforcement learning with execution feedback. We research how to do much more efficient architectures for code world models, latent space reasoning, and grounded reasoning and planning with world models. We develop various agents, e.g. AI research agents to help our own research, and upstream our agents’ needs to our foundational models.
The north star goal of our Perception research teams is to enable general AI systems to perceive the visual world to inform action, communication, and generation. To achieve this goal, we're developing next generation perception models capable of understanding images and videos not as pixels, but as a capture of visual entities like people, objects, activities, their spatial and temporal relationships.
