
Every day, billions of people on Meta’s platforms enjoy an array of AI-powered experiences ranging from personalized recommendations to AI assistants. Meanwhile, the AI models that will define the next era of computing are evolving faster than any single hardware generation can anticipate. Serving a wide range of AI models on a global scale, while maintaining the lowest possible costs, is one of the most demanding infrastructure challenges in the industry. Our response is to define the path forward — delivering flexible solutions today and improving them continuously as needs evolve.
While we remain committed to a diverse silicon portfolio and to leveraging the best solutions available — both internally and externally — the Meta Training and Inference Accelerator (MTIA), our family of homegrown AI chips developed in close partnership with Broadcom, has remained and will continue to be an important part of Meta’s AI infrastructure strategy. MTIA plays an important role in cost-effectively powering AI experiences for the billions of people who use Meta’s products.
The Past and Future of MTIA
We have published research papers at ISCA’23 and ISCA’25 detailing the first two generations of MTIA chips: MTIA 100 and MTIA 200 (formerly known as MTIA 1 and MTIA 2i). More importantly, we have deployed hundreds of thousands of MTIA chips in production, onboarded numerous internal production models, and tested MTIA with large language models (LLMs) like Llama.
Since introducing MTIA 100 and 200, we have accelerated MTIA development across four successive generations: MTIA 300, 400, 450, and 500. These new chips have either already been deployed or are scheduled for deployment in 2026 or 2027, expanding workload coverage from ranking and recommendation (R&R) inference to R&R training, general GenAI workloads, and GenAI inference with targeted optimizations.
AI models are evolving faster than traditional chip development cycles. Chip designs are based on projected workloads, but by the time the hardware reaches production — often two years later — those workloads may have shifted substantially. Rather than placing a bet and waiting for a long period of time, we deliberately take an iterative approach: Each MTIA generation builds on the last, using modular chiplets, incorporating the latest AI workload insights and hardware technologies, and deploying on a shorter cadence. This tighter loop keeps our hardware better aligned with evolving models while enabling faster adoption of new technology.
The MTIA family now includes:
The Evolution of MTIA Chips
From MTIA 300 to MTIA 500, the HBM bandwidth increases by 4.5x and the compute FLOPS increases by 25x (from MTIA300’s MX8 to MTIA500’s MX4), as shown in the chip specifications below. This rapid advancement in less than two years highlights the benefits of our velocity strategy.
*Some vendors report bidirectional bandwidth. Multiply the value in the table by two to obtain the corresponding bidirectional bandwidth.
**MTIA 300 is configured with a scale-out network with higher bandwidth (200 GB/s) due to its relatively small scale-up domain size and the target R&R workloads.
MTIA 300
MTIA 400
MTIA 450
MTIA 500
MTIA 300: A Cost-Effective Foundation
Compared with earlier generations, MTIA 300’s distinguishing features include built-in NIC chiplets, dedicated message engines for offloading communication collectives, and near-memory compute for reduction-based collectives. Although initially optimized for R&R training, these low-latency, high-bandwidth communication components have provided the foundation for efficient GenAI inference and training in subsequent MTIA chips.
MTIA 300 comprises one compute chiplet, two network chiplets, and several HBM stacks. Each compute chiplet comprises a grid of processing elements (PEs), with some redundant PEs to improve yield.
Each PE contains:
Please refer to our ISCA’25 paper for more details on the aforementioned PE components.
MTIA 400: Competitive Raw Performance
As GenAI took off, we evolved MTIA 300 into MTIA 400 to better support GenAI workloads in addition to R&R workloads. MTIA 400 is a major improvement over MTIA 300, with 400% higher FP8 FLOPS and 51% higher HBM bandwidth. While MTIA 300 is a cost-effective product, MTIA 400 is the first MTIA chip designed to deliver not only cost savings but also raw performance competitive with leading commercial products. It combines two compute chiplets to double compute density, and also supports enhanced versions of MX8 and MX4, which are important low-precision formats for efficient GenAI inference. A rack with 72 MTIA 400 devices, connected via a switched backplane, forms a single scale-up domain.
MTIA 450: A Leap Forward for GenAI Inference
Anticipating the rapid growth in GenAI inference demand, we evolved MTIA 400 into MTIA 450, optimizing it for GenAI inference by advancing four areas:
MTIA 450 goes beyond FP8/MX8 and delivers 6x the MX4 FLOPS of FP16/BF16, reflecting the importance of low-precision FLOPS for inference. MTIA 450 also supports mixed low-precision computation without incurring the software overhead associated with data type conversion. Finally, it introduces our custom data-type innovations that preserve model quality and boost FLOPS, with minimal impact on chip area.
MTIA 500: Delivering More with Less for GenAI Inference
As GenAI inference demand continued to grow, we advanced MTIA 450 into MTIA 500 to power GenAI inference even more cost-effectively, with 50% higher HBM bandwidth, up to 80% higher HBM capacity, and 43% higher MX4 FLOPS. MTIA 500 pushes the modular philosophy further by using a 2x2 configuration of smaller compute chiplets surrounded by several HBM stacks and two network chiplets, along with an SoC chiplet that provides PCIe connectivity to the host CPU and scale-out NICs. Like MTIA 450, MTIA 500 also introduces additional hardware acceleration and data-type innovation to address bottlenecks observed in GenAI inference.
Our Strategy: High Velocity, Inference First, and PyTorch Native
In the highly competitive AI chip landscape, our MTIA strategy rests on three pillars for success:
High Velocity
Given the rapid pace of AI innovation, we have built the capability to ship a new chip roughly every six months. This fast pace offers two advantages:
We achieve high velocity through a reusable and modular design across all levels: chiplets, chassis, racks, and network infrastructure. We architect our accelerators as systems of chiplets — discrete, reusable building blocks for compute, I/O, and networking. Because each chiplet can be upgraded separately, we can implement improvements in months rather than years. Moreover, different chiplets can be manufactured at different process nodes that are most cost-effective while meeting performance and power requirements.
At the system level, MTIA 400, 450, and 500 all utilize the same chassis, rack, and network infrastructure. Therefore, each new chip generation can be dropped into the same physical footprint, accelerating the transition from silicon to production deployment. Our modular, reusable designs also minimize the resources needed to develop and deploy multiple chip generations, and the benefits of these highly optimized chips can offset the resources used for development and deployment.
Inference First
Mainstream GPUs are typically built for the most demanding workload — large-scale GenAI pre-training — and then applied, often less cost-effectively, to other workloads such as GenAI inference. We take a different approach: MTIA 450 and 500 are optimized first for GenAI inference, and can then be used to support other workloads as needed, including R&R training and inference, as well as GenAI training. This keeps MTIA well-tuned to the anticipated growth in GenAI inference demand.
Frictionless Adoption
MTIA is built natively on industry‑standard software and hardware ecosystems — PyTorch, vLLM, Triton, and the Open Compute Project (OCP) — from the outset rather than treating adoption and compatibility as an afterthought. Since PyTorch originated at Meta and has become the most widely used ML framework, MTIA naturally takes a PyTorch-native approach. Together, PyTorch, vLLM, and Triton provide developers with a familiar software stack, enable reuse of assets from the open source community, and simplify model migration. Beyond industry-standard software, MTIA’s system and rack solutions align with OCP standards, enabling MTIA to be seamlessly deployed in data centers.
The MTIA Software Stack: A PyTorch-Native Approach
Across all chip generations, the MTIA software stack delivers a consistent programming experience. It takes a PyTorch-native approach, giving developers a familiar and complete ecosystem.
Key attributes of the software stack include:
Seamless model onboarding: MTIA supports both eager and graph modes. In graph mode, it integrates directly with PyTorch 2.0’s compilation pipeline. Developers use familiar tools — torch.compile and torch.export — to capture and optimize model graphs. No MTIA-specific rewrites are required to enable models. This portability enables our production models to be deployed simultaneously on both GPUs and MTIA.
Compilers: Beneath the PyTorch frontend, MTIA-specific compilers translate high-level graph representations into highly optimized device code. The graph compiler is built on Torch FX IR and TorchInductor. The kernel compiler and lower-level backends are based on Triton, MLIR, and LLVM, enhanced and optimized for MTIA. We improved and tailored TorchInductor’s Triton code generations and kernel fusion for MTIA, and introduced MTIA-aware MLIR dialects and Triton DSL extensions. These extensions can be used optionally for performance-critical kernels. The compiler stack has autotuning capabilities that automatically optimize workloads using multiple compilation strategies.
Kernel authoring: MTIA supports compiler-driven kernel generation and fusion, enables both auto-generated and user-driven manual kernel authoring using Triton and C++, and provides kernel auto-tuning and optimizations. Furthermore, we have built agentic AI systems to automate kernel generation; see our papers on TritorX and KernelEvolve for details.
Communication and transport: MTIA’s communication library, Hoot Collective Communications Library (HCCL), is similar to GPU communication libraries but offers several differentiators. It leverages the MTIA chips’ built-in network chiplets for efficient communication, offloads collective operations to dedicated message engines, and uses near-memory compute to accelerate reduction-heavy collectives. HCCL also supports fusing compute and collective kernels to minimize latency. Finally, its transport stack is optimized for low-latency transactions and offloads the entire data path to reduce host-stack runtime overhead.
Runtime and firmware: The MTIA runtime manages device memory, kernel scheduling, and execution coordination across multiple devices. It supports both eager and graph execution modes. Additionally, it orchestrates compute and collective operations in an Inductor-native, eager-style graph mode. This approach enables compute and communication to be captured and scheduled together, providing a GPU-like experience with minimal overhead. The runtime interfaces with a Rust-based user-space driver, rather than a traditional in-kernel Linux driver. The firmware is written in bare-metal Rust, delivering low latency and high performance, with built-in memory and thread safety.
vLLM support : vLLM's plugin architecture allows easy integration with MTIA. Our MTIA plugin replaces important operators, such as FlashAttention and fused LayerNorm, with MTIA-specific kernels. Graph-mode execution is supported via a custom torch.compile backend. MTIA inherits and benefits from vLLM’s features such as prefill-decode disaggregation and continuous batching.
Production tools: To reliably operate hundreds of thousands of MTIA chips in production, MTIA offers production-grade monitoring, profiling, and debugging tools comparable to those available for mainstream GPUs, while providing unique capabilities such as full-stack, at-scale observability across both host and device, spanning software, firmware, and hardware. Its debugger enables fine-grained control, including breakpoints and coordinated stepping at the PE level.
MTIA: Advancing With Each Generation
While our large-scale production deployments of MTIA chips have demonstrated strong R&R inference capabilities, we expect the latest four generations — either recently launched or planned for launch in 2026 or 2027 — to push the boundaries of GenAI inference, enable R&R training, and lay the groundwork for future GenAI training. Each generation of MTIA has built on the lessons of the one before, is co-designed with our software stack, and is guided by the trajectory of future AI models. Their modular, multi-chiplet design and vertically integrated co-design approach can deliver rapid, compounding performance gains while maintaining system-level compatibility. Together, they bring us closer to our goal to deliver today and tomorrow’s most powerful AI experiences to everyone on our platforms.
Written by:
Yee Jiun Song, Andrew Tulloch, Harikrishna Reddy, CQ Tang, Vijay Thakkar
Our approach
Latest news
Foundational models