FEATURED
Hardware

Four MTIA Chips in Two Years: Scaling AI Experiences for Billions

March 11, 2026
17 minute read

Every day, billions of people on Meta’s platforms enjoy an array of AI-powered experiences ranging from personalized recommendations to AI assistants. Meanwhile, the AI models that will define the next era of computing are evolving faster than any single hardware generation can anticipate. Serving a wide range of AI models on a global scale, while maintaining the lowest possible costs, is one of the most demanding infrastructure challenges in the industry. Our response is to define the path forward — delivering flexible solutions today and improving them continuously as needs evolve.

While we remain committed to a diverse silicon portfolio and to leveraging the best solutions available — both internally and externally — the Meta Training and Inference Accelerator (MTIA), our family of homegrown AI chips developed in close partnership with Broadcom, has remained and will continue to be an important part of Meta’s AI infrastructure strategy. MTIA plays an important role in cost-effectively powering AI experiences for the billions of people who use Meta’s products.

The Past and Future of MTIA

We have published research papers at ISCA’23 and ISCA’25 detailing the first two generations of MTIA chips: MTIA 100 and MTIA 200 (formerly known as MTIA 1 and MTIA 2i). More importantly, we have deployed hundreds of thousands of MTIA chips in production, onboarded numerous internal production models, and tested MTIA with large language models (LLMs) like Llama.

Since introducing MTIA 100 and 200, we have accelerated MTIA development across four successive generations: MTIA 300, 400, 450, and 500. These new chips have either already been deployed or are scheduled for deployment in 2026 or 2027, expanding workload coverage from ranking and recommendation (R&R) inference to R&R training, general GenAI workloads, and GenAI inference with targeted optimizations.

AI models are evolving faster than traditional chip development cycles. Chip designs are based on projected workloads, but by the time the hardware reaches production — often two years later — those workloads may have shifted substantially. Rather than placing a bet and waiting for a long period of time, we deliberately take an iterative approach: Each MTIA generation builds on the last, using modular chiplets, incorporating the latest AI workload insights and hardware technologies, and deploying on a shorter cadence. This tighter loop keeps our hardware better aligned with evolving models while enabling faster adoption of new technology.

The MTIA family now includes:

  • MTIA 300: Initially optimized for R&R models — the dominant Meta workload before GenAI took off — its building blocks established a strong foundation for subsequent chips optimized for GenAI models. It is in production for R&R training.
  • MTIA 400: As GenAI surged, MTIA 300 evolved into MTIA 400 to better support GenAI models, while maintaining the capabilities for supporting R&R workloads. Featuring a 72-accelerator scale-up domain, MTIA 400 delivers high performance that is competitive with leading commercial products. We have finished testing MTIA 400 in our labs and are on the path to deploying it in our data centers.
  • MTIA 450: Anticipating the rise in GenAI inference demand, MTIA 400 transitioned into MTIA 450, with specific optimizations for GenAI inference. Since the bandwidth of high-bandwidth memory (HBM) is the most important factor affecting GenAI inference performance, we doubled HBM bandwidth from MTIA 400 to 450, making it much higher than that of existing leading commercial products. Additionally, we introduced low-precision data types co-designed for inference workloads. MTIA 450 is scheduled for mass deployment in early 2027.
  • MTIA 500: Continuing the focus on GenAI inference, MTIA 500 increased HBM bandwidth by an additional 50% compared to MTIA 450 and introduced further innovations in low-precision data types. MTIA 500 is scheduled for mass deployment in 2027.

The Evolution of MTIA Chips

From MTIA 300 to MTIA 500, the HBM bandwidth increases by 4.5x and the compute FLOPS increases by 25x (from MTIA300’s MX8 to MTIA500’s MX4), as shown in the chip specifications below. This rapid advancement in less than two years highlights the benefits of our velocity strategy.

*Some vendors report bidirectional bandwidth. Multiply the value in the table by two to obtain the corresponding bidirectional bandwidth.

**MTIA 300 is configured with a scale-out network with higher bandwidth (200 GB/s) due to its relatively small scale-up domain size and the target R&R workloads.

MTIA 300

MTIA 400

MTIA 450

MTIA 500

MTIA 300: A Cost-Effective Foundation

Compared with earlier generations, MTIA 300’s distinguishing features include built-in NIC chiplets, dedicated message engines for offloading communication collectives, and near-memory compute for reduction-based collectives. Although initially optimized for R&R training, these low-latency, high-bandwidth communication components have provided the foundation for efficient GenAI inference and training in subsequent MTIA chips.

MTIA 300 comprises one compute chiplet, two network chiplets, and several HBM stacks. Each compute chiplet comprises a grid of processing elements (PEs), with some redundant PEs to improve yield.

Each PE contains:

  • Two RISC-V vector cores.
  • Dot Product Engine for matrix multiplication.
  • Special Function Unit for activations and elementwise operations.
  • Reduction Engine for accumulation and inter-PE communication.
  • DMA engine for data movement in and out of local scratch memory.

Please refer to our ISCA’25 paper for more details on the aforementioned PE components.

MTIA 400: Competitive Raw Performance

As GenAI took off, we evolved MTIA 300 into MTIA 400 to better support GenAI workloads in addition to R&R workloads. MTIA 400 is a major improvement over MTIA 300, with 400% higher FP8 FLOPS and 51% higher HBM bandwidth. While MTIA 300 is a cost-effective product, MTIA 400 is the first MTIA chip designed to deliver not only cost savings but also raw performance competitive with leading commercial products. It combines two compute chiplets to double compute density, and also supports enhanced versions of MX8 and MX4, which are important low-precision formats for efficient GenAI inference. A rack with 72 MTIA 400 devices, connected via a switched backplane, forms a single scale-up domain.

A rack-scale system comprising 72 MTIA 400 chips in a single scale-up domain, along with associated networking devices and air-assisted liquid cooling (AALC) racks. While MTIA 400 can also support facility liquid cooling, AALC enables rapid deployment in legacy data centers.

MTIA 450: A Leap Forward for GenAI Inference

Anticipating the rapid growth in GenAI inference demand, we evolved MTIA 400 into MTIA 450, optimizing it for GenAI inference by advancing four areas:

  1. Doubling HBM bandwidth from the prior version to accelerate decode.
  2. Increasing MX4 FLOPS by 75% to speed up mixture-of-experts (MoE) feed-forward network (FFN) computation.
  3. Introducing hardware acceleration that makes both attention and FFN computation more efficient (e.g., by alleviating Softmax and FlashAttention bottlenecks).
  4. Innovating in low-precision data types.

MTIA 450 goes beyond FP8/MX8 and delivers 6x the MX4 FLOPS of FP16/BF16, reflecting the importance of low-precision FLOPS for inference. MTIA 450 also supports mixed low-precision computation without incurring the software overhead associated with data type conversion. Finally, it introduces our custom data-type innovations that preserve model quality and boost FLOPS, with minimal impact on chip area.

MTIA 500: Delivering More with Less for GenAI Inference

As GenAI inference demand continued to grow, we advanced MTIA 450 into MTIA 500 to power GenAI inference even more cost-effectively, with 50% higher HBM bandwidth, up to 80% higher HBM capacity, and 43% higher MX4 FLOPS. MTIA 500 pushes the modular philosophy further by using a 2x2 configuration of smaller compute chiplets surrounded by several HBM stacks and two network chiplets, along with an SoC chiplet that provides PCIe connectivity to the host CPU and scale-out NICs. Like MTIA 450, MTIA 500 also introduces additional hardware acceleration and data-type innovation to address bottlenecks observed in GenAI inference.

Our Strategy: High Velocity, Inference First, and PyTorch Native

In the highly competitive AI chip landscape, our MTIA strategy rests on three pillars for success:

  • High-velocity iterative chip development.
  • Inference‑first focus.
  • Frictionless adoption by building natively on industry standards like PyTorch.

High Velocity

Given the rapid pace of AI innovation, we have built the capability to ship a new chip roughly every six months. This fast pace offers two advantages:

  • Fast adaptation to evolving AI techniques: As new model architectures, low-precision data types, and serving techniques emerge, we can optimize our latest chips for these advancements, introduce hardware acceleration for important operators, and address bottleneck shifts among compute, memory, and I/O.
  • Fast adoption of the latest hardware technologies: Examples include the latest process nodes, HBM, and packaging technologies.

We achieve high velocity through a reusable and modular design across all levels: chiplets, chassis, racks, and network infrastructure. We architect our accelerators as systems of chiplets — discrete, reusable building blocks for compute, I/O, and networking. Because each chiplet can be upgraded separately, we can implement improvements in months rather than years. Moreover, different chiplets can be manufactured at different process nodes that are most cost-effective while meeting performance and power requirements.

At the system level, MTIA 400, 450, and 500 all utilize the same chassis, rack, and network infrastructure. Therefore, each new chip generation can be dropped into the same physical footprint, accelerating the transition from silicon to production deployment. Our modular, reusable designs also minimize the resources needed to develop and deploy multiple chip generations, and the benefits of these highly optimized chips can offset the resources used for development and deployment.

Inference First

Mainstream GPUs are typically built for the most demanding workload — large-scale GenAI pre-training — and then applied, often less cost-effectively, to other workloads such as GenAI inference. We take a different approach: MTIA 450 and 500 are optimized first for GenAI inference, and can then be used to support other workloads as needed, including R&R training and inference, as well as GenAI training. This keeps MTIA well-tuned to the anticipated growth in GenAI inference demand.

Frictionless Adoption

MTIA is built natively on industry‑standard software and hardware ecosystems — PyTorch, vLLM, Triton, and the Open Compute Project (OCP) — from the outset rather than treating adoption and compatibility as an afterthought. Since PyTorch originated at Meta and has become the most widely used ML framework, MTIA naturally takes a PyTorch-native approach. Together, PyTorch, vLLM, and Triton provide developers with a familiar software stack, enable reuse of assets from the open source community, and simplify model migration. Beyond industry-standard software, MTIA’s system and rack solutions align with OCP standards, enabling MTIA to be seamlessly deployed in data centers.

The MTIA Software Stack: A PyTorch-Native Approach

Across all chip generations, the MTIA software stack delivers a consistent programming experience. It takes a PyTorch-native approach, giving developers a familiar and complete ecosystem.

The MTIA software stack

Key attributes of the software stack include:

Seamless model onboarding: MTIA supports both eager and graph modes. In graph mode, it integrates directly with PyTorch 2.0’s compilation pipeline. Developers use familiar tools — torch.compile and torch.export — to capture and optimize model graphs. No MTIA-specific rewrites are required to enable models. This portability enables our production models to be deployed simultaneously on both GPUs and MTIA.

Compilers: Beneath the PyTorch frontend, MTIA-specific compilers translate high-level graph representations into highly optimized device code. The graph compiler is built on Torch FX IR and TorchInductor. The kernel compiler and lower-level backends are based on Triton, MLIR, and LLVM, enhanced and optimized for MTIA. We improved and tailored TorchInductor’s Triton code generations and kernel fusion for MTIA, and introduced MTIA-aware MLIR dialects and Triton DSL extensions. These extensions can be used optionally for performance-critical kernels. The compiler stack has autotuning capabilities that automatically optimize workloads using multiple compilation strategies.

Kernel authoring: MTIA supports compiler-driven kernel generation and fusion, enables both auto-generated and user-driven manual kernel authoring using Triton and C++, and provides kernel auto-tuning and optimizations. Furthermore, we have built agentic AI systems to automate kernel generation; see our papers on TritorX and KernelEvolve for details.

Communication and transport: MTIA’s communication library, Hoot Collective Communications Library (HCCL), is similar to GPU communication libraries but offers several differentiators. It leverages the MTIA chips’ built-in network chiplets for efficient communication, offloads collective operations to dedicated message engines, and uses near-memory compute to accelerate reduction-heavy collectives. HCCL also supports fusing compute and collective kernels to minimize latency. Finally, its transport stack is optimized for low-latency transactions and offloads the entire data path to reduce host-stack runtime overhead.

Runtime and firmware: The MTIA runtime manages device memory, kernel scheduling, and execution coordination across multiple devices. It supports both eager and graph execution modes. Additionally, it orchestrates compute and collective operations in an Inductor-native, eager-style graph mode. This approach enables compute and communication to be captured and scheduled together, providing a GPU-like experience with minimal overhead. The runtime interfaces with a Rust-based user-space driver, rather than a traditional in-kernel Linux driver. The firmware is written in bare-metal Rust, delivering low latency and high performance, with built-in memory and thread safety.

vLLM support : vLLM's plugin architecture allows easy integration with MTIA. Our MTIA plugin replaces important operators, such as FlashAttention and fused LayerNorm, with MTIA-specific kernels. Graph-mode execution is supported via a custom torch.compile backend. MTIA inherits and benefits from vLLM’s features such as prefill-decode disaggregation and continuous batching.

Production tools: To reliably operate hundreds of thousands of MTIA chips in production, MTIA offers production-grade monitoring, profiling, and debugging tools comparable to those available for mainstream GPUs, while providing unique capabilities such as full-stack, at-scale observability across both host and device, spanning software, firmware, and hardware. Its debugger enables fine-grained control, including breakpoints and coordinated stepping at the PE level.

MTIA: Advancing With Each Generation

While our large-scale production deployments of MTIA chips have demonstrated strong R&R inference capabilities, we expect the latest four generations — either recently launched or planned for launch in 2026 or 2027 — to push the boundaries of GenAI inference, enable R&R training, and lay the groundwork for future GenAI training. Each generation of MTIA has built on the lessons of the one before, is co-designed with our software stack, and is guided by the trajectory of future AI models. Their modular, multi-chiplet design and vertically integrated co-design approach can deliver rapid, compounding performance gains while maintaining system-level compatibility. Together, they bring us closer to our goal to deliver today and tomorrow’s most powerful AI experiences to everyone on our platforms.


Written by:
Yee Jiun Song, Andrew Tulloch, Harikrishna Reddy, CQ Tang, Vijay Thakkar