Our next generation Meta Training and Inference Accelerator

Our next-generation Meta Training and Inference Accelerator

April 10, 2024 · 8 min read

Takeaways

We’re sharing details about the next generation of the Meta Training and Inference Accelerator (MTIA), our family of custom-made chips designed for Meta’s AI workloads.
This latest version shows significant performance improvements over MTIA v1 and helps power our ranking and recommendation ads models.
MTIA is part of our growing investment in our AI infrastructure and will complement our existing and future AI infrastructure to deliver new and better experiences across our products and services.

The next generation of Meta’s large-scale infrastructure is being built with AI in mind, including supporting new generative AI (GenAI) products and services, recommendation systems, and advanced AI research. It’s an investment we expect will grow in the years ahead as the compute requirements to support AI models increase alongside the models’ sophistication.

Last year, we unveiled the Meta Training and Inference Accelerator (MTIA) v1, our first-generation AI inference accelerator that we designed in-house with Meta’s AI workloads in mind – specifically our deep learning recommendation models that are improving a variety of experiences across our products.

MTIA is a long-term venture to provide the most efficient architecture for Meta’s unique workloads. As AI workloads become increasingly important to our products and services, this efficiency will improve our ability to provide the best experiences for our users around the world. MTIA v1 was an important step in improving the compute efficiency of our infrastructure and better supporting our software developers as they build AI models that will facilitate new and better user experiences.

Now, we’re sharing details about the next generation of MTIA.

Next generation MTIA chip model. Drag to rotate.

This inference accelerator is part of our broader full-stack development program for custom, domain-specific silicon that addresses our unique workloads and systems. This new version of MTIA more than doubles the compute and memory bandwidth of our previous solution while maintaining our close tie-in to our workloads. It is designed to efficiently serve the ranking and recommendation models that provide high-quality recommendations to users.

Close up photograph of a hand holding a chip

GET INVOLVED

Join us on our journey to build infrastructure
for AI

Visit the Meta careers page

Under the hood

This chip’s architecture is fundamentally focused on providing the right balance of compute, memory bandwidth, and memory capacity for serving ranking and recommendation models. In inference we need to be able to provide relatively high utilization, even when our batch sizes are relatively low. By focusing on providing outsized SRAM capacity, relative to typical GPUs, we can provide high utilization in cases where batch sizes are limited and provide enough compute when we experience larger amounts of potential concurrent work.

This accelerator consists of an 8x8 grid of processing elements (PEs). These PEs provide significantly increased dense compute performance (3.5x over MTIA v1) and sparse compute performance (7x improvement). This comes partly from improvements in the architecture associated with pipelining of sparse compute. It also comes from how we feed the PE grid: We have tripled the size of the local PE storage, doubled the on-chip SRAM and increased its bandwidth by 3.5X, and doubled the capacity of LPDDR5.

Our new MTIA design also features an improved network on chip (NoC) architecture that doubles the bandwidth and allows us to coordinate between different PEs at low latency. These and other new functions in the PEs form the key technologies that are vital to our long-term roadmap to scale MTIA to a wider variety of more challenging workloads.

First Gen MTIA

Technology

TSMC 7nm

Frequency

800MHz

Instances

1.12B gates, 65M flops

Area

19.34mm x 19.1mm, 373mm2

Package

43mm x 43mm

Voltage

0.67V logic, 0.75V memory

TDP

25W

Host Connection

8x PCIe Gen4 (16 GB/s)

GEMM TOPS

102.4 TFLOPS/s (INT8)
51.2 TFLOPS/s (FP16/BF16)

SIMD TOPS

Vector core:
3.2 TFLOPS/s (INT8),
1.6 TFLOPS/s (FP16/BF16),
0.8 TFLOPS/s (FP32)
SIMD:
3.2 TFLOPS/s (INT8/FP16/BF16),
1.6 TFLOPS/s (FP32)

Memory Capacity

Local memory: 128 KB per PE
On-chip memory: 128 MB
Off-chip LPDDR5: 64 GB

Memory Bandwidth

Local memory: 400 GB/s per PE
On-chip memory: 800 GB/s
Off-chip LPDDR5: 176 GB/s

Next Gen MTIA

Technology

TSMC 5nm

Frequency

1.35GHz

Instances

2.35B gates, 103M flops

Area

25.6mm x 16.4mm, 421mm2

Package

50mm x 40mm

Voltage

0.85V

TDP

90W

Host Connection

8x PCIe Gen5 (32 GB/s)

GEMM TOPS

708 TFLOPS/s (INT8) (sparsity)
354 TFLOPS/s (INT8)
354 TFLOPS/s (FP16/BF16) (sparsity)
177 TFLOPS/s (FP16/BF16)

SIMD TOPS

Vector core:
11.06 TFLOPS/s (INT8),
5.53 TFLOPS/s (FP16/BF16),
2.76 TFLOPS/s (FP32)
SIMD:
5.53 TFLOPS/s (INT8/FP16/BF16),
2.76 TFLOPS/s (FP32)

Memory Capacity

Local memory: 384 KB per PE
On-chip memory: 256 MB
Off-chip LPDDR5: 128 GB

Memory Bandwidth

Local memory: 1 TB/s per PE
On-chip memory: 2.7 TB/s
Off-chip LPDDR5: 204.8 GB/s

The hardware

Serving our workloads effectively is not simply a silicon challenge. Co-designing the hardware system and the software stack along with the silicon is essential for the success of the overall inference solution.

To support the next-generation silicon we have developed a large, rack-based system that holds up to 72 accelerators. This consists of three chassis, each containing 12 boards that house two accelerators each. We specifically designed the system so that we could clock the chip at 1.35GHz (up from 800 MHz) and run it at 90 watts compared to 25 watts for our first-generation design. Our design ensures we provide denser capabilities with higher compute, memory bandwidth, and memory capacity. This density allows us to more easily accommodate a broad range of model complexities and sizes.

Photograph of a person placing a motherboard into a server

Photograph of a server rack with wires and chips

Beyond this, we have upgraded the fabric between the accelerators and between the host and accelerators to PCIe Gen5 to increase the bandwidth and scalability of our system. There is also the option to add an RDMA NIC if we choose to scale out beyond the rack.

The software stack

Software has been one of our key areas of focus from the start of our investment in MTIA. As the initial developers of PyTorch, we value programmability and developer efficiency. Our MTIA stack is designed to fully integrate with PyTorch 2.0 and features like TorchDynamo and TorchInductor. Frontend graph-level capturing, analysis, transformation, and extraction mechanisms (such as TorchDynamo, torch.export, etc.) are agnostic to MTIA and are being reused. The lower level compiler for MTIA takes the outputs from the frontend and produces highly efficient and device-specific code. This lower level compiler itself consists of a few components that are responsible for generating executable code for models and kernels.

Below this sits the runtime stack responsible for interfacing with the driver/firmware. The MTIA Streaming interface abstraction provides the basic and essential operations that both inference and (in the future) training software require to manage the device memory, as well as run operators and execute compiled graphs on the device. Finally, the runtime interacts with the driver, which sits in user space – a decision we made to enable us to iterate faster on the driver and firmware within our production stack.

In many ways this new chip system runs the software stack similarly to MTIA v1, which made it much faster for the team to deploy since we had already done much of the necessary integration and development work needed to be able to run our applications on this architecture. The new MTIA is designed to be compatible with code developed for MTIA v1. Since we had already integrated the full software stack to the silicon, we were up and running our traffic with this new chip in a matter of days. This allowed us to land this next-generation MTIA silicon rapidly, going from first silicon to production models running in 16 regions in less than nine months.

Triton-MTIA

We’ve further optimized the software stack by creating the Triton-MTIA compiler backend to generate high-performance code for the MTIA hardware. Triton is an open source language and compiler for writing highly efficient ML compute kernels. It improves developer productivity for writing GPU code and we have found that the Triton language is sufficiently hardware-agnostic to be applicable to non-GPU hardware architectures like MTIA.

The Triton-MTIA backend performs optimizations to maximize hardware utilization and support high-performance kernels. It also exposes key knobs to leverage Triton and MTIA auto-tuning infrastructures to explore the kernel configuration and optimization space.

We have implemented support for the features of the Triton language and integration into PyTorch 2, providing extensive coverage for PyTorch operators. Thanks to TorchInductor, for example, our developers can leverage Triton-MTIA in both ahead-of-time (AOT) and just-in-time (JIT) workflows.

We observed dramatically improved developer efficiency with Triton-MTIA, which allowed us to scale up compute kernel authoring and significantly expand the support of PyTorch operators.

Performance Results

The results so far show that this MTIA chip can handle both the low complexity (LC) and high complexity (HC) ranking and recommendation models that are components of Meta’s products. Across these models, there can be a ~10x-100x difference in model size and the amount of compute per input sample. Because we control the whole stack, we can achieve greater efficiency compared to commercially available GPUs. Realizing these gains is an ongoing effort and we continue to improve performance per watt as we build up and deploy MTIA chips in our systems.

Early results show that this next generation silicon has already improved performance by 3x over our first generation chip across four key models we evaluated. At the platform level, with 2x the number of devices and a powerful 2-socket CPU, we are able to achieve 6x model serving throughput and a 1.5x performance per watt improvement over the first generation MTIA system. To achieve this, we have made significant progress optimizing kernels, compiler, runtime, and host serving stack. The time to optimize models is going down as the developer ecosystem matures, yet there is more headroom to improve efficiency in the future.

improved performance over our first gen chip

MTIA has been deployed in the data center and is now serving models in production. We are already seeing the positive results of this program as it's allowing us to dedicate and invest in more compute power for our more intensive AI workloads. It is proving to be highly complementary to commercially available GPUs in delivering the optimal mix of performance and efficiency on Meta-specific workloads.

Meta’s ongoing investment in custom silicon

MTIA will be an important piece of our long-term roadmap to build and scale the most powerful and efficient infrastructure possible for Meta’s unique AI workloads.

We’re designing our custom silicon to work in cooperation with our existing infrastructure as well as with new, more advanced hardware (including next-generation GPUs) that we may leverage in the future. Meeting our ambitions for our custom silicon means investing not only in compute silicon but also in memory bandwidth, networking and capacity as well as other next-generation hardware systems.

We currently have several programs underway aimed at expanding the scope of MTIA, including support for GenAI workloads.

We’re only at the beginning of this journey, and we’re inviting people who want to be a part of it to visit Meta Careers to learn about our open positions.

Acknowledgements

We would like to thank Eugene Burmako, Kaustubh Gondkar, Adam Hutchin, Olivia Wu, and everyone involved in the development and productionization of the next-generation MTIA solution

Written by:

Eran Tal, Nicolaas Viljoen, Joel Coburn, and Roman Levenstein

Product experiences