Our next-generation Meta Training and Inference Accelerator
April 10, 2024 · 8 min read
The next generation of Meta’s large-scale infrastructure is being built with AI in mind, including supporting new generative AI (GenAI) products and services, recommendation systems, and advanced AI research. It’s an investment we expect will grow in the years ahead as the compute requirements to support AI models increase alongside the models’ sophistication.
Last year, we unveiled the Meta Training and Inference Accelerator (MTIA) v1, our first-generation AI inference accelerator that we designed in-house with Meta’s AI workloads in mind – specifically our deep learning recommendation models that are improving a variety of experiences across our products.
MTIA is a long-term venture to provide the most efficient architecture for Meta’s unique workloads. As AI workloads become increasingly important to our products and services, this efficiency will improve our ability to provide the best experiences for our users around the world. MTIA v1 was an important step in improving the compute efficiency of our infrastructure and better supporting our software developers as they build AI models that will facilitate new and better user experiences.
Now, we’re sharing details about the next generation of MTIA.
Next generation MTIA chip model. Drag to rotate.
This inference accelerator is part of our broader full-stack development program for custom, domain-specific silicon that addresses our unique workloads and systems. This new version of MTIA more than doubles the compute and memory bandwidth of our previous solution while maintaining our close tie-in to our workloads. It is designed to efficiently serve the ranking and recommendation models that provide high-quality recommendations to users.
This chip’s architecture is fundamentally focused on providing the right balance of compute, memory bandwidth, and memory capacity for serving ranking and recommendation models. In inference we need to be able to provide relatively high utilization, even when our batch sizes are relatively low. By focusing on providing outsized SRAM capacity, relative to typical GPUs, we can provide high utilization in cases where batch sizes are limited and provide enough compute when we experience larger amounts of potential concurrent work.
This accelerator consists of an 8x8 grid of processing elements (PEs). These PEs provide significantly increased dense compute performance (3.5x over MTIA v1) and sparse compute performance (7x improvement). This comes partly from improvements in the architecture associated with pipelining of sparse compute. It also comes from how we feed the PE grid: We have tripled the size of the local PE storage, doubled the on-chip SRAM and increased its bandwidth by 3.5X, and doubled the capacity of LPDDR5.
Our new MTIA design also features an improved network on chip (NoC) architecture that doubles the bandwidth and allows us to coordinate between different PEs at low latency. These and other new functions in the PEs form the key technologies that are vital to our long-term roadmap to scale MTIA to a wider variety of more challenging workloads.
Technology
TSMC 7nm
Frequency
800MHz
Instances
1.12B gates, 65M flops
Area
19.34mm x 19.1mm, 373mm2
Package
43mm x 43mm
Voltage
0.67V logic, 0.75V memory
TDP
25W
Host Connection
8x PCIe Gen4 (16 GB/s)
GEMM TOPS
102.4 TFLOPS/s (INT8)
51.2 TFLOPS/s (FP16/BF16)
SIMD TOPS
Vector core:
3.2 TFLOPS/s (INT8),
1.6 TFLOPS/s (FP16/BF16),
0.8 TFLOPS/s (FP32)
SIMD:
3.2 TFLOPS/s (INT8/FP16/BF16),
1.6 TFLOPS/s (FP32)
Memory Capacity
Local memory: 128 KB per PE
On-chip memory: 128 MB
Off-chip LPDDR5: 64 GB
Memory Bandwidth
Local memory: 400 GB/s per PE
On-chip memory: 800 GB/s
Off-chip LPDDR5: 176 GB/s
Technology
TSMC 5nm
Frequency
1.35GHz
Instances
2.35B gates, 103M flops
Area
25.6mm x 16.4mm, 421mm2
Package
50mm x 40mm
Voltage
0.85V
TDP
90W
Host Connection
8x PCIe Gen5 (32 GB/s)
GEMM TOPS
708 TFLOPS/s (INT8) (sparsity)
354 TFLOPS/s (INT8)
354 TFLOPS/s (FP16/BF16) (sparsity)
177 TFLOPS/s (FP16/BF16)
SIMD TOPS
Vector core:
11.06 TFLOPS/s (INT8),
5.53 TFLOPS/s (FP16/BF16),
2.76 TFLOPS/s (FP32)
SIMD:
5.53 TFLOPS/s (INT8/FP16/BF16),
2.76 TFLOPS/s (FP32)
Memory Capacity
Local memory: 384 KB per PE
On-chip memory: 256 MB
Off-chip LPDDR5: 128 GB
Memory Bandwidth
Local memory: 1 TB/s per PE
On-chip memory: 2.7 TB/s
Off-chip LPDDR5: 204.8 GB/s
Serving our workloads effectively is not simply a silicon challenge. Co-designing the hardware system and the software stack along with the silicon is essential for the success of the overall inference solution.
To support the next-generation silicon we have developed a large, rack-based system that holds up to 72 accelerators. This consists of three chassis, each containing 12 boards that house two accelerators each. We specifically designed the system so that we could clock the chip at 1.35GHz (up from 800 MHz) and run it at 90 watts compared to 25 watts for our first-generation design. Our design ensures we provide denser capabilities with higher compute, memory bandwidth, and memory capacity. This density allows us to more easily accommodate a broad range of model complexities and sizes.
Beyond this, we have upgraded the fabric between the accelerators and between the host and accelerators to PCIe Gen5 to increase the bandwidth and scalability of our system. There is also the option to add an RDMA NIC if we choose to scale out beyond the rack.
Software has been one of our key areas of focus from the start of our investment in MTIA. As the initial developers of PyTorch, we value programmability and developer efficiency. Our MTIA stack is designed to fully integrate with PyTorch 2.0 and features like TorchDynamo and TorchInductor. Frontend graph-level capturing, analysis, transformation, and extraction mechanisms (such as TorchDynamo, torch.export, etc.) are agnostic to MTIA and are being reused. The lower level compiler for MTIA takes the outputs from the frontend and produces highly efficient and device-specific code. This lower level compiler itself consists of a few components that are responsible for generating executable code for models and kernels.
Below this sits the runtime stack responsible for interfacing with the driver/firmware. The MTIA Streaming interface abstraction provides the basic and essential operations that both inference and (in the future) training software require to manage the device memory, as well as run operators and execute compiled graphs on the device. Finally, the runtime interacts with the driver, which sits in user space – a decision we made to enable us to iterate faster on the driver and firmware within our production stack.
In many ways this new chip system runs the software stack similarly to MTIA v1, which made it much faster for the team to deploy since we had already done much of the necessary integration and development work needed to be able to run our applications on this architecture. The new MTIA is designed to be compatible with code developed for MTIA v1. Since we had already integrated the full software stack to the silicon, we were up and running our traffic with this new chip in a matter of days. This allowed us to land this next-generation MTIA silicon rapidly, going from first silicon to production models running in 16 regions in less than nine months.
We’ve further optimized the software stack by creating the Triton-MTIA compiler backend to generate high-performance code for the MTIA hardware. Triton is an open source language and compiler for writing highly efficient ML compute kernels. It improves developer productivity for writing GPU code and we have found that the Triton language is sufficiently hardware-agnostic to be applicable to non-GPU hardware architectures like MTIA.
The Triton-MTIA backend performs optimizations to maximize hardware utilization and support high-performance kernels. It also exposes key knobs to leverage Triton and MTIA auto-tuning infrastructures to explore the kernel configuration and optimization space.
We have implemented support for the features of the Triton language and integration into PyTorch 2, providing extensive coverage for PyTorch operators. Thanks to TorchInductor, for example, our developers can leverage Triton-MTIA in both ahead-of-time (AOT) and just-in-time (JIT) workflows.
We observed dramatically improved developer efficiency with Triton-MTIA, which allowed us to scale up compute kernel authoring and significantly expand the support of PyTorch operators.
The results so far show that this MTIA chip can handle both the low complexity (LC) and high complexity (HC) ranking and recommendation models that are components of Meta’s products. Across these models, there can be a ~10x-100x difference in model size and the amount of compute per input sample. Because we control the whole stack, we can achieve greater efficiency compared to commercially available GPUs. Realizing these gains is an ongoing effort and we continue to improve performance per watt as we build up and deploy MTIA chips in our systems.
Early results show that this next generation silicon has already improved performance by 3x over our first generation chip across four key models we evaluated. At the platform level, with 2x the number of devices and a powerful 2-socket CPU, we are able to achieve 6x model serving throughput and a 1.5x performance per watt improvement over the first generation MTIA system. To achieve this, we have made significant progress optimizing kernels, compiler, runtime, and host serving stack. The time to optimize models is going down as the developer ecosystem matures, yet there is more headroom to improve efficiency in the future.
3x
improved performance over our first gen chip
MTIA has been deployed in the data center and is now serving models in production. We are already seeing the positive results of this program as it's allowing us to dedicate and invest in more compute power for our more intensive AI workloads. It is proving to be highly complementary to commercially available GPUs in delivering the optimal mix of performance and efficiency on Meta-specific workloads.
MTIA will be an important piece of our long-term roadmap to build and scale the most powerful and efficient infrastructure possible for Meta’s unique AI workloads.
We’re designing our custom silicon to work in cooperation with our existing infrastructure as well as with new, more advanced hardware (including next-generation GPUs) that we may leverage in the future. Meeting our ambitions for our custom silicon means investing not only in compute silicon but also in memory bandwidth, networking and capacity as well as other next-generation hardware systems.
We currently have several programs underway aimed at expanding the scope of MTIA, including support for GenAI workloads.
We’re only at the beginning of this journey, and we’re inviting people who want to be a part of it to visit Meta Careers to learn about our open positions.
Acknowledgements
We would like to thank Eugene Burmako, Kaustubh Gondkar, Adam Hutchin, Olivia Wu, and everyone involved in the development and productionization of the next-generation MTIA solution
Foundational models
Latest news
Foundational models