May 18, 2023 · 8 min read
AI workloads are ubiquitous at Meta — forming the basis for a wide range of use cases, including content understanding, Feeds, generative AI, and ads ranking. These workloads run on PyTorch with first-class Python integration, eager-mode development, and the simplicity of APIs. Deep learning recommendation models (DLRMs), in particular, are important for improving experiences across Meta’s services and applications. But as these models increase in size and complexity, the underlying hardware systems need to provide exponentially more memory and compute while remaining efficient.
We found that GPUs were not always optimal for running Meta’s specific recommendation workloads at the levels of efficiency required at our scale. Our solution to this challenge was to design a family of recommendation-specific Meta Training and Inference Accelerator (MTIA) ASICs. We co-designed the first-generation ASIC with next-generation recommendation model requirements in mind and integrated it into PyTorch to create a wholly optimized ranking system. In addition, we maintained the user experience and developer efficiency offered by PyTorch eager-mode development. Developer efficiency is a journey as we continue to support PyTorch 2.0, which supercharges how PyTorch operates at the compiler level — under the hood.
The MTIA v1 (inference) die.
In 2020, we designed the first-generation MTIA ASIC for Meta’s internal workloads. This inference accelerator is a part of a co-designed full-stack solution that includes silicon, PyTorch, and the recommendation models. The accelerator is fabricated in TSMC 7nm process and runs at 800 MHz, providing 102.4 TOPS at INT8 precision and 51.2 TFLOPS at FP16 precision. It has a thermal design power (TDP) of 25 W.
At a high level, the accelerator consists of a grid of processing elements (PEs), on-chip and off-chip memory resources, and interconnects.
The accelerator is equipped with a dedicated control subsystem that runs the system’s firmware. The firmware manages available compute and memory resources, communicates with the host through a dedicated host interface, and orchestrates job execution on the accelerator.
The memory subsystem uses LPDDR5 for the off-chip DRAM resources and can scale up to 128 GB.
The chip also has 128 MB of on-chip SRAM shared among all the PEs, which provides higher bandwidth and much lower latency for frequently accessed data and instructions.
The grid contains 64 PEs organized in an 8x8 configuration. The PEs are connected to one another and to the memory blocks via a mesh network. The grid can be utilized for running a job as a whole, or it can be divided into multiple subgrids that can run independent jobs.
Each PE is equipped with two processor cores (one of them equipped with the vector extension) and a number of fixed-function units that are optimized for performing critical operations, such as matrix multiplication, accumulation, data movement, and nonlinear function calculation. The processor cores are based on the RISC-V open instruction set architecture (ISA) and are heavily customized to perform necessary compute and control tasks.
Each PE also has 128 KB of local SRAM memory for quickly storing and operating on data. The architecture maximizes parallelism and data reuse, which are foundational for running workloads efficiently.
The chip provides both thread and data level parallelism (TLP and DLP), exploits instruction level parallelism (ILP), and enables abundant amounts of memory-level parallelism (MLP) by allowing numerous memory requests to be outstanding concurrently.
The MTIA accelerators are mounted on small dual M.2 boards, which allows for easier aggregation into a server. These boards are connected to the host CPU on the server using PCIe Gen4 x8 links and consume as little as 35 W.
A sample test board with an MTIA.
The servers that host these accelerators use the Yosemite V3 server specification from the Open Compute Project. Each server contains 12 accelerators that are connected to the host CPU and to one another using a hierarchy of PCIe switches. Thus, the communication between different accelerators does not need to involve the host CPU. This topology allows workloads to be distributed over multiple accelerators and run in parallel. The number of accelerators and the server configuration parameters are carefully chosen to be optimal for executing current and future workloads.
The MTIA software (SW) stack aims to provide developer efficiency and high performance. It integrates fully with PyTorch, providing a familiar developer experience. Using PyTorch with MTIA is as easy as using PyTorch for CPUs or GPUs. The MTIA SW stack benefits from the flourishing PyTorch developer ecosystem and tooling. The compiler performs model-level transformations and optimizations using PyTorch FX IR and low-level optimizations using LLVM IR, with extensions to support the custom architecture and ISA of the MTIA accelerator.
The PyTorch runtime for MTIA manages on-device execution and features such as MTIA tensors, memory management, and the APIs for scheduling operators on the accelerator. The runtime and firmware perform communication to the accelerator device. The SW stack supports different modes of execution, such as eager mode and graph mode, and allows workloads to be partitioned across multiple accelerator cards. In the latter case, the SW stack also provides the necessary synchronization and communication between multiple accelerator boards.
The MTIA software stack.
There are multiple ways to author compute kernels that can run on the accelerator, including using PyTorch, C/C++ (for hand-tuned, very optimized kernels), and a new domain-specific language called KNYFE, which takes a short, high-level description of an ML operator as input and generates optimized, low-level C++ kernel code that is the implementation of this operator for MTIA.
Low-level code generation and optimizations leverage the open source LLVM compiler toolchain with MTIA extensions. The LLVM compiler then takes care of the next level of optimization and code generation to produce efficient executables that run on the processor cores within the PEs.
As part of the SW stack, we have also developed a library of hand-tuned and highly optimized kernels for performance-critical ML kernels, such as fully connected and embedding-bag operators. The higher levels of the SW stack can choose to instantiate and use these highly optimized kernels during the compilation and code generation process.
The MTIA SW stack continues to evolve with integration to PyTorch 2.0, which is faster and more Pythonic, yet as dynamic as ever. This will enable new features such as TorchDynamo and TorchInductor. We are also extending Triton DSL to support MTIA accelerators and using MLIR for internal representations and advanced optimizations.
While our SW stack continues to evolve, we collected some results comparing the performance of MTIA with that of other accelerators. The comparison is based on the end-to-end performance of running five different DLRMs, representing low- to high-complexity workloads.
We used five different DLRMs, ranging from low to high complexity, to evaluate MTIA with representative production workloads.
Efficiency is one of the most important factors for deploying accelerators in the data center, and TCO is a measure of efficiency. Our comparison is focused on the performance-per-watt metric (TFLOPS/W) which is a key component of TCO.
Our study compared MTIA with an NNPI accelerator and a GPU. For low-complexity models, MTIA relies on handling small shapes and batch sizes more efficiently. For the medium- and high-complexity models, MTIA is running larger shapes that are much more optimized on the GPU’s SW stack (this is where MTIA’s SW stack is currently being optimized and is expected to achieve similar efficiency levels over time).
Low complexity 1
Low complexity 2
Medium complexity 1
Medium complexity 2
Our evaluation found that MTIA handled low-complexity (LC1 and LC2) and medium-complexity (MC1 and MC2) models more efficiently compared with an NNPI and a GPU. We also recognized that we have not yet optimized MTIA for high-complexity (HC) models.
Building custom silicon solutions, especially for the first time, is a significant undertaking. From this initial program, we have learned invaluable lessons that we are incorporating into our roadmap, including architectural insights and software stack enhancements that will lead to improved performance and scale of future systems.
The challenges we need to address are becoming increasingly complicated. Looking at historical trends in the industry for scaling compute, as well as memory and interconnect bandwidth, we can see that memory and interconnect bandwidth are scaling at a much lower pace compared with compute over the last several generations of hardware platforms.
Scaling trends for compute, memory, and interconnect bandwidth (source).
The lagging performance of memory and interconnect bandwidth has also manifested itself in the final performance of our workloads as well. For example, we see a significant portion of a workload’s execution time spent on networking and communication.
Moving forward, as part of building a better and more efficient solution, we are focused on striking a balance between these three axes (compute power, memory bandwidth, and interconnect bandwidth) to achieve the best performance for Meta’s workloads. This is an exciting journey, and we’re just getting started.
This project is the result of the work of many talented teams and individuals at Meta. Hence, we would like to especially thank the following teams, whose contributions were instrumental in the success of this project: Infra Silicon, AI & Systems Co-Design, MTIA SW, Emulation, ASIC Platform Software, Hardware Platforms, Release to Production (RTP), and Sourcing and Operations Engineering (SOE).
Technical Lead, Infra Silicon
Engineering Manager & Tech Lead