May 18, 2023 · 13 min read
Processing video for video on demand (VOD) and live streaming is already compute intensive. It involves encoding large video files into more manageable formats and transcoding them to deliver to audiences. Today, demand for video content is greater than ever. And emerging use cases, such as generative AI content, mean the demands on video infrastructure are only going to intensify.
GPUs and even general-purpose CPUs are capable of handling video processing. But at the scale that Meta operates (serving billions of people all over the world), and with an eye on future AI-related use cases, we believe that dedicated hardware is the best solution in terms of compute power and efficiency.
Meta Scalable Video Processor (MSVP) is Meta’s first in-house-developed ASIC solution — designed for the processing needs of the ever-growing VOD and live streaming workloads at Meta. MSVP is programmable and scalable, and can be configured to efficiently support both the high-quality transcoding needed for VOD as well as the low latency and faster processing times that live streaming requires.
In the future, MSVP will also help bring new forms of video content to every member of Meta’s family of apps — including AI-generated content as well as AR and VR content in the metaverse.
On Facebook alone, people spend 50 percent of their time on the app watching video. To serve the wide variety of devices all over the world (mobile devices, laptops, TVs, etc.), videos uploaded to Facebook or Instagram, for example, are transcoded into multiple bitstreams, with different encoding formats, resolutions, and quality.
Any of these videos can draw anywhere from a handful of views to millions of them. On Facebook alone, there are more than 4 billion video views per day. At this scale, improving the compression efficiency of our video encodings is crucial for delivering the best quality video experience (even on low-bandwidth networks), reducing data usage, and saving energy. At the same time, this scale also means we need to consider the computational complexity of the encoder as well as device support for the video codec.
video views per day on Facebook
Every generation of video coding standards brings in about an additional 50 percent compression efficiency, which can be used to serve high-quality video with lower bit rates, but this compression efficiency comes with a 10x computational cost. At Meta’s scale, we need a video encoding solution that can deliver the best quality video possible, with the shortest amount of encoding time — all while being energy efficient, programmable, and scalable.
This is where hardware acceleration thrives. ASICs are easy to scale and offer very high energy efficiency.
With MSVP in place for hardware acceleration, we can:
A high-level overview of MSVP.
The host (server) communicates with the on-chip CPU subsystem through the PCIe interface to schedule streams to be decoded, encoded, preprocessed, or transcoded. Bitstreams are pulled from the host memory through the PCIE-DMA engine and stored in the local LPDDR. This operation can be triggered on the host-firmware (FW) interface for every frame or for a group of frames. When the final transcoded bitstreams are ready, FW will trigger the host to transfer the final transcoded bitstream to host memory.
The MSVP’s key components are:
Multiprocessor CPU subsystem
4-lane PCIe (Gen4) PHY and controller connecting the ASIC to the Host CPU
LPDDR5 PHY and controller connecting to LPDDR5 memory (8 GB)
Debug and trace connected to JTAG and USB
Peripherals (SPI, UART, SMBUS, and GPIOs)
High-speed NIC to connect all units to the memory controller
The transcoder IP is the core IP that implements the video transcoding operation
Here are the primary stages of the transcoding process:
These stages are implemented as memory-to-memory operations, meaning intermediate buffers are stored back to DRAM and refetched as needed by the downstream operation.
Each MSVP ASIC can offer a peak transcoding performance of 4K at 15fps at the highest quality configuration with 1-in, 5-out streams and can scale up to 4K at 60fps at the standard quality configuration. Performance scales linearly with resolution. This performance is achieved at ~10W of PCIe module power. We achieved a throughput gain of ~9x for H.264 when compared against libx264 SW encoding. For VP9, we achieved a throughput gain of ~50x when compared with libVPX speed 2 preset.
faster throughput for H.264 compared to libx264 SW
faster throughput for VP9 compared to libVPX speed 2
In video coding, we deploy a method to assess and compare compression efficiency, called the Bjontegaard delta rate (BD-Rate), which estimates the number of bits saved (if the BD-Rate is a negative figure) in order to deliver the same objective quality for a given video over a baseline configuration.
The MSVP encoder has two main goals: to be highly power efficient and to deliver the same or better video quality as software encoders. There are existing video encoder IPs, but most of them are targeted at mobile devices with tight area/power constraints and cannot meet the quality bar set by current software encoders.
Because software encoders offer very flexible control and fast evolution over time, it is quite challenging for ASIC video encoders to meet the same performance bar as software encoders.
Here’s a simplified version of the data flow of modern hybrid (hardware and software) video encoders:
Simplified video encoder modules.
These encoders use intra-coding to reduce spatial redundancy and inter-coding to remove temporal redundancy. Different stages of motion estimation are applied in inter-coding to find out the best prediction among all possible block positions in available reference frames. Entropy coding is the lossless compression part that squeezes the statistical redundancy of all syntax elements, including encoding modes, motion vectors and quantized residual coefficients.
For MSVP’s algorithms to perform the way we wanted, we had to find hardware-friendly alternatives for each of the above key modules. We mainly focused on three levels: block level, frame level, and group of picture (GOP) level.
At the block level, we looked for coding tools with the highest return on investment, that were easy/economical (in terms of silicon area and power requirements) to implement in hardware, and that met our performance targets while maximizing compression efficiency. At frame level, we studied the best algorithms to make intelligent frame type decisions among I/P/B frames, and the best rate-control algorithms based on statistics collected from hardware. And at the GOP level, we had to figure out whether to use multiple-pass encoding with look-ahead, or to insert intra (key) frames at a given shot boundary.
Motion estimation is one of the most computationally intensive algorithms in video encoding. To find accurate motion vectors that closely match the block currently being encoded, a full motion estimation pipeline often includes a multistage search to balance among large search range, computing complexity, and accuracy.
MSVP’s motion search algorithm needs to be one that identifies which potential neighboring blocks can contribute more to quality and only searches around highly correlated neighbors with a limited cycle budget. Although we lack the flexibility of iterative software motion search algorithms, such as diamond or hexagon shapes, hardware motion estimation can search multiple blocks in parallel. Thus, it allows us to search more candidates, cover a larger search range and more reference frames in both single direction and bidirectional mode, and search all supported block partition shapes in parallel.
Achieving high video encoding quality also requires RDO support. Since there are so many decisions to make in video encoding (intra/inter modes, partition block size, transform block types/sizes, etc.), RDO is one of the best practices in video compression to determine which mode is optimal given the current rate or quality target.
MSVP supports exhaustive RDO at almost all mode decision stages. Distortion calculation is intensive but both straightforward and easily parallelizable. But the unique challenge is the bit rate estimation. Entropy coding for the final bitstream is sequential in nature, and each context model is dependent on the previously encoded ones. In a hardware encoder implementation, rate distortion (RD) cost for different blocks/partitions might be evaluated in parallel; thus, it is impossible to have very accurate bit rate estimation. We implemented a pretty accurate bit rate estimation model in MSVP. The model is hardware friendly, in that it is easy to evaluate multiple coding modes in parallel.
Quantization is the only lossy part of video compression, and it is also the dominant bit rate control knob in any video coding standard. The corresponding parameter is called the quantization parameter (QP), and it is inversely related to quality: Low QP values result in small quantization errors, creating low distortion levels and, subsequently, high quality at the expense of higher bit rates. By making smart quantization choices, encoding bits can be allocated to areas that impact visual quality the most. We perform smart quantization using optimal QP selection and rounding decisions.
Modern video coding standards allow different QP values to be applied to different coding units. In MSVP’s hardware encoder, block-level QP values are determined adaptively based on both spatial and temporal characteristics.
In spatial adaptive QP (AQP) selection, since the human visual system is less sensitive to quality loss at high texture or high motion areas, a larger QP value can be applied to these coding blocks. In temporal AQP, coding blocks that are referenced more in the future can be quantized with a lower QP to get higher quality, such that future coding blocks that reference these blocks will benefit from it.
Smart rounding tries to make a joint optimization on rounding decisions for all coefficients in each coding block. Since the choices of rounding at different coefficient positions are dependent on one another, we need better algorithms that remove the dependency while maintaining the rounding decision accuracy. To reduce compute cost, we’ve applied smart rounding to the final stage after the coding mode for each block is determined. This feature alone can achieve a ~1 percent to 2 percent BD-Rate improvement.
The frame-level algorithm for the MSVP H.264 encoder can be configured to be either two-pass or one-pass, depending on whether it is a VOD or live streaming use case. In the high quality (longer latency) VOD two-pass mode, MSVP looks ahead N frames and collects statistics, such as intra/inter cost and motion vectors, from these frames. Then, based on the statistics collected in the look-ahead, frame level control applies back-propagation on the reference tree in the look-ahead buffers for each reference frame to assign an importance to frames. Then, finally, the accumulated reference importance of the frame to be coded is modulated using temporal AQP of each block. Finally, the delta QP map is passed to the final encoding pass to be used as the encoding QP, also captured in the output bitstream.
MSVP H.264 encoder frame level control flow.
In MSVP’s VP9 encoder, multiple-pass encoding is also enabled for high-quality VOD use cases. An analysis pass (the first pass) is performed up front to capture the video characteristics into a set of statistics, and the statistics are used to determine the frame level parameters for filtering and encoding. Since VP9’s frame type is different from H.264’s, the strategy for making frame level decisions is also different, as shown in the following figure:
VP9 encoder frame level algorithm flow.
Most of the overall watch time on Meta’s apps is generated by a relatively small percentage of the videos uploaded. Therefore, we have to use different encoding configurations to process videos based on their popularity and optimize the compression and compute efficiency at scale.
At the system level, there is a benefit-cost model that predicts the watch time for every uploaded video and then controls what encoding configuration will be triggered based on that.
Once a video is uploaded to one of our apps, it’s processed by the back-end production and delivery system. At Meta, we decouple the production and delivery processes. On the production side, we have multiple ABR encoding families, with different codecs and configurations. The goal of the basic ABR family is to quickly encode videos with good quality so people can share them.
Publishing latency is the key metric we need to optimize for. Once the video gets sufficiently high watch time, a full ABR encoding is triggered, which also produces a fixed ABR ladder using H.264 and VP9 high-quality presets to further improve video quality. Once the video gets even more watch time, an advanced ABR encoding is triggered, which produces a dynamic ABR ladder.
This is a very computationally complex process. But with MSVP’s support, we can significantly reduce the compute complexity of the advanced ABR encoding family so that we can use it to replace the full ABR encoding family and enable advanced ABR encoding to more videos.
Getting higher compression efficiency becomes very important for popular and viral videos. RDX is a methodology that determines the best resolution and QP to encode for a given channel condition (bit rate) and client viewport. It typically involves a preprocessing stage where a fast encoding is performed to estimate distortion and rate at different resolutions and QPs. Using those statistics, a convex hull is generated using dynamic optimization to get the best encoding lanes for the particular video. The number of encodings in the preprocessing step is much higher than those typically produced for an ABR family and thus requires 4-6x the compute resources.
With MSVP’s high perf/W, RDX compute requirements are significantly reduced. MSVP provides a super-fast preset to do fast encoding that gives 6x performance compared with the corresponding high-quality preset, HQ_VOD. A prediction model/mapping is used to get the R-D points of the HQ_VOD from the R-D points of the super-fast preset. Experimental results show very small loss (~1 percent BD-Rate) between the predicted R-D curve and the actual R-D curve that is generated from HQ_VOD.
performance compared to HQ_VOD
The tables below show a comparison between MSVP’s encoding efficiency and high-quality software encoders — specifically x264 for H.264 and libvpx for VP9. We get bit rate and distortion (quality) at a wide range of bit rates and resolutions to sweep different user conditions. The experiment is run on the AOM CTC dataset, and the average bit rate difference (for the same quality) between MSVP (test) and SW (baseline) is presented. MSVP HQ_VOD and FAST_VOD H.264 quality is better than that of x264 slow preset (in terms of SSIM). MSVP VP9 quality is between speed 1 and speed 2 for the AOM CTC dataset; it is closer to speed 1 on compressed videos, as tools and algorithms are more optimized for user-uploaded (typically compressed) videos, compared to pristine video content.
Baseline: x264 medium
Baseline: libvpx v1.8 speed 2
AV1 is very important for Meta, and HW acceleration for AV1 is necessary to increase the adoption across all our workloads. However, we wanted to enable HW acceleration for the bulk of our workloads as soon as possible and chose to move AV1 encode support to the next generation of MSVP.
MSVP is just the first milestone in our effort to develop new hardware-assisted video processing solutions. MSVP opens up a lot of opportunities for us, and we are at a very early stage.
Given the superior video quality MSVP offers over its software counterparts (for both H.264 and VP9 encoding), we are also offloading basic ABR encoding on MSVP to reduce publishing latency and improve quality.
Our plan is to eventually offload the majority of our stable and mature video processing workloads to MSVP and use software only for workloads that require specific customization and significantly higher quality.
We’re also continuing to work on further improving video quality with MSVP using preprocessing methods such as smart denoising and image enhancement, as well as post-processing methods such as artifact removal and super-resolution. In the future, MSVP will allow us to support even more of Meta’s most important use cases and needs, including short-form videos, enabling efficient delivery of generative AI, AR/VR, and other metaverse content.
This project has been the culmination of dedicated and enthusiastic work of many talented teams and individuals. While it is impossible to mention every team and every person here, the authors would like to specially thank all the Meta Infra teams who helped take MSVP from concept to production.
TLM, ASIC Architecture
Technical Lead Manager, Infra Silicon
Technical Lead Manager, Infra Silicon