Significantly faster Vision Transformer training

May 5, 2022

What the research is

Vision Transformers (ViTs) — adopted in a wide range of computer vision tasks, from image classification to object detection and segmentation — can achieve state-of-the-art results in visual representation and recognition. Because the performance of computer vision models tends to improve with more parameters and longer training schedules, the AI community has experimented with ever-larger ViTs. But as models begin to exceed teraflops scale, the field has come up against major bottlenecks. Training a single model can take months and require hundreds or thousands of GPUs, inflating accelerator requirements and pushing large-scale ViTs beyond the reach of many practitioners.

To broaden access to ViTs, we have developed methods to make training more efficient. For these models to become more accessible, training must be optimized to achieve the best accelerator utilization. But this process is laborious and requires considerable expertise. To set up an orderly experiment, researchers must choose from myriad possible optimizations: Any of the millions of operations conducted in a single training pass could be hampered by inefficiencies.

We found that we could improve compute and memory efficiency by applying a series of optimizations to the ViT implementation in PyCls, Meta AI’s image classification codebase. Our improvements boosted training speed and per-accelerator throughput (TFLOPS) for ViT models trained using PyCls.

The relative increase in accelerator throughput per chip compared to V100 baseline using the optimized codebase. A100 optimized has 4.05x more accelerator thoughput compared with the V100 baseline.

How it works

We began by profiling our codebase to identify potential sources of inefficiency, eventually zeroing in on our choice of number format. As a default, most applications represent neural network values in the 32-bit single-precision floating-point format. Converting to a 16-bit half-precision format (FP16) reduces a model’s memory footprint and execution time but often lowers its accuracy as well.

We sought a middle ground: mixed precision. With this method, the system speeds training and reduces memory use by performing computations in half precision, while the results are stored in single precision to preserve accuracy. Rather than manually casting parts of our network down to half precision, we experimented with different modes of automatic mixed precision training (AMP), which automatically toggles between number formats. Advanced modes of AMP rely primarily on half-precision operations and model weights. We found a balanced setting that significantly accelerates training without sacrificing accuracy.

To make our process even more efficient, we took advantage of FairScale’s Fully Sharded Data Parallel (FSDP) training algorithm. It shards parameters, gradients, and optimizer states across the GPUs. With FSDP, we can build models that are orders of magnitude larger using fewer GPUs. We also used MTA optimizers, a pooled ViT classifier, and a batch-second input tensor layout to skip redundant transpose operations.

The x-axis designates possible optimizations, and the y-axis shows the relative increase in accelerator throughput for ViT-H/16 training compared with the distributed data parallel (DDP) baseline.

We achieved 1.51x higher accelerator throughout — measured by the number of floating-point operations performed per second on each accelerator chip — using a total batch size of 560. We could boost throughput to 1.86x by expanding the image size from 224 pixels to 256 pixels. However, altering the image size changes the hyperparameters, which can affect the model’s accuracy. The relative throughput increases to 2.18x when training in the full FP16 mode, although this sometimes reduces accuracy (the accuracy degradation was less than 10 percent in our experiments).

The y-axis shows epoch time — the duration of one training pass over the entire ImageNet-1K data set. We focused on the actual wall-clock training time of existing recipes that typically use an image size of 224 pixels, so we did not plot observations with larger image sizes.

Using our optimizations, we reduced epoch time — the duration of one training pass over the entire ImageNet-1K data set — from 0.65 hours to 0.43 hours.

The x-axis specifies the number of accelerator chips in a particular configuration of A100 GPUs, and the y-axis indicates absolute throughput in TFLOPS per chip.

We also investigated the effects of different GPU configurations. In each case, our system achieved higher throughput than the distributed data parallel (DDP) baseline did. As we increased the number of chips, we observed a slight drop in throughput due to the overhead allotted to interdevice communication. However, even using 64 GPUs, our system was 1.83x faster than the DDP baseline.

Why it matters

Doubling the achievable throughput in ViT training effectively doubles the training cluster size, and improved accelerator utilization directly reduces the carbon footprint of AI models. Given the recent trend of developing larger models with longer training times, we hope our optimizations will help the research community further push the state of the art, with shorter turnaround times and enhanced productivity.

Example training configuration

Written By

Research Engineer

Anjali Sridhar

Research Engineer

Hanzi Mao

Research Scientist

Natalia Gimelshein

Applied Research Scientist

Myle Ott

Research Engineer

Ross Girshick

Research Scientist

Rahul Iyer

Software Engineering Manager

Wan-Yen Lo

Research Engineering Manager