August 31, 2020
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.
Written by
Shen Li
Brian Vaughan
Omkar Salpekar
Pritam Damania
Soumith Chintala
Teng Li
Yanli Zhao
Adam Paszke
Pieter Noordhuis
Publisher
VLDB-Industrial Track
Research Topics
July 23, 2024
Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, Joshua Saxe
July 23, 2024
June 27, 2024
Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Rozière, Jonas Gehring, Gabriel Synnaeve, Hugh Leather
June 27, 2024
June 14, 2024
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Bilge Acun, Ahmed Aly, Beidi Chen, Carole-Jean Wu, Ahmed Roman, Nas Mahmoud, Saurabh Agarwal
June 14, 2024
June 07, 2024
Carole-Jean Wu, Bilge Acun, Ramya Raghavendra, Kim Hazelwood
June 07, 2024
Foundational models
Latest news
Foundational models