April 8, 2021
The paper proposes and optimizes a partial recovery training system, CPR, for recommendation models. CPR relaxes the consistency requirement by enabling non-failed nodes to proceed without loading checkpoints when a node fails during training, improving failure-related overheads. The paper is the first to the extent of our knowledge to perform a data-driven, in-depth analysis of applying partial recovery to recommendation models and identified a trade-off between accuracy and performance. Motivated by the analysis, we present CPR, a partial recovery training system that can reduce the training time and maintain the desired level of model accuracy by (1) estimating the benefit of partial recovery, (2) selecting an appropriate checkpoint saving interval, and (3) prioritizing to save updates of more frequently accessed parameters. Two variants of CPR, CPR-MFU and CPR-SSU, reduce the checkpoint-related overhead from 8.2–8.5% to 0.53–0.68% compared to full recovery, on a configuration emulating the failure pattern and overhead of a production-scale cluster. While reducing overhead significantly, CPR achieves model quality on par with the more expensive full recovery scheme, training the state-of-the-art recommendation model using Criteo’s Terabyte CTR dataset. Our results also suggest that CPR can speed up training on a real production-scale cluster, without notably degrading the accuracy.
Written by
Kiwan Maeng
Shivam Bharuka
Isabel Gao
Mark C. Jeffrey
Vikram Saraph
Bor-Yiing Su
Caroline Trippel
Jiyan Yang
Brandon Lucia
Publisher
MLSys 2021
November 30, 2020
Nicolas Usunier, Clément Calauzènes
November 30, 2020
February 01, 2021
Adria Ruiz, Jakob Verbeek
February 01, 2021
November 01, 2018
Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, Xiaohui Ye
November 01, 2018
December 03, 2018
Jian Zhang, Jiyan Yang, Hector Yuen
December 03, 2018
May 03, 2019
Jinfeng Rao, Wei Yang, Yuhao Zhang, Ferhan Ture, Jimmy Lin
May 03, 2019
Foundational models
Latest news
Foundational models