Ranking & Recommendations

Systems Research

CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

April 8, 2021

Abstract

The paper proposes and optimizes a partial recovery training system, CPR, for recommendation models. CPR relaxes the consistency requirement by enabling non-failed nodes to proceed without loading checkpoints when a node fails during training, improving failure-related overheads. The paper is the first to the extent of our knowledge to perform a data-driven, in-depth analysis of applying partial recovery to recommendation models and identified a trade-off between accuracy and performance. Motivated by the analysis, we present CPR, a partial recovery training system that can reduce the training time and maintain the desired level of model accuracy by (1) estimating the benefit of partial recovery, (2) selecting an appropriate checkpoint saving interval, and (3) prioritizing to save updates of more frequently accessed parameters. Two variants of CPR, CPR-MFU and CPR-SSU, reduce the checkpoint-related overhead from 8.2–8.5% to 0.53–0.68% compared to full recovery, on a configuration emulating the failure pattern and overhead of a production-scale cluster. While reducing overhead significantly, CPR achieves model quality on par with the more expensive full recovery scheme, training the state-of-the-art recommendation model using Criteo’s Terabyte CTR dataset. Our results also suggest that CPR can speed up training on a real production-scale cluster, without notably degrading the accuracy.

Download the Paper

AUTHORS

Written by

Kiwan Maeng

Shivam Bharuka

Isabel Gao

Mark C. Jeffrey

Vikram Saraph

Bor-Yiing Su

Caroline Trippel

Jiyan Yang

Mike Rabbat

Brandon Lucia

Carole-Jean Wu

Publisher

MLSys 2021

Related Publications

November 30, 2020

Theory

Ranking & Recommendations

On ranking via sorting by estimated expected utility

Nicolas Usunier, Clément Calauzènes

November 30, 2020

February 01, 2021

Ranking & Recommendations

Anytime Inference with Distilled Hierarchical Neural Ensembles

Adria Ruiz, Jakob Verbeek

February 01, 2021

November 01, 2018

Ranking & Recommendations

Horizon: Facebook's Open Source Applied Reinforcement Learning Platform | Facebook AI Research

Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, Xiaohui Ye

November 01, 2018

December 03, 2018

Ranking & Recommendations

NLP

Training with Low-precision Embedding Tables | Facebook AI Research

Jian Zhang, Jiyan Yang, Hector Yuen

December 03, 2018

May 03, 2019

Ranking & Recommendations

Multi-Perspective Relevance Matching with Hierarchical ConvNets for Social Media Search | Facebook AI Research

Jinfeng Rao, Wei Yang, Yuhao Zhang, Ferhan Ture, Jimmy Lin

May 03, 2019

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.