Research

ML Applications

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

September 3, 2020

Abstract

Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers. Recommendation models present unique challenges in training because they exercise not only compute but also memory capacity as well as memory and network bandwidth. As model size and complexity increase, efficiently scaling training becomes a challenge. To address it we design Zion - Facebook's next-generation large-memory training platform that consists of both CPUs and accelerators. Also, we discuss the design requirements of future scale-out training systems.

Download the Paper

AUTHORS

Written by

Maxim Naumov

Jiyan Yang

John Kim

Dheevatsa Mudigere

Srinivas Sridharan

Xiaodong Wang

Whitney Zhao

Serhat Yilmaz

Changkyu Kim

Hector Yuen

Mustafa Ozdal

Krishnakumar Nair

Isabel Gao

Mikhail Smelyanskiy

Publisher

arXiv

Research Topics

Artificial Intelligence

Human and Machine Learning

Systems Research

Related Publications

June 27, 2025

Human & Machine Intelligence

Conversational AI

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, Praveen Chowdary, Joe Chuang, Antony D'Avirro, Jon Daly, Ning Dong, Mark Duppenthaler, Cynthia Gao, Jeff Girard, Martin Gleize, Sahir Gomez, Hongyu Gong, Srivathsan Govindarajan, Brandon Han, Sen He, Denise Hernandez, Yordan Hristov, Rongjie Huang, Hirofumi Inaguma, Somya Jain, Raj Janardhan, Qingyao Jia, Christopher Klaiber, Dejan Kovachev, Moneish Kumar, Hang Li, Yilei Li, Pavel Litvin, Wei Liu, Guangyao Ma, Jing Ma, Martin Ma, Xutai Ma, Lucas Mantovani, Sagar Miglani, Sreyas Mohan, Louis-Philippe Morency, Evonne Ng, Kam-Woh Ng, Tu Anh Nguyen, Amia Oberai, Benjamin Peloquin, Juan Pino, Jovan Popovic, Omid Poursaeed, Fabian Prada, Alice Rakotoarison, Alexander Richard, Christophe Ropers, Safiyyah Saleem, Vasu Sharma, Alex Shcherbyna, Jie Shen, Anastasis Stathopoulos, Anna Sun, Paden Tomasello, Tuan Tran, Arina Turkatenko, Bo Wan, Chao Wang, Jeff Wang, Mary Williamson, Carleigh Wood, Tao Xiang, Yilin Yang, Zhiyuan Yao, Chen Zhang, Jiemin Zhang, Xinyue Zhang, Jason Zheng, Pavlo Zhyzheria, Jan Zikes, Michael Zollhoefer

June 27, 2025

June 13, 2025

Fairness

Integrity

Measuring multi-calibration

Ido Guy, Daniel Haimovich, Fridolin Linder, Nastaran Okati, Lorenzo Perini, Niek Tax, Mark Tygert

June 13, 2025

June 11, 2025

Computer Vision

IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments

Florian Bordes, Quentin Garrido, Justine Kao, Adina Williams, Mike Rabbat, Emmanuel Dupoux

June 11, 2025

June 10, 2025

Computer Vision

A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, Mido Assran

June 10, 2025

April 08, 2021

Responsible AI

Integrity

Towards measuring fairness in AI: the Casual Conversations dataset

Caner Hazirbas, Joanna Bitton, Brian Dolhansky, Jacqueline Pan, Albert Gordo, Cristian Canton Ferrer

April 08, 2021

April 30, 2018

The Role of Minimal Complexity Functions in Unsupervised Learning of Semantic Mappings | Facebook AI Research

Tomer Galanti, Lior Wolf, Sagie Benaim

April 30, 2018

April 30, 2018

Computer Vision

NAM – Unsupervised Cross-Domain Image Mapping without Cycles or GANs | Facebook AI Research

Yedid Hoshen, Lior Wolf

April 30, 2018

December 11, 2019

Speech & Audio

Computer Vision

Hyper-Graph-Network Decoders for Block Codes | Facebook AI Research

Eliya Nachmani, Lior Wolf

December 11, 2019

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.