NLP

SYSTEMS RESEARCH

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

June 14, 2024

Abstract

We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task. We open source our code at https://github.com/facebookresearch/LayerSkip.

Download the Paper

AUTHORS

Written by

Mostafa Elhoushi

Akshat Shrivastava

Diana Liskovich

Basil Hosmer

Bram Wasti

Liangzhen Lai

Bilge Acun

Ahmed Aly

Beidi Chen

Carole-Jean Wu

Ahmed Roman

Nas Mahmoud

Saurabh Agarwal

Publisher

ACL

Related Publications

March 13, 2025

NLP

COMPUTER VISION

Subobject-level Image Tokenization

Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung

March 13, 2025

February 28, 2025

SYSTEMS RESEARCH

Revisiting Reliability in Large-Scale Machine Learning Research Clusters

Apostolos Kokolis, Michael Kuchnik, John Hoffman, Adithya Kumar, Parth Malani, Faye Ma, Zachary DeVito, Shubho Sengupta, Kalyan Saladi, Carole-Jean Wu

February 28, 2025

February 07, 2025

NLP

BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

The Omnilingual MT Team, Pierre Andrews, Mikel Artetxe, Mariano Coria Meglioli, Marta R. Costa-jussa, Joe Chuang, David Dale, Cynthia Gao, Jean Maillard, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Eduardo Sánchez, Yiannis Tsiamas, Arina Turkatenko, Albert Ventayol, Shireen Yates

February 07, 2025

February 06, 2025

RESEARCH

NLP

Brain-to-Text Decoding: A Non-invasive Approach via Typing

Jarod Levy, Mingfang (Lucy) Zhang, Svetlana Pinet, Jérémy Rapin, Hubert Jacob Banville, Stéphane d'Ascoli, Jean Remi King

February 06, 2025

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.