CORE MACHINE LEARNING

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

March 13, 2024

Abstract

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

Download the Paper

AUTHORS

Written by

Jiawei Zhao

Zhenyu Zhang

Beidi Chen

Zhangyang Wang

Anima Anandkumar

Yuandong Tian

Publisher

arXiv

Research Topics

Core Machine Learning

Related Publications

July 21, 2024

CORE MACHINE LEARNING

From Neurons to Neutrons: A Case Study in Mechanistic Interpretability

Ouail Kitouni, Niklas Nolte, Samuel Pérez Díaz, Sokratis Trifinopoulos, Mike Williams

July 21, 2024

July 08, 2024

THEORY

CORE MACHINE LEARNING

An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes

Antonio Orvieto, Lin Xiao

July 08, 2024

June 17, 2024

HUMAN & MACHINE INTELLIGENCE

COMPUTER VISION

D-Flow: Differentiating through Flows for Controlled Generation

Heli Ben-Hamu, Omri Puny, Itai Gat, Brian Karrer, Uriel Singer, Yaron Lipman

June 17, 2024

June 17, 2024

COMPUTER VISION

CORE MACHINE LEARNING

Bespoke Non-Stationary Solvers for Fast Sampling of Diffusion and Flow Models

Neta Shaul, Uriel Singer, Ricky Chen, Matt Le, Ali Thabet, Albert Pumarola, Yaron Lipman

June 17, 2024

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.