SYSTEMS RESEARCH

CHAI: Clustered Head Attention for Efficient LLM Inference

June 03, 2024

Abstract

Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and compute requirement. We observe that there is a high amount of redundancy across heads on which tokens they pay attention to. Based on this insight, we propose Clustered Head Attention ( CHAI ). CHAI combines heads with a high amount of correlation for self-attention at runtime, thus reducing both memory and compute. In our experiments, we show that CHAI is able to reduce the memory requirements for storing K,V cache by up to 21.4% and inference time latency by up to 1.73× without any fine-tuning required. CHAI achieves this with a maximum 3.2% deviation in accuracy across 3 different models (i.e. OPT-66B, LLAMA-7B, LLAMA-33B) and 5 different evaluation datasets.

Download the Paper

AUTHORS

Written by

Saurabh Agarwal

Bilge Acun

Basil Hosmer

Mostafa Elhoushi

Yejin Lee

Carole-Jean Wu

Publisher

ICML

Research Topics

Systems Research

Related Publications

November 20, 2024

SYSTEMS RESEARCH

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao

November 20, 2024

July 23, 2024

SYSTEMS RESEARCH

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, Joshua Saxe

July 23, 2024

June 27, 2024

SYSTEMS RESEARCH

Meta Large Language Model Compiler: Foundation Models of Compiler Optimization

Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Rozière, Jonas Gehring, Gabriel Synnaeve, Hugh Leather

June 27, 2024

June 14, 2024

NLP

SYSTEMS RESEARCH

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Bilge Acun, Ahmed Aly, Beidi Chen, Carole-Jean Wu, Ahmed Roman, Nas Mahmoud, Saurabh Agarwal

June 14, 2024

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.