Products

AI Research

Resources

About

Overview
Open Source
Careers

SYSTEMS RESEARCH

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

November 20, 2024

Abstract

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0X with BF16 reaching up to 840 TFLOPs/s (85% utilization), and with FP8 reaching 1.3 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6X lower numerical error than a baseline FP8 attention.

Download the Paper

AUTHORS

Written by

Jay Shah

Ganesh Bikshandi

Vijay Thakkar

Pradeep Ramani

Tri Dao

Ying Zhang

Publisher

NeurIPS

Research Topics

Systems Research

Related Publications

November 11, 2025

COMPUTER VISION

SYSTEMS RESEARCH

CATransformers: Carbon Aware Transformers Through Joint Model-Hardware Optimization

Irene Wang, Mostafa Elhouishi, Divya Mahajan, Bilge Acun, Carole-Jean Wu, Daniel Jiang, Ekin Sumbul, Newsha Ardalani, Samuel Hsia

November 11, 2025

Read the Paper

February 28, 2025

SYSTEMS RESEARCH

Revisiting Reliability in Large-Scale Machine Learning Research Clusters

Apostolos Kokolis, Adithya Kumar, Carole-Jean Wu, Faye Ma, John Hoffman, Kalyan Saladi, Michael Kuchnik, Parth Malani, Shubho Sengupta, Zachary DeVito

February 28, 2025

Read the Paper

December 12, 2024

CORE MACHINE LEARNING

SYSTEMS RESEARCH

Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Luca Foschini, Pieter Gijsbers, Joan Giner-Miguelez, Sujata Goswami, Nitisha Jain, Michalis Karamousadakis, Satyapriya Krishna, Sylvain Lesage, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Hamidah Oderinwale, Pierre Ruyssen, Tim Santos, Rajat Shinde, Elena Simperl, Arjun Suresh, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Susheel Varma, Jos van der Velde, Steffen Vogler, Luyao Zhang, Michael Kuchnik, Carole-Jean Wu

December 12, 2024

Read the Paper

July 23, 2024

SYSTEMS RESEARCH

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Joshua Saxe, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li