January 09, 2024
We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads. In particular, this paper surveys the type of matrix multiplication between a skinny activation matrix and a square weight matrix. Our results show an average of 65\% speed improvement on A100, and an average of 124\% speed improvement on H100 (with a peak of 295\%) for a range of matrix dimensions including those found in a llama-style model, where m < n = k.
Written by
Less Wright
Adnan Hoque
Publisher
arxiv.org
Research Topics
Core Machine Learning
July 21, 2024
Ouail Kitouni, Niklas Nolte, Samuel Pérez Díaz, Sokratis Trifinopoulos, Mike Williams
July 21, 2024
July 08, 2024
Antonio Orvieto, Lin Xiao
July 08, 2024
June 17, 2024
Heli Ben-Hamu, Omri Puny, Itai Gat, Brian Karrer, Uriel Singer, Yaron Lipman
June 17, 2024
June 17, 2024
Neta Shaul, Uriel Singer, Ricky Chen, Matt Le, Ali Thabet, Albert Pumarola, Yaron Lipman
June 17, 2024
Product experiences
Foundational models
Product experiences
Latest news
Foundational models