January 09, 2024
We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads. In particular, this paper surveys the type of matrix multiplication between a skinny activation matrix and a square weight matrix. Our results show an average of 65\% speed improvement on A100, and an average of 124\% speed improvement on H100 (with a peak of 295\%) for a range of matrix dimensions including those found in a llama-style model, where m < n = k.
Written by
Less Wright
Adnan Hoque
Publisher
arxiv.org
Research Topics
Core Machine Learning
November 20, 2024
Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, Kimish Patel, Zechun Liu, Yangyang Shi, Tijmen Blankevoort, Mahesh Pasupuleti, Bilge Soran, Zacharie Delpierre Coudert, Rachad Alao, Raghuraman Krishnamoorthi, Vikas Chandra
November 20, 2024
November 14, 2024
Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, Xujie Si
November 14, 2024
November 06, 2024
Aaron Defazio, Alice Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky
November 06, 2024
August 16, 2024
Zhihan Xiong, Maryam Fazel, Lin Xiao
August 16, 2024
Foundational models
Latest news
Foundational models