December 07, 2018
Reducing hardware overhead of neural networks for faster or lower power inference and training is an active area of research. Uniform quantization using integer multiply-add has been thoroughly investigated, which requires learning many quantization parameters, fine-tuning training or other prerequisites. Little effort is made to improve floating point relative to this baseline; it remains energy inefficient, and word size reduction yields drastic loss in needed dynamic range. We improve floating point to be more energy efficient than equivalent bit width integer hardware on a 28 nm ASIC process while retaining accuracy in 8 bits with a novel hybrid log multiply/linear add, Kulisch accumulation and tapered encodings from Gustafson’s posit format. With no network retraining, and drop-in replacement of all math and float32 parameters via round-to-nearest-even only, this open-sourced 8-bit log float is within 0.9% top-1 and 0.2% top-5 accuracy of the original float32 ResNet-50 CNN model on ImageNet. Unlike int8 quantization, it is still a general purpose floating point arithmetic, interpretable out-of-the-box. Our 8/38-bit log float multiply-add is synthesized and power profiled at 28 nm at 0.96× the power and 1.12× the area of 8/32-bit integer multiply-add. In 16 bits, our log float multiply-add is 0.59× the power and 0.68× the area of IEEE 754 float16 fused multiply-add, maintaining the same signficand precision and dynamic range, proving useful for training ASICs as well.
June 19, 2023
Benoit Steiner, Mostafa Elhoushi, Jacob Kahn, James Hegarty
June 19, 2023
April 26, 2023
Ashkan Yousefpour, Shen Guo, Ashish Shenoy, Sayan Ghosh, Pierre Stock, Kiwan Maeng, Schalk Krüger, Mike Rabbat, Carole-Jean Wu, Ilya Mironov
April 26, 2023
November 28, 2022
Nicolas Ballas, Bernhard Schölkopf, Chris Pal, Francesco Locatello, Li Erran, Martin Weiss, Nasim Rahaman, Yoshua Bengio
November 28, 2022
November 27, 2022
Andrea Tirinzoni, Aymen Al Marjani, Emilie Kaufmann
November 27, 2022
Who We Are
Our Actions
Newsletter