Takeaways
At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B—our smallest models yet—to address the demand for on-device and edge deployments. Since their release, we’ve seen not just how the community has adopted our lightweight models, but also how grassroots developers are quantizing them to save capacity and memory footprint, often at a tradeoff to performance and accuracy.
As we’ve shared before, we want to make it easier for more developers to build with Llama, without needing significant compute resources and expertise. Today, we’re sharing quantized versions of Llama 3.2 1B and 3B models. These models offer a reduced memory footprint, faster on-device inference, accuracy, and portability—all while maintaining quality and safety for developers to deploy on resource-constrained devices. Given the limited runtime memory available on mobile devices, we prioritized short-context applications up to 8K for these new quantized models. Our results show we can achieve superior accuracy by training with quantization as opposed to post-processing. The models we are sharing today have 2-4x speedup and an average reduction of 56% in model size compared to the original format, based on testing with Android OnePlus 12 models. We also reduce memory usage by an average of 41%. Starting today, the community can deploy our quantized models onto more mobile CPUs, giving them the opportunity to build unique experiences that are fast and provide more privacy since interactions stay entirely on device.
We developed these state-of-the-art models using Quantization-Aware Training with LoRA adaptors (QLoRA) to optimize performance in low-precision environments. We also used SpinQuant, a technique that enables us to determine the best possible combination for compression while retaining the most possible quality. As a result of the close collaborative work with our industry-leading partners, QLoRA and SpinQuant Llama models are available on Qualcomm and MediaTek SoCs with Arm CPUs. The performance of the quantized models has been optimized for mobile CPUs using Kleidi AI kernels, and we’re currently collaborating with our partners to utilize NPUs for even greater performance for Llama 1B/3B.
Our quantization setup
We designed the current quantization scheme with PyTorch’s ExecuTorch inference framework and Arm CPU backend in mind, taking into account metrics including model quality, prefill/decoding speed, and memory footprint. Our quantization scheme involves three parts:
Quantization-Aware Training and LoRA
We employ Quantization-Aware Training (QAT) to simulate the effects of quantization during the training of Llama 3.2 models, enabling us to optimize their performance in low-precision environments. To initialize QAT, we utilize BF16 Llama 3.2 model checkpoints obtained after supervised fine-tuning (SFT) and perform an additional full round of SFT training with QAT. We then freeze the backbone of the QAT model and perform another round of SFT with low-rank adaptation (LoRA) adaptors applied to all layers within the transformer block. Meanwhile, the LoRA adaptors' weights and activations are maintained in BF16. Because our approach is similar to QLoRA in principle (i.e., quantization followed by LoRA adapters), we refer to it as QLoRA in this post.
Finally, we fine-tune the resulting model (both backbone and LoRA adaptors) using direct preference optimization (DPO). The resulting model is a highly efficient model that achieves competitive accuracy to the BF16 model, while maintaining a comparable speed and memory footprint to other quantization methods (see below figure).
We used torchao APIs to do QAT. Developers can further use QAT as a foundational model and use LoRA to fine-tune Llama for their bespoke use cases, saving time and computational cost.
SpinQuant
Although QAT gives the best results, some people might want to quantize their fine-tuned 1B and 3B models or quantize the models for different targets with different quantization settings. For this reason we are also releasing the models and method of SpinQuant, which is a state-of-the-art technique for post-training quantization.
While the method is less accurate than QAT + LoRA, a key advantage of SpinQuant is its portability and ability to operate without requiring access to training datasets, which are often private. It’s an attractive solution for applications where data availability or computational resources are limited. Developers can use this method to take their own fine-tuned Llama models and quantize them for different hardware targets and use cases, using the open source repository that is fully compatible with ExecuTorch and Llama Stack.
In our experiments, we utilize WikiText, a small calibration dataset, to learn rotation matrices in SpinQuant. These matrices enable the smoothing of outliers and facilitate more effective quantization. After this, best practices in quantization such as range setting and generative post-training quantization are applied. The SpinQuant matrices are optimized for the quantization scheme similar to QAT + LoRA.
Results
In the table below, we show comprehensive evaluation of the models quantized with vanilla post-training quantization (PTQ), SpinQuant, which produces the state-of-the-art PTQ quality, as well as QLoRA, which gives the best quality of all.
In the table below, we compare the performance metrics of different quantization methods (SpinQuant and QAT + LoRA) with the BF16 baseline. The evaluation was done using the ExecuTorch framework as the inference engine, with the ARM CPU as a backend. The quantized models were optimized primarily for Arm CPU architecture by leveraging Kleidi AI library.
Decode latency improved by 2.5x and prefill latency improved by 4.2x on average, while model size decreased by 56% and memory usage reduced by 41% on average. The benchmarks can be reproducible today via ExecuTorch Llama instructions. The table above shows results using an Android OnePlus 12 device—however, we’ve also verified similar relative performance on Samsung S24+ for 1B and 3B and Samsung S22 for 1B. For iOS devices, we’ve verified these models run with comparable accuracy but haven’t evaluated performance.
Besides CPU, we’re currently collaborating with partners to utilize NPUs for these quantized models for even greater performance. Our partners have already integrated foundational components in the ExecuTorch open source ecosystem to leverage NPUs, and work is underway to specifically enable quantization on NPU for Llama 1B/3B.
Looking to the future
We’ve been inspired and encouraged by the excitement and progress the community has achieved with Llama in just a short span of time. This year, Llama has achieved 10x growth and become the standard for responsible innovation. Llama also continues to lead on openness, modifiability, and cost efficiency and is competitive with closed models—even leading in some areas. As always, we can’t wait to see what the community builds using Llama and the powerful experiences they’ll enable on mobile devices.
We’re making Llama 3.2 models available for download on llama.com and Hugging Face.
We’d like to acknowledge the close collaboration of our partners: Arm, Hugging Face, MediaTek, Ollama, and Qualcomm.
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Join us in the pursuit of what’s possible with AI.
Foundational models
Latest news
Foundational models