Large Language Model
Introducing quantized Llama models with increased speed and a reduced memory footprint
October 24, 2024
7 minute read

Takeaways


  • Today, we’re releasing our first lightweight quantized Llama models that are small and performant enough to run on many popular mobile devices. At Meta, we’re uniquely positioned to provide quantized models because of access to compute resources, training data, full evaluations, and safety.
  • As our first quantized models in this Llama category, these instruction-tuned models apply the same quality and safety requirements as the original 1B and 3B models, while achieving 2-4x speedup. We also achieve an average reduction of 56% in model size and a 41% average reduction in memory usage compared to the original BF16 format.
  • We used two techniques for quantizing Llama 3.2 1B and 3B models: Quantization-Aware Training with LoRA adaptors, which prioritize accuracy, and SpinQuant, a state-of-the-art post-training quantization method that prioritizes portability.
  • Inferences using both quantization techniques are supported in the Llama Stack reference implementation via PyTorch’s ExecuTorch framework.
  • We built these quantized models in close collaboration with our industry-leading partners and are making them available on Qualcomm and MediaTek SoCs with Arm CPUs.

Similar improvements were observed for the 3B model. See results section for more details.

At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B—our smallest models yet—to address the demand for on-device and edge deployments. Since their release, we’ve seen not just how the community has adopted our lightweight models, but also how grassroots developers are quantizing them to save capacity and memory footprint, often at a tradeoff to performance and accuracy.

As we’ve shared before, we want to make it easier for more developers to build with Llama, without needing significant compute resources and expertise. Today, we’re sharing quantized versions of Llama 3.2 1B and 3B models. These models offer a reduced memory footprint, faster on-device inference, accuracy, and portability—all while maintaining quality and safety for developers to deploy on resource-constrained devices. Given the limited runtime memory available on mobile devices, we prioritized short-context applications up to 8K for these new quantized models. Our results show we can achieve superior accuracy by training with quantization as opposed to post-processing. The models we are sharing today have 2-4x speedup and an average reduction of 56% in model size compared to the original format, based on testing with Android OnePlus 12 models. We also reduce memory usage by an average of 41%. Starting today, the community can deploy our quantized models onto more mobile CPUs, giving them the opportunity to build unique experiences that are fast and provide more privacy since interactions stay entirely on device.

We developed these state-of-the-art models using Quantization-Aware Training with LoRA adaptors (QLoRA) to optimize performance in low-precision environments. We also used SpinQuant, a technique that enables us to determine the best possible combination for compression while retaining the most possible quality. As a result of the close collaborative work with our industry-leading partners, QLoRA and SpinQuant Llama models are available on Qualcomm and MediaTek SoCs with Arm CPUs. The performance of the quantized models has been optimized for mobile CPUs using Kleidi AI kernels, and we’re currently collaborating with our partners to utilize NPUs for even greater performance for Llama 1B/3B.


Our quantization setup

We designed the current quantization scheme with PyTorch’s ExecuTorch inference framework and Arm CPU backend in mind, taking into account metrics including model quality, prefill/decoding speed, and memory footprint. Our quantization scheme involves three parts:

  • We quantize all linear layers in all transformer blocks to a 4-bit groupwise scheme (with a group size of 32) for weights and 8-bit per-token dynamic quantization for activations.
  • The classification layer is quantized to 8-bit per-channel for weight and 8-bit per-token dynamic quantization for activation.
  • We employ an 8-bit per-channel quantization for embedding.

Quantization-Aware Training and LoRA

We employ Quantization-Aware Training (QAT) to simulate the effects of quantization during the training of Llama 3.2 models, enabling us to optimize their performance in low-precision environments. To initialize QAT, we utilize BF16 Llama 3.2 model checkpoints obtained after supervised fine-tuning (SFT) and perform an additional full round of SFT training with QAT. We then freeze the backbone of the QAT model and perform another round of SFT with low-rank adaptation (LoRA) adaptors applied to all layers within the transformer block. Meanwhile, the LoRA adaptors' weights and activations are maintained in BF16. Because our approach is similar to QLoRA in principle (i.e., quantization followed by LoRA adapters), we refer to it as QLoRA in this post.

Finally, we fine-tune the resulting model (both backbone and LoRA adaptors) using direct preference optimization (DPO). The resulting model is a highly efficient model that achieves competitive accuracy to the BF16 model, while maintaining a comparable speed and memory footprint to other quantization methods (see below figure).

We used torchao APIs to do QAT. Developers can further use QAT as a foundational model and use LoRA to fine-tune Llama for their bespoke use cases, saving time and computational cost.


SpinQuant

Although QAT gives the best results, some people might want to quantize their fine-tuned 1B and 3B models or quantize the models for different targets with different quantization settings. For this reason we are also releasing the models and method of SpinQuant, which is a state-of-the-art technique for post-training quantization.

While the method is less accurate than QAT + LoRA, a key advantage of SpinQuant is its portability and ability to operate without requiring access to training datasets, which are often private. It’s an attractive solution for applications where data availability or computational resources are limited. Developers can use this method to take their own fine-tuned Llama models and quantize them for different hardware targets and use cases, using the open source repository that is fully compatible with ExecuTorch and Llama Stack.

In our experiments, we utilize WikiText, a small calibration dataset, to learn rotation matrices in SpinQuant. These matrices enable the smoothing of outliers and facilitate more effective quantization. After this, best practices in quantization such as range setting and generative post-training quantization are applied. The SpinQuant matrices are optimized for the quantization scheme similar to QAT + LoRA.


Results

In the table below, we show comprehensive evaluation of the models quantized with vanilla post-training quantization (PTQ), SpinQuant, which produces the state-of-the-art PTQ quality, as well as QLoRA, which gives the best quality of all.

Percentage difference in relation to the average value for BF16.
Percentage difference in relation to the average value for BF16.

In the table below, we compare the performance metrics of different quantization methods (SpinQuant and QAT + LoRA) with the BF16 baseline. The evaluation was done using the ExecuTorch framework as the inference engine, with the ARM CPU as a backend. The quantized models were optimized primarily for Arm CPU architecture by leveraging Kleidi AI library.

The performance measurement is done using an adb binary-based approach and is measured on an Android OnePlus 12 device. Time-to-first-token (TTFT) is measured with prompt length=64.

Decode latency improved by 2.5x and prefill latency improved by 4.2x on average, while model size decreased by 56% and memory usage reduced by 41% on average. The benchmarks can be reproducible today via ExecuTorch Llama instructions. The table above shows results using an Android OnePlus 12 device—however, we’ve also verified similar relative performance on Samsung S24+ for 1B and 3B and Samsung S22 for 1B. For iOS devices, we’ve verified these models run with comparable accuracy but haven’t evaluated performance.

Besides CPU, we’re currently collaborating with partners to utilize NPUs for these quantized models for even greater performance. Our partners have already integrated foundational components in the ExecuTorch open source ecosystem to leverage NPUs, and work is underway to specifically enable quantization on NPU for Llama 1B/3B.


Looking to the future

We’ve been inspired and encouraged by the excitement and progress the community has achieved with Llama in just a short span of time. This year, Llama has achieved 10x growth and become the standard for responsible innovation. Llama also continues to lead on openness, modifiability, and cost efficiency and is competitive with closed models—even leading in some areas. As always, we can’t wait to see what the community builds using Llama and the powerful experiences they’ll enable on mobile devices.

We’re making Llama 3.2 models available for download on llama.com and Hugging Face.

We’d like to acknowledge the close collaboration of our partners: Arm, Hugging Face, MediaTek, Ollama, and Qualcomm.

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
FEATURED
Research
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
FEATURED
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023