Large Language Model
How NVIDIA is using structured weight pruning and knowledge distillation to build new Llama models
August 14, 2024
1 minute read

Large language models like Llama can move with impressive speed and precision to handle a variety of challenging tasks, such as generating code, solving math problems, and helping doctors make life-saving medical decisions. Open source models are already leading to incredible breakthroughs across disciplines—however, they’re resource-intensive to deploy. It’s important that we work collaboratively across the industry to make it even easier for people to tap into the game-changing potential of LLMs.

Last month, we announced Llama 3.1, which includes our largest model yet, the 405B, as well as two smaller models with 70 billion and 8 billion parameters, respectively. Smaller models from a larger relative are typically cheaper to deploy to the masses and perform well across many language tasks. In a new research paper, our partners at NVIDIA explore how various large models can be made smaller using structured weight pruning and knowledge distillation—without having to train a new model from scratch. Working with Llama 3.1 8B, the team shares how it created Llama-Minitron 3.1 4B, its first work within the Llama 3.1 open source family.

Learn more about this work, and get the pruning and distillation strategy and additional resources by reading NVIDIA’s blog post.


Share your Llama story

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
FEATURED
Research
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
FEATURED
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023