This is the second blog post in a series about adapting open source large language models (LLMs). In this post, we’ll discuss the following question: “When should we fine-tune, and when should we consider other techniques?”
Introduction
Prior to the rise of LLMs, fine-tuning was commonly used for smaller-scale models (100M – 300M parameters). State-of-the-art domain applications were built using supervised fine-tuning (SFT)—i.e., further training the pre-trained model using annotated data for your own domain and downstream task. However, with the advent of larger models (> 1B parameters), the question of fine-tuning has become more nuanced. Most importantly, large models require larger resources and commercial hardware for fine-tuning. Table 1 below provides a list of peak GPU memory usage for fine-tuning Llama 2 7B and Llama 2 13B models under three scenarios. You may notice that algorithms such as QLoRA have made it much more accessible to fine-tune a large model using limited resources. As an example, Table 1 shows the peak GPU memory for three fine-tuning schemas (full fine-tuning, LoRA, and QLoRA) on Llama 2 7B. Similar reduction in memory as a result of a parameter-efficient fine-tuning (PEFT) or quantization are reported for Llama 1. In addition to computational resources, catastrophic forgetting (see Part 1 of this series for more) is a common pitfall for full parameter fine-tuning. PEFT techniques aim to address these shortcomings by training on a small number of parameters.
Table 1: Memory in Gb for different fine-tuning methods (source) on LLama 2 7B. QLoRA uses 4-bit NormalFloat quantization.
Archetypes where fine-tuning might be beneficial
We’ve identified the following scenarios as common use cases that can benefit from fine-tuning:
Comparison with other techniques for domain adaptation
Fine-tuning vs. in-context (few-shot) learning
In-context learning (ICL) is a powerful way of improving the performance of an LLM-based system. Given its simplicity, ICL should be experimented with prior to any fine-tuning activities. Furthermore, ICL experiments can help you evaluate whether or not fine-tuning would improve performance on the downstream task. Common considerations when using ICL are:
Fine-tuning vs. and RAG
The common consensus is that when the LLM base performance isn’t satisfactory, you might “start with RAG, gauge its performance, and if found lacking, shift to fine-tuning,” or that “RAG may have an edge” over fine-tuning (source). However, we think this paradigm is too simplistic, as there are several scenarios in which RAG is not only not an alternative to fine-tuning, but also more of a complementary approach to fine-tuning]. Depending on the characteristics of the problem, one or perhaps both approaches should be tried. Adopting the framework of this article, here are some of the questions you may ask to determine whether fine-tuning or RAG (or perhaps both) is suitable for your problem:
In most cases, a hybrid solution of fine-tuning and RAG will yield the best results—the question then becomes the cost, time, and additional independent benefit of doing both. Refer to the above questions to guide your decision-making if RAG and/or fine-tuning is needed and internal experiments to understand metric gains possible by analyzing errors. Finally, exploration in fine-tuning does require a robust data gathering and data improvement strategy, and we would recommend it as a precursor to starting fine-tuning.
Acknowledgements
We would like to thank Suraj Subramanian and Varun Vontimitta for their constructive feedback on the organization and preparation of this blog post.
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Join us in the pursuit of what’s possible with AI.
Foundational models
Latest news
Foundational models