To fine-tune or not to fine-tune
August 7, 2024

This is the second blog post in a series about adapting open source large language models (LLMs). In this post, we’ll discuss the following question: “When should we fine-tune, and when should we consider other techniques?”

  • In Part 1, we took a look at prevalent approaches for adapting language models to domain data.
  • In Part 3, explore some rules of thumb for curating a good training dataset.

Introduction

Prior to the rise of LLMs, fine-tuning was commonly used for smaller-scale models (100M – 300M parameters). State-of-the-art domain applications were built using supervised fine-tuning (SFT)—i.e., further training the pre-trained model using annotated data for your own domain and downstream task. However, with the advent of larger models (> 1B parameters), the question of fine-tuning has become more nuanced. Most importantly, large models require larger resources and commercial hardware for fine-tuning. Table 1 below provides a list of peak GPU memory usage for fine-tuning Llama 2 7B and Llama 2 13B models under three scenarios. You may notice that algorithms such as QLoRA have made it much more accessible to fine-tune a large model using limited resources. As an example, Table 1 shows the peak GPU memory for three fine-tuning schemas (full fine-tuning, LoRA, and QLoRA) on Llama 2 7B. Similar reduction in memory as a result of a parameter-efficient fine-tuning (PEFT) or quantization are reported for Llama 1. In addition to computational resources, catastrophic forgetting (see Part 1 of this series for more) is a common pitfall for full parameter fine-tuning. PEFT techniques aim to address these shortcomings by training on a small number of parameters.

Table 1: Memory in Gb for different fine-tuning methods (source) on LLama 2 7B. QLoRA uses 4-bit NormalFloat quantization.

Archetypes where fine-tuning might be beneficial

We’ve identified the following scenarios as common use cases that can benefit from fine-tuning:

  1. Tone, style, and format customization: Use cases may seek an LLM that mirrors a specific persona or serves a particular audience. By fine-tuning LLMs with custom datasets, we can shape the chatbot’s responses to align more closely with the specific requirements of their audience or the intended experience. We may also want to structure the output in a specific manner—for example, JSON, YAML, or Markdown formatted outputs.
  2. Increasing accuracy and handling edge cases: Fine-tuning can be used to correct hallucinations or errors that are challenging to rectify through prompt engineering and in-context learning. It can also enhance the model’s ability to perform new skills or tasks that are difficult to express in a prompt. This process can help in correcting the model’s failures to follow complex prompts and improve its reliability in producing thedesired output. We provide two examples:
    • Phi-2’s accuracy on financial data sentiment analysis increased from 34% to 85%.
    • ChatGPT’s accuracy on Reddit comment sentiment analysis improved by 25 percentage points (from 48% to 73%) using just 100 examples.
      Typically, for smaller initial accuracy numbers (< 50%), fine-tuning has been a great bump with a few hundred examples.
  3. Addressing underrepresented domains: Despite LLMs being trained on vast amounts of general data, they may not always be proficient in the nuanced jargon, terminology, or specificities of every niche domain. For diverse domains such as legal, medical, or finance, fine-tuning has been shown to help in increasing accuracy in downstream tasks. We provide two examples:
    • As pointed out in this article, patients’ medical histories contain highly sensitive data that aren’t typically found in public domains. Therefore, an LLM-based system for summarizing medical histories requires fine-tuning.
    • For underrepresented languages such as Indic languages, fine-tuning using PEFT techniques helped across all tasks in those languages.
  4. Cost reduction: Fine-tuning can distill the skills in a bigger model like Llama 2 70B/GPT-4 into a smaller model such as Llama 2 7B, reducing costs and latency without compromising quality. Additionally, fine-tuning decreases the need for lengthy or specific prompts (as used in prompt engineering), leading to token savings and further cost reduction. As an example, this article shows how to fine-tune a GPT-3.5 judge by distilling a more expensive GPT-4 model, which resulted in cost savings.
  5. New tasks/abilities: Often, a new capability may be achieved via fine-tuning. We provide three examples:

Comparison with other techniques for domain adaptation

Fine-tuning vs. in-context (few-shot) learning

In-context learning (ICL) is a powerful way of improving the performance of an LLM-based system. Given its simplicity, ICL should be experimented with prior to any fine-tuning activities. Furthermore, ICL experiments can help you evaluate whether or not fine-tuning would improve performance on the downstream task. Common considerations when using ICL are:

  • As the number of examples needed to show increases, the cost of inference and latency increases.
  • With more and more examples, it’s common for the LLM to ignore some. This means you may need to have a RAG-based system where you find the most relevant examples depending on the input.
  • LLMs may spit out the knowledge provided to them as examples. This concern also exists when fine-tuning.

Fine-tuning vs. and RAG

The common consensus is that when the LLM base performance isn’t satisfactory, you might “start with RAG, gauge its performance, and if found lacking, shift to fine-tuning,” or that “RAG may have an edge” over fine-tuning (source). However, we think this paradigm is too simplistic, as there are several scenarios in which RAG is not only not an alternative to fine-tuning, but also more of a complementary approach to fine-tuning]. Depending on the characteristics of the problem, one or perhaps both approaches should be tried. Adopting the framework of this article, here are some of the questions you may ask to determine whether fine-tuning or RAG (or perhaps both) is suitable for your problem:

  • Does your application require external knowledge? Fine-tuning is typically not helpful for injecting new knowledge.
  • Does your application need custom tone/behavior/vocabulary or style? For these types of requirements, fine-tuning is typically the right approach.
  • How forgiving is your application to hallucinations? In applications where suppressing falsehoods and imaginative fabrications is vital, RAG systems provide built-in mechanisms to minimize hallucinations.
  • How much labeled training data is available?
  • How static/dynamic is the data? If the problem requires access to a dynamic corpus of data, fine-tuning may not be the right approach, as the knowledge of the LLM can soon become stale.
  • How transparent/interpretable does the LLM application need to be? RAG can inherently provide references, which are useful for interpreting the LLM output.
  • Cost and complexity: Does the team have expertise building search systems or previous fine-tuning experience?
  • How diverse are the tasks in your application?

In most cases, a hybrid solution of fine-tuning and RAG will yield the best results—the question then becomes the cost, time, and additional independent benefit of doing both. Refer to the above questions to guide your decision-making if RAG and/or fine-tuning is needed and internal experiments to understand metric gains possible by analyzing errors. Finally, exploration in fine-tuning does require a robust data gathering and data improvement strategy, and we would recommend it as a precursor to starting fine-tuning.

Acknowledgements

We would like to thank Suraj Subramanian and Varun Vontimitta for their constructive feedback on the organization and preparation of this blog post.


Written by:
Aditya Jain
Applied Research Scientist
Amir Maleki
Applied Research Scientist
Nathalie Saade
Applied Research Manager
Share:

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
FEATURED
Research
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
FEATURED
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023