Methods for adapting large language models
August 7, 2024

This is the first blog post in a three-part series about adapting open source large language models (LLMs). In this post, we’ll take a look at the various approaches available to adapt LLMs to domain data.

  • In Part 2, we’ll discuss how to determine if fine-tuning is the right approach for your use case.
  • In Part 3, we’ll explore some rules of thumb for curating a good training dataset.

Introduction

Large language models (LLMs) have demonstrated exceptional abilities across a plethora of language tasks and natural language processing (NLP) benchmarks. Product use cases based on these “generalized” models are on the rise. In this blog post, we’ll provide guidance for small AI product teams who want to adapt and integrate LLMs into their projects. Let’s start by clarifying the (often confusing) terminology surrounding LLMs, then briefly comparing the different methods of adaptation available, and finally recommending a step-by-step flowchart to identify the right approach for your use case.

Approaches to LLM adaptation

Pre-training

Pre-training is the process of training an LLM from scratch using trillions of data tokens. The model is trainedusing a self-supervised algorithm. Most commonly, training happens by predicting the next token autoregressively (a.k.a. causal language modeling). Pre-training typically requires thousands of GPU hours (105 – 107 [source1, source2]]) spread across multiple GPUs. The output model from pre-training is known as a foundation model.

Continued pre-training

Continued pre-training (a.k.a. second-stage pre-training) involves further training a foundation model with new, unseen domain data. The same self-supervised algorithm from the initial pre-training is used. All model weights are typically involved, and a fraction of the original data is mixed with the new data.

Fine-tuning

Fine-tuning is the process of adapting a pre-trained language model using an annotated dataset in a supervised manner or using reinforcement learning-based techniques. There are two major differences compared to pre-training:

  1. Supervised training on an annotated dataset—that contains the correct labels/answers/preferences—instead of self-supervised training
  2. Requires fewer tokens (thousands or millions instead of the billions or trillions needed in pre-training) where the primary aim is to enhance abilities like instruction following, human alignment, task performance, etc.

There are two dimensions to understanding the current landscape of fine-tuning: percentage of parameters changed and new capabilities added as a result of the fine-tuning.

Percentage of parameters changed

Depending on the number of parameters changed, there are two categories of algorithms:

  1. Full fine-tuning: As the name suggests, this encompasses changing all parameters of the model and includes legacy fine-tuning as done on smallish models like XLMR and BERT (100 – 300M parameters) as well as fine-tuning on large models like Llama 2, GPT3 (1B+ parameters), etc.
  2. Parameter-efficient fine-tuning (PEFT): Instead offine-tuning all LLM weights, PEFT algorithms only fine-tune a small number of additional parametersorupdate a subset of the pre-trained parameters, typically 1 – 6% of the total parameters.

Capabilities added to a base model

Fine-tuning is carried out with the intention of adding capabilities to the pre-trained model—for example: instruction following, human alignment, etc. Chat-tuned Llama 2 is an example of a fine-tuned model with added instruction-following and alignment capabilities.

Retrieval augmented generation (RAG)

Enterprises can also adapt LLMs by adding a domain-specific knowledge base. RAG is quintessentially “search-powered LLM text generation.” Introduced in 2020, RAG uses a dynamic prompt context that is retrieved using the user question and injected into the LLM prompt in order to steer it to use the retrieved content instead of its pre-trained—and possibly outdated—knowledge. Chat LangChain is a popular Q/A chatbot on LangChain documentation that’s powered by RAG.

In-context learning (ICL)

With ICL, we adapt the LLM by placing prototype examples in the prompt. “Demonstration through examples” has been shown in multiple studies to be effective. The examples can contain different kinds of information:

  • Input and output text only—that is, few-shot learning
  • Reasoning traces: adding intermediate reasoning steps; see Chain-of-Thought (CoT) prompting
  • Planning and reflection traces: adding information that teaches the LLM to plan and reflect on its problem solving strategy; see ReACT

Multiple other strategies to modify the prompts exist, and the Prompt Engineering Guide contains a comprehensive overview.

Choosing the right adaptation method

To decide which of the above approaches are suitable for a particular application, you should consider various factors: the model capability required for the pursued task, cost of training, cost of inference, types of datasets, etc. The flowchart below summarizes our recommendations to assist you in choosing the right LLM adaptation method.

❌ Pre-training

Pre-training, a vital part of LLM training, uses a token prediction variant as a loss function. Its self-supervised nature allows for training on extensive data. As an example, Llama 2 is trained on 2 trillion tokens. This requires massive computational infrastructure: Llama 2 70B took 1,720,320 GPU hours. Therefore, for teams with limited resources, we do not recommend pre-training as a viable approach for LLM adaptation.

With pre-training being computationally prohibitive, it stands to reason that updating the weights of a model that is already pre-trained could be an effective way to adapt an LLM for particular tasks. Any approach that updates the model weights of a pre-trained model is susceptible to a phenomenon called catastrophic forgetting, which is a term coined for the model forgetting previously learned skills and knowledge. For example, this study showed how a model fine-tuned in the medical domain degraded in performance for instruction-following and common QA tasks. Other studies have also shown that general knowledge obtained through pre-training can be forgotten in subsequent training sessions. For example, this study provided some evidence of knowledge forgetting in LLMs, from the perspectives of domain knowledge, reasoning, and reading comprehension.

❌ Continued pre-training

Having considered catastrophic forgetting, recent developments have shown that continued pre-training (CPT) can lead to further improvements in performance at a fraction of the compute cost that pre-training would incur. CPT can be beneficial for tasks that require a new transformation skill to be acquired by the LLM. As an example, continued pre-training has been reported to be successful at adding multilingual capabilities.

But CPT is still an expensive process, requiring a substantial amount of data and computational resources. For instance, the Pythia suite underwent a second stage of pre-training, resulting in the creation of FinPythia-6.9B. This model, specifically designed for financial data, was subjected to CPT for 18 days using a dataset comprising 24 billion tokens. Furthermore, CPT is also prone to catastrophic forgetting. Therefore, for teams with limited resources, we do not recommend continued pre-training as a viable approach for LLM adaptation.

To summarize, adapting LLM using self-supervised algorithms and unannotated datasets as done in pre-training and continued pre-training is resource- and cost-intensive and not recommended as a viable approach.

✅ Full fine-tuning and parameter-efficient fine-tuning (PEFT)

Fine-tuning with smallish annotated datasets is a more cost-effective approach compared to pre-training(s) with unannotated datasets. By adapting a pre-trained model to a specific task, fine-tuned models are shown to achieve state-of-the-art results in a wide range of applications and in specialized domains, such as legal, medical, or finance.

Fine-tuning, specifically parameter-efficient fine-tuning (PEFT), requires only a fraction of the computational resources needed for pre-training/continued pre-training. Therefore, this is a viable approach for adapting LLMs for teams with limited resources. In Part 3 of this series, we dig into fine-tuning details, including full fine-tuning, PEFT, and practical guidelines for how to fine-tune.

✅ Retrieval-augmented generation (RAG)

RAG is another popular approach to LLM adaptation. If your application requires extraction from a dynamic knowledge base (e.g. a QA bot), RAG could be a great solution. The complexity of a RAG-based system is primarily in the retrieval engine implementation. The inference cost in such a system could be more expensive, since the prompt includes the retrieved documents and most providers use a cost-per-token billing model. In Part 2 of this series, we discuss RAG more broadly and provide comparisons with fine-tuning.

✅ In-context learning (ICL)

This is the most cost-effective way of adapting LLMs. ICL doesn’t require any additional training data or computational resources, making it a cost-effective approach. However, similar to RAG, the cost and latency of inference may increase as more tokens are processed at inference time.

Summary

Creating an LLM-based system is iterative. We advise starting with simple methods and gradually increasing complexity until your goals are achieved. The flowchart above outlines this iterative process and serves as a solid foundation for your LLM adaptation strategy.

Acknowledgements

We would like to thank Suraj Subramanian and Varun Vontimitta for their constructive feedback on the organization and preparation of this blog post.


Written by:
Aditya Jain
Applied Research Scientist
Amir Maleki
Applied Research Scientist
Nathalie Saade
Applied Research Manager
Share:

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
FEATURED
Research
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
FEATURED
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023