RAFT: Sailing Llama towards better domain-specific RAG
May 7, 2024

Retrieval-Augmented Fine-Tuning (RAFT) combines the benefits of Retrieval-Augmented Generation and Supervised Fine-Tuning for better domain adaptation

Check out the original RAFT blog post on Microsoft TechCommunity.


Having been trained on a wide range of topics, pretrained models such as Meta Llama 2 can in turn generate informative responses to queries about a wide range of topics. Many use cases, however, require the model to be specialized for a domain, and to leverage domain-specific information when generating responses.

Currently there are two ways to do this:

  • Domain-specific Supervised Fine-Tuning (DSF), which means training an existing base model on a set of documents that represent the domain specific knowledge.
  • Retrieval Augmented Generation (RAG), which involves storing those documents in a vector database and, at query time, retrieving semantically similar documents to the question and using their contents as context for the LLM to generate a response.

In this article, we will look at the limitations of those two approaches and how a team of UC Berkeley researchers, Tianjun Zhang and Shishir G. Patil, may have just discovered a better approach. The team previously known for Gorilla LLM presents this new approach in their RAFT paper (Retrieval Augmented Fine Tuning) showing how they used Meta Llama 2 on Azure AI Studio through MaaS to conduct their research and implement their approach.

The Berkeley team also published a blog post about the paper explaining the advantages and disadvantages of the previous approaches and how the RAFT approach produces more effective results. The RAFT paper implementation is available in their GitHub repository.

Let’s start by giving an overview of how the RAFT approach works.

Understanding the RAFT method

In conventional RAG, when a query is posed to a model, it retrieves a few documents from an index that are likely to contain the answer. It uses these documents as the context to generate an answer to the user’s query.

With fine-tuning, the model answers queries like a student writing a closed-book exam. With RAG, this scenario resembles an open-book exam, where the student has full access to a textbook to find the answers. Open-book exams are easier to solve than closed-book exams, which explains the efficacy and popularity of RAG.

Both approaches have limitations. With fine-tuning, the model is not only limited to what it has been trained on, but it is also subject to approximation and hallucination. With RAG, the model is grounded, that is, its responses are based on some reference documents in a corpus. These reference documents are retrieved based on their semantic similarity to the query; the model doesn’t actually know which documents are truly relevant or are just red herrings. These “distractor” documents may be pulled into the model’s context even when they are not good sources for a well-reasoned answer.

Tianjun and Shishir were looking to improve these deficiencies of RAG. They hypothesized that a student who studies the textbooks before the open-book exam would be more likely to perform better than a student who references the textbook only during the exam. Translating that back to LLMs, if a model “studied” the documents beforehand, could that improve its RAG performance? Their approach—Retrieval Augmented Fine Tuning—attempts to get the model to study or adapt to a domain before it is used in a RAG setup.

Using the Meta Llama 2 7B language model, they first prepare a synthetic dataset where each data sample consists of:

  • A question
  • A set of reference documents that includes documents containing relevant information and documents that do not contain any relevant information to answer the question—and therefore can safely be ignored
  • An answer generated from the documents
  • A Chain-of-Thought (CoT) explanation that includes excerpts from the relevant documents

This dataset is used to fine-tune the Meta Llama 2 7B model using standard supervised training. The model is now better adapted to the domain; it not only aligns its tone and voice to the domain dataset but is also better at extracting the useful bits of information from the retrieved context. The addition of Chain-of-Thought reasoning prevents overfitting and improves training robustness.

RAFT sits in the middle-ground between RAG and DSF. It simultaneously primes the LLM on domain knowledge and style (a la DSF), while improving the quality of generated answers from the retrieved context. Since pretrained models such as Meta Llama 2 are trained on a diverse set of domains, techniques such as RAFT can make them better suited for niche areas such as healthcare or legal datasets.

Q&A with the RAFT researchers

We had the opportunity to ask the Berkeley team about their experience of using Meta Llama for RAFT.

Why did you choose Meta Llama 2 7B?

RAFT Researchers: We chose Meta Llama 2 7B because we focus on RAG tasks, where the task requires a combination of the model's ability to reason, understand language, have lower-latency inference, and be easily adaptable to diverse settings. Meta Llama 2 7B fit the bill well: It's a good base model for a lot of the general-knowledge, question-answering tasks, with encouraging math skills, and the ability to parse reasonably long documents due to its 4096 token context length. Meta Llama 2 7B is also a perfect model for training on four A100-40G GPUs and serving on a single GPU. In the pareto curve on performance, ease-of-deployment, and with the right licensing, the Meta Llama 2 model is quite apt for the RAFT task. With the help of Microsoft AI studio, we are happy to explore Meta Llama 2 13B or Meta 70B as well.

What recommendations do you have for people trying to fine-tune Meta Llama? Any best practices you learned in the field with fine-tuning LLMs?

RAFT Researchers: Fine-tuning Meta Llama is usually a complex task involving data collection, data cleaning, and actual fine-tuning. In terms of data, we recommend collecting diverse questions with respect to your domain and constructing chain-of-thought (CoT) answers (also talked about in our RAFT paper). We also recommend you store intermediate checkpoints, which would then help with early stopping. It is also critical to have the fine-tuning learning rate set to at least a magnitude lower than what was used for pre-training. Other than this, the usual best-practices of 16-bit precision, not training for more than 3 epochs, and using large-batch sizes, are also recommended.

Should the fine-tuning be applied to each domain? Or is the fine-tuned model better at RAG on multiple domains in general?

RAFT Researchers: The fine-tuned model's performance is dependent on the domain (the documents it is trained on) for knowledge but can generalize across domains for behavior to a certain extent. There is a slight tradeoff between accuracy vs. generalization. Usually fine-tuning for a domain is a good practice, but fine-tuning for a limited set of enterprise docs may bring better performance since the knowledge is strictly narrower.


The RAFT method is a significant step forward in the field of language model fine-tuning. It not only improves the quality of generated answers but also enhances the model's ability to extract useful information from the retrieved context. As such, it holds great potential for future applications in various fields.

The use of the Meta Llama 2 7B language model in this research demonstrates the versatility and adaptability of this model in handling diverse tasks. The team's experience and recommendations provide valuable insights for those looking to fine-tune Meta Llama or similar models.

Azure AI Studio further democratizes access to state-of-the-art GenAI capabilities. The platform simplifies the process of fine-tuning, testing, and deploying, which enables developers and enterprises to create innovative and customized solutions without requiring extensive ML expertise.

Learn more about RAFT and Meta Llama on Azure Models-as-a-Service

Written by:
Suraj Subramanian
AI Advocate, Meta
Cedric Vidal
Principal AI Advocate, Microsoft

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023