How to fine-tune: Focus on effective datasets
August 7, 2024

This is the third blog post in a series about adapting open source large language models (LLMs). In this post, we explore some rules of thumb for curating a good training dataset.

  • In Part 1, we took a look at prevalent approaches for adapting language models to domain data.
  • In Part 2, we discussed how to determine if fine-tuning is the right approach for your use case.

Introduction

Fine-tuning LLMs is a mix of art and science, with best practices in the field still emerging. In this blog post, we’ll highlight design variables for fine-tuning and give directional guidance on best practices we’ve seen so far to fine-tune models with resource constraints. We recommend using the information below as a starting point to strategize your fine-tuning experiments.

Full fine-tuning vs. parameter-efficient fine-tuning (PEFT)

Both full fine-tuning and PEFT have shown improvements in downstream performance when applied to new domains in both academic and practical settings. Choosing one boils down to compute available (in GPU hours and GPU memory), performance on tasks other than the target downstream task (the learning-forgetting tradeoff) and human annotation costs.

Full fine-tuning is more prone to suffer from two problems: model collapse and catastrophic forgetting. Model collapse is where the model output converges to a limited set of outputs and the tail of the original content distribution disappears. Catastrophic forgetting, as discussed in Part 1 of this series, leads to the model losing its abilities. Some early empirical studies show that full fine-tuning techniques are more prone to the above mentioned issues as compared to PEFT techniques, though more research needs to be done.

PEFT techniques serve as natural regularizers for fine-tuning by design. PEFT often costs relatively less compute to train a downstream model and is much more accessible for a resource-constrained scenario with limited dataset sizes. In some cases, full fine-tuning has shown better performance at the specific task of interest, often at the cost of forgetting some of the capabilities of the original model. This “learning-forgetting” tradeoff between the specific downstream task performance and performance on other tasks is explored deeply in the comparison of LoRA and full fine-tuning in this paper.

Given resource constraints, PEFT techniques will likely give a better performance boost/cost ratio as compared to full fine-tuning. If downstream performance is of paramount importance with resource constraints, full fine-tuning will be the most effective. In either scenario, the key is to create a high-quality dataset keeping the following key principles in mind.

Dataset curation

In fine-tuning experiments across literature, the dataset has been critical to reap the benefits of fine-tuning. There is more nuance than “better quality and more examples,” and you can intelligently invest in dataset collection for increased performance in resource-constrained fine-tuning experiments.

Data quality/quantity

  • Quality is paramount: A general trend we’ve seen is that quality is more important than quantity—i.e., it’s better to have a small set of high-quality data, rather than a large set of low-quality data. Key principles of quality are consistent annotation, free from errors, mislabeled data, noisy input/outputs, and a representative distribution as compared to the population. A few thousand curated examples of the LIMA dataset had better performance than the 50K machine-generated Alpaca dataset when fine-tuning. OpenAI fine-tuning documentation suggests even a 50- to 100-example dataset can potentially make a difference.
  • Tougher language tasks need more data: Relatively tougher tasks such as text generation and summarization are harder to fine-tune and will require more data than easier tasks such as classification and entity extraction. “Tougher” here can mean multiple things: more tokens in the output, higher order human ability required, multiple correct answers.
  • Effective high-quality data collection: Since data collection isexpensive, it’s recommended to use the following strategies to get more sample efficiency and cost reduction
    • Observe failure modes: Observe examples where the previous ML capability fails and add examples targeting those failure modes.
    • Human in the loop: This is a cheaper way to scale data labeling. We use LLM automation to generate a base response, which can be used by human annotators to label in less time.

Data diversity

In simpler terms, if you over-train the model with a specific type of response, it will be biased towards giving that response, even when it’s not the most appropriate answer. The rule of thumb here is to ensure—as best as you can—that the training data reflects how the model should behave in the real world.

  • Duplication: This has been found to be a cause of model degradation both in fine-tuning and pre-training. Diversity through deduplication often results in improvement of performance measures.
  • Diversity in input: Paraphrasing inputs for diversity. While fine-tuning SQLCoder2, the team rephrased the plain text accompanying the SQL query to introduce syntactic and semantic diversity. Similarly, Instruction Backtranslation has been used on human written texts to generate a Q/A dataset by asking an LLM, “What questions could this be an answer to?”
  • Diversity in dataset: When fine-tuning for a more general downstream task—for example, multilingual adaptation—using diverse datasets has been shown to improve the learning-forgetting tradeoff between the forgetting original capabilities of the model and learning on the new capabilities. Fine-tuned models for different languages like Hindi and Odia have used enriched language-specific datasets with other instruction fine-tuning datasets such as FLAN, Alpaca, Dolly, etc. to induce diversity.
  • Standardized outputs: Removing white spaces and other formatting gimmicks from the output has been shown to help. SQLCoder2 removes white spaces from the generated SQL to have the model focus on learning important SQL concepts instead of gimmicks such as spacing and indenting. If you want a particular tone in your answer. “The helpdesk chatbot is ...,” then add those to the dataset for each example

LLM-based data pipelines

To curate a high-quality diverse dataset, data pipelines often use LLMs to reduce the cost of annotation. The following techniques are observed in the wild:

  • Evaluation: Training a model with a high-quality dataset and using it to annotate your large-ish dataset to filter out high-quality examples.
  • Generation: Seeding LLMs with high-quality examples and prompting to generate similar high-quality examples. Synthetic dataset best practices are starting to materialize.
  • Human in the loop: Using LLMs to generate an initial set of outputs and using humans to improve the quality by either editing or choosing preferences.

Debugging your datasets

  • Evaluate your dataset for bad outputs: If the model still isn’t good at certain aspects, add training examples that directly show the model how to do these aspects correctly. If your model has grammar, logic, or style issues, check if your data has any of the same issues. For instance, if the model now says, “I will schedule this meeting for you” (when it shouldn’t), see if existing examples teach the model to say it can do new things that it can’t do.
  • Scrutinize balance of positive/negative class: If 60% of the assistant responses in the data says, “I cannot answer this,” but at inference time only 5% of responses should say that, you will likely get an overabundance of refusals.
  • Exhaustiveness and Consistency: Make sure your training examples contain all of the information needed for the response. If we want the model to compliment a user based on their personal traits and a training example includes assistant compliments for traits not found in the preceding conversation, the model may learn to hallucinate information. Make sure all of your training examples are in the same format, as expected for inference. Look at the agreement and consistency in the training examples. If multiple people created the training data, it’s likely that model performance will be limited by the level of agreement and consistency between people. For instance, in a text extraction task, if people only agreed on 70% of extracted snippets, the model would likely not be able to do better than this.

Conclusion

Fine-tuning is a crucial aspect of LLM development that requires a delicate balance between art and science. The quality and curation of datasets play a significant role in the success of fine-tuning, with small fine-tuned LLMs often outperforming larger models on specific tasks. The Llama fine-tuning guide provides a solid starting point once the decision to fine-tune has been made. The proprietary nature of dataset mixes for fine-tuning hinders the sharing of best practices and open source advancements. As the field continues to evolve, we anticipate the emergence of general best practices while maintaining the creative and adaptive nature of fine-tuning.

Acknowledgements

We would like to thank Suraj Subramanian and Varun Vontimitta for their constructive feedback on the organization and preparation of this blog post.


Written by:
Aditya Jain
Applied Research Scientist
Amir Maleki
Applied Research Scientist
Nathalie Saade
Applied Research Manager
Share:

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
FEATURED
Research
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
FEATURED
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023