This is the third blog post in a series about adapting open source large language models (LLMs). In this post, we explore some rules of thumb for curating a good training dataset.
Introduction
Fine-tuning LLMs is a mix of art and science, with best practices in the field still emerging. In this blog post, we’ll highlight design variables for fine-tuning and give directional guidance on best practices we’ve seen so far to fine-tune models with resource constraints. We recommend using the information below as a starting point to strategize your fine-tuning experiments.
Full fine-tuning vs. parameter-efficient fine-tuning (PEFT)
Both full fine-tuning and PEFT have shown improvements in downstream performance when applied to new domains in both academic and practical settings. Choosing one boils down to compute available (in GPU hours and GPU memory), performance on tasks other than the target downstream task (the learning-forgetting tradeoff) and human annotation costs.
Full fine-tuning is more prone to suffer from two problems: model collapse and catastrophic forgetting. Model collapse is where the model output converges to a limited set of outputs and the tail of the original content distribution disappears. Catastrophic forgetting, as discussed in Part 1 of this series, leads to the model losing its abilities. Some early empirical studies show that full fine-tuning techniques are more prone to the above mentioned issues as compared to PEFT techniques, though more research needs to be done.
PEFT techniques serve as natural regularizers for fine-tuning by design. PEFT often costs relatively less compute to train a downstream model and is much more accessible for a resource-constrained scenario with limited dataset sizes. In some cases, full fine-tuning has shown better performance at the specific task of interest, often at the cost of forgetting some of the capabilities of the original model. This “learning-forgetting” tradeoff between the specific downstream task performance and performance on other tasks is explored deeply in the comparison of LoRA and full fine-tuning in this paper.
Given resource constraints, PEFT techniques will likely give a better performance boost/cost ratio as compared to full fine-tuning. If downstream performance is of paramount importance with resource constraints, full fine-tuning will be the most effective. In either scenario, the key is to create a high-quality dataset keeping the following key principles in mind.
Dataset curation
In fine-tuning experiments across literature, the dataset has been critical to reap the benefits of fine-tuning. There is more nuance than “better quality and more examples,” and you can intelligently invest in dataset collection for increased performance in resource-constrained fine-tuning experiments.
Data quality/quantity
Data diversity
In simpler terms, if you over-train the model with a specific type of response, it will be biased towards giving that response, even when it’s not the most appropriate answer. The rule of thumb here is to ensure—as best as you can—that the training data reflects how the model should behave in the real world.
LLM-based data pipelines
To curate a high-quality diverse dataset, data pipelines often use LLMs to reduce the cost of annotation. The following techniques are observed in the wild:
Debugging your datasets
Conclusion
Fine-tuning is a crucial aspect of LLM development that requires a delicate balance between art and science. The quality and curation of datasets play a significant role in the success of fine-tuning, with small fine-tuned LLMs often outperforming larger models on specific tasks. The Llama fine-tuning guide provides a solid starting point once the decision to fine-tune has been made. The proprietary nature of dataset mixes for fine-tuning hinders the sharing of best practices and open source advancements. As the field continues to evolve, we anticipate the emergence of general best practices while maintaining the creative and adaptive nature of fine-tuning.
Acknowledgements
We would like to thank Suraj Subramanian and Varun Vontimitta for their constructive feedback on the organization and preparation of this blog post.
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Join us in the pursuit of what’s possible with AI.
Foundational models
Latest news
Foundational models