Quick setup and how-to guide
Welcome to the getting started guide for Llama.
This guide provides information and resources to help you set up Llama including how to access the model, hosting, how-to and integration guides. Additionally, you will find supplemental materials to further assist you while building with Llama.
If you want to use Llama 2 on Windows, macOS, iOS, Android or in a Python notebook, please refer to the open source community on how they have achieved this. Here are some of the resources from open source that you can read more about: (Repo 1) (Repo 2) (Repo 3).
Getting the Models
a. Keep in mind that the links expire after 24 hours and a certain amount of downloads.
If you start seeing errors such as 403: Forbidden, you can always re-request a link.
Amazon Web Services
AWS provides multiple ways to host your Llama models. (SageMaker Jumpstart, EC2, Bedrock etc). In this document we are going to outline the steps to host your models using SageMaker Jumpstart and Bedrock. You can refer to other offerings directly on the AWS site.
Is a serverless GPU-powered inference on Cloudflare’s global network. It’s an AI inference as a service platform, empowering developers to run AI models with just a few lines of code. Learn more about Workers AI here and look at the documentation here to get started to use Llama 2 models here.
Google Cloud Platform (GCP) - Model Garden
GCP is a suite of cloud computing services that provides computing resources as well as virtual machines. Building on top of GCP services, Model Garden on Vertex AI offers infrastructure to jumpstart your ML project with a single place to discover, customize, and deploy a wide range of models. With more than 100 foundation models available to developers, you can deploy AI models with a few clicks as well as running fine-tuning tasks in Notebook in Google Colab.
You must first request a download using the same email address as your Hugging Face account. After doing so, you can request access to any of the models on Hugging Face and within 1-2 days your account will be granted access to all versions.
Kaggle is an online community of data scientists and machine learning engineers. It not only allows users to find datasets for building their AI models, but also allows users to search and discover hundreds of trained, ready-to-deploy machine learning models in one place. Moreover, community members can also publish their innovative works with these models as in Notebooks, backed by Google Cloud AI platform for computing resources and virtual machines.
We have collaborated with Kaggle to fully integrate Llama 2, offering pre-trained, chat and CodeLlama in various sizes. To download Llama 2 model artifacts from Kaggle, you must first request a download using the same email address as your Kaggle account. After doing so, you can request access to Llama 2 and Code Lama models. You get access to downloads once your request is processed.
Microsoft Azure & Windows
With Microsoft Azure you can access Llama 2 in one of two ways, either by downloading the Llama 2 model and deploying it on a virtual machine or using Azure Model Catalog.
HOW TO GUIDES
If you are looking to learn by writing code it's highly recommended to look into the Getting to Know Llama 2 - Jupyter notebook. It's a great place to start with most commonly performed operations on Llama 2.
Full parameter fine-tuning is a method that fine-tunes all the parameters of all the layers of the pre-trained model. In general, it can achieve the best performance but it is also the most resource-intensive and time consuming: it requires most GPU resources and takes the longest.
PEFT, or Parameter Efficient Fine Tuning, allows one to fine tune models with minimal resources and costs. There are two important PEFT methods: LoRA (Low Rank Adaptation) and QLoRA (Quantized LoRA), where pre-trained models are loaded to GPU as quantized 8-bit and 4-bit weights, respectively. It’s likely that you can fine-tune the Llama 2-13B model using LoRA or QLoRA fine-tuning with a single consumer GPU with 24GB of memory, and using QLoRA requires even less GPU memory and fine-tuning time than LoRA.
Typically, one should try LoRA, or if resources are extremely limited, QLoRA, first, and after the fine-tuning is done, evaluate the performance. Only consider full fine-tuning when the performance is not desirable.
Experiment tracking is crucial when evaluating various fine-tuning methods like LoRA, and QLoRA. It ensures reproducibility, maintains a structured version history, allows for easy collaboration, and aids in identifying optimal training configurations. Especially with numerous iterations, hyperparameters, and model versions at play, tools like Weights & Biases (W&B)become indispensable. With its seamless integration into multiple frameworks, W&B provides a comprehensive dashboard to visualize metrics, compare runs, and manage model checkpoints. It's often as simple as adding a single argument to your training script to realize these benefits - we’ll show an example in the Hugging Face PEFT LoRA section.
Recipes PEFT LoRA
The llama-recipes repo has details on different fine-tuning (FT) alternatives supported by the provided sample scripts. In particular, it highlights the use of PEFT as the preferred FT method, as it reduces the hardware requirements and prevents catastrophic forgetting. For specific cases, full parameter FT can still be valid, and different strategies can be used to still prevent modifying the model too much. Additionally, FT can be done in single gpu or multi-gpu with FSDP.
In order to run the recipes, follow the steps below:
Hugging Face PEFT LoRA (link)
Using Low Rank Adaption (LoRA) , Llama 2 is loaded to the GPU memory as quantized 8-bit weights.
Using the Hugging Face Fine-tuning with PEFT LoRA is super easy - an example fine-tuning run on Llama 2-7b using the OpenAssistant data set can be done in three simple steps:
This takes about 16 hours on a single GPU and uses less than 10GB GPU memory; changing batch size to 8/16/32 will use over 11/16/25 GB GPU memory. After the fine-tuning completes, you’ll see in a new directory named “output” at least adapter_config.json and adapter_model.bin - run the script below to infer with the base model and the new model, generated by merging the base model with the fined-tuned one:
QLoRA Fine Tuning
QLoRA (Q for quantized) is more memory efficient than LoRA. In QLoRA, the pretrained model is loaded to the GPU as quantized 4-bit weights. Fine-tuning using QLoRA is also very easy to run - an example of fine-tuning Llama 2-7b with the OpenAssistant can be done in four quick steps:
It takes about 6.5 hours to run on a single GPU, using 11GB memory of the GPU. After the fine-tuning completes and the output_dir specified in ./scripts/finetune_llama2_guanaco_7b.sh will have checkoutpoint-xxx subfolders, holding the fine-tuned adapter model files. To run inference, use the script below:
Axolotl is another open source library you can use to streamline the fine-tuning of Llama 2. A good example of using Axolotl to fine-tune Llama 2 with four notebooks covering the whole fine-tuning process (generate the dataset, fine-tune the model using LoRA, evaluate and benchmark) is here.
Quantization is a technique to represent the model weights which are usually in 32-bit floating numbers with lower precision data such as 16-bit float, 16-bit int, 8-bit int, or even 4/3/2-bit int. The benefits of quantization include smaller model size, faster fine-tuning and faster inference. In resource-constrained environments such as single-GPU or Mac or mobile edge devices ( e.g. https://github.com/ggerganov/llama.cpp), quantization is a must in order to fine-tune the model or run the inference. More information about quantization is available here & here.
Prompt engineering is a technique used in natural language processing (NLP) to improve the performance of the language model by providing them with more context and information about the task in hand. It involves creating prompts, which are short pieces of text that provide additional information or guidance to the model, such as the topic or genre of the text it will generate. By using prompts, the model can better understand what kind of output is expected and produce more accurate and relevant results. In Llama 2 the size of the context, in terms of number of tokens, has doubled from 2048 to 4096.
Crafting Effective Prompts
Crafting effective prompts is an important part of prompt engineering. Here are some tips for creating prompts that will help improve the performance of your language model:
Be clear and concise: Your prompt should be easy to understand and provide enough information for the model to generate relevant output. Avoid using jargon or technical terms that may confuse the model.
Use specific examples: Providing specific examples in your prompt can help the model better understand what kind of output is expected. For example, if you want the model to generate a story about a particular topic, include a few sentences about the setting, characters, and plot.
Vary the prompts: Using different prompts can help the model learn more about the task at hand and produce more diverse and creative output. Try using different styles, tones, and formats to see how the model responds.
Test and refine: Once you have created a set of prompts, test them out on the model to see how it performs. If the results are not as expected, try refining the prompts by adding more detail or adjusting the tone and style.
Use feedback: Finally, use feedback from users or other sources to continually improve your prompts. This can help you identify areas where the model needs more guidance and make adjustments accordingly.
Role Based Prompts
Creating prompts based on the role or perspective of the person or entity being addressed. This technique can be useful for generating more relevant and engaging responses from language models.
Improves relevance: Role-based prompting helps the language model understand the role or perspective of the person or entity being addressed, which can lead to more relevant and engaging responses.
Increases accuracy: Providing additional context about the role or perspective of the person or entity being addressed can help the language model avoid making mistakes or misunderstandings.
Requires effort: Requires more effort to gather and provide the necessary information about the role or perspective of the person or entity being addressed.
You are a virtual tour guide currently walking the tourists Eiffel Tower on a night tour. Describe Eiffel Tower to your audience that covers its history, number of people visiting each year, amount of time it takes to do a full tour and why do so many people visit this place each year.
Chain of Thought Technique
Involves providing the language model with a series of prompts or questions to help guide its thinking and generate a more coherent and relevant response. This technique can be useful for generating more thoughtful and well-reasoned responses from language models.
Improves coherence: Helps the language model think through a problem or question in a logical and structured way, which can lead to more coherent and relevant responses.
Increases depth: Providing a series of prompts or questions can help the language model explore a topic more deeply and thoroughly, potentially leading to more insightful and informative responses.
Requires effort: The chain of thought technique requires more effort to create and provide the necessary prompts or questions.
You are a virtual tour guide from 1901. You have tourists visiting Eiffel Tower. Describe Eiffel Tower to your audience. Begin with
1. Why it was built
2. Then by how long it took them to build
3. Where were the materials sourced to build
4. Number of people it took to build
5. End it with the number of people visiting the Eiffel tour annually in the 1900's, the amount of time it completes a full tour and why so many people visit this place each year.
Make your tour funny by including 1 or 2 funny jokes at the end of the tour.
Meta’s Responsible Use Guide is a great resource to understand how best to prompt and address input/output risks of the language model. Refer to pages (14-17).
Here are some examples of how a language model might hallucinate and some strategies for fixing the issue:
A language model is asked to generate a response to a question about a topic it has not been trained on. The language model may hallucinate information or make up facts that are not accurate or supported by evidence.
A language model is asked to generate a response to a question that requires a specific perspective or point of view. The language model may hallucinate information or make up facts that are not consistent with the desired perspective or point of view.
A language model is asked to generate a response to a question that requires a specific tone or style. The language model may hallucinate information or make up facts that are not consistent with the desired tone or style.
Overall, the key to avoiding hallucination in language models is to provide them with clear and accurate information and context, and to carefully monitor their responses to ensure that they are consistent with your expectations and requirements.
Prompting with Llama
Multi Turn User Prompt
Here are some great resources to get started with Inferencing with your LLMs.
Llama RecipesGithub Llama recipes
Page Attention vLLMLearn moreRecipe examples
Hugging Face TGILearn moreRecipe examples
Cuda GraphsLearn more
As the saying goes, if you can't measure it, you can't improve it. In this section, we are going to cover different ways to measure and ultimately validate Llama so it's possible to determine the improvements provided by different fine tuning techniques.
The focus of these techniques is to gather objective metrics that can be compared easily during and after each fine tuning run, to provide quick feedback on whether the model is performing. The main metrics collected are loss and perplexity.
This method consists in dividing the dataset into k subsets or folds, and then fine tuning the model k times. On each run, a different fold is used as a validation dataset, using the rest for training. The performance results of each run are averaged out for the final report. This provides a more accurate metric of the performance of the model across the complete dataset, as all entries serve both for validation and training. While it produces the most accurate prediction on how a model is going to generalize after fine tuning on a given dataset, it is computationally expensive and better suited for small datasets.
When using a holdout, the dataset is split into two or three subsets, training and validation with test as optional. The test and validation sets can represent 10% - 30% of the dataset each. As the name implies, the first two subsets are used for training and validating the model during fine tuning, while the third is used only after fine tuning is complete to evaluate how well the model generalizes on data it has not seen in either phase. The advantage of having three partitions is that it provides a way to evaluate the model after fine-tuning for an unbiased view into the model performance, but it requires a slightly bigger dataset to allow for a proper split. This is currently implemented in the Llama recipes fine tuning script with two subsets of the dataset, train and validation. The data is collected in a json file that can be plotted to easily interpret the results and evaluate how the model is performing.
Standard Evaluation tools
There are multiple projects that provide standard evaluation. They provide predefined tasks with commonly used metrics to evaluate the performance of LLMs, like HellaSwag and ThrouthfulQA. These tools can be used to test if the model has degraded after fine tuning. Additionally, a custom task can be created using the dataset intended to fine-tune the
model, effectively automating the manual verification of the model performance before and after fine tuning. These types of projects provide a quantitative way of looking at the models performance in simulated real world examples. Some of these projects include the LM Evaluation Harness (used to create the HF leaderboard), HELM, BIG-bench and OpenCompass.
Interpreting Loss and Perplexity
The loss value used comes from the transformer's LlamaForCausalLM, which initializes a different loss function depending on the objective required from the model. The objective of this section is to give a brief overview on how to understand the results from loss and perplexity as an initial evaluation of the model performance during fine tuning. We also calculate the perplexity as an exponentiation of the loss value. Additional information on loss functions can be found in these resources: 1, 2, 3, 4, 5, 6.
In our recipes, we use a simple holdout during fine tuning. Using the logged loss values, both for train and validation dataset, the curves for both are plotted to analyze the results of the process. Given the setup in the recipe, the expected behavior is a log graph that shows a diminishing train and validation loss value as it progresses.
If the validation curve starts going up while the train curve continues decreasing, the model is overfitting and it's not generalizing well. Some alternatives to test when this happens are early stopping, verifying the validation dataset is a statistically significant equivalent of the train dataset, data augmentation, using parameter efficient fine tuning or using k-fold cross-validation to better tune the hyperparameters.
Manually evaluating a fine tuned model will vary according to the FT objective and available resources. Here we provide general guidelines on how to accomplish it.
With a dataset prepared for fine tuning, a part of it can be separated into a manual test subset, which can be further increased with general knowledge questions that might be relevant to the specific use case. In addition to these general questions, we recommend executing standard evaluations as well, and compare the results with the baseline for the fine tuned model.
To rate the results, a clear evaluation criteria should be defined that is relevant to the dataset being used. Example criteria can be accuracy, coherence and safety. Create a rubric for each criteria and define what would be required for an output to receive a specific score.
With these guidelines in place, distribute the test questions with a diverse set of reviewers to have multiple data points for each question. With multiple data points for each question and different criteria, a final score can be calculated for each query, allowing for weighting the scores based on the preferred focus for the final model.
Code Llama is an open-source family of LLMs based on Llama 2 providing SOTA performance on code tasks. It consists of:
The different stages of fine-tuning annotated with the number of tokens seen during training. Source
One of the best ways to try out and integrate with Code Llama is using Hugging Face ecosystem by following the blog, which has:
If the model does not perform well on your specific task, for example if none of the Code Llama models (7B/13B/34B) generate the correct answer for a text to SQL task, fine-tuning should be considered. This is a complete guide and notebook on how to fine-tune Code Llama using the 7B model hosted on Hugging Face. It uses the LoRA fine-tuning method and can run on a single GPU.
As shown in the Code Llama References (here), fine-tuning improves the performance of Code Llama on SQL code generation, and it can be critical that LLMs are able to interoperate with structured data and SQL, the primary way to access structured data - we are developing demo apps in LangChain and RAG with Llama 2 to show this.
In most of the cases, the simplest method to integrate any model size is through ollama, occasionally combined with litellm. Ollama is a program that allows quantized versions of popular LLMs to run locally. It leverages the GPU and can even run Code Llama 34B on an M1 mac. Litellm is a simple proxy that can serve an OpenAI style API, so it's easy to replace OpenAI in existing applications, in our case, extensions.
This extension can be used with ollama, allowing for easy local only execution. Additionally, it provides a simple interface to 1/ Chat with the model directly running inside VS Code and 2/ Select specific files and sections to edit or explain.
This extension is an effective way to evaluate Llama because it provides simple and useful features. It also allows developers to build trust, by creating diffs for each proposed change and showing exactly what is being changed before saving the file. Handling the context for the LLM is easy and relies heavily on keyboard shortcuts. It's important to note that all the interactions with the extension are recorded in jsonl format. The objective is to provide data for future fine tuning of the models based on the feedback recorded during real world usage as well.
This extension from Hugging Face provides an open alternative to the closed sourced GitHub Copilot, allowing for the same functionality, context based autocomplete suggestions, to work with open source models. It works out of the box with a HF Token and their Inference API but can be configured to use any TGI compatible API. For usage with a self-hosted TGI server, follow these steps:
LangChain is an open source framework for building LLM powered applications. It implements common abstractions and higher-level APIs to make the app building process easier, so you don't need to call LLM from scratch. The main building blocks/APIs of LangChain are:
The Models or LLMs API can be used to easily connect to all popular LLMs such as Hugging Face or Replicate where all types of Llama 2 models are hosted.
LangChain can be used as a powerful retrieval augmented generation (RAG) tool to integrate the internal data or more recent public data with LLM to QA or chat about the data. LangChain already supports loading many types of unstructured and structured data.
To learn more about LangChain, enroll for free in the two LangChain short courses. Be aware that the code in the courses use OpenAI ChatGPT LLM, but we've published a series of demo apps using LangChain with Llama 2.
There is also a Getting to Know Llama notebook, presented at Meta Connect 2023.
LlamaIndex is another popular open source framework for building LLM applications. Like LangChain, LlamaIndex can also be used to build RAG applications by easily integrating data not built-in the LLM with LLM. There are three key tools in LlamaIndex:
Return data-augmented answer
LlamaIndex is mainly a data framework for connecting private or domain-specific data with LLMs, so it specializes in RAG, smart data storage and retrieval, while LangChain is a more general purpose framework which can be used to build agents connecting multiple tools. The integration of the two may provide the best performant and effective solution to building real world RAG powered Llama apps.
For an example usage of how to integrate LlamaIndex with Llama 2, see here. We also published a completed demo app showing how to use LlamaIndex to chat with Llama 2 about live data via the you.com API.
It’s worth noting that LlamaIndex has implemented many RAG powered LLM evaluation tools to easily measure the quality of retrieval and response, including:
COMMUNITY SUPPORT AND RESOURCES
If you have any feature requests, suggestions, bugs to report we encourage you to report the issue in the respective github repository. If you are working in partnership with Meta on Llama 2 please request access to Asana and report any issues using Asana.
Performance & Latency
We value your feedback
Help us improve Llama by submitting feedback, suggestions, or reporting bugs.Submit feedback