Llama 2

Troubleshooting
& FAQ


General

  1. We received unprecedented interest in the Llama 1 model we released for the research community – more than 100,000 individuals and organizations have applied for access to Llama 1 and tens of thousands are now using it to innovate. After external feedback, fine tuning, and extensive safety evaluations, we made the decision to release the next version of Llama more broadly.
  2. Llama 2 is also available under a permissive commercial license, whereas Llama 1 was limited to non-commercial use.
  3. Llama 2 is capable of processing longer prompts than Llama 1 and is also designed to work more efficiently.
  4. For Llama 2 we’re pairing our release of our pretrained models with versions fine-tuned for helpfulness and safety. Sharing fine-tuned versions makes it easier to use our models while also improving safety performance.

On a limited case by case basis, we will consider bespoke licensing requests from individual entities. Please contact llama2@meta.com to provide more details about your request.

  • A combination of sources are used for training. These sources include information that is publicly available online and annotated data to train our models.
  • Llama 2 is not trained on Meta user data.

We believe developers will have plenty to work with as we release our model weights and starting code for pre-trained and conversational fine-tuned versions as well as responsible use resources. While data mixes are intentionally withheld for competitive reasons, all models have gone through Meta’s internal Privacy Review process to ensure responsible data usage in building our products. We are dedicated to the responsible and ethical development of our genAI products, ensuring our policies reflect diverse contexts and meet evolving societal expectations.

Yes. There are more details about our use of human annotators in the research paper.

It's correct that the license restricts using any part of the Llama 2 models, including the response outputs to train another AI model (LLM or otherwise). However, one can use the outputs to further train the Llama 2 family of models. Techniques such as Quantized Aware Training (QAT) utilize such a technique and hence this is allowed.

4096. If you want to use more tokens, you will need to fine-tune the model so that it supports longer sequences. More information and examples on fine tuning can be found in the Llama Recipes repository.

The Llama models thus far have been mainly focused on the English language. We are looking at true multi-linguality for the future but for now there are a lot of community projects that fine tune Llama models to support languages.

Linux is the only OS currently supported by this repo.

download.sh: 14: [[: not found

A: Make sure to run the command as follows

./download.sh

HTTP request sent, awaiting response... 400 Bad Request

A: The issue occurs because of not copying the URL correctly. If you right click on the link and copy the link, the link may be copied with url defence wrapper. To avoid this problem, please select the url manually and copy it.

The model was primarily trained on English with a bit of additional data from 27 other languages (for more information, see Table 10 on page 20 of the Llama 2 paper). We do not expect the same level of performance in these languages as in English. You’ll find the full list of languages referenced in the research paper. You can look at some of the community lead projects to fine-tune Llama 2 models to support other languages. (eg. link)

The vanilla model shipped in the repository does not run on Windows and/or macOS out of the box. There are some community led projects that support running Llama on Mac, Windows, iOS, Android or anywhere (e.g llama cpp, MLC LLM, and Llama 2 Everywhere). You can also find a work around at this issue based on Llama 2 fine tuning.

Some differences between the two models include:

  1. Llama 1 released 7, 13, 33 and 65 billion parameters while Llama 2 has7, 13 and 70 billion parameters
  2. Llama 2 was trained on 40% more data
  3. Llama2 has double the context length
  4. Llama2 was fine tuned for helpfulness and safety
  5. Please review the research paper and model cards (llama 2 model card, llama 1 model card) for more differences.

Details on how to access the models are available on our website link. Please note that the models are subject to the acceptable use policy and the provided responsible use guide.ee the “Accessing to Llama 2 Models” section of this document for more information on how to get access to the models.

  1. Models are available through multiple sources but the place to start is at https://ai.meta.com/llama/
  2. Model code, quickstart guide and fine-tuning examples are available through our Github Llama repository. Model Weights are available through an email link after the user submits a sign-up form.
  3. Models are also being hosted by Microsoft, Amazon Web Services, and Hugging Face, and may also be available through other hosting providers in the future.
  1. Llama 2 is broadly available to developers and licensees through a variety of hosting providers and on the Meta website.
  2. Llama 2 is licensed under the Llama 2 Community License Agreement, which provides a permissive license to the models along with certain restrictions to help ensure that the models are being used responsibly.

Hardware requirements vary based on latency, throughput and cost constraints. For good latency, we split models across multiple GPUs with tensor parallelism in a machine with NVIDIA A100s or H100s. But TPUs, other types of GPUs, or even commodity hardware can also be used to deploy these models (e.g. llama cpp, MLC LLM).

Only the 70B model has MQA for more efficient inference.

Llama 2 is an auto-regressive language model, built on the transformer architecture. Llama 2 functions by taking a sequence of words as input and predicting the next word, recursively generating text.

The vanilla model of Llama does not, however, the Code Llama models have been trained with fill-in-the-middle completion to assist with tasks like code completion.

This is implementation dependent (i.e. the code used to run the model).

The model itself supports these parameters, but whether they are exposed or not depends on implementation.

There are many ways to use RAG with Llama. The most popular libraries are LangChain and LlamaIndex, and many of our developers have used them successfully with Llama 2. See the LangChain, LlamaIndex

You can find steps on how to set up an EC2 instance in the AWS section here.

The AWS section has some insights on instance size that you can start with.

This depends on your application. The Lama 2 pre-trained models were trained for general large language applications, whereas the Llama 2 chat models were fine tuned for dialogue specific uses like chat bots. You should review the model card and research paper for more information on the models as this will help you decide which to use.

This error can be caused by a number of different factors including, model size being too large, in-efficient memory usage and so on. Some of the steps below have been known to help with this issue, but you might need to do some troubleshooting to figure out the exact cause of your issue.

  1. Ensure your GPU has enough memory
  2. Reduce the `batch_size`
  3. Lower thePrecision
  4. Clear cache
  5. Modify the Model/Training

If multiple calls are necessary then you could look into the following:

  1. Optimize inference so each call has less latency.
  2. Merge the calls into fewer calls. For example summarize the data and utilize the summary.
  3. Possibly utilize Llama 2 function calling.
  4. Consider fine tuning the model with the updated data.

Special attention was paid to safety while fine tuning the Llama 2 chat models. The Llama 2 chat models scored better than the Falcon and MPT in the TruthfulQA and ToxiGen benchmarks. More information can be found in Section 4 of the Llama 2 paper.

Fine tuning

You can find examples on how to fine tune the Llama 2 models in the Llama Recipes repository.

You can adapt the finetuning script found here for pre-training. You can also find the hyperparams used for pretraining in Section 2 of the Llama 2 paper.

Developers may fine-tune Llama 2 models for languages beyond English provided they comply with the Llama 2 Community License and the Acceptable Use Policy.

Although prompts cannot eliminate hallucinations completely, they can reduce it significantly. Using techniques like Chain-of-thought, Instruction-Based, N-Shot, Few-Shot can help depending on your application. Additionally prompting the models to back up the responses by verifying with factual data sets or requesting the models to provide the source of information can help as well. Overall finetuning should also be helpful for reducing hallucination.

We believe developers will have plenty to work with as we release our model weights and starting code for pre trained and conversational fine-tuned versions as well as responsible use of resources. While data mixes are intentionally withheld for competitive reasons, all models have gone through Meta’s internal Privacy Review process to ensure responsible data usage in building our products. We are dedicated to the responsible and ethical development of our genAI products, ensuring our policies reflect diverse contexts and meet evolving societal expectations.

Fine-tuning requirements also vary based on amount of data, time to complete fine-tuning and cost constraints. To fine-tune these models we have generally used multiple NVIDIA A100 machines with data parallelism across nodes and a mix of data and tensor parallelism intra node. But using a single machine, or other GPU types are definitely possible (e.g. alpaca models are trained on a single RTX4090: https://github.com/tloen/alpaca-lora)

The Llama 2 fine-tuned models were fine tuned for dialogue specific uses like chat bots.

You can find example fine-tuning scripts in the Github recipes repository.
You can also review the fine-tuning section in our Getting started with Llama guide.

The Llama 2 pre-trained models were trained for general large language applications, whereas the Llama 2 chat models were fine tuned for dialogue specific uses like chat bots.

You can find the Llama 2 model card here.

You can find some best practices in the fine-tuning section in our Getting started with Llama guide.

LoRA has made the fine tuning of LLM like Llama 2 possible on consumer GPUs (like Tesla T4) by only retraining a very small set of model parameters, democratizing the fine tuning of LLM, while still reaching comparable performance as fine tuning the whole model on much expensive GPUs. So LoRA is essential and very effective in Llama 2 fine tuning.

a. It depends on the application what type of data we are fine-tuning on, but it needs to be beyond normal harness eval sets, something that makes sense for the application, for example for something like sql data, maybe running generate code would be a better eval. So essentially having a truthful data on the specific application can be helpful to reduce the risk on a specific application

b. Also setting some sort of threshold such as prob>90% might be helpful to get more confidence in the output

You can find some fine-tuning recommendations in the Llama 2 Github recipes repository as well as fine-tuning section in our Getting started with Llama guide.

The best approach would be to review the LoRA research paper for more information on the rankings, then reviewing similar implementations for other models and finally experimenting.

Take a look at the fine tuning section in our Getting started with Llama guide for some pointers towards fine tuning.

Prompting

You can find some helpful information here: Prompting, LangChain and RAG.

Legal

  1. This is a bespoke commercial license that balances open access to the models with responsibility and protections in place to help address potential misuse.
  2. Our license allows for broad commercial use, as well as for developers to create and redistribute additional work on top of Llama 2.
  3. We want to enable more innovation in both research and commercial use cases, but believe in taking a responsible approach to releasing AI technologies.
  4. For more details, our license can be found here.

The model is trained on a subset of publicly available text-based datasets.

Benchmarking

Yes we will publish benchmarks alongside the release. If there are particular benchmarks partners are interested in it may be possible to share some under NDA earlier.