On a limited case by case basis, we will consider bespoke licensing requests from individual entities. Please contact email@example.com to provide more details about your request.
We believe developers will have plenty to work with as we release our model weights and starting code for pre-trained and conversational fine-tuned versions as well as responsible use resources. While data mixes are intentionally withheld for competitive reasons, all models have gone through Meta’s internal Privacy Review process to ensure responsible data usage in building our products. We are dedicated to the responsible and ethical development of our genAI products, ensuring our policies reflect diverse contexts and meet evolving societal expectations.
It's correct that the license restricts using any part of the Llama 2 models, including the response outputs to train another AI model (LLM or otherwise). However, one can use the outputs to further train the Llama 2 family of models. Techniques such as Quantized Aware Training (QAT) utilize such a technique and hence this is allowed.
4096. If you want to use more tokens, you will need to fine-tune the model so that it supports longer sequences. More information and examples on fine tuning can be found in the Llama Recipes repository.
The Llama models thus far have been mainly focused on the English language. We are looking at true multi-linguality for the future but for now there are a lot of community projects that fine tune Llama models to support languages.
Linux is the only OS currently supported by this repo.
A: Make sure to run the command as follows
A: The issue occurs because of not copying the URL correctly. If you right click on the link and copy the link, the link may be copied with url defence wrapper. To avoid this problem, please select the url manually and copy it.
The model was primarily trained on English with a bit of additional data from 27 other languages (for more information, see Table 10 on page 20 of the Llama 2 paper). We do not expect the same level of performance in these languages as in English. You’ll find the full list of languages referenced in the research paper. You can look at some of the community lead projects to fine-tune Llama 2 models to support other languages. (eg. link)
The vanilla model shipped in the repository does not run on Windows and/or macOS out of the box. There are some community led projects that support running Llama on Mac, Windows, iOS, Android or anywhere (e.g llama cpp, MLC LLM, and Llama 2 Everywhere). You can also find a work around at this issue based on Llama 2 fine tuning.
Some differences between the two models include:
Hardware requirements vary based on latency, throughput and cost constraints. For good latency, we split models across multiple GPUs with tensor parallelism in a machine with NVIDIA A100s or H100s. But TPUs, other types of GPUs, or even commodity hardware can also be used to deploy these models (e.g. llama cpp, MLC LLM).
Only the 70B model has MQA for more efficient inference.
Llama 2 is an auto-regressive language model, built on the transformer architecture. Llama 2 functions by taking a sequence of words as input and predicting the next word, recursively generating text.
The vanilla model of Llama does not, however, the Code Llama models have been trained with fill-in-the-middle completion to assist with tasks like code completion.
This is implementation dependent (i.e. the code used to run the model).
The model itself supports these parameters, but whether they are exposed or not depends on implementation.
This depends on your application. The Lama 2 pre-trained models were trained for general large language applications, whereas the Llama 2 chat models were fine tuned for dialogue specific uses like chat bots. You should review the model card and research paper for more information on the models as this will help you decide which to use.
This error can be caused by a number of different factors including, model size being too large, in-efficient memory usage and so on. Some of the steps below have been known to help with this issue, but you might need to do some troubleshooting to figure out the exact cause of your issue.
If multiple calls are necessary then you could look into the following:
Special attention was paid to safety while fine tuning the Llama 2 chat models. The Llama 2 chat models scored better than the Falcon and MPT in the TruthfulQA and ToxiGen benchmarks. More information can be found in Section 4 of the Llama 2 paper.
Developers may fine-tune Llama 2 models for languages beyond English provided they comply with the Llama 2 Community License and the Acceptable Use Policy.
Although prompts cannot eliminate hallucinations completely, they can reduce it significantly. Using techniques like Chain-of-thought, Instruction-Based, N-Shot, Few-Shot can help depending on your application. Additionally prompting the models to back up the responses by verifying with factual data sets or requesting the models to provide the source of information can help as well. Overall finetuning should also be helpful for reducing hallucination.
We believe developers will have plenty to work with as we release our model weights and starting code for pre trained and conversational fine-tuned versions as well as responsible use of resources. While data mixes are intentionally withheld for competitive reasons, all models have gone through Meta’s internal Privacy Review process to ensure responsible data usage in building our products. We are dedicated to the responsible and ethical development of our genAI products, ensuring our policies reflect diverse contexts and meet evolving societal expectations.
Fine-tuning requirements also vary based on amount of data, time to complete fine-tuning and cost constraints. To fine-tune these models we have generally used multiple NVIDIA A100 machines with data parallelism across nodes and a mix of data and tensor parallelism intra node. But using a single machine, or other GPU types are definitely possible (e.g. alpaca models are trained on a single RTX4090: https://github.com/tloen/alpaca-lora)
The Llama 2 fine-tuned models were fine tuned for dialogue specific uses like chat bots.
The Llama 2 pre-trained models were trained for general large language applications, whereas the Llama 2 chat models were fine tuned for dialogue specific uses like chat bots.
You can find some best practices in the fine-tuning section in our Getting started with Llama guide.
LoRA has made the fine tuning of LLM like Llama 2 possible on consumer GPUs (like Tesla T4) by only retraining a very small set of model parameters, democratizing the fine tuning of LLM, while still reaching comparable performance as fine tuning the whole model on much expensive GPUs. So LoRA is essential and very effective in Llama 2 fine tuning.
a. It depends on the application what type of data we are fine-tuning on, but it needs to be beyond normal harness eval sets, something that makes sense for the application, for example for something like sql data, maybe running generate code would be a better eval. So essentially having a truthful data on the specific application can be helpful to reduce the risk on a specific application
b. Also setting some sort of threshold such as prob>90% might be helpful to get more confidence in the output
The best approach would be to review the LoRA research paper for more information on the rankings, then reviewing similar implementations for other models and finally experimenting.
The model is trained on a subset of publicly available text-based datasets.
Yes we will publish benchmarks alongside the release. If there are particular benchmarks partners are interested in it may be possible to share some under NDA earlier.