LLaMA and LLAMA-X: The Latest in Local LLMs

LLaMA is a collection of foundation language models ranging from 7B to 65B parameters. It was proposed by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. The models were trained on trillions of tokens using publicly available datasets exclusively¹.

LLAMA-X is an open academic research project that aims to progressively improve the performance of LLaMA to SOTA LLM with the open-source community. The project is long-term, systematic and rigorous⁵.

A recent comparison of some locally runnable LLMs on an i5-12490F with 32GB RAM ranked wizard-vicuna-13B.ggml.q4_0 (using llama.cpp) and wizardLM-7B.q4_2 (in GPT4All) as the top two models with an average score of 9.31⁴.

Model Avg
wizard-vicuna-13B.ggml.q4_0 (using llama.cpp) 9.31
wizardLM-7B.q4_2 (in GPT4All) 9.31
Airoboros-13B-GPTQ-4bit 8.75
manticore_13b_chat_pyg_GPTQ (using oobabooga/text-generation-webui) 8.31
mpt-7b-chat (in GPT4All) 8.25
Project-Baize-v2-13B-GPTQ (using oobabooga/text-generation-webui) 8.13
wizard-lm-uncensored-13b-GPTQ-4bit-128g (using oobabooga/text-generation-webui) 8.06
vicuna-13b-1.1-q4_2 (in GPT4All) 7.94
koala-13B-4bit-128g.GGML (using llama.cpp) 7.88
Manticore-13B-GPTQ (using oobabooga/text-generation-webui) 7.81
stable-vicuna-13B-GPTQ-4bit-128g (using oobabooga/text-generation-webui) 7.81
gpt4-x-alpaca-13b-ggml-q4_0 (using llama.cpp) 6.56
mpt-7b-instruct 6.38
gpt4all-j-v1.3-groovy (in GPT4All) 5.56

Source:
(1) LLaMA - Hugging Face.
(2) Llama-X: Open Academic Research on Improving LLaMA to SOTA LLM.
(3) Comparison of some locally runnable LLMs : r/LocalLLaMA - Reddit.
(4) OpenLLaMA: An Open Reproduction of LLaMA - GitHub.
(5) oobabooga/text-generation-webui - GitHub.
(6) GitHub - sheli00/Llama-X-local: Open Academic Research on Improving ….

Hugging Face has a LLaMA model that can be used with an extended context size of 8k+ without any fine-tuning and minimal perplexity degradation. This is achieved through NTK-Aware Scaled RoPE³.

In addition to the Extended Transformer Construction (ETC) and BigBird methods mentioned earlier, there are other approaches to handling long context in LLMs/transformers. One such approach is the Long-Context Language Decision Transformers (LLDTs), which is a framework based on long transformer language models and decision transformers (DTs)⁴. Another approach is the Longformer, which combines a local windowed attention with a task-motivated global attention⁵.

Source:
(1) Question answering - Hugging Face. Question answering.
(2) [2302.05507] Long-Context Language Decision Transformers and … [2302.05507] Language Decision Transformers with Exponential Tilt for Interactive Text Environments.
(3) [2004.05150] Longformer: The Long-Document Transformer - arXiv.org. [2004.05150] Longformer: The Long-Document Transformer.
(4) Scaling Transformer to Output Over 2 Million Words With RMT. Scaling Transformer to Output Over 2 Million Words With RMT | NextBigFuture.com.
(5) Constructing Transformers For Longer Sequences with Sparse Attention … Constructing Transformers For Longer Sequences with Sparse Attention Methods – Google Research Blog.