This is my attempt at making the LLaMA language model working on a pure Rust CPU implementation. I was inspired by an amazing CPU implementation here: https://github.com/ggerganov/ggml that could run GPT-J 6B models.
The current performance is as follows:
``` Pure Rust implementations:
LLaMA-7B: AMD Ryzen 3950X: 552ms / token f16 (pure Rust) LLaMA-7B: AMD Ryzen 3950X: 1008ms / token f32 (pure Rust) LLaMA-13B: AMD Ryzen 3950X: 1029ms / token f16 (pure Rust) LLaMA-13B: AMD Ryzen 3950X: 1930ms / token f32 (pure Rust) LLaMA-30B: AMD Ryzen 5950X: 2112ms / token f16 (pure Rust)
OpenCL (all use f16):
LLaMA-7B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: 247ms / token (OpenCL on GPU) LLaMA-7B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 680ms / token (OpenCL on CPU) LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti:
(Scroll to the bottom to see benchmarks over time).
I have not tried to run LLaMA-60B but I think it would work if you got a big enough computer.
It also has a Python unpickler that understands the .pth
files used by
PyTorch. Well almost, it doesn't unzip them automatically (see below).
The implementation uses AVX2, even in the OpenCL codepath, so this will only run on AMD64 at this time.
As of March 18, rllama
is on crates.io
. You can install it with cargo install rllama
. You may need to explicitly enable AVX2 features:
RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2" cargo install rllama
There is a .cargo/config.toml
inside this repository that will enable these
features if you install manually from this Git repository instead.
You will need Rust. Make sure you can run cargo
from a command line. In
particular, this is using unstable features so you need nightly rust. Make sure
that if you write cargo --version
it shows that it is nightly Rust.
You will need to download LLaMA-7B weights. Refer to https://github.com/facebookresearch/llama/
Once you have 7B weights, and the tokenizer.model
it comes with, you need to
decompress it.
```shell $ cd LLaMA $ cd 7B $ unzip consolidated.00.pth
$ mv consolidated consolidated.00 ```
You should then be ready to generate some text.
shell
cargo run --release -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"
By default, it will use the weights in the precision they are in the source
files. You can use --f16
command line argument to cast the largest weight
matrices to float16. Also, using OpenCL will also cast the weight matrices to
float16.
You can use --temperature
, --top-p
and --top-k
to adjust token sampler
settings.
There is --repetition-penalty
setting. 1.0 means no penalty. This value
likely should be between 0 and 1. Values smaller than 1.0 give a penalty to
tokens that appear in the context, by
x*(repetitition_penalty^num_occurrences)
before applying softmax()
on the
output probabilities. Or in other words, values smaller than 1.0 apply penalty.
You can also use --prompt-file
to read the prompt from a file instead from
the command line.
Use opencl
Cargo feature.
cargo run --release --features opencl -- --tokenizer-model /path/to/tokenizer.model --model-path /path/to/LLaMA/7B --param-path /path/to/LLaMA/7B/params.json --prompt "The meaning of life is"
With opencl
feature, there is also another argument, --opencl-device
that
takes a number. That number selects Nth OpenCL device found on the system. You
can see the devices in the output when you run the program (e.g. see the
screenshot below).
This is a hobby thing for me so don't expect updates or help.
rllama
does not have this.I'm trying to track that I'm making this faster and not slower.
For 50-length sequence generation:
``` cargo run --release -- --model-path /LLaMA/13B \ --param-path /LLaMA/13B/params.json \ --tokenizer-path /LLaMA/tokenizer.model \ --prompt "Computers are pretty complica" --max-seq-len 50
LLaMA-7B: AMD Ryzen 3950X: 1058ms / token LLaMA-13B: AMD Ryzen 3950X: 2005ms / token
LLaMA-7B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: 567ms / token LLaMA-7B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 956ms / token LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: 987ms / token LLaMA-13B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 1706ms / token
LLaMA-7B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: 283ms / token
LLaMA-7B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 679ms / token
LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti:
LLaMA-7B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti: 247ms / token
LLaMA-7B: AMD Ryzen 3950X + OpenCL Ryzen 3950X: 680ms / token
LLaMA-13B: AMD Ryzen 3950X + OpenCL GTX 3090 Ti:
LLaMA-7B: AMD Ryzen 3950X: 552ms / token f16 LLaMA-7B: AMD Ryzen 3950X: 1008ms / token f32 LLaMA-13B: AMD Ryzen 3950X: 1029ms / token f16 LLaMA-13B: AMD Ryzen 3950X: 1930ms / token f32 LLaMA-30B: AMD Ryzen 5950X: 2112ms / token f16 ```