This repository implements a new simple format for storing tensors safely (as opposed to pickle) and that is still fast (zero-copy).
You can install safetensors via the pip manager:
bash
pip install safetensors
For the sources, you need Rust
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup update git clone https://github.com/huggingface/safetensors cd safetensors/bindings/python pip install setuptools_rust pip install -e . ```
```python from safetensors import safeopen from safetensors.torch import savefile
tensors = { "weight1": torch.zeros((1024, 1024)), "weight2": torch.zeros((1024, 1024)) } save_file(tensors, "model.safetensors")
tensors = {} with safeopen("model.safetensors", framework="pt", device="cpu") as f: for key in f.keys(): tensors[key] = f.gettensor(key) ```
N
, a u64 int, containing the size of the header__metadata__
is allowed to contain free form text map.The main rationale for this crate is to remove the need to use
pickle
on PyTorch
which is used by default.
There are other formats out there used by machine learning and more general
formats.
Let's take a look at alternatives and why this format is deemed interesting. This is my very personal and probably biased view:
| Format | Safe | Zero-copy | Lazy loading | No file size limit | Layout control | Flexibility | Bfloat16 | ----------------------- | --- | --- | --- | --- | --- | --- | --- | | pickle (PyTorch) | ✗ | ✗ | ✗ | 🗸 | ✗ | 🗸 | 🗸 | | H5 (Tensorflow) | 🗸 | ✗ | 🗸 | 🗸 | ~ | ~ | ✗ | | SavedModel (Tensorflow) | 🗸 | ✗ | ✗ | 🗸 | 🗸 | ✗ | 🗸 | | MsgPack (flax) | 🗸 | 🗸 | ✗ | 🗸 | ✗ | ✗ | 🗸 | | Protobuf (ONNX) | 🗸 | ✗ | ✗ | ✗ | ✗ | ✗ | 🗸 | | Cap'n'Proto | 🗸 | 🗸 | ~ | 🗸 | 🗸 | ~ | ✗ | | Arrow | ? | ? | ? | ? | ? | ? | ✗ | | Numpy (npy,npz) | 🗸 | ? | ? | ✗ | 🗸 | ✗ | ✗ | | pdparams (Paddle) | ✗ | ✗ | ✗ | 🗸 | ✗ | 🗸 | 🗸 | | SafeTensors | 🗸 | 🗸 | 🗸 | 🗸 | 🗸 | ✗ | 🗸 |
bfloat16
support. Vulnerable to zip bombs (DOS).bfloat16
support. Seem to require decoding linkZero-copy: No format is really zero-copy in ML, it needs to go from disk to RAM/GPU RAM (that takes time). Also In PyTorch/numpy, you need a mutable buffer, and we don't really want to mutate a mmaped file, so 1 copy is really necessary to use the thing freely in user code. That being said, zero-copy is achievable in Rust if it's wanted and safety can be guaranteed by some other means. SafeTensors is not zero-copy for the header. The choice of JSON is pretty arbitrary, but since deserialization is <<< of the time required to load the actual tensor data and is readable I went that way, (also space is <<< to the tensor data).
Endianness: Little-endian. This can be modified later, but it feels really unecessary at the moment.
Since we can invent a new format we can propose additional benefits:
Prevent DOS attacks: We can craft the format in such a way that it's almost impossible to use malicious files to DOS attack a user. Currently, there's a limit on the size of the header of 100MB to prevent parsing extremely large JSON. Also when reading the file, there's a guarantee that addresses in the file do not overlap in any way, meaning when you're loading a file you should never exceed the size of the file in memory
Faster load: PyTorch seems to be the fastest file to load out in the major
ML formats. However, it does seem to have an extra copy on CPU, which we
can bypass in this lib link.
Currently, CPU loading times are extremely fast with this lib compared to pickle.
GPU loading times can be sped up but are still hidden behind an environment variable
(SAFETENSORS_FAST_GPU=1
) because it hasn't received enough external scrutiny to be safe.
But it does load roughly 2X faster than PyTorch on regular Linux hardware because of this extra CPU copy skip.
Lazy loading: in distributed (multi-node or multi-gpu) settings, it's nice to be able to load only part of the tensors on the various models. For BLOOM using this format enabled to load the model on 8 GPUs from 10mn with regular PyTorch weights down to 45s. This really speeds up feedbacks loops when developing on the model. For instance you don't have to have separate copies of the weights when changing the distribution strategy (for instance Pipeline Parallelism vs Tensor Parallelism).
License: Apache-2.0