xvc

codecov build crates.io docs.rs unsafe forbidden

A fast and robust MLOps tool to manage data and pipelines

⌛ When to use xvc?

✳️ What is xvc for?

🔽 Installation

You can get the binary files for Linux, macOS and Windows from releases page. Extract and copy the file to your $PATH.

Alternatively, if you have Rust [installed], you can build xvc:

shell $ cargo install xvc

🏃🏾 Quicktart

Xvc tracks your files and directories on top of Git. To start run the following command in the repository.

shell $ git init # if you're not already in a Git repository $ xvc init

It initializes the metafiles in .xvc/ directory and adds .xvcignore file in case you want to hide certain elements from Xvc.

Add your data files and directories for tracking.

shell $ xvc file track my-data/ --cache-type symlink

The command calculates data content hashes (with BLAKE-3, by default) and records them. It commits these changes to Git. It also copies these files to content addressed directories under .xvc/b3 and creates read-only symbolic links to them.

You can specify different types of [cache-types] specific to files and directories, for your use case. If you have need to track model files that change frequently, you can set --cache-type copy (the default) and make all versions of models available.

shell $ xvc file track my-models/ --cache-type copy

When you want to share them, configure a storage to share the files you added.

shell $ xvc storage new s3 --name my-remote --region us-east-1 --bucket-name my-xvc-remote

You can send the files you're tracking in Xvc to this storage.

shell $ xvc file send --to my-remote

When you (or someone else) want to access these files later, you can clone the Git repository and get back the files from file storage.

console $ git clone https://example.com/my-machine-learning-project $ cd my-machine-learning-project $ xvc file bring my-data/ --from my-remote

(Note that, you don't have to reconfigure the storage but you need to have valid credentials to access the data. Xvc doesn't store any credentials.)

If you have commands that depend on data or code elements, you can configure a pipeline.

Create a step for each command.

shell $ xvc pipeline step new --step-name preprocess --command 'python3 preprocess.py' $ xvc pipeline step new --step-name train --command 'python3 train.py' $ xvc pipeline step new --step-name test --command 'python3 test.py'

Then, configure dependencies between these steps.

console $ xvc pipeline step dependency --step-name preprocess --glob 'my-data/*.jpg' \ --file preprocess.py \ --regex 'names.txt:/^Name:' \ --lines a-long-file.csv::-1000 $ xvc pipeline step dependency --step-name train --step preprocess $ xvc pipeline step dependency --step-name test --file test-data.npz \ --file my-models/model.h5 $ xvc pipeline step output --step-name preprocess --output-file test-data.npz $ xvc pipeline step output --step-name train --output-file my-models/model.h5

The above commands define three steps in default pipeline. You can have multiple pipelines if you need.

The first is preprocess that depends on 'jpg' files in my-data/ directory, lines that start with Name: in names.txt; and the first 1000 lines in a-long-file.csv. It also depends on the script itself, so when you make changes to the script itself, it invalidates the step. The second step is called train. It depends on preprocess step directly, anything that make preprocess to rerun, makes train to run as well. The test step depends on train and preprocess via their outputs. It's run when these outputs (test-data.npz and model.h5) are changed.

You can get the pipeline in Graphviz DOT format to convert to an image.

console $ xvc pipeline dag digraph { 0 [ label = "step: train (by_dependencies, python3 train.py)" ] 1 [ label = "step: preprocess (by_dependencies, python3 preprocess.py)" ] 2 [ label = "step: test (by_dependencies, python3 test.py)" ] 3 [ label = "file: my-models/model.h5" ] 4 [ label = "file: test-data.npz" ] 0 -> 1 [ label = "" ] 2 -> 3 [ label = "" ] 2 -> 4 [ label = "" ] }

You can also export and import the pipeline to JSON to edit in your editor.

console $ xvc pipeline export > my-pipeline.json $ nvim my-pipeline.json $ xvc pipeline import --file my-pipeline.json --overwrite

You can run the pipeline with.

shell $ xvc pipeline run

If the steps you defined doesn't depend to each other, they are run in parallel.

You can define fairly complex dependencies with globs, files, directories, regular expression searches in files, lines in files, other steps and pipelines with xvc pipeline step dependency commands. More dependency types like database queries, content from URLs, S3 (and compatible) buckets, REST and GraphQL results are in the backlog. Please create an issue or discussion for any other kinds of dependencies that you'd like to be included.

Please check xvc.netlify.app for documentation.

🤟 Big Thanks

xvc stands on the following (giant) crates:

And, biggest thanks to Rust designers, developers and contributors. Although I can't see myself expert to appreciate it all, it's a fabulous language and environment to work with.

🚁 Support

👐 Contributing

⚠️ Disclaimer

This software is fresh and ambitious. Although I use it and test it close to real world conditions, it didn't go under test of time. Xvc can eat your files and spit them to eternal void!