Ballista

Ballista is a proof-of-concept distributed compute platform based on Kubernetes and the Rust implementation of Apache Arrow.

This is not my first attempt at building something like this. I originally wanted DataFusion to be a distributed compute platform but this was overly ambitious at the time, and it ended up becoming an in-memory query execution engine for the Rust implementation of Apache Arrow. However, DataFusion now provides a good foundation to have another attempt at building a modern distributed compute platform in Rust.

My goal is to use this repo to move fast and try out ideas that eventually can be contributed back to Apache Arrow and to help drive requirements for Apache Arrow and DataFusion.

Demo

This demo shows a Ballista cluster being created in Minikube and then shows the nyctaxi example being executed, causing a distributed query to run in the cluster, with each executor pod performing a projection on one partition of the data.

asciicast

Here are the commands being run, with some explanation:

```bash

create a cluster with 12 executors

cargo run --bin ballista -- create-cluster --name nyctaxi --num-executors 12 --template examples/nyctaxi/templates/executor.yaml

check status

kubectl get pods

run the nyctaxi example application, that executes queries using the executors

cargo run --bin ballista -- run --name nyctaxi --template examples/nyctaxi/templates/application.yaml

check status again to find the name of the application pod

kubectl get pods

watch progress of the application

kubectl logs -f ballista-nyctaxi-app-n5kxl ```

PoC Status