DataFusion: Distributed Query Processing in Rust

License Version Docs Build Status

This project is a proof-of-concept of a distributed data processing platform in Rust with features somewhat similar to Apache Spark but it is not intended to be a clone of Apache Spark.

DataFusion can also be used as a crate dependency in your project if you want the ability to perform SQL queries and DataFrame style data manipulation in-process.

Why am I building this?

Primarily, this is a just a fun side-project for me to use to become a better Rust developer since it involves solving some non-trivial problems. I'm also generally interested in researching distributed systems and query optimizers since I've been working with these concepts professionally for quite a few years now.

Apart from using this as a way to learn, I do think that it could result in a useful product.

I have a hypothesis that even a naive implementation in Rust will have performance that is roughly comparable to that of Apache Spark for simple use cases, but more importantly the performance will be predictable and reliable because there is no garbage collector involved.

What will be similar to Apache Spark?

What will be different to Apache Spark?

Due to the statically compiled nature of Rust, this platform will be less interactive:

Current Status

There are two working examples:

Both of these examples run a trivial query against a trivial CSV file using a single thread.

Roadmap

I've started defining milestones and issues in github issues, but here's a high level summary of the plan with some rough guesses of timescale.

POC (Q1 2018)

For the POC, I want to be able to run a single worker process (preferably dockerized) and be able to send it a query (via JSON) and have it execute that query. This will be sufficient to run some representative (but trivial) workloads to compare with Apache Spark.

The workloads will read and write CSV files from HDFS.

MVP (Q2 2018)

MVP should be fully deployable, have a good UX, have good documentation etc. It could still be lacking major features though such as JOIN, GROUP BY, user-defined functions etc.

1.0 (Q4 2018)

The 1.0 release should be able to support real-world workloads with performance, scalability, and reliability that generally exceed those of Apache Spark.

Contributing

Contributers are welcome! Please see CONTRIBUTING.md for details.