DataFusion: Big Data Platform for Rust

DataFusion is a distributed data processing platform implemented in Rust. It is very much inspired by Apache Spark and has a similar programming style through the use of DataFrames and SQL.

DataFusion can also be used as a crate dependency in your project if you want the ability to perform SQL queries and DataFrame style data manipulation in-process.

Project Home Page

The project home page is now at https://datafusion.rs

Current Status

It is currently possible to use DataFusion as a crate dependency to execute SQL and DataFrame operations against data in-process and it is also possible to deploy DataFusion as a distributed data processing platform (but only with a single worker so far).

Standalone

Both of these examples run a trivial query against a trivial CSV file using a single thread.

Distributed

It is possible to start a single worker node and use a SQL console to execute queries in the remote worker.

Run Worker

bash cargo run --bin worker

``` Worker listening on 0.0.0.0:8080

```

Run Console

bash cargo run --bin console

``` DataFusion Console $ CREATE EXTERNAL TABLE ukcities (name VARCHAR(100) NOT NULL, lat DOUBLE NOT NULL, lng DOUBLE NOT NULL) Executing: CREATE EXTERNAL TABLE ukcities (name VARCHAR(100) NOT NULL, lat DOUBLE NOT NULL, lng DOUBLE NOT NULL)

$ SELECT name, lat, lng FROM ukcities WHERE lat < 51 Executing: SELECT name, lat, lng FROM ukcities WHERE lat < 51 Eastbourne, East Sussex, UK,50.768036,0.290472 Weymouth, Dorset, UK,50.614429,-2.457621 Bournemouth, UK,50.720806,-1.904755 Hastings, East Sussex, UK,50.854259,0.573453 Uckfield, East Sussex, UK,50.967941,0.085831 Worthing, West Sussex, UK,50.825024,-0.383835 Plymouth, UK,50.376289,-4.143841

Query executed in 0.005250537 seconds ```

Roadmap

I've started defining milestones and issues in github issues, but the current priorities are.

Implement basic partitioning so that a query can run in parallel on multiple worker nodes
Implement shuffle so that more advanced distributed jobs can be executed
Implement in-memory SORT, JOIN, GROUP BY so more categories of query can be executed
Add support for Hadoop data sources such as HDFS, Parquet, and Kudu

Contributing

Contributers are welcome! Please see CONTRIBUTING.md for details.