DataFusion: Big Data Platform for Rust

License Version Docs Build Status Gitter chat

DataFusion is a distributed data processing platform implemented in Rust. It is very much inspired by Apache Spark and has a similar programming style through the use of DataFrames and SQL.

DataFusion can also be used as a crate dependency in your project if you want the ability to perform SQL queries and DataFrame style data manipulation in-process.

Project Home Page

The project home page is now at https://datafusion.rs

Current Status

It is currently possible to use DataFusion as a crate dependency to execute SQL and DataFrame operations against data in-process and it is also possible to deploy DataFusion as a distributed data processing platform (but only with a single worker so far).

Standalone

Both of these examples run a trivial query against a trivial CSV file using a single thread.

Distributed

It is possible to start a single worker node and use a SQL console to execute queries in the remote worker.

Run Worker

bash cargo run --bin worker

``` Worker listening on 0.0.0.0:8080

```

Run Console

bash cargo run --bin console

``` DataFusion Console $ CREATE EXTERNAL TABLE ukcities (name VARCHAR(100) NOT NULL, lat DOUBLE NOT NULL, lng DOUBLE NOT NULL) Executing: CREATE EXTERNAL TABLE ukcities (name VARCHAR(100) NOT NULL, lat DOUBLE NOT NULL, lng DOUBLE NOT NULL)

$ SELECT name, lat, lng FROM ukcities WHERE lat < 51 Executing: SELECT name, lat, lng FROM ukcities WHERE lat < 51 Eastbourne, East Sussex, UK,50.768036,0.290472 Weymouth, Dorset, UK,50.614429,-2.457621 Bournemouth, UK,50.720806,-1.904755 Hastings, East Sussex, UK,50.854259,0.573453 Uckfield, East Sussex, UK,50.967941,0.085831 Worthing, West Sussex, UK,50.825024,-0.383835 Plymouth, UK,50.376289,-4.143841

Query executed in 0.005250537 seconds ```

Roadmap

I've started defining milestones and issues in github issues, but the current priorities are.

Contributing

Contributers are welcome! Please see CONTRIBUTING.md for details.