This is a Python library that binds to Apache Arrow in-memory query engine DataFusion.
DataFusion's Python bindings can be used as an end-user tool as well as providing a foundation for building new systems.
Here is a comparison with similar projects that may help understand when DataFusion might be suitable and unsuitable for your needs:
DuckDB is an open source, in-process analytic database. Like DataFusion, it supports very fast execution, both from its custom file format and directly from Parquet files. Unlike DataFusion, it is written in C/C++ and it is primarily used directly by users as a serverless database and query system rather than as a library for building such database systems.
Polars is one of the fastest DataFrame libraries at the time of writing. Like DataFusion, it is also written in Rust and uses the Apache Arrow memory model, but unlike DataFusion it does not provide full SQL support, nor as many extension points.
The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the results in a Pandas DataFrame, and then plotting a chart.
The Parquet file used in this example can be downloaded from the following page:
```python from datafusion import SessionContext
ctx = SessionContext()
ctx.registerparquet('taxi', 'yellowtripdata_2021-01.parquet')
df = ctx.sql("select passengercount, count(*) " "from taxi " "where passengercount is not null " "group by passengercount " "order by passengercount")
pandasdf = df.topandas()
fig = pandasdf.plot(kind="bar", title="Trip Count by Number of Passengers").getfigure() fig.savefig('chart.png') ```
This produces the following chart:
It is possible to configure runtime (memory and disk settings) and configuration settings when creating a context.
python
runtime = (
RuntimeConfig()
.with_disk_manager_os()
.with_fair_spill_pool(10000000)
)
config = (
SessionConfig()
.with_create_default_catalog_and_schema(True)
.with_default_catalog_and_schema("foo", "bar")
.with_target_partitions(8)
.with_information_schema(True)
.with_repartition_joins(False)
.with_repartition_aggregations(False)
.with_repartition_windows(False)
.with_parquet_pruning(False)
.set("datafusion.execution.parquet.pushdown_filters", "true")
)
ctx = SessionContext(config, runtime)
Refer to the API documentation for more information.
Printing the context will show the current configuration settings.
python
print(ctx)
See examples for more information.
```bash pip install datafusion
python -m pip install datafusion ```
bash
conda install -c conda-forge datafusion
You can verify the installation by running:
```python
import datafusion datafusion.version '0.6.0' ```
This assumes that you have rust and cargo installed. We use the workflow recommended by pyo3 and maturin.
The Maturin tools used in this workflow can be installed either via Conda or Pip. Both approaches should offer the same experience. Multiple approaches are only offered to appease developer preference. Bootstrapping for both Conda and Pip are as follows.
Bootstrap (Conda):
```bash
git clone git@github.com:apache/arrow-datafusion-python.git
conda env create -f ./conda/environments/datafusion-dev.yaml -n datafusion-dev
conda activate datafusion-dev ```
Bootstrap (Pip):
```bash
git clone git@github.com:apache/arrow-datafusion-python.git
python3 -m venv venv
source venv/bin/activate
python -m pip install -U pip
python -m pip install -r requirements-310.txt ```
The tests rely on test data in git submodules.
bash
git submodule init
git submodule update
Whenever rust code changes (your changes or via git pull
):
```bash
maturin develop python -m pytest ```
arrow-datafusion-python takes advantage of pre-commit to assist developers with code linting to help reduce the number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for the developer but certainly helpful for keeping PRs clean and concise.
Our pre-commit hooks can be installed by running pre-commit install
, which will install the configurations in your ARROWDATAFUSIONPYTHON_ROOT/.github directory and run each time you perform a commit, failing to complete the commit if an offending lint is found allowing you to make changes locally before pushing.
The pre-commit hooks can also be run adhoc without installing them by simply running pre-commit run --all-files
To change test dependencies, change the requirements.in
and run
```bash
python -m pip install pip-tools python -m piptools compile --generate-hashes -o requirements-310.txt ```
To update dependencies, run with -U
bash
python -m piptools compile -U --generate-hashes -o requirements-310.txt
More details here