DataFusion in Python

Python test Python Release Build

This is a Python library that binds to Apache Arrow in-memory query engine DataFusion.

DataFusion's Python bindings can be used as an end-user tool as well as providing a foundation for building new systems.

Features

Comparison with other projects

Here is a comparison with similar projects that may help understand when DataFusion might be suitable and unsuitable for your needs:

Example Usage

The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the results in a Pandas DataFrame, and then plotting a chart.

The Parquet file used in this example can be downloaded from the following page:

```python from datafusion import SessionContext

Create a DataFusion context

ctx = SessionContext()

Register table with context

ctx.registerparquet('taxi', 'yellowtripdata_2021-01.parquet')

Execute SQL

df = ctx.sql("select passengercount, count(*) " "from taxi " "where passengercount is not null " "group by passengercount " "order by passengercount")

convert to Pandas

pandasdf = df.topandas()

create a chart

fig = pandasdf.plot(kind="bar", title="Trip Count by Number of Passengers").getfigure() fig.savefig('chart.png') ```

This produces the following chart:

Chart

Configuration

It is possible to configure runtime (memory and disk settings) and configuration settings when creating a context.

python runtime = ( RuntimeConfig() .with_disk_manager_os() .with_fair_spill_pool(10000000) ) config = ( SessionConfig() .with_create_default_catalog_and_schema(True) .with_default_catalog_and_schema("foo", "bar") .with_target_partitions(8) .with_information_schema(True) .with_repartition_joins(False) .with_repartition_aggregations(False) .with_repartition_windows(False) .with_parquet_pruning(False) .set("datafusion.execution.parquet.pushdown_filters", "true") ) ctx = SessionContext(config, runtime)

Refer to the API documentation for more information.

Printing the context will show the current configuration settings.

python print(ctx)

More Examples

See examples for more information.

Executing Queries with DataFusion

Running User-Defined Python Code

Substrait Support

Executing SQL against DataFrame Libraries (Experimental)

How to install (from pip)

Pip

```bash pip install datafusion

or

python -m pip install datafusion ```

Conda

bash conda install -c conda-forge datafusion

You can verify the installation by running:

```python

import datafusion datafusion.version '0.6.0' ```

How to develop

This assumes that you have rust and cargo installed. We use the workflow recommended by pyo3 and maturin.

The Maturin tools used in this workflow can be installed either via Conda or Pip. Both approaches should offer the same experience. Multiple approaches are only offered to appease developer preference. Bootstrapping for both Conda and Pip are as follows.

Bootstrap (Conda):

```bash

fetch this repo

git clone git@github.com:apache/arrow-datafusion-python.git

create the conda environment for dev

conda env create -f ./conda/environments/datafusion-dev.yaml -n datafusion-dev

activate the conda environment

conda activate datafusion-dev ```

Bootstrap (Pip):

```bash

fetch this repo

git clone git@github.com:apache/arrow-datafusion-python.git

prepare development environment (used to build wheel / install in development)

python3 -m venv venv

activate the venv

source venv/bin/activate

update pip itself if necessary

python -m pip install -U pip

install dependencies (for Python 3.8+)

python -m pip install -r requirements-310.txt ```

The tests rely on test data in git submodules.

bash git submodule init git submodule update

Whenever rust code changes (your changes or via git pull):

```bash

make sure you activate the venv using "source venv/bin/activate" first

maturin develop python -m pytest ```

Running & Installing pre-commit hooks

arrow-datafusion-python takes advantage of pre-commit to assist developers with code linting to help reduce the number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for the developer but certainly helpful for keeping PRs clean and concise.

Our pre-commit hooks can be installed by running pre-commit install, which will install the configurations in your ARROWDATAFUSIONPYTHON_ROOT/.github directory and run each time you perform a commit, failing to complete the commit if an offending lint is found allowing you to make changes locally before pushing.

The pre-commit hooks can also be run adhoc without installing them by simply running pre-commit run --all-files

How to update dependencies

To change test dependencies, change the requirements.in and run

```bash

install pip-tools (this can be done only once), also consider running in venv

python -m pip install pip-tools python -m piptools compile --generate-hashes -o requirements-310.txt ```

To update dependencies, run with -U

bash python -m piptools compile -U --generate-hashes -o requirements-310.txt

More details here