dply is a command line tool for viewing, querying, and writing csv and parquet files, inspired by dplyr and powered by polars.
A dply pipeline consists of a number of functions to read, transform, or write data to disk.
The following is an example of a three steps pipeline that reads a parquet file selects all columns that contain amount and shows some of the data[^1]:
$ dply -c 'parquet("nyctaxi.parquet") | select(contains("amount")) | head()'
shape: (10, 4)
┌─────────────┬────────────┬──────────────┬──────────────┐
│ fare_amount ┆ tip_amount ┆ tolls_amount ┆ total_amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════════╪════════════╪══════════════╪══════════════╡
│ 14.5 ┆ 3.76 ┆ 0.0 ┆ 22.56 │
│ 6.5 ┆ 0.0 ┆ 0.0 ┆ 9.8 │
│ 11.5 ┆ 2.96 ┆ 0.0 ┆ 17.76 │
│ 18.0 ┆ 4.36 ┆ 0.0 ┆ 26.16 │
│ 12.5 ┆ 3.25 ┆ 0.0 ┆ 19.55 │
│ 19.0 ┆ 0.0 ┆ 0.0 ┆ 22.3 │
│ 8.5 ┆ 0.0 ┆ 0.0 ┆ 11.8 │
│ 6.0 ┆ 2.0 ┆ 0.0 ┆ 11.3 │
│ 12.0 ┆ 3.26 ┆ 0.0 ┆ 19.56 │
│ 9.0 ┆ 2.56 ┆ 0.0 ┆ 15.36 │
└─────────────┴────────────┴──────────────┴──────────────┘
A simple pipeline can be passed as a command line argument with the -c
flag or
as standard input, for more complex pipelines is convenient to store the pipeline
in a file and run dply with the file name as a command line argument.
For example the NYC taxi test file [^1] has a payment_type
and total_amount
columns, let's say we want to find out for all payment types the minimum,
maximum, and mean amount paid and the number of payments for each type sorted in
descending order, we can write the following pipeline in a dply file:
```
parquet("nyctaxi.parquet") | groupby(paymenttype) | summarize( meanprice = mean(totalamount), minprice = min(totalamount), maxprice = max(totalamount), n = n() ) | arrange(desc(n)) | show() ```
and then run the script:
$ dply payments.dply
shape: (5, 5)
┌──────────────┬────────────┬───────────┬───────────┬─────┐
│ payment_type ┆ mean_price ┆ min_price ┆ max_price ┆ n │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ u32 │
╞══════════════╪════════════╪═══════════╪═══════════╪═════╡
│ Credit card ┆ 22.378757 ┆ 8.5 ┆ 84.36 ┆ 185 │
│ Cash ┆ 18.458491 ┆ 3.3 ┆ 63.1 ┆ 53 │
│ Unknown ┆ 26.847778 ┆ 9.96 ┆ 54.47 ┆ 9 │
│ Dispute ┆ -0.5 ┆ -8.3 ┆ 7.3 ┆ 2 │
│ No charge ┆ 8.8 ┆ 8.8 ┆ 8.8 ┆ 1 │
└──────────────┴────────────┴───────────┴───────────┴─────┘
250 rows parquet file sampled from the NYC trip record data.
The following is a list of the functions that are currently supported with examples, more examples can be found in the tests folder.
arrange
sorts the rows of its input dataframe according to the values of the
given columns:
$ dply -c 'parquet("nyctaxi.parquet") |
count(payment_type, VendorID) |
arrange(payment_type, n) |
show()'
shape: (8, 3)
┌──────────────┬──────────┬─────┐
│ payment_type ┆ VendorID ┆ n │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞══════════════╪══════════╪═════╡
│ Cash ┆ 1 ┆ 12 │
│ Cash ┆ 2 ┆ 41 │
│ Credit card ┆ 1 ┆ 37 │
│ Credit card ┆ 2 ┆ 148 │
│ Dispute ┆ 2 ┆ 2 │
│ No charge ┆ 1 ┆ 1 │
│ Unknown ┆ 2 ┆ 4 │
│ Unknown ┆ 1 ┆ 5 │
└──────────────┴──────────┴─────┘
To invert the ordering of a column use the desc
function:
$ dply -c 'parquet("nyctaxi.parquet") |
count(payment_type, VendorID) |
arrange(desc(payment_type), n) |
show()'
shape: (8, 3)
┌──────────────┬──────────┬─────┐
│ payment_type ┆ VendorID ┆ n │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞══════════════╪══════════╪═════╡
│ Unknown ┆ 2 ┆ 4 │
│ Unknown ┆ 1 ┆ 5 │
│ No charge ┆ 1 ┆ 1 │
│ Dispute ┆ 2 ┆ 2 │
│ Credit card ┆ 1 ┆ 37 │
│ Credit card ┆ 2 ┆ 148 │
│ Cash ┆ 1 ┆ 12 │
│ Cash ┆ 2 ┆ 41 │
└──────────────┴──────────┴─────┘
count
counts the number of unique values in the given columns:
$ dply -c 'parquet("nyctaxi.parquet") |
count(payment_type, VendorID) |
show()'
shape: (8, 3)
┌──────────────┬──────────┬─────┐
│ payment_type ┆ VendorID ┆ n │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞══════════════╪══════════╪═════╡
│ Cash ┆ 1 ┆ 12 │
│ Cash ┆ 2 ┆ 41 │
│ Credit card ┆ 1 ┆ 37 │
│ Credit card ┆ 2 ┆ 148 │
│ Dispute ┆ 2 ┆ 2 │
│ No charge ┆ 1 ┆ 1 │
│ Unknown ┆ 1 ┆ 5 │
│ Unknown ┆ 2 ┆ 4 │
└──────────────┴──────────┴─────┘
passing sort = true
sorts the counters in descending order:
$ dply -c 'parquet("nyctaxi.parquet") |
count(payment_type, VendorID, sort=true) |
show()'
shape: (8, 3)
┌──────────────┬──────────┬─────┐
│ payment_type ┆ VendorID ┆ n │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞══════════════╪══════════╪═════╡
│ Credit card ┆ 2 ┆ 148 │
│ Cash ┆ 2 ┆ 41 │
│ Credit card ┆ 1 ┆ 37 │
│ Cash ┆ 1 ┆ 12 │
│ Unknown ┆ 1 ┆ 5 │
│ Unknown ┆ 2 ┆ 4 │
│ Dispute ┆ 2 ┆ 2 │
│ No charge ┆ 1 ┆ 1 │
└──────────────┴──────────┴─────┘
When csv
is called as the first step in a pipeline it reads a csv file from disk:
$ dply -c 'csv("nyctaxi.csv") |
select(passenger_count, trip_distance, total_amount) |
head(5)'
shape: (5, 3)
┌─────────────────┬───────────────┬──────────────┐
│ passenger_count ┆ trip_distance ┆ total_amount │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════════════════╪═══════════════╪══════════════╡
│ 1 ┆ 3.14 ┆ 22.56 │
│ 2 ┆ 1.06 ┆ 9.8 │
│ 1 ┆ 2.36 ┆ 17.76 │
│ 1 ┆ 5.2 ┆ 26.16 │
│ 3 ┆ 0.0 ┆ 19.55 │
└─────────────────┴───────────────┴──────────────┘
when called after the first step it writes the active dataframe to disk:
$ dply -c 'parquet("nyctaxi.parquet") |
select(passenger_count, payment_type, trip_distance, total_amount) |
csv("trips.csv", overwrite = true) |
count(passenger_count, payment_type, sort = true) |
csv("payments.csv")'
$ ls *.csv
nyctaxi.csv payments.csv trips.csv
By default csv
generates an error if the file already exists, to overwrite the
file pass overwrite = true
.
distinct
keeps unique rows in the input dataframe:
$ dply -c 'parquet("nyctaxi.parquet") |
distinct(payment_type, VendorID) |
arrange(payment_type, VendorID) |
show()'
shape: (8, 2)
┌──────────────┬──────────┐
│ payment_type ┆ VendorID │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════════╪══════════╡
│ Cash ┆ 1 │
│ Cash ┆ 2 │
│ Credit card ┆ 1 │
│ Credit card ┆ 2 │
│ Dispute ┆ 2 │
│ No charge ┆ 1 │
│ Unknown ┆ 1 │
│ Unknown ┆ 2 │
└──────────────┴──────────┘
when called without any columns it shows the distinct rows in the input dataframe.
filter
retains all the rows whose column values satisfy the given predicates.
For each predicate the left hand side of each condition must specify a column,
predicates that are comma separated are applied one after the other:
$ dply -c 'parquet("nyctaxi.parquet") |
select(payment_type, trip_distance, total_amount) |
filter(payment_type == "Cash", trip_distance < 2, total_amount < 10) |
show()'
shape: (8, 3)
┌──────────────┬───────────────┬──────────────┐
│ payment_type ┆ trip_distance ┆ total_amount │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞══════════════╪═══════════════╪══════════════╡
│ Cash ┆ 1.06 ┆ 9.8 │
│ Cash ┆ 0.0 ┆ 3.3 │
│ Cash ┆ 1.24 ┆ 7.8 │
│ Cash ┆ 1.18 ┆ 8.8 │
│ Cash ┆ 1.18 ┆ 9.8 │
│ Cash ┆ 0.9 ┆ 8.3 │
│ Cash ┆ 0.74 ┆ 8.8 │
│ Cash ┆ 1.2 ┆ 9.8 │
└──────────────┴───────────────┴──────────────┘
filter
supports logical &
and |
in predicates, their priority is right
associative, the following predicate will return all rows whose payment is Cash
or rows whose trip_distance < 2
and total_amount < 10
:
$ dply -c 'parquet("nyctaxi.parquet") |
select(payment_type, trip_distance, total_amount) |
filter(payment_type == "Cash" | trip_distance < 2 & total_amount < 10) |
glimpse()'
Rows: 68
Columns: 3
+---------------+--------+----------------------------------------------------+
| payment_type | str | "Cash", "Cash", "Cash", "Credit card", "Cash",... |
| trip_distance | f64 | 1.06, 2.39, 1.52, 0.48, 2.88, 4.67, 1.6, 0.0,... |
| total_amount | f64 | 9.8, 22.3, 11.8, 9.13, 16.3, 21.3, 12.8, 3.3, 7... |
+---------------+--------+----------------------------------------------------+
we can use parenthesis to change the priority:
$ dply -c 'parquet("nyctaxi.parquet") |
select(payment_type, trip_distance, total_amount) |
filter((payment_type == "Cash" | trip_distance < 2) & total_amount < 10) |
glimpse()'
Rows: 23
Columns: 3
+---------------+--------+----------------------------------------------------+
| payment_type | str | "Cash", "Credit card", "Cash", "Dispute", "Cred... |
| trip_distance | f64 | 1.06, 0.48, 0.0, 0.43, 0.42, 0.66, 1.1, 0.49, 0.5 |
| total_amount | f64 | 9.8, 9.13, 3.3, 7.3, 8.5, 9.36, 8.8, 8.76, 9.8 |
+---------------+--------+----------------------------------------------------+
To compare dates use the dt
function, it can parse a string with a date-time
YYYY-MM-DD HH:MM:SS
or a date YYYY-MM-DD
:
$ dply -c 'parquet("nyctaxi.parquet") |
select(ends_with("time")) |
filter(tpep_pickup_datetime < dt("2022-11-01 12:00:00")) |
show()'
shape: (4, 2)
┌──────────────────────┬───────────────────────┐
│ tpep_pickup_datetime ┆ tpep_dropoff_datetime │
│ --- ┆ --- │
│ datetime[ns] ┆ datetime[ns] │
╞══════════════════════╪═══════════════════════╡
│ 2022-11-01 10:45:13 ┆ 2022-11-01 10:53:56 │
│ 2022-11-01 07:31:16 ┆ 2022-11-01 08:19:44 │
│ 2022-11-01 11:33:46 ┆ 2022-11-01 12:03:15 │
│ 2022-11-01 11:17:08 ┆ 2022-11-01 12:08:15 │
└──────────────────────┴───────────────────────┘
glimpse
displays an overview of the input dataframe by showing each column in a
row with its type and a few values. This format is convenient when a dataframe
has many columns and a table view doesn't fit in the terminal.
$ dply -c 'parquet("nyctaxi.parquet") | glimpse()'
Rows: 250
Columns: 19
+-----------------------+--------------+---------------------------------------------+
| VendorID | i64 | 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2... |
| tpep_pickup_datetime | datetime[ns] | 2022-11-22 19:27:01, 2022-11-27 16:43:26... |
| tpep_dropoff_datetime | datetime[ns] | 2022-11-22 19:45:53, 2022-11-27 16:50:06... |
| passenger_count | i64 | 1, 2, 1, 1, 3, 1, 2, 1, 1, 2, 2, 1, 1, 1... |
| trip_distance | f64 | 3.14, 1.06, 2.36, 5.2, 0.0, 2.39, 1.52,... |
| rate_code | str | "Standard", "Standard", "Standard",... |
| store_and_fwd_flag | str | "N", "N", "N", "N", "N", "N", "N", "N",... |
| PULocationID | i64 | 234, 48, 142, 79, 237, 137, 107, 229, 16... |
| DOLocationID | i64 | 141, 142, 236, 75, 230, 140, 162, 161, 1... |
| payment_type | str | "Credit card", "Cash", "Credit card",... |
| fare_amount | f64 | 14.5, 6.5, 11.5, 18.0, 12.5, 19.0, 8.5,... |
| extra | f64 | 1.0, 0.0, 0.0, 0.5, 3.0, 0.0, 0.0, 0.0,... |
| mta_tax | f64 | 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,... |
| tip_amount | f64 | 3.76, 0.0, 2.96, 4.36, 3.25, 0.0, 0.0, 2... |
| tolls_amount | f64 | 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,... |
| improvement_surcharge | f64 | 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,... |
| total_amount | f64 | 22.56, 9.8, 17.76, 26.16, 19.55, 22.3,... |
| congestion_surcharge | f64 | 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5,... |
| airport_fee | f64 | 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,... |
+-----------------------+--------------+---------------------------------------------+
As glimpse
consumes the input dataframe it must be the last function in a
pipeline.
group_by
and summarize
work together to compute aggregations on groups of
values. group_by
specifies which columns to use for the groups and summarize
specifies which aggregate operations to compute.
summarize
supports the following aggregate functions, max
, min
, mean
,
median
, sd
, sum
, var
and quantile
.
A call to group_by
must always be followed by a summarize
.
For example to compute the mean, standard deviation, minimum and maximum price paid and number of rows for each payment type:
$ dply -c 'parquet("nyctaxi.parquet") |
group_by(payment_type) |
summarize(
mean_price = mean(total_amount),
std_price = sd(total_amount),
min_price = min(total_amount),
max_price = max(total_amount),
n = n()
) |
arrange(desc(n)) |
show()'
shape: (5, 6)
┌──────────────┬────────────┬───────────┬───────────┬───────────┬─────┐
│ payment_type ┆ mean_price ┆ std_price ┆ min_price ┆ max_price ┆ n │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ u32 │
╞══════════════╪════════════╪═══════════╪═══════════╪═══════════╪═════╡
│ Credit card ┆ 22.378757 ┆ 16.095337 ┆ 8.5 ┆ 84.36 ┆ 185 │
│ Cash ┆ 18.458491 ┆ 12.545236 ┆ 3.3 ┆ 63.1 ┆ 53 │
│ Unknown ┆ 26.847778 ┆ 14.279152 ┆ 9.96 ┆ 54.47 ┆ 9 │
│ Dispute ┆ -0.5 ┆ 11.030866 ┆ -8.3 ┆ 7.3 ┆ 2 │
│ No charge ┆ 8.8 ┆ 0.0 ┆ 8.8 ┆ 8.8 ┆ 1 │
└──────────────┴────────────┴───────────┴───────────┴───────────┴─────┘
See tests for more examples.
head
shows the first few rows from a dataframe, an optional parameter can be
used to change the number of rows that are shown.
head
must be the last step in a pipeline as it consumes the input dataframe.
mutate
creates new columns by applying transformations to existing columns. For
example to add column for trip duration and average speed in km/h:
$ dply -c 'parquet("nyctaxi.parquet") |
select(ends_with("time"), trip_distance_mi = trip_distance) |
mutate(
travel_time_ns = tpep_dropoff_datetime - tpep_pickup_datetime,
trip_distance_km = trip_distance_mi * 1.60934,
avg_speed_km_h = trip_distance_km / (travel_time_ns / 3.6e12)
) |
select(travel_time_ns, trip_distance_km, avg_speed_km_h) |
arrange(desc(travel_time_ns)) |
head(5)'
shape: (5, 3)
┌────────────────┬──────────────────┬────────────────┐
│ travel_time_ns ┆ trip_distance_km ┆ avg_speed_km_h │
│ --- ┆ --- ┆ --- │
│ duration[ns] ┆ f64 ┆ f64 │
╞════════════════╪══════════════════╪════════════════╡
│ 1h 6m ┆ 28.179543 ┆ 25.617767 │
│ 1h 2m 39s ┆ 28.630159 ┆ 27.419146 │
│ 55m 48s ┆ 26.763324 ┆ 28.777768 │
│ 53m 45s ┆ 19.988003 ┆ 22.312189 │
│ 51m 7s ┆ 14.966862 ┆ 17.567885 │
└────────────────┴──────────────────┴────────────────┘
mutate
supports also mean
, max
, min
, median
, and dt
functions see
tests for more examples.
When parquet
is called as the first step in a pipeline it reads a parquet file
from disk:
$ dply -c 'parquet("nyctaxi.parquet") |
select(passenger_count, trip_distance, total_amount) |
head(5)'
shape: (5, 3)
┌─────────────────┬───────────────┬──────────────┐
│ passenger_count ┆ trip_distance ┆ total_amount │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════════════════╪═══════════════╪══════════════╡
│ 1 ┆ 3.14 ┆ 22.56 │
│ 2 ┆ 1.06 ┆ 9.8 │
│ 1 ┆ 2.36 ┆ 17.76 │
│ 1 ┆ 5.2 ┆ 26.16 │
│ 3 ┆ 0.0 ┆ 19.55 │
└─────────────────┴───────────────┴──────────────┘
when called after the first step it writes the active dataframe to disk:
``` $ dply -c 'parquet("nyctaxi.parquet") | select(passengercount, paymenttype, tripdistance, totalamount) | parquet("trips.parquet", overwrite = true) | count(passengercount, paymenttype, sort = true) | parquet("payments.parquet")'
$ ls *.parquet nyctaxi.parquet payments.parquet trips.parquet ```
By default parquet
generates an error if the file already exists, to overwrite
the file pass overwrite = true
.
relocate
moves column in the dataframe, by default the given columns are moved
before the first column:
$ dply -c 'parquet("nyctaxi.parquet") |
relocate(passenger_count, payment_type, total_amount) |
glimpse()'
Rows: 250
Columns: 19
+-----------------------+--------------+----------------------------------------------------+
| passenger_count | i64 | 1, 2, 1, 1, 3, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 5,... |
| payment_type | str | "Credit card", "Cash", "Credit card", "Credit... |
| total_amount | f64 | 22.56, 9.8, 17.76, 26.16, 19.55, 22.3, 11.8, 11... |
| VendorID | i64 | 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2,... |
| tpep_pickup_datetime | datetime[ns] | 2022-11-22 19:27:01, 2022-11-27 16:43:26,... |
| tpep_dropoff_datetime | datetime[ns] | 2022-11-22 19:45:53, 2022-11-27 16:50:06,... |
| trip_distance | f64 | 3.14, 1.06, 2.36, 5.2, 0.0, 2.39, 1.52, 0.51,... |
| rate_code | str | "Standard", "Standard", "Standard", "Standard",... |
| store_and_fwd_flag | str | "N", "N", "N", "N", "N", "N", "N", "N", "N", "N... |
| PULocationID | i64 | 234, 48, 142, 79, 237, 137, 107, 229, 162, 48,... |
| DOLocationID | i64 | 141, 142, 236, 75, 230, 140, 162, 161, 186, 239... |
| fare_amount | f64 | 14.5, 6.5, 11.5, 18.0, 12.5, 19.0, 8.5, 6.0, 12... |
| extra | f64 | 1.0, 0.0, 0.0, 0.5, 3.0, 0.0, 0.0, 0.0, 1.0, 0.... |
| mta_tax | f64 | 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.... |
| tip_amount | f64 | 3.76, 0.0, 2.96, 4.36, 3.25, 0.0, 0.0, 2.0, 3.2... |
| tolls_amount | f64 | 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.... |
| improvement_surcharge | f64 | 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.... |
| congestion_surcharge | f64 | 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.... |
| airport_fee | f64 | 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.... |
+-----------------------+--------------+----------------------------------------------------+
relocate
also supports the options before = column
and after = column
to
move columns before or after a specific column, see tests
for examples.
rename
renames columns, each rename has new_name = old_name
format:
$ dply -c 'parquet("nyctaxi.parquet") |
rename(
vendor_id = VendorID,
pu_location_id = PULocationID,
do_location_id = DOLocationID
) |
glimpse()'
Rows: 250
Columns: 19
+-----------------------+--------------+----------------------------------------------------+
| vendor_id | i64 | 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2,... |
| tpep_pickup_datetime | datetime[ns] | 2022-11-22 19:27:01, 2022-11-27 16:43:26,... |
| tpep_dropoff_datetime | datetime[ns] | 2022-11-22 19:45:53, 2022-11-27 16:50:06,... |
| passenger_count | i64 | 1, 2, 1, 1, 3, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 5,... |
| trip_distance | f64 | 3.14, 1.06, 2.36, 5.2, 0.0, 2.39, 1.52, 0.51,... |
| rate_code | str | "Standard", "Standard", "Standard", "Standard",... |
| store_and_fwd_flag | str | "N", "N", "N", "N", "N", "N", "N", "N", "N", "N... |
| pu_location_id | i64 | 234, 48, 142, 79, 237, 137, 107, 229, 162, 48,... |
| do_location_id | i64 | 141, 142, 236, 75, 230, 140, 162, 161, 186, 239... |
| payment_type | str | "Credit card", "Cash", "Credit card", "Credit... |
| fare_amount | f64 | 14.5, 6.5, 11.5, 18.0, 12.5, 19.0, 8.5, 6.0, 12... |
| extra | f64 | 1.0, 0.0, 0.0, 0.5, 3.0, 0.0, 0.0, 0.0, 1.0, 0.... |
| mta_tax | f64 | 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.... |
| tip_amount | f64 | 3.76, 0.0, 2.96, 4.36, 3.25, 0.0, 0.0, 2.0, 3.2... |
| tolls_amount | f64 | 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.... |
| improvement_surcharge | f64 | 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.... |
| total_amount | f64 | 22.56, 9.8, 17.76, 26.16, 19.55, 22.3, 11.8, 11... |
| congestion_surcharge | f64 | 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.... |
| airport_fee | f64 | 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.... |
+-----------------------+--------------+----------------------------------------------------+
select
keeps the columns specified in its arguments and optionally rename them.
It accepts column names and starts_with
, ends_with
and contains
predicates:
$ dply -c 'parquet("nyctaxi.parquet") |
select(
vendor_id = VendorID,
ends_with("time"),
contains("amount")
) |
glimpse()'
Rows: 250
Columns: 7
+-----------------------+--------------+----------------------------------------------------+
| vendor_id | i64 | 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2,... |
| tpep_pickup_datetime | datetime[ns] | 2022-11-22 19:27:01, 2022-11-27 16:43:26,... |
| tpep_dropoff_datetime | datetime[ns] | 2022-11-22 19:45:53, 2022-11-27 16:50:06,... |
| fare_amount | f64 | 14.5, 6.5, 11.5, 18.0, 12.5, 19.0, 8.5, 6.0, 12... |
| tip_amount | f64 | 3.76, 0.0, 2.96, 4.36, 3.25, 0.0, 0.0, 2.0, 3.2... |
| tolls_amount | f64 | 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.... |
| total_amount | f64 | 22.56, 9.8, 17.76, 26.16, 19.55, 22.3, 11.8, 11... |
+-----------------------+--------------+----------------------------------------------------+
Any of the predicates functions can be negated with !
:
$ dply -c 'parquet("nyctaxi.parquet") |
select(!contains("a")) |
head(5)'
shape: (5, 1)
┌──────────┐
│ VendorID │
│ --- │
│ i64 │
╞══════════╡
│ 2 │
│ 2 │
│ 2 │
│ 2 │
│ 1 │
└──────────┘
show
displays all the rows in the input dataframe in table format. show
must
be the last step in a pipeline as it consumes the input dataframe.
Dataframe variables store a pipeline progress that can be used in another
pipeline, in the following example the fare_amounts
variable stores the result
of the parent select
that is then used by another group_by
:
``` $ dply -c 'parquet("nyctaxi.parquet") | select(endswith("time"), paymenttype, contains("amount")) | fareamounts | groupby(paymenttype) | summarize(meanamount = mean(total_amount)) | head()
fareamounts | groupby(paymenttype) | summarize(meantips = mean(tipamount)) | head()' shape: (5, 2) ┌──────────────┬─────────────┐ │ paymenttype ┆ meanamount │ │ --- ┆ --- │ │ str ┆ f64 │ ╞══════════════╪═════════════╡ │ Credit card ┆ 22.378757 │ │ Cash ┆ 18.458491 │ │ Dispute ┆ -0.5 │ │ Unknown ┆ 26.847778 │ │ No charge ┆ 8.8 │ └──────────────┴─────────────┘ shape: (5, 2) ┌──────────────┬───────────┐ │ paymenttype ┆ mean_tips │ │ --- ┆ --- │ │ str ┆ f64 │ ╞══════════════╪═══════════╡ │ Credit card ┆ 3.469784 │ │ Cash ┆ 0.0 │ │ Dispute ┆ 0.0 │ │ Unknown ┆ 3.082222 │ │ No charge ┆ 0.0 │ └──────────────┴───────────┘ ```
Binaries generated by the release Github action for Linux, macOS (x86), and Windows are available in the releases page.
You can also install dply
using Cargo:
bash
cargo install dply
or by building it from this repository:
bash
git clone https://github.com/vincev/dply-rs
cd dply-rs
cargo install --path .