Pipelined Relational Query Language, pronounced "Prequel".
PRQL is a modern language for transforming data — a simpler and more powerful SQL. Like SQL, it's readable, explicit and declarative. Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL.
PRQL was discussed on Hacker News and Lobsters earlier this year when it was just a proposal.
PRQL just hit 0.1! This means:
filter
, select
, aggregate
, take
,
sort
, & join
. Variables (derive
), functions (func
) and CTEs (table
) work.
PRQL can be installed with cargo
, or built from source.
sh
cargo install prql
```sh $ echo "from employees | filter has_dog" | prql compile
SELECT * FROM employees WHERE has_dog ```
See below for fuller examples of PRQL.
There is a python implementation at
qorrect/PyPrql, which can be installed with
pip install pyprql
. It has some great features, including a native interactive
console with auto-complete for column names.
Here's a fairly simple SQL query:
sql
SELECT TOP 20
title,
country,
AVG(salary) AS average_salary,
SUM(salary) AS sum_salary,
AVG(salary + payroll_tax) AS average_gross_salary,
SUM(salary + payroll_tax) AS sum_gross_salary,
AVG(salary + payroll_tax + benefits_cost) AS average_gross_cost,
SUM(salary + payroll_tax + benefits_cost) AS sum_gross_cost,
COUNT(*) as count
FROM employees
WHERE salary + payroll_tax + benefits_cost > 0 AND country = 'USA'
GROUP BY title, country
ORDER BY sum_gross_cost
HAVING count > 200
Even this simple query demonstrates some of the problems with SQL's lack of abstractions:
WHERE
clause obfuscates the meaning of the expression.HAVING
& WHERE
are fundamentally
similar operations applied at different stages of the pipeline but SQL's lack
of pipeline-based precedence requires it to have two different operators.SELECT
operator both
creates new aggregations, and selects which columns to include.SELECT
list causes a syntax error because of how commas are handled, and
we need to repeat the columns in the GROUP BY
clause in the SELECT
list.Here's the same query with PRQL:
elm
from employees
filter country = "USA" # Each line transforms the previous result.
derive [ # This adds columns / variables.
gross_salary: salary + payroll_tax,
gross_cost: gross_salary + benefits_cost # Variables can use other variables.
]
filter gross_cost > 0
aggregate by:[title, country] [ # `by` are the columns to group by.
average salary, # These are aggregation calcs run on each group.
sum salary,
average gross_salary,
sum gross_salary,
average gross_cost,
sum_gross_cost: sum gross_cost,
ct: count *,
]
sort sum_gross_cost
filter ct > 200
take 20
As well as using variables to reduce unnecessary repetition, the query is also
more readable — it flows from top to bottom, each line representing a
transformation of the previous line's result. For example, TOP 20
/ take 20
modify the final result in both queries — but only PRQL represents it as the
final transformation. And context is localized — the aggregate
function
contains both the calculations and the columns to group by.
While PRQL is designed for reading & writing by people, it's also much simpler
for code to construct or edit PRQL queries. In SQL, adding a filter to a query
involves parsing the query to find and then modify the WHERE
statement, or
wrapping the existing query in a CTE. In PRQL, adding a filter just involves
adding a filter
transformation to the final line.
Here's another SQL query, which calculates returns from prices on days with valid prices.
sql
SELECT
date,
-- Can't use a `WHERE` clause, as it would affect the row that the `LAG` function referenced.
IF(is_valid_price, price_adjusted / LAG(price_adjusted, 1) OVER
(PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) AS return_total,
IF(is_valid_price, price_adjusted_usd / LAG(price_adjusted_usd, 1) OVER
(PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) AS return_usd,
IF(is_valid_price, price_adjusted / LAG(price_adjusted, 1) OVER
(PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL)
- interest_rate / 252 AS return_excess,
IF(is_valid_price, price_adjusted_usd / LAG(price_adjusted_usd, 1) OVER
(PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL)
- interest_rate / 252 AS return_usd_excess
FROM prices
This might seem like a convoluted example, but it's taken from a real query. Indeed, it's also simpler and smaller than the full logic — note that it starts from
price_adjusted
, whose logic had to be split into a previous query to avoid the SQL becoming even less readable.
Here's the same query with PRQL:
```elm prql version:0.1 db:snowflake # Version number & database name.
func lagday x = ( window x by secid sort date lag 1 ) func ret x = x / (x | lagday) - 1 + dividendreturn func excess x = (x - interestrate) / 252 func ifvalid x = isvalidprice ? x : null
from prices
derive [
returntotal: pricesadj | ret | ifvalid # |
can be used rather than newlines.
returnusd: pricesusd | ret | ifvalid
returnexcess: returntotal | excess
returnusdexcess: returnusd | excess
]
select [
date,
secid,
returntotal,
returnusd,
returnexcess,
returnusd_excess,
]
```
Because we define the functions once rather than copying & pasting the code, we get all the benefits of encapsulation and extensibility — we can have reliable & tested functions, whose purpose is explicit, which we can share across queries and colleagues.
PRQL is intended to be a modern, simple, declarative language for transforming data, with abstractions such as variables & functions. It's intended to replace SQL, but doesn't have ambitions as a general-purpose programming language. While it's at a pre-alpha stage, it has some immutable principles:
I'm excited and inspired by the level of enthusiasm behind the project, both from individual contributors and the broader community of users who are unsatisfied with SQL. We currently have an initial working version for the intrepid early user.
I'm hoping we can build a beautiful language, an app that's approachable & powerful, and a vibrant community. Many projects have reached the current stage and fallen, so this requires compounding on what we've done so far.
Already since becoming public, the language has improved dramatically, thanks to the feedback of dozens of contributors. The current state of the basics is now stable and while we'll hit corner-cases, I expect we'll only make small changes to the existing features — even as we continue adding features.
Feel free to post questions or continue discussions on Language Design Issues.
Currently the documentation exists in tests, examples, and some Notes below.
If you're up for contributing and don't have a preference for writing code or not, this is the area that would most benefit from your contribution. Issues are tagged with documentation.
Currently the language is not friendly, as described in Current status. We'd like to make error messages better, sand off sharp corners, etc.
Both bug reports of unfriendliness, and code contributions to improve them are welcome; there's a friendliness label.
As well as a command-line tool that transpiles queries, we'd like to make developing in PRQL a wonderful experience, where it feels like it's on your side:
One benefit of PRQL over SQL is that auto-complete, type-inference, and error checking can be much more powerful.
This is harder to build, since it requires a connection to the database in order to understand the schema of the table.
We should focus on solving a distinct problem really well. PRQL's goal is to make reading and writing analytical queries easier, and so for the moment that means putting some things out of scope:
If you're interested in joining the community to build a better SQL, there are lots of ways of contributing; big and small:
Any of these will inspire others to spend more time developing this; thank you in advance.
.NET
ecosystem which can (mostly) compile to
SQL. It was one of the first languages to take this approach.If any of these descriptions can be improved, please feel free to PR changes.
Joins are implemented as join side:{join_type} {table} {[conditions]}
. For example:
elm
from employees
join side:left positions [id=employee_id]
...is equivalent to...
sql
SELECT * FROM employees LEFT JOIN positions ON id = employee_id
Possibly we could shorten [id=id]
to id
, and use SQL's USING
, but it may
be ambiguous with using id
as a boolean column.
Previously the syntax was {join_type} {table} {[conditions]}
. For example:
elm
left_join positions [id=employee_id]
...but it was not expandable.
So a function like:
elm
func lag col sort_col by_col=id = (
window col
by by_col
sort sort_col
lag 1
)
...is called lag
, takes three arguments col
, sort_col
& by_col
, of
which the first two much be supplied, the third can optionally be supplied
with by_col:sec_id
.
derive {column_name}: {calculation}
in a
pipeline. derive
can also take a list of pairs.
Technically this "upserts" the column — it'll either create or overwrite a
column, depending on whether it already exists.
{column_name} = {calculation}
, but it breaks
the pattern of every line starting with a keyword.:
to just have derive {column_name} ({calculation})
, which
would be more consistent with the other functions, but I think less readable
(and definitely less familiar).An s-string inserts SQL directly. It's similar in form to a python f-string, but the result is SQL, rather than a string literal; i.e.:
elm
func sum col = s"SUM({col})"
sum salary
transpiles to:
sql
SUM(salary)
...whereas if it were a python f-string, it would make "sum(salary)"
, with the
quotes.
Most keywords that take a single argument can also take a list, so these are equivalent:
diff
from employees
-select salary
+select [salary]
More examples in list-equivalence.md.
A line-break generally creates a pipelined transformation. For example:
elm
from tbl
select [
col1,
col2,
]
filter col1 = col2
...is equivalent to:
elm
from tbl | select [col1, col2] | filter col1 = col2
A line-break doesn't created a pipeline in a few cases:
select
example above).func
.How functions represent the previous result — the previous result is passed as
the final argument of a function; i.e. aggregate
would be like this; where
X
is taken from the line above:
elm
aggregate by=[] calcs X
Literal strings & f-strings: https://github.com/max-sixty/prql/issues/109
Arrays — PRQL is in part inspired by
DataPipes.jl, which demonstrates
how effective point-free pipelines can be
(Chain.jl is similar). One benefit
of this is how well it deals with arbitrarily nested pipelines — which are
difficult to read in SQL and even in jq
. Could we do something similar for
nested data in PRQL?
DataPipes.jl
— and we could avoid the macros / do
/ end
):julia
@p begin
text
strip
split(__, "\n")
map() do __
collect
map() do __
__ == chars[begin] ? 1 : 0
end
end
hcat(__...)'
end
Partials — how functional do we want to make the lang? e.g. should we have
partial functions? e.g. [now based on an old version of window
] potentially
we don't need the col
in lag
here?
elm
func lag col = window col by:sec_id sort:date lag:1
Boolean logic — how should we represent boolean logic like or
? With some
or
function that takes *args
(which we don't currently have a design for)?
Or implement dyadic operators; either or
or ||
? (Same for not
)
from
— do we need from
? A previous version of this proposal didn't require
this — just start with the table name. But some initial feedback was that
removing from
made it less clear.
Readme syntax — we can't get syntax highlighting in GitHub's markdown — is there a solution to this aside from submitting a parser to GitHub / screenshots / creating a website?
elm
as it coincidentally provides the best syntax
highlight (open to suggestions for others!).