PRQL

GitHub CI Status Discord Stars

pre-commit.ci Status

Pipelined Relational Query Language, pronounced "Prequel".

PRQL is a modern language for transforming data — a simpler and more powerful SQL. Like SQL, it's readable, explicit and declarative. Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL.

PRQL was discussed on Hacker News and Lobsters.

PRQL is currently being implemented in:

An example

Here's a fairly simple SQL query:

sql SELECT TOP 20 title, country, AVG(salary) AS average_salary, SUM(salary) AS sum_salary, AVG(salary + payroll_tax) AS average_gross_salary, SUM(salary + payroll_tax) AS sum_gross_salary, AVG(salary + payroll_tax + benefits_cost) AS average_gross_cost, SUM(salary + payroll_tax + benefits_cost) AS sum_gross_cost, COUNT(*) as count FROM employees WHERE salary + payroll_tax + benefits_cost > 0 AND country = 'USA' GROUP BY title, country ORDER BY sum_gross_cost HAVING count > 200

Even this simple query demonstrates some of the problems with SQL's lack of abstractions:

Here's the same query with PRQL:

elm from employees filter country = "USA" # Each line transforms the previous result. derive [ # This adds columns / variables. gross_salary: salary + payroll_tax, gross_cost: gross_salary + benefits_cost # Variables can use other variables. ] filter gross_cost > 0 aggregate by:[title, country] [ # `by` are the columns to group by. average salary, # These are aggregation calcs run on each group. sum salary, average gross_salary, sum gross_salary, average gross_cost, sum_gross_cost: sum gross_cost, ct: count *, ] sort sum_gross_cost filter ct > 200 take 20

As well as using variables to reduce unnecessary repetition, the query is also more readable — it flows from top to bottom, each line representing a transformation of the previous line's result. For example, TOP 20 / take 20 modify the final result in both queries — but only PRQL represents it as the final transformation. And context is localized — the aggregate function contains both the calculations and the columns to group by.

While PRQL is designed for reading & writing by people, it's also much simpler for code to construct or edit PRQL queries. In SQL, adding a filter to a query involves parsing the query to find and then modify the WHERE statement, or wrapping the existing query in a CTE. In PRQL, adding a filter just involves adding a filter transformation to the final line.

An example using Functions

Here's another SQL query, which calculates returns from prices on days with valid prices.

sql SELECT date, -- Can't use a `WHERE` clause, as it would affect the row that the `LAG` function referenced. IF(is_valid_price, price_adjusted / LAG(price_adjusted, 1) OVER (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) AS return_total, IF(is_valid_price, price_adjusted_usd / LAG(price_adjusted_usd, 1) OVER (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) AS return_usd, IF(is_valid_price, price_adjusted / LAG(price_adjusted, 1) OVER (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) - interest_rate / 252 AS return_excess, IF(is_valid_price, price_adjusted_usd / LAG(price_adjusted_usd, 1) OVER (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) - interest_rate / 252 AS return_usd_excess FROM prices

This might seem like a convoluted example, but it's taken from a real query. Indeed, it's also simpler and smaller than the full logic — note that it starts from price_adjusted, whose logic had to be split into a previous query to avoid the SQL becoming even less readable.

Here's the same query with PRQL:

```elm prql version:0.1 db:snowflake # Version number & database name.

func lagday x = ( window x by secid sort date lag 1 ) func ret x = x / (x | lagday) - 1 + dividendreturn func excess x = (x - interestrate) / 252 func ifvalid x = isvalidprice ? x : null

from prices derive [ returntotal: pricesadj | ret | ifvalid # | can be used rather than newlines. returnusd: pricesusd | ret | ifvalid returnexcess: returntotal | excess returnusdexcess: returnusd | excess ] select [ date, secid, returntotal, returnusd, returnexcess, returnusd_excess, ] ```

Because we define the functions once rather than copying & pasting the code, we get all the benefits of encapsulation and extensibility — we can have reliable & tested functions, whose purpose is explicit, which we can share across queries and colleagues.

Principles

PRQL is intended to be a modern, simple, declarative language for transforming data, with abstractions such as variables & functions. It's intended to replace SQL, but doesn't have ambitions as a general-purpose programming language. While it's at a pre-alpha stage, it has some immutable principles:

Roadmap — getting to v0.1

I'm excited and inspired by the level of enthusiasm behind the project. Many projects have fallen at this stage, though, so I'm hoping we can demonstrate viability by shipping a working version fairly soon.

Currently, we're most of the way to an initial version that will let folks use the language for experimentation. The remaining features that we need to complete — a few minor, a few major — are listed under the 0.1 Milestone.

Develop the language

Already since becoming public, the language has improved dramatically, thanks to the feedback of dozens of contributors. The current state of the basics is now stable and while we'll hit corner-cases, I expect we'll only make small changes from here.

Feel free to post questions or continue discussions on Language Design Issues. It would be great to continue building out the examples path.

Parser

This is mostly complete. There are a couple of issues in #26 remaining.

Transpiler

There are two parts to this, both of which have some major features we need before hitting v0.1.

Roadmap — beyond v0.1

Fast feedback

As well as a command-line tool that transpiles queries, we'd like to make developing in PRQL a wonderful experience, where it feels like it's on your side:

Database cohesion

One benefit of PRQL over SQL is that auto-complete, type-inference, and error checking can be much more powerful.

This is much harder to build though, since it requires a connection to the database in order to understand the schema of the table. So this would come after having a working transpiler.

Not in focus

We should focus on solving a distinct problem really well. PRQL's goal is to make reading and writing analytical queries easier, and so for the moment that means putting some things out of scope:

Interested in seeing this happen?

If you're interested in the ideas here and would like to contribute towards them being explored:

Any of these will inspire others to spend more time developing this; thank you in advance.

Inspired by

Similar projects

If any of these descriptions can be improved, please feel free to PR changes.

Notes

Joins

Functions

Assignments

S-Strings

An s-string inserts SQL directly. It's similar in form to a python f-string, but the result is SQL, rather than a string literal; i.e.:

elm func sum col = s"SUM({col})" sum salary

transpiles to:

sql SUM(salary)

...whereas if it were a python f-string, it would make "sum(salary)", with the quotes.

Lists

Pipelines

CTEs

Thinking about