PRQL

Pipelined Relational Query Language, pronounced "Prequel".

PRQL is a modern language for transforming data — a simpler and more powerful SQL. Like SQL, it's readable, explicit and declarative. Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL.

PRQL was discussed on Hacker News here and Lobsters here.

An example

Here's a fairly simple SQL query:

sql SELECT TOP 20 title, country, AVG(salary) AS average_salary, SUM(salary) AS sum_salary, AVG(salary + payroll_tax) AS average_gross_salary, SUM(salary + payroll_tax) AS sum_gross_salary, AVG(salary + payroll_tax + benefits_cost) AS average_gross_cost, SUM(salary + payroll_tax + benefits_cost) AS sum_gross_cost, COUNT(*) as count FROM employees WHERE salary + payroll_tax + benefits_cost > 0 AND country = 'USA' GROUP BY title, country ORDER BY sum_gross_cost HAVING count > 200

Even this simple query demonstrates some of the problems with SQL's lack of abstractions:

Here's the same query with PRQL:

elm from employees filter country = "USA" # Each line transforms the previous result. derive [ # This adds columns / variables. gross_salary: salary + payroll_tax, gross_cost: gross_salary + benefits_cost # Variables can use other variables. ] filter gross_cost > 0 aggregate by:[title, country] [ # `by` are the columns to group by. average salary, # These are aggregation calcs run on each group. sum salary, average gross_salary, sum gross_salary, average gross_cost, sum_gross_cost: sum gross_cost, count, ] sort sum_gross_cost filter count > 200 take 20

As well as using variables to reduce unnecessary repetition, the query is also more readable — it flows from top to bottom, each line representing a transformation of the previous line's result. For example, TOP 20 / take 20 modify the final result in both queries — but only PRQL represents it as the final transformation. And context is localized — the aggregate function contains both the calculations and the columns to group by.

An example using Functions

Here's another SQL query, which calculates returns from prices on days with valid prices.

sql SELECT date, -- Can't use a `WHERE` clause, as it would affect the row that the `LAG` function referenced. IF(is_valid_price, price_adjusted / LAG(price_adjusted, 1) OVER (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) AS return_total, IF(is_valid_price, price_adjusted_usd / LAG(price_adjusted_usd, 1) OVER (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) AS return_usd, IF(is_valid_price, price_adjusted / LAG(price_adjusted, 1) OVER (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) - interest_rate / 252 AS return_excess, IF(is_valid_price, price_adjusted_usd / LAG(price_adjusted_usd, 1) OVER (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) - interest_rate / 252 AS return_usd_excess FROM prices

This might seem like a convoluted example, but it's taken from a real query. Indeed, it's also simpler and smaller than the full logic — note that it starts from price_adjusted, whose logic had to be split into a previous query to avoid the SQL becoming even less readable.

Here's the same query with PRQL:

```elm prql version:0.1 db:snowflake # Version number & database name.

func lagday x = ( window x by secid sort date lag 1 ) func ret x = x / (x | lagday) - 1 + dividendreturn func excess x = (x - interestrate) / 252 func ifvalid x = isvalidprice ? x : null

from prices derive [ returntotal: pricesadj | ret | ifvalid # | can be used rather than newlines. returnusd: pricesusd | ret | ifvalid returnexcess: returntotal | excess returnusdexcess: returnusd | excess ] select [ date, secid, returntotal, returnusd, returnexcess, returnusd_excess, ] ```

Because we define the functions once rather than copying & pasting the code, we get all the benefits of encapsulation and extensibility — we can have reliable & tested functions, whose purpose is explicit, which we can share across queries and colleagues.

Principles

PRQL is intended to be a modern, simple, declarative language for transforming data, with abstractions such as variables & functions. It's intended to replace SQL, but doesn't have ambitions as a general-purpose programming language. While it's at a pre-alpha stage, it has some immutable principles:

Roadmap

I'm excited and inspired by the level of enthusiasm behind the project. Many projects have fallen at this stage, though, so I'm hoping we can demonstrate viability by building things that are within our immediate reach, before potentially expanding.

Here's an initial plan — feedback welcome:

Develop the language

Already since becoming public, the language has improved dramatically, thanks to the feedback of dozens of contributors. The current state of the basics is now stable (at least I don't expect the examples in the README to change much). While we're going to hit corner-cases, we're in a good enough place to start writing the code.

I'd encourage continued discussion on Issues, and examples to be added to the examples path.

Build a parser

We need to parse the PRQL into an AST. We're planning to use nom for this. I'm also open to those who've suggested writing a formal grammar concurrently.

This is the area that needs the most immediate help. If anyone is looking for an area to contribute, check out #26 — thank you!

Build a transpiler

I'm broadly envisioning two passes:

We'll need to make initial progress on the parser before starting here.

UX

As well as a command-line tool that transpiles queries, it would be great if we could allow the language to be accessible; e.g.:

Database cohesion

One benefit of PRQL over SQL is that auto-complete, type-inference, and transpile-time error checking become much more powerful.

This is much harder to build though, since it requires a connection to the database in order to understand the schema of the table. So this would come after having a working transpiler.

Not in focus

We should focus on solving a distinct problem really well. PRQL's goal is to make reading and writing analytical queries easier, and so for the moment that means putting some things out of scope:

Interested in seeing this happen?

If you're interested in the ideas here and would like to contribute towards them being explored:

Any of these will inspire others to spend more time developing this; thank you in advance.

Inspired by

Similar projects

Notes

Joins

Functions

Assignments

Lists

Pipelines

CTEs

Thinking about