hermit-crab
Paguroidea

GITHUB-BADGE

| Crate | Status | |----------------|----------------------------------------------------------------| | pag-lexer | crates.io | | pag-parser | crates.io | | pag-compiler | crates.io |

A reimplementation of the Flap parser in Rust (with our own modifications applied)!

๐Ÿšง Under Construction ๐Ÿšง

This project is still under its early-stage development. The grammar for Paguroidea is subject to change (see Issue #22). The parser generation is not thoroughly tested, which may still shake some bugs out from time to time. There are also ongoing works to improvement the quality of the generated code.

Introduction

Paguroidea is a parser generator (a.k.a. the compiler of compiler). The theoretical foundation of Paguroidea is built on a few papers:

We modified the work of flap by extending DGNF with tree-generation actions, which are similar to the "reduce" operation in a traditional shift-reduce parser.

How to use

Notice: grammar for parser definitions used in this section will be changed in the near future.

It is simple: just define your grammar and pass it to our compiler. Then, Paguroidea will output a standalone parser file.

For example, a simple S-expression parser can be defined as the following ```text lexer { definition BLANK = ' '; definition DIGIT = '0' .. '9'; definition ALPHA = 'a' .. 'z' | 'A' .. 'Z'; active token LPAREN = '('; active token RPAREN = ')'; active token ATOM = ALPHA ~ (ALPHA | DIGIT)*; silent token WHITESPACE = (BLANK | '\t' | '\n' | '\r')+; } parser sexpr { active fixpoint compound = LPAREN ~ (compound | atom) * ~ RPAREN;

active definition atom
    = ATOM;

active definition sexpr
    = compound | atom;

} ```

How to write a grammar file

You can put up your own one with the following rules:

For more complicated examples, one can see json.pag.

How to compile and use a grammar file

To compile your grammar file, the recommended way is to add pag-compiler as your build dependency. With pag-compiler, the parser file can be easily generated in a build script as the following: rust fn main() { pag_compiler::compile("csv.pag", "src/parser.rs"); println!("cargo:rerun-if-changed=csv.pag"); }

For some reasons (mostly performance issues), only nightly rust (1.71+) is supported for now. It is also required that the crate containing the parser file should be annotated with ```rust

![feature(portable_simd)]

![feature(core_intrinsics)]

![feature(array_chunks)]

```

Performance

We are continuously working on improvement the quality of our generated parser. For now, on workloads of CSV/JSON, the performance is close to or even better than those specialized parsers. === Random Generated CSV === throughput/pag time: [635.88 ยตs 637.64 ยตs 639.46 ยตs] thrpt: [622.63 MiB/s 624.41 MiB/s 626.14 MiB/s] throughput/csv time: [528.36 ยตs 541.72 ยตs 559.54 ยตs] thrpt: [711.56 MiB/s 734.97 MiB/s 753.55 MiB/s] throughput/pest time: [3.7278 ms 3.7364 ms 3.7460 ms] thrpt: [106.29 MiB/s 106.56 MiB/s 106.80 MiB/s] === Random Generated JSON === random-json/pag-json time: [22.634 ns 22.650 ns 22.666 ns] thrpt: [84.149 MiB/s 84.209 MiB/s 84.271 MiB/s] random-json/serde-json time: [12.493 ns 12.587 ns 12.694 ns] thrpt: [150.26 MiB/s 151.54 MiB/s 152.67 MiB/s] random-json/pest-json time: [177.38 ns 178.17 ns 179.17 ns] thrpt: [10.645 MiB/s 10.705 MiB/s 10.753 MiB/s] === twitter.json === twitter-json/pag-json time: [1.0923 ms 1.0941 ms 1.0961 ms] thrpt: [667.24 MiB/s 668.46 MiB/s 669.59 MiB/s] twitter-json/serde-json time: [1.2281 ms 1.2295 ms 1.2312 ms] thrpt: [594.02 MiB/s 594.88 MiB/s 595.54 MiB/s] twitter-json/pest-json time: [5.2977 ms 5.3055 ms 5.3148 ms] thrpt: [137.61 MiB/s 137.85 MiB/s 138.06 MiB/s]

Why is it fast and how can I make my grammar faster

Diagnostic Grammar Error Check

We provide diagnostic information for "type errors" in your grammar definitions. Here are some examples:

Left-recursion Error: Unguarded fixpoint โ•ญโ”€[json.pag:39:5] โ”‚ 39 โ”‚ active fixpoint json = json ~ value; โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ fixpoint rule json is not guarded -- your grammar is left-recursive โ”€โ”€โ”€โ”€โ•ฏ Sequence Ambiguity

Explanation: there may be ambiguity when separating a sequence into two part according to the grammar definition

Error: When type checking a sequence of rules, the following rules are ambiguous โ•ญโ”€[json.pag:39:28] โ”‚ 39 โ”‚ active fixpoint test = NUMBER+ ~ NUMBER+; โ”‚ โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€ โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€ โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ type info for left-hand side: nullable: false, first set: {NUMBER}, follow set: {NUMBER} โ”‚ โ”‚ โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€ type info for right-hand side: nullable: false, first set: {NUMBER}, follow set: {NUMBER} โ”€โ”€โ”€โ”€โ•ฏ

Alternation Ambiguity

Explanation: there may be ambiguity when select a match in an alternation of two rules. Error: When type checking an alternation of rules, the following rules are ambiguous โ•ญโ”€[json.pag:39:28] โ”‚ 39 โ”‚ active fixpoint test = NUMBER+ | NUMBER; โ”‚ โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€ โ”€โ”€โ”€โ”ฌโ”€โ”€ โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ type info for left-hand side: nullable false, first set: NUMBER, follow set: NUMBER โ”‚ โ”‚ โ”‚ โ•ฐโ”€โ”€โ”€โ”€ type info for right-hand side: nullable false, first set: NUMBER, follow set: โ”€โ”€โ”€โ”€โ•ฏ

There are other diagnostic information for undefined references, nullable tokens in lexer, character format error, etc.