On the left are Pomsky expressions, on the right is the compiled regex:
```py
'hello world' # hello world
'hello'{1,5} # (?:hello){1,5} 'hello'* # (?:hello)* 'hello'+ # (?:hello)+
'hello'{1,5} lazy # (?:hello){1,5}? 'hello'* lazy # (?:hello)*? 'hello'+ lazy # (?:hello)+?
'hello' | 'world' # hello|world
['aeiou'] # [aeiou] ['p'-'s'] # [p-s]
[word] [space] [n] # \w\s\n
[w 'a' 't'-'z' U+15] # [\wat-z\x15]
!['a' 't'-'z'] # [^at-z]
[Greek] U+30F Grapheme # \p{Greek}\u030F\X
^ $ # ^$ % 'hello' !% # \bhello\B
'terri' ('fic' | 'ble') # terri(?:fic|ble)
:('test') # (test)
:name('test') # (?P
(>> 'foo' | 'bar') # (?=foo|bar) (<< 'foo' | 'bar') # (?<=foo|bar) (!>> 'foo' | 'bar') # (?!foo|bar) (!<< 'foo' | 'bar') # (?
:('test') ::1 # (test)\1
:name('test') ::name # (?P
range '0'-'999' # 0|[1-9][0-9]{0,2} range '0'-'255' # 0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?
regex '[\w[^abg]]' # [\w[^abg]] ```
```rust let operator = '+' | '-' | '*' | '/'; let number = '-'? [digit]+;
number (operator number)* ```
Read the book to get started, or check out the CLI program, the Rust library and the procedural macro.
Normal regexes are very concise, but when they get longer, they get increasingly difficult to
understand. By default, they don't have comments, and whitespace is significant. Then there's the
plethora of sigils and backslash escapes that follow no discernible system:
(?<=) (?P<>) .?? \N \p{} \k<> \g''
and so on. And with various inconsistencies between regex
implementations, it's the perfect recipe for confusion.
Pomsky solves these problems with a new, simpler but also more powerful syntax:
Pomsky is currently compatible with PCRE, JavaScript, Java, .NET, Python, Ruby and Rust. The regex flavor must be specified during compilation, so Pomsky can ensure that the produced regex works as desired on the targeted regex engine.
Note: You should enable Unicode support in your regex engine, if it isn't enabled by default. This is explained here.
Pomsky aims to be as portable as possible, polyfilling Unicode and unsupported features where feasible. That said, there are some cases where portability is not possible:
Some features (e.g. lookaround, backreferences, Unicode properties) aren't supported in every flavor. Pomsky fails to compile when you're using an unsupported feature.
\b
(word boundaries) are not Unicode aware in JavaScript. Pomsky therefore only allows word boundaries when Unicode is disabled.
\w
in .NET handles Unicode incorrectly, with no way to polyfill it properly. This means that in .NET, [word]
only matches the L
, Mn
, Nd
, and Pc
general categories, instead of Alphabetic
, M
, Nd
, Pc
and Join_Control
.
In .NET, .
, Codepoint
and character classes (e.g. [Latin]
) only match a single UTF-16 code unit rather than a codepoint.
[space]
matches slightly different code points in JavaScript than in Java. This will be fixed.
Backreferences behave differently in JavaScript and Python when the referenced group has no captured text. There is nothing we can do about it, but we could add a warning for this in the future.
Never compile or execute an untrusted Pomsky expression on your critical infrastructure. This may make you vulnerable for denial of service attacks, like the Billion Laughs attack.
Pomsky looks for mistakes and displays helpful diagnostics:
I wrote an in-depth comparison with similar projects, which you can find here.
The Code of Conduct can be found here.
You can contribute by using Pomsky and providing feedback. If you find a bug or have a question, please create an issue.
I also gladly accept code contributions. More information
Dual-licensed under the MIT license or the Apache 2.0 license.