Background: Modern regexes are surprisingly complex
- Capturing groups
(r)
- Lookaheads
(?=r)
- Lookbehinds
(?<=r)
- Negative lookaheads
(?!r)
- Negative lookbehinds
(?<!r)
- Backreferences
\1
- ...
Growing feature set
Increasingly subtle semantics
let rec matches s = function
| Or (l, r) -> matches l s || matches r s
| ... -> ...
- State-of-the-art implementations are complex:
Irregexp: >20k LoC
- Existing models are incomplete or wrong:
r? ≢ r|\(ε\)
- State-of-the-art implementations have bugs:
/(a?b??)*/.exec('ab') ≢ ['a','a']