Background: Modern regexes are surprisingly complex
- Capturing groups
(r) - Lookaheads
(?=r) - Lookbehinds
(?<=r) - Negative lookaheads
(?!r) - Negative lookbehinds
(?<!r) - Backreferences
\1 - ...
Growing feature set
Increasingly subtle semantics
let rec matches s = function
| Or (l, r) -> matches l s || matches r s
| ... -> ...
- State-of-the-art implementations are complex:
Irregexp: >20k LoC
- Existing models are incomplete or wrong:
r? ≢ r|\(ε\) - State-of-the-art implementations have bugs:
/(a?b??)*/.exec('ab') ≢ ['a','a']