\(\gdef\eqq{\stackrel{?}{=}}\)
Warblre logo (bird on branch)

Mechanized semantics for ECMAScript regexes

Noé De Santo, SYSTEMF, EPFL

Supervised by Aurèle Barrière and Clément Pit-Claudel

QR code linking to this page

Background: Modern regexes are surprisingly complex

1960s
\(\neq\)
2024

 

\[\begin{alignat*}{3} r\ {:}{:}{=}&\;\; a \;\; &&|\;\; \varepsilon \;\; &&|\;\; r_1r_2 \\ |&\;\; r_1 | r_2 \;\; &&|\;\; r_1^ {*} \end{alignat*}\]
\(\subsetneqq\)
  • Capturing groups (r)
  • Lookaheads (?=r)
  • Lookbehinds (?<=r)
  • Negative lookaheads (?!r)
  • Negative lookbehinds (?<!r)
  • Backreferences \1
  • ...

Growing feature set

\[ \gdef\tmatch#1#2{ #1\ \vdash\ #2 } \begin{gather*} \frac{\tmatch{r_1}{v}}{\tmatch{r_1 | r_2}{v}} \quad \frac{\tmatch{r_2}{w}}{\tmatch{r_1 | r_2}{w}} \end{gather*} \]
\(\neq\)
The full ECMAScript specification of regexes

Increasingly subtle semantics

let rec matches s = function
| Or (l, r) -> matches l s || matches r s
| ...       -> ...
\(\neq\)
  • State-of-the-art implementations are complex:

    Irregexp: >20k LoC

  • Existing models are incomplete or wrong:

    r? ≢ r|\(ε\)

  • State-of-the-art implementations have bugs:

    /(a?b??)*/.exec('ab') ≢ ['a','a']

Increasingly complex code

A pop quiz: “What does this regex do?!”

Which of the following equalities hold in JavaScript?

  1. /a|ab/.exec('ab') \(\eqq\) ['ab']
  2. /()?/.exec('') \(\eqq\) ['', '']
  3. /()|/.exec('') \(\eqq\) ['', '']
  4. /(a?b??)*/.exec('ab') \(\eqq\) ['ab', 'b']

The results would be different for Rust, POSIX, etc.

Modern regexes are diverse and subtle.

Answers: 1. No; 2. No; 3. Yes; 4. Yes.

Problem:

We wanted to verify a regex engine (optimization, compilation, matching), but there were no mechanized semantics for real-world regexes.

Solution:

We translated the ECMAScript regex specification to Coq.

Our model covers 33 pages of English specs. It is executable (in OCaml & JS), proven-safe (terminates with no exceptions), auditable (literate), faithful (passes Test262), and future-proof (model & spec match 1-to-1).

We lay robust foundations to verify efficient regex matchers and optimizers.

Our model is future-proof and straightforwardly auditable

(*>> Disjunction :: Alternative | Disjunction <<*)
| Disjunction r11 r22 
  (*>> 1. Let m1 be CompileSubpattern of Alternative with arguments rer and direction. <<*)
  let! m11 =<< compileSubPattern r11 (Disjunction_left r22  ctx) rer direction in
  (*>> 2. Let m2 be CompileSubpattern of Disjunction with arguments rer and direction. <<*)
  let! m22 =<< compileSubPattern r22 (Disjunction_right r11  ctx) rer direction in
  (*>> 3. Return a new Matcher with parameters (x, c) that captures m1 and m2 and performs
       the following steps when called: <<*)
  (λ (x: MatchState) (c: MatcherContinuation) 
    (*>> a. Assert: x is a MatchState. <<*)
    (*>> b. Assert: c is a MatcherContinuation. <<*)
    (*>> c. Let r be m1(x, c). <<*)
    let! r =<< m11 x c in
    (*>> d. If r is not failure, return r. <<*)
    if r is not failure then r
    (*>> e. Return m2(x, c). <<*)
    else m22 x c): Matcher

Our model passes tests and proofs

We prove termination and absence of errors to confirm the usability of the Coq model:

 (r: Regex) input m,
  compilePattern r = Ok m ->
  m input  out_of_fuel.
 (r: Regex) input m,
  compilePattern r = Ok m ->
  m input  assertion_failed.

We generate JS from Coq using extraction + Melange, then link with a Unicode library.

$ ./run_test262  # Test262 test suite
Ran 498 tests
495 passed

Mechanizing a large specification poses many challenges

Termination is non-obvious

/a\(\varepsilon\)*b/.exec('ab') \(=\) ???

Can \(\varepsilon\)* loop forever? We prove that it cannot.


Some operations may fail

Let ch be the character Input[index].

Is index always in bounds? We prove that it is.

Non-local references are not compatible with a simple structural traversal:

Definition: CountLeftCapturingParensBefore ( node )

Return the number of Atom :: ( GroupSpecifieropt Disjunction ) Parse Nodes contained within pattern that either occur before node or contain node.

What does pattern refer to? How do we retrieve nodes that “occur before node”?

We use a zipper to represent in-progress traversals of the pattern.