Proposal for lookbehind support

About This Document

This document was prepared to propose that RegExp of ECMAScript should support fixed-length lookbehind assertions as in Perl 5. However, as TC39 prefers variable length lookbehind assertions as in .NET, this document is likely to be superseded by another specification document for supporting variable length lookbehind assertions. (Samples of specifications for variable length lookbehinds support: Claude Pache's version, my Compact Version and my Lengthy Version).

Abstract

Basically, the evaluation for a lookbehind assertion is performed by "rewind the target sequence and do the same thing as a lookahead assertion":

  1. When compiling a regular expression pattern, count the total number of PatternCharacter and . and \ AtomEscape (except backreferences) and CharacterClass in Disjunction per lookbehind assertion. Each Atom may be followed by {n}, but may not be followed by the other quantifiers.
  2. When a lookbehind assertion is evaluated, rewind the target sequence by that number and evaluate as if a lookahead assertion from that point.

Static Semantics: Early Errors

Assertion :: ( ? < = Disjunction )
Assertion :: ( ? < ! Disjunction )
  • It is a Syntax Error if Disjunction contains Quantifier :: QuantifierPrefix except QuantifierPrefix :: { DecimalDigits }.
  • It is a Syntax Error if Disjunction contains AtomEscape :: DecimalEscape.
  • When Disjunction is separated by the | regular expression, if the number of the sequence of characters which the left Alternative matches and the number of the sequence of characters which the right Disjunction matches are different, then it is a Syntax Error.

Static Semantics: DisjunctionMatchLength

Assertion :: ( ? < = Disjunction )
Assertion :: ( ? < ! Disjunction )
  1. Return the sum of the lengths of all the Atoms in Disjunction. The length of an Atom is calculated as follows:
    • For the following Atoms, if an Atom is followed by a QuantifierPrefix :: { DecimalDigits }, then the length of the Atom is equal to the number of DecimalDigits, otherwise one.
      • Atom :: PatternCharacter
      • Atom :: .
      • Atom :: \ AtomEscape except AtomEscape :: DecimalEscape
      • Atom :: CharacterClass
    • For the following Atoms, if an Atom is followed by a QuantifierPrefix :: { DecimalDigits }, then the length of the Atom is equal to the number of DisjunctionMatchLength of the inner-Disjunction multiplied by DecimalDigits, otherwise equal to DisjunctionMatchLength of the inner-Disjunction.
      • Atom :: ( Disjunction )
      • Atom :: ( ? : Disjunction )

Assertion

The production Assertion :: ( ? < = Disjunction ) evaluates as follows:

  1. Evaluate Disjunction to obtain a Matcher m.
  2. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
    1. Let n be DisjunctionMatchLength of Disjunction.
    2. Let xe be x's endIndex.
    3. If xe < n, return failure.
    4. Let xcap be x's captures List.
    5. Let y be the State (xe-n, xcap).
    6. Let d be a Continuation that always returns its State argument as a successful MatchResult.
    7. Call m(y, d) and let r be its result.
    8. If r is failure, return failure.
    9. Let z be r's State.
    10. Let zcap be z's captures List.
    11. Let a be the State (xe, zcap).
    12. Call c(a) and return its result.

The production Assertion :: ( ? < ! Disjunction ) evaluates as follows:

  1. Evaluate Disjunction to obtain a Matcher m.
  2. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
    1. Let n be DisjunctionMatchLength of Disjunction.
    2. Let xe be x's endIndex.
    3. If xe < n, then call c(x) and return its result.
    4. Let xcap be x's captures List.
    5. Let y be the State (xe-n, xcap).
    6. Let d be a Continuation that always returns its State argument as a successful MatchResult.
    7. Call m(y, d) and let r be its result.
    8. If r is not failure, return failure.
    9. Call c(x) and return its result.

End of Proposal


Note

This section is not a part of the proposal, but written just for clarification.

NOTE "Input is a List consisting of all of the characters, in order, of the String being matched by the regular expression pattern. Each character is either a code unit or a code point, depending upon the kind of pattern involved" (21.2.2.1). Thus, for example,

/(?<=a.)bc/.exec("a𝄞bc"); // 𝄞 is U+1D11E, MUSICAL SYMBOL G CLEF.

returns null. Given a BMP pattern, one character is one code unit. In this case, therefore, the positive lookbehind rewinds the target sequence by two code units and Atom :: . in the assertion does not match the whole surrogate pair that represents the character U+1D11E; the lookbehind (?<=a.) matches only the sequence that consists of "a" and the first half of the surrogate pair. However,

/(?<=a.)bc/u.exec("a𝄞bc")

returns "bc". Given a Unicode pattern, one character is one code point. In this case, the positive lookbehind rewinds the target sequence by two code points and Atom :: . in the assertion can match the whole surrogate pair that represents the character U+1D11E; thus the lookbehind (?<=a.) matches the sequence "a𝄞".

History