Regular Expression Buffer Boundaries for ECMAScript

This proposal seeks to introduce \A and \z character escapes to Unicode-mode regular expressions as synonyms for ^ and $ that are not affected by the m (multiline) flag.

Status

Stage: 2
Champion: Ron Buckton (@rbuckton)

For detailed status of this proposal see TODO, below.

Authors

Ron Buckton (@rbuckton)

Motivations

NOTE: See https://github.com/rbuckton/proposal-regexp-features for an overview of how this proposal fits into other possible future features for Regular Expressions.

Buffer Boundaries are a common feature across a wide array of regular expression engines that allow you to match the start or end of the entire input regardless of whether the m (multiline) flag has been set. Buffer Boundaries also allow you to match the start/end of a line and the start/end of the input in a single RegExp using the m flag.

While its possible to emulate \A and \z using existing patterns, the alternatives are harder to far read, and require a more comprehensive working understanding of regular experssions to interpret.

For example, compare the following approaches:

// emulate `m`-mode `^` outside of `m`-mode:
const a = /^foo|(?<=^|[\u000A\u000D\u2028\u2029])bar/u;

// emulate non-`m`-mode `^` inside of `m`-mode using modifiers (proposed):
const b = /(?-m:^)foo|^bar/mu;

// using `\A`:
const c = /\Afoo|^bar/mu;

In the example above, it is far less likely that a reader will readily understand the expression in example (a). Not only is the content of the regular expression much harder to read, but understanding its purpose requires interpreting how six different features of regular expressions interact: grouping, positive lookbehind, the ^ metacharacer, disjunctions, character classes, and unicode escapes.

Example (b) is a an improvement, but still requires the reader to visually balance the parentheses as well as to interpret how four different regular expression features interact: grouping, modifiers (proposed), the m flag, and the ^ metacharacter.

In comparison, example (c) is far easier to read. It consists of a terse escape sequence consisting of only two characters (\A), which makes it far easier to distinguish between special pattern syntax and plain text segments like foo and bar.

The \A and \z escapes have broad support across multiple other languages and regular expression engines. As a result it has the benefit of extensive existing documentation online, including Wikipedia, numerous tutorial websites, as well as the documentation from other languages. This significantly lessens the learning curve for \A over its alternatives.

Relationship to RegExp Modifiers

This proposal can be consider syntax sugar over RegExp modifiers (Stage 4):

\A → (?-m:^)
\z → (?-m:$)

While RegExp modifiers can accomplish this task, the \A and \z escapes are convenient and portable across multiple different languages and are frequently found in language-independent resources such as JSON and YAML files which are often used by build tools and editors, such as TextMate grammar files, and are frequently consumed by ECMAScript applications. As such, introducing consistent syntax for this behavior improves portability and allows for more reuse of Regular Expression patterns found in documentation on the web as well as source code produced by LLMs and coding agents.

Prior Art

See https://rbuckton.github.io/regexp-features/features/buffer-boundaries.html for additional information.

Syntax

Buffer boundaries are similar to the ^ and $ anchors, except that they are not affected by the m (multiline) flag:

\A — Matches the start of the input.
\z — Matches the end of the input.
~~\Z — A zero-width assertion consisting of an optional newline at the end of the buffer. Equivalent to (?=\R?\z).~~

NOTE: Requires the u or v flag, as \A, \z, and \Z are currently just escapes for A, z and Z without the u or v flag.

NOTE: Not supported inside of a character class.

NOTE: The \Z assertion is no longer being considered as part of this proposal as of December 15th, 2021, but has been reserved for possible future use.

For more information about the v flag, see https://github.com/tc39/proposal-regexp-set-notation.

~~For more information about the \R escape sequence, see https://github.com/tc39/proposal-regexp-r-escape.~~

Examples

// without buffer boundaries
const re1 = /^foo$/u;
re1.test("foo"); // true
re1.test("foo\nbar"); // false

const re2 = /^foo$/um;
re2.test("foo"); // true
re2.test("foo\nbar"); // true

// with modifiers
const re3 = /(?-m:^)foo(?-m:$)/um;
re3.test("foo"); // true
re3.test("foo\nbar"); // false

// with buffer boundaries
const re1 = /\Afoo\z/u;
re1.test("foo"); // true
re1.test("foo\nbar"); // false

const re2 = /\Afoo\z/um;
re2.test("foo"); // true
re2.test("foo\nbar"); // false

// mixing buffer boundaries and anchors
const re = /\Afoo|^bar$|baz\z/um;
re.test("foo");         // true
re.test("foo\n");       // true
re.test("\nfoo");       // false

re.test("bar");         // true
re.test("bar\n");       // true
re.test("\nbar");       // true

re.test("baz");         // true
re.test("baz\n");       // false
re.test("\nbaz");       // true

History

October 28, 2021 — Proposed for Stage 1 (slides)
- Outcome: Advanced to Stage 1
December 15, 2021 — Proposed for Stage 2 (slides)
- Outcome: \A and \z advanced to Stage 2 (\Z did not advance, but will be reserved)
- Stage 2 Reviewers: Richard Gibson, Waldemar Horwat

TODO

The following is a high-level list of tasks to progress through each stage of the TC39 proposal process:

Stage 1 Entrance Criteria

Identified a "champion" who will advance the addition.
Prose outlining the problem or need and the general shape of a solution.
Illustrative examples of usage.
~~High-level API.~~

Stage 2 Entrance Criteria

Initial specification text.
~~Transpiler support (Optional).~~

Stage 2.7 Entrance Criteria

Complete specification text.
Designated reviewers have signed off on the current spec text:
- Richard Gibson (#5)
- Waldemar Horwat (#4)
- Chris de Almeida
The ECMAScript editor has signed off on the current spec text.

Stage 3 Entrance Criteria

Test262 acceptance tests have been written (ecma262/test262#4975) for mainline usage scenarios and merged.

Stage 4 Entrance Criteria

Two compatible implementations which pass the acceptance tests:
- TBD
- TBD.
- Engine262
A pull request has been sent to tc39/ecma262 with the integrated spec text.
The ECMAScript editor has signed off on the pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
.vscode		.vscode
spec		spec
.gitattributes		.gitattributes
.gitignore		.gitignore
.yo-rc.json		.yo-rc.json
LICENSE		LICENSE
README.md		README.md
gulpfile.js		gulpfile.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Regular Expression Buffer Boundaries for ECMAScript

Status

Authors

Motivations

Relationship to RegExp Modifiers

Prior Art

Syntax

Examples

History

TODO

Stage 1 Entrance Criteria

Stage 2 Entrance Criteria

Stage 2.7 Entrance Criteria

Stage 3 Entrance Criteria

Stage 4 Entrance Criteria

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Regular Expression Buffer Boundaries for ECMAScript

Status

Authors

Motivations

Relationship to RegExp Modifiers

Prior Art

Syntax

Examples

History

TODO

Stage 1 Entrance Criteria

Stage 2 Entrance Criteria

Stage 2.7 Entrance Criteria

Stage 3 Entrance Criteria

Stage 4 Entrance Criteria

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages