The optimizer transforms Oniguruma patterns into optimized versions of themselves. This optimization includes both minification and performance improvements. Optimized regexes always match exactly the same strings, with the same subpattern matches.
Note
Oniguruma is a regular expression engine written in C that's used in Ruby (via a fork named Onigmo), PHP (mb_ereg, etc.), TextMate grammars (used by VS Code, Shiki, etc.), and many other tools.
Example:
(?x) (?:\!{1,}) (\b(?:ark|arm|art)\b) [[^0-9A-Fa-f]\P{^Nd}\p{ Letter }]
Becomes:
!+\b(ar[kmt])\b[\H\d\p{L}]
Tip
🧪 Try the optimizer demo.
The optimizer has been battle-tested by tm-grammars, which is used by Shiki to process tens of thousands of real-world Oniguruma regexes.
- Benefits
- Install and use
- Type definition
- Optimizations
- How performance optimizations work
- Flags
- Disable specific optimizations
- Enable only specific, optional optimizations
- Contributing
- Optimized regexes are shorter. Good for minification.
- Optimized regexes run faster. Some optimizations improve performance and can eliminate or reduce the risk of ReDoS.
- Optimized regexes are typically easier to read, unless the original used flag
xfor insignificant whitespace and comments (which are removed during optimization).
In rare cases, results might be slightly longer than the input.
npm install oniguruma-parserimport {optimize} from 'oniguruma-parser/optimizer';
optimize('[aa]');
// → {pattern: 'a', flags: ''}function optimize(
pattern: string;
options?: {
flags?: string;
override?: {[key in OptimizationName]?: boolean};
rules?: {
allowOrphanBackrefs?: boolean;
captureGroup?: boolean;
singleline?: boolean;
};
}
): {
pattern: string;
flags: string;
};All of the following optimizations are on by default. Optimizations with names can optionally be disabled. Optimizations that don't yet have a name listed are currently always enabled, but will become optional in future versions (see issue).
🚀 = Can improve performance.
| Optimization name | Description | Example | |
|---|---|---|---|
| Alternation | alternationToClass 🚀 |
Use character classes for adjacent alternatives with single-length values | a|b|\d → [ab\d] |
extractPrefix 🚀 |
Extract nodes at the start of every alternative into a prefix | ^aa|^abb|^ac → ^a(?:a|bb|c) |
|
extractPrefix2 🚀 |
Extract alternating prefixes if patterns are repeated for each prefix | ^a|!a|^bb|!bb|^c|!c → (?:^|!)(?:a|bb|c) |
|
extractSuffix |
Extract nodes at the end of every alternative into a suffix | aa$|bba$|ca$ → (?:a|bb|c)a$ |
|
optionalize 🚀 |
Combine adjacent alternatives with only an added last node as the difference | ==|= → ==? |
|
| Backrefs and subroutines | Unenclose numbered backreferences | ()\k<1> → ()\1 |
|
| Remove leading zeros from backreference/subroutine numbers | ()\k<01> → ()\k<1> |
||
| Resolve relative backreference/subroutine numbers | ()\k<-1> → ()\k<1> |
||
| Callouts | simplifyCallouts |
Cleanup callout arguments, removing redundant commas, leading zeros, and empty braces | (*SKIP{,}) → (*SKIP) |
| Characters | Remove unnecessary escapes | \![\?] → ![?] |
|
| Use the simplest character representation | \u0061 → a |
||
| Encoded bytes to code points | \xE2\x82\xAC (U+20AC) → € |
||
| Remove leading zeros from enclosed code point escapes | \x{000ABCDE} → \x{ABCDE} |
||
| Character classes | mergeRanges |
Merge, dedupe, and sort ranges and characters in character classes |
[a\x61] → [a],[abcb-f] → [a-f]
|
unnestUselessClasses |
Unnest character classes when possible |
[a[b]] → [ab],[^[^a]] → [a]
|
|
unwrapNegationWrappers |
Unwrap negated classes used to negate an individual character set | [^\d] → \D |
|
unwrapUselessClasses |
Unwrap outermost non-negated character classes containing a single character or character set | [a] → a |
|
| Character sets | useShorthands |
Use shorthands (\d, \h, \s, etc.) when possible |
[[:space:]\p{Nd}] → [\s\d] |
| Always on[1] | Normalize Unicode property names | \p{-IDS- TART} → \p{ID_Start} |
|
useUnicodeAliases |
Use Unicode property aliases | \p{ID_Start} → \p{IDS} |
|
useUnicodeProps |
Use Unicode properties when possible | [\0-\x{10FFFF}] → [\p{Any}] |
|
| Use outer negation for Unicode properties | \p{^L} → \P{L} |
||
| Comments and whitespace | Remove comment groups and flag x line comments |
(?#comment)a → a |
|
Remove flag x insignificant whitespace |
(?x) a b → (?x)ab |
||
| Flags | Always on[1] | Remove duplicate flags | (?ii-m-m) → (?i-m) |
removeUselessFlags |
Remove flags (from top-level and modifiers) that have no effect | (?x)a → a |
|
| Groups | exposeAnchors |
Pull leading and trailing assertions out of capturing groups when possible; helps group unwrapping | (^a$) → ^(a)$ |
removeEmptyGroups |
Remove empty noncapturing, atomic, and flag groups, even if quantified | (?:)a → a |
|
unwrapUselessGroups |
Unwrap nonbeneficial noncapturing and atomic groups | (?:a) → a |
|
| Quantifiers | preventReDoS 🚀 |
Remove identified ReDoS vulnerabilities without changing matches | (?:a+)*b → (?:a)*b |
| Use symbols for quantifier ranges when possible | a{1,} → a+ |
||
| Remove leading zeros from quantifier ranges | a{01,03} → a{1,3} |
Optimizations are applied in a loop until no further optimization progress is made. Individual optimization transforms are typically narrow and work best when combined with others.
- This optimization isn't expected to become optional in future versions since it results from the nature of the parser, which builds an AST.
Although the optimizer's primary purpose is minification, some optimizations can improve search performance by:
- Reducing backtracking.
- Ex: Reducing use of alternation or adjusting quantifiers in ways that don't change what the regex matches.
- Although Oniguruma includes numerous sophisticated internal optimizations and, in theory, this library's optimizations could be included directly in the engine, in practice, this library is able to find additional opportunities through a combination of cleverness and not having the same extremely tight constraints on compilation time.
- Triggering internal optimizations built into regex engines.
- Ex: More clearly exposing that a particular token must match for any match to occur.
These effects can be significant. Additionally, though less significant, minification can reduce compilation time for extremely long regexes.
Performance improvements can sometimes result from a combination of transformations. For example, consider the following optimization chain:
^1$|^2$— Initial^(?:1|2)$—extractPrefix,extractSuffix^(?:[12])$—alternationToClass^[12]$—unwrapUselessGroups
This sequence of changes happens automatically, assuming none of the individual transforms have been disabled. Note that, although the extractSuffix transform doesn't typically impact performance on its own, its change helped enable removing alternation in the subsequent step, which reduces backtracking and can have a direct performance impact (in some cases, it can even eliminate ReDoS).
A real world example of performance improvements comes from Better C++, which includes a large collection of complex Oniguruma regexes used for highlighting C++ code in VS Code, Shiki, and other tools. Despite having gone through multiple rounds of performance hand-tuning over the years (and despite not including known cases of catastophic backtracking, which could lead to even greater opportunities for performance optimization), pre-running its regexes through this library resulted in a ~30% improvement in syntax highlighting performance. And that improvement wasn't specific to the Oniguruma engine. Using Oniguruma-To-ES to transpile the regexes to native JavaScript RegExp (before and after optimization) showed a comparable ~30% performance boost for JavaScript in V8.
Although the optimizer takes provided flags into account and includes a flags property on the returned object, it never changes top-level flags in ways that would change the meaning of the regex if you didn't provide the updated flags to Oniguruma. As a result, the optimized pattern can be used in situations when you aren't able to change flags. For example, the optimizer removes x from the returned flags (since its effects are always applied), but it doesn't add flags.
import {optimize} from 'oniguruma-parser/optimizer';
const pattern = '...';
const optimized = optimize(pattern, {
override: {
// Disable specific optimizations
removeEmptyGroups: false,
},
});import {optimize, getOptionalOptimizations} from 'oniguruma-parser/optimizer';
const pattern = '...';
const optimized = optimize(pattern, {
override: {
...getOptionalOptimizations({disable: true}),
// Enable specific optimizations
removeEmptyGroups: true,
},
});Contributions are welcome! Review this library's contributing guide, and use the following steps to add a new optimization transform:
- Add your optimization in a new file, e.g.
src/optimizer/transforms/foo.ts. - Import and list it in
src/optimizer/optimizations.ts. - Add tests in
test/optimizer/foo.test.ts; run them viapnpm test. - List it in this readme file with a simple example.
Optimizations can compliment each other, so you don't need to do too much in one. Ideas for new optimizations are collected here.
Created by Steven Levithan and contributors.
Inspiration for the optimizer included regexp-tree, which includes an optimizer for JavaScript regexes.
If you want to support this project, I'd love your help by contributing improvements (guide), sharing it with others, or sponsoring ongoing development.
MIT License.