Skip to content

Latest commit

 

History

History
380 lines (319 loc) · 14.3 KB

File metadata and controls

380 lines (319 loc) · 14.3 KB

Optimizer module 🪄
oniguruma-parser/optimizer

The optimizer transforms Oniguruma patterns into optimized versions of themselves. This optimization includes both minification and performance improvements. Optimized regexes always match exactly the same strings, with the same subpattern matches.

Note

Oniguruma is a regular expression engine written in C that's used in Ruby (via a fork named Onigmo), PHP (mb_ereg, etc.), TextMate grammars (used by VS Code, Shiki, etc.), and many other tools.

Example:

(?x) (?:\!{1,}) (\b(?:ark|arm|art)\b) [[^0-9A-Fa-f]\P{^Nd}\p{ Letter }]

Becomes:

!+\b(ar[kmt])\b[\H\d\p{L}]

Tip

🧪 Try the optimizer demo.

The optimizer has been battle-tested by tm-grammars, which is used by Shiki to process tens of thousands of real-world Oniguruma regexes.

Contents

Benefits

  • Optimized regexes are shorter. Good for minification.
  • Optimized regexes run faster. Some optimizations improve performance and can eliminate or reduce the risk of ReDoS.
  • Optimized regexes are typically easier to read, unless the original used flag x for insignificant whitespace and comments (which are removed during optimization).

In rare cases, results might be slightly longer than the input.

Install and use

npm install oniguruma-parser
import {optimize} from 'oniguruma-parser/optimizer';

optimize('[aa]');
// → {pattern: 'a', flags: ''}

Type definition

function optimize(
  pattern: string;
  options?: {
    flags?: string;
    override?: {[key in OptimizationName]?: boolean};
    rules?: {
      allowOrphanBackrefs?: boolean;
      captureGroup?: boolean;
      singleline?: boolean;
    };
  }
): {
  pattern: string;
  flags: string;
};

Optimizations

All of the following optimizations are on by default. Optimizations with names can optionally be disabled. Optimizations that don't yet have a name listed are currently always enabled, but will become optional in future versions (see issue).

🚀 = Can improve performance.

Optimization name Description Example
Alternation alternationToClass 🚀 Use character classes for adjacent alternatives with single-length values a|b|\d[ab\d]
extractPrefix 🚀 Extract nodes at the start of every alternative into a prefix ^aa|^abb|^ac^a(?:a|bb|c)
extractPrefix2 🚀 Extract alternating prefixes if patterns are repeated for each prefix ^a|!a|^bb|!bb|^c|!c(?:^|!)(?:a|bb|c)
extractSuffix Extract nodes at the end of every alternative into a suffix aa$|bba$|ca$(?:a|bb|c)a$
optionalize 🚀 Combine adjacent alternatives with only an added last node as the difference ==|===?
Backrefs and subroutines Unenclose numbered backreferences ()\k<1>()\1
Remove leading zeros from backreference/subroutine numbers ()\k<01>()\k<1>
Resolve relative backreference/subroutine numbers ()\k<-1>()\k<1>
Callouts simplifyCallouts Cleanup callout arguments, removing redundant commas, leading zeros, and empty braces (*SKIP{,})(*SKIP)
Characters Remove unnecessary escapes \![\?]![?]
Use the simplest character representation \u0061a
Encoded bytes to code points \xE2\x82\xAC (U+20AC) →
Remove leading zeros from enclosed code point escapes \x{000ABCDE}\x{ABCDE}
Character classes mergeRanges Merge, dedupe, and sort ranges and characters in character classes [a\x61][a],
[abcb-f][a-f]
unnestUselessClasses Unnest character classes when possible [a[b]][ab],
[^[^a]][a]
unwrapNegationWrappers Unwrap negated classes used to negate an individual character set [^\d]\D
unwrapUselessClasses Unwrap outermost non-negated character classes containing a single character or character set [a]a
Character sets useShorthands Use shorthands (\d, \h, \s, etc.) when possible [[:space:]\p{Nd}][\s\d]
Always on[1] Normalize Unicode property names \p{-IDS- TART}\p{ID_Start}
useUnicodeAliases Use Unicode property aliases \p{ID_Start}\p{IDS}
useUnicodeProps Use Unicode properties when possible [\0-\x{10FFFF}][\p{Any}]
Use outer negation for Unicode properties \p{^L}\P{L}
Comments and whitespace Remove comment groups and flag x line comments (?#comment)aa
Remove flag x insignificant whitespace (?x) a b(?x)ab
Flags Always on[1] Remove duplicate flags (?ii-m-m)(?i-m)
removeUselessFlags Remove flags (from top-level and modifiers) that have no effect (?x)aa
Groups exposeAnchors Pull leading and trailing assertions out of capturing groups when possible; helps group unwrapping (^a$)^(a)$
removeEmptyGroups Remove empty noncapturing, atomic, and flag groups, even if quantified (?:)aa
unwrapUselessGroups Unwrap nonbeneficial noncapturing and atomic groups (?:a)a
Quantifiers preventReDoS 🚀 Remove identified ReDoS vulnerabilities without changing matches (?:a+)*b(?:a)*b
Use symbols for quantifier ranges when possible a{1,}a+
Remove leading zeros from quantifier ranges a{01,03}a{1,3}

Optimizations are applied in a loop until no further optimization progress is made. Individual optimization transforms are typically narrow and work best when combined with others.

Footnotes

  1. This optimization isn't expected to become optional in future versions since it results from the nature of the parser, which builds an AST.

How performance optimizations work

Although the optimizer's primary purpose is minification, some optimizations can improve search performance by:

  • Reducing backtracking.
    • Ex: Reducing use of alternation or adjusting quantifiers in ways that don't change what the regex matches.
    • Although Oniguruma includes numerous sophisticated internal optimizations and, in theory, this library's optimizations could be included directly in the engine, in practice, this library is able to find additional opportunities through a combination of cleverness and not having the same extremely tight constraints on compilation time.
  • Triggering internal optimizations built into regex engines.
    • Ex: More clearly exposing that a particular token must match for any match to occur.

These effects can be significant. Additionally, though less significant, minification can reduce compilation time for extremely long regexes.

Performance improvements can sometimes result from a combination of transformations. For example, consider the following optimization chain:

  1. ^1$|^2$ — Initial
  2. ^(?:1|2)$extractPrefix, extractSuffix
  3. ^(?:[12])$alternationToClass
  4. ^[12]$unwrapUselessGroups

This sequence of changes happens automatically, assuming none of the individual transforms have been disabled. Note that, although the extractSuffix transform doesn't typically impact performance on its own, its change helped enable removing alternation in the subsequent step, which reduces backtracking and can have a direct performance impact (in some cases, it can even eliminate ReDoS).

A real world example of performance improvements comes from Better C++, which includes a large collection of complex Oniguruma regexes used for highlighting C++ code in VS Code, Shiki, and other tools. Despite having gone through multiple rounds of performance hand-tuning over the years (and despite not including known cases of catastophic backtracking, which could lead to even greater opportunities for performance optimization), pre-running its regexes through this library resulted in a ~30% improvement in syntax highlighting performance. And that improvement wasn't specific to the Oniguruma engine. Using Oniguruma-To-ES to transpile the regexes to native JavaScript RegExp (before and after optimization) showed a comparable ~30% performance boost for JavaScript in V8.

Flags

Although the optimizer takes provided flags into account and includes a flags property on the returned object, it never changes top-level flags in ways that would change the meaning of the regex if you didn't provide the updated flags to Oniguruma. As a result, the optimized pattern can be used in situations when you aren't able to change flags. For example, the optimizer removes x from the returned flags (since its effects are always applied), but it doesn't add flags.

Disable specific optimizations

import {optimize} from 'oniguruma-parser/optimizer';

const pattern = '...';
const optimized = optimize(pattern, {
  override: {
    // Disable specific optimizations
    removeEmptyGroups: false,
  },
});

Enable only specific, optional optimizations

import {optimize, getOptionalOptimizations} from 'oniguruma-parser/optimizer';

const pattern = '...';
const optimized = optimize(pattern, {
  override: {
    ...getOptionalOptimizations({disable: true}),
    // Enable specific optimizations
    removeEmptyGroups: true,
  },
});

Contributing

Contributions are welcome! Review this library's contributing guide, and use the following steps to add a new optimization transform:

  • Add your optimization in a new file, e.g. src/optimizer/transforms/foo.ts.
  • Import and list it in src/optimizer/optimizations.ts.
  • Add tests in test/optimizer/foo.test.ts; run them via pnpm test.
  • List it in this readme file with a simple example.

Optimizations can compliment each other, so you don't need to do too much in one. Ideas for new optimizations are collected here.

About

Created by Steven Levithan and contributors.

Inspiration for the optimizer included regexp-tree, which includes an optimizer for JavaScript regexes.

If you want to support this project, I'd love your help by contributing improvements (guide), sharing it with others, or sponsoring ongoing development.

MIT License.