@gruhn/regex-utils
    Preparing search index...

    @gruhn/regex-utils

    Regex Utils

    Zero-dependency TypeScript library for regex utilities that go beyond string matching. These are surprisingly hard to come by for any programming language. ✨

    npm install @gruhn/regex-utils
    
    import { RB } from '@gruhn/regex-utils'
    
    Feature Support Examples
    Quantifiers a*, a+, a{3,10}, a?
    Lazy Quantifiers a*?, a+?, a{3,10}?, a??
    Alternation a|b
    Character classes ., \w, [a-zA-Z], ...
    Escaping \$, \., ...
    (Non-)capturing groups (?:...), (...)
    Start/end anchors ⚠️1 ^, $
    Lookahead ⚠️2 (?=...), (?!...)
    Lookbehind ⚠️2 (?<=...), (?<!...)
    Word boundary \b, \B
    Unicode property escapes \p{...}, \P{...}
    Backreferences \1 \2 ...
    dotAll flag /.../s, (?s:...)
    global flag /.../g
    hasIndices flag /.../d
    ignoreCase flag /.../i (?i:...)
    multiline flag /.../m (?m:...)
    unicode flag /.../u
    unicodeSets flag /.../v
    sticky flag /.../y
    1. Some complex patterns are not supported like anchors inside quantifiers (^a)+ or anchors inside lookaheads (?=^a).
    2. Not supported are nested lookaheads/lookbehinds like (?=a(?=b)) and lookaheads/lookbehinds combinations like (?=a)b(?<=c).

    An UnsupportedSyntaxError is thrown when unsupported patterns are detected. The library SHOULD ALWAYS either throw an error or respect the regex specification exactly. Please report a bug if the library silently uses a faulty interpretation.

    Handling syntax-related errors:

    import { RB, ParseError, UnsupportedSyntaxError } from '@gruhn/regex-utils'

    try {
    RB(/^[a-z]*$/)
    } catch (error) {
    if (error instanceof SyntaxError) {
    // Invalid regex syntax! Native error, not emitted by this library.
    // E.g. this will also throw a `SyntaxError`: new RegExp(')')
    } else if (error instanceof ParseError) {
    // The regex syntax is valid but the internal parser could not handle it.
    // If this happens it's a bug in this library.
    } else if (error instanceof UnsupportedSyntaxError) {
    // Regex syntax is valid but not supported by this library.
    }
    }

    Generate 5 random email addresses:

    const email = RB(/^[a-z]+@[a-z]+\.[a-z]{2,3}$/)
    for (const str of email.sample().take(5)) {
    console.log(str)
    }
    ky@e.no
    cc@gg.gaj
    z@if.ojk
    vr@y.ehl
    e@zx.hzq

    Generate 5 random email addresses, which have exactly 20 characters:

    const emailLength20 = email.and(/^.{20}$/)
    for (const str of emailLength20.sample().take(5)) {
    console.log(str)
    }
    kahragjijttzyze@i.mv
    gnpbjzll@cwoktvw.hhd
    knqmyotxxblh@yip.ccc
    kopfpstjlnbq@lal.nmi
    vrskllsvblqb@gemi.wc

    ONLINE DEMO

    Say we found this incredibly complicated regex somewhere in the codebase:

    const oldRegex = /^a|b$/
    

    This can be simplified, right?

    const newRegex = /^[ab]$/
    

    But to double-check we can use .isEquivalent to verify that the new version matches exactly the same strings as the old version. That is, whether oldRegex.test(str) === newRegex.test(str) for every possible input string:

    RB(oldRegex).isEquivalent(newRegex) // false
    

    Looks like we made some mistake. We can generate counterexamples using .without(...) and .sample(...). First, we derive new regex that match exactly what newRegex matches but not oldRegex and vice versa:

    const onlyNew = RB(newRegex).without(oldRegex)
    const onlyOld = RB(oldRegex).without(newRegex)

    onlyNew turns out to be empty (onlyNew.isEmpty() === true) but onlyOld has some matches:

    for (const str of onlyOld.sample().take(5)) {
    console.log(str)
    }
    aaba
    aa
    aba
    bab
    aababa

    Why does oldRegex match all these strings with multiple characters? Shouldn't it only match "a" or "b" like newRegex? Turns out we thought that oldRegex is the same as ^(a|b)$ but in reality it's the same as (^a)|(b$).

    How do you write a regex that matches HTML comments like:

    <!-- This is a comment -->
    

    A straightforward attempt would be:

    <!--.*-->
    

    The problem is that .* also matches the end marker -->, so this is also a match:

    <!-- This is a comment --> and this shouldn't be part of it -->
    

    We need to specify that the inner part can be any string that does not contain -->. With .not() (aka. regex complement) this is easy:

    import { RB } from '@gruhn/regex-utils'

    const commentStart = RB('<!--')
    const commentInner = RB(/^.*-->.*$/).not()
    const commentEnd = RB('-->')

    const comment = commentStart.concat(commentInner).concat(commentEnd)

    With .toRegExp() we can convert back to a native JavaScript regex:

    comment.toRegExp()
    
    /^<!--(---*[^->]|-?[^-])*---*>$/
    

    ONLINE DEMO

    It's difficult to write a single regex for multiple independent constraints. For example, to specify a valid password. But with regex intersections it's very natural:

    import { RB } from '@gruhn/regex-utils'

    const passwordRegex = RB(/^[a-zA-Z0-9]{12,32}$/) // 12-32 alphanumeric characters
    .and(/[0-9]/) // contains a number
    .and(/[A-Z]/) // contains an upper case letter
    .and(/[a-z]/) // contains a lower case letter

    We can convert this back to a native JavaScript RegExp with:

    passwordRegex.toRegExp()
    
    Note

    The output RegExp can be very large.

    We can also use other utilities like .size() to determine how many potential passwords match this regex:

    console.log(passwordRegex.size())
    
    2301586451429392354821768871006991487961066695735482449920n
    

    With .sample() we can generate some of these matches:

    for (const str of passwordRegex.sample().take(10)) {
    console.log(str)
    }
    NEWJIAXQISWT0Wwm
    lxoegadrzeynezkmtfcIBzzQ9e
    ypzvhvtwpWk4u6
    MSZXXKIKEKWKXLQ8HQ7Ds
    BCBSFBSMNOLKlgQN5L
    8950244600709IW1pg
    UOTQBLVOTZQWFSAJYBXZNQBEeom0l
    520302447164378435bv4dp4ysC
    71073970686490eY2Jt4
    afgpnxqwUK5B

    In the coding puzzle Advent Of Code 2023 - Day 12 you are given pairs of string patterns. An example pair is .??..??...?##. and 1,1,3. Both patterns describe a class of strings and the task is to count the number of strings that match both patterns.

    In the first pattern, . and # stand for the literal characters "dot" and "hash". The ? stands for either . or #. This can be written as a regular expression:

    • for # we simply write #
    • for . we write o (since . is a reserved symbol in regular expressions)
    • for ? we write (o|#)

    So the pattern .??..??...?##. would be written as:

    const firstRegex = /^o(o|#)(o|#)oo(o|#)(o|#)ooo(o|#)##o$/
    

    In the second pattern, each digit stands for a sequence of # separated by at least one o. This can also be written as a regular expression:

    • For a digit like 3 we write #{3}.
    • Between digits we write o+.
    • Additionally, arbitrary many o are allowed at the start and end, so we add o* at the start and end.

    Thus, 1,1,3 would be written as:

    const secondRegex = /^o*#{1}o+#{1}o+#{3}o*$/
    

    To solve the task and find the number of strings that match both regex, we can use .and(...) and .size() from regex-utils. .and(...) computes the intersection of two regular expressions. That is, it creates a new regex which exactly matches the strings matched by both input regex.

    const intersection = RB(firstRegex).and(secondRegex)
    

    With .size() we can then determine the number of matched strings:

    console.log(intersection.size())
    
    4n
    

    While at it, we can also try .enumerate() to list all these matches:

    for (const str of intersection.enumerate()) {
    console.log(str)
    }
    oo#ooo#ooo###o
    o#oooo#ooo###o
    oo#oo#oooo###o
    o#ooo#oooo###o

    For a full solution checkout: ./benchmark/aoc2023-day12.ts.

    Heavily informed by these papers: