perf(lexer): try phf_map for `match_keyword` by Boshen · Pull Request #140 · oxc-project/oxc

Boshen · 2023-03-06T14:41:31Z

No description provided.

github-actions · 2023-03-06T14:55:57Z

Parser Benchmark Results - ubuntu-latest

group                      main                                   pr
-----                      ----                                   --
parser/babylon.max.js      1.00    122.4±0.55ms    84.3 MB/sec    1.00    122.8±0.38ms    84.1 MB/sec
parser/d3.js               1.00     14.7±0.08ms    37.1 MB/sec    1.01     14.9±0.08ms    36.6 MB/sec
parser/lodash.js           1.00      4.2±0.01ms   121.6 MB/sec    1.02      4.3±0.04ms   118.8 MB/sec
parser/pdf.js              1.00      8.4±0.03ms    47.9 MB/sec    1.02      8.5±0.08ms    47.1 MB/sec
parser/typescript.js       1.00    122.8±0.27ms    78.3 MB/sec    1.01    124.2±0.26ms    77.4 MB/sec
semantic/babylon.max.js    1.00    274.6±6.01ms    37.6 MB/sec    1.02    279.6±8.56ms    36.9 MB/sec
semantic/d3.js             1.00     20.4±0.70ms    26.7 MB/sec    1.00     20.5±0.62ms    26.6 MB/sec
semantic/lodash.js         1.00      5.3±0.30ms    97.7 MB/sec    1.02      5.4±0.08ms    95.3 MB/sec
semantic/pdf.js            1.00     15.0±0.46ms    26.8 MB/sec    1.02     15.3±0.60ms    26.2 MB/sec
semantic/typescript.js     1.00    179.3±7.34ms    53.6 MB/sec    1.02    182.3±7.39ms    52.8 MB/sec

Boshen · 2023-03-06T15:22:56Z

@YoniFeng I'm stuck optimizing this routine, do you have any trick up your sleeve?

I know @strager has done some SIMD magic, but I think this is beyond my reach:

But we should at least get LLVM to hand us some finely tuned assembly code, or have @strager point us in the right direction 😁

strager · 2023-03-06T20:06:56Z

 use std::fmt;

 use oxc_ast::Atom;
+use phf::phf_map;


phf is a terrible library in terms of performance. (In my benchmarking, it is easily beatable by a general purpose hash table with a custom hash function.) phf seems to be designed for large tables [1], not run-time performance.

If you want something off-the-shelf and easy, I suggest gperf. Unfortunately, I think gperf cannot generate Rust code. But you can have it generate C code which you wrap in Rust. Or you can add a Rust backend to gperf, or you can write a script which rewrites gperf's C output into Rust.

In quick-lint-js, switching from gperf to my custom keyword lookup function (without inline assembly) reduced jQuery parsing times by 4-5% on my Ryzen 9 5950X. Some benchmark data for just the lookup function (higher is better):

Rust std::collections::HashMap (default hash function): 31.6M lookups/s Rust phf: 37.9M lookups/s C++ std::unordered_map (default hash function): 33.9M lookups/s C++ std::unordered_map (custom hash function): 45.3M lookups/s gperf: 122.2M lookups/s mine (production): (unknown; probably around 200M lookup/s) mine (with inline assembly): 338.1M lookups/s 😉

(The "Rust std::collections::HashMap" variant used unsafe code because there was a mutable global variable. If you're allergic to unsafe, expect it to be slightly slower due to lazy_static overhead or Mutex overhead or something.)

[1] From the docs: "It currently uses the CHD algorithm and can generate a 100,000 entry map in roughly .4 seconds."

Interesting.

I just looked at your code - it's exactly what I started imagining is possible.
Super efficient hash function based on the input assumptions (voodoo), and then the key comparison is SIMD string comparison.
Neat!

YoniFei · 2023-03-06T21:41:34Z

Are you doing this for fun or you think it matters?
Lookups don't seem substantial in profiles.
Still, you nerd sniped me to take a look..

I've noticed 2 things so far:

1.
The existing match statement's assembly is comparison after comparison:

pub fn match_keyword(s: &str) -> (Self, Atom) {
 push    r15
 push    r14
 push    r12
 push    rsi
 push    rdi
 push    rbx
 sub     rsp, 72
 mov     r12, r8
 mov     r15, rdx
 mov     r14, rcx
 if len == 1 || len >= 12 {
 cmp     r8, 1
 je      .LBB2359_4
 cmp     r12, 11
 ja      .LBB2359_4
     lea     rax, [r12, -, 2]
     cmp     rax, 9
     ja      .LBB2359_124
     lea     rcx, [rip, +, .LJTI2359_0]
     movsxd  rax, dword, ptr, [rcx, +, 4*rax]
     add     rax, rcx
     jmp     rax
.LBB2359_16:
     movzx   eax, word, ptr, [r15]
     cmp     eax, 29537
 "as" => (As, Atom::new_inline("as")),
 je      .LBB2359_127
     movzx   eax, word, ptr, [r15]
     cmp     eax, 28516
 "do" => (Do, Atom::new_inline("do")),
 je      .LBB2359_128
     movzx   eax, word, ptr, [r15]
     cmp     eax, 26217
 "if" => (If, Atom::new_inline("if")),
 je      .LBB2359_129
     movzx   eax, word, ptr, [r15]
     cmp     eax, 28265
 "in" => (In, Atom::new_inline("in")),
 je      .LBB2359_130
     movzx   eax, word, ptr, [r15]
     cmp     eax, 29545
 "is" => (Is, Atom::new_inline("is")),
 je      .LBB2359_23
     movzx   eax, word, ptr, [r15]
     cmp     eax, 26223
 "of" => (Of, Atom::new_inline("of")),
 jne     .LBB2359_124

Which gets more expensive the longer the string (i.e. once it's >4 chars, >8 chars etc)

The jumps are to blocks that look like this

.LBB2359_132:
     buffer[i - 1] = text[i - 1]; (C:\Users\yfeig\.cargo\registry\src\github.com-1ecc6299db9ec823\compact_str-0.7.0\src\repr\inline.rs:63)
     xorps   xmm0, xmm0
     movups  xmmword, ptr, [rsp, +, 51], xmm0
     mov     dword, ptr, [rsp, +, 67], 0
     mov     word, ptr, [rsp, +, 48], 28518
     mov     byte, ptr, [rsp, +, 50], 114
     InlineBuffer(buffer) (C:\Users\yfeig\.cargo\registry\src\github.com-1ecc6299db9ec823\compact_str-0.7.0\src\repr\inline.rs:67)
     mov     rax, qword, ptr, [rsp, +, 63]
     mov     qword, ptr, [r14, +, 23], rax
     movzx   eax, word, ptr, [rsp, +, 48]
     mov     word, ptr, [r14, +, 8], ax
     movzx   eax, byte, ptr, [rsp, +, 50]
     mov     byte, ptr, [r14, +, 10], al
     mov     rax, qword, ptr, [rsp, +, 51]
     mov     qword, ptr, [r14, +, 11], rax
     mov     eax, dword, ptr, [rsp, +, 59]
     mov     dword, ptr, [r14, +, 19], eax
     movzx   eax, byte, ptr, [rsp, +, 63]
     mov     byte, ptr, [r14, +, 23], al

This doesn't seem right, because that's the code for new_const.

pub const fn new_const(text: &str) -> Self {
        if text.len() > MAX_SIZE {
            panic!("Provided string has a length greater than our MAX_SIZE");
        }

        let len = text.len();
        let mut buffer = [0u8; MAX_SIZE];

        // set the length
        buffer[MAX_SIZE - 1] = len as u8 | LENGTH_MASK;

        // Note: for loops aren't allowed in `const fn`, hence the while.
        // Note: Iterating forward results in badly optimized code, because the compiler tries to
        //       unroll the loop.
        let text = text.as_bytes();
        let mut i = len;
        while i > 0 {
            buffer[i - 1] = text[i - 1];
            i -= 1;
        }

        InlineBuffer(buffer)
    }

It's as if the Atom::new_inline() aren't const.
(Note that in mov word, ptr, [rsp, +, 48], 28518, the number 28518 is for UTF-8 "of". These code blocks aren't for the case where we didn't have a match. There are dozens of such blocks, they each an inline of new_inline for a specific hardcoded value.)

I didn't test, but I wonder if moving the match to its own function that returns a &'static Atom would fix this.

I cloned your branch with the phf_map, and this isn't happening. I think it might thanks to KEYWORDS being static (maybe in combination with the value tuples being explicitly stored in an array?).
I'm surprised there's no noticeable improvement. I guess it doesn't outweigh #2.

2.
phf_map! is interesting.
It creates a (general) perfect hashing map for us with no effort, which is amazing:

phf::Map {
    key: 12913932095322966823u64,
    disps: &[
        (1u32, 45u32),
        (0u32, 0u32),
        (1u32, 37u32),
        (1u32, 22u32),
        (4u32, 54u32),
        (0u32, 1u32),
        (1u32, 50u32),
        (9u32, 76u32),
        (0u32, 29u32),
        (5u32, 43u32),
        (0u32, 0u32),
        (0u32, 1u32),
        (3u32, 5u32),
        (7u32, 64u32),
        (3u32, 5u32),
        (83u32, 20u32),
        (3u32, 69u32),
    ],
    entries: &[
        ("abstract", (Abstract, Atom::new_inline("abstract"))),
        ("implements", (Implements, Atom::new_inline("implements"))),
        ("false", (False, Atom::new_inline("false"))),
        ("do", (Do, Atom::new_inline("do"))),
        ("get", (Get, Atom::new_inline("get"))),
        ...
    ],

The hashing function used is SipHash which is the default in Rust (SipHasher13).
It has cryptographic benefits that aren't relevant for our use case (and, I imagine, for the majority of phf_map uses..), so using other hash functions could result in a speedup (however, the phf crate doesn't seem to have support for this).

A hand-written PHF map and hashing function tailored to this use case (and known assumptions: keys are finite (small) set of strings, length between 2 to 12) could definitely be faster.
Like I said initially - it doesn't seem like the perf gains would be meaningful, but it could be done "just for fun".

I'm still a little baffled at how the PR performs worse.
On one hand I'd expect most of the shorter keywords to be way more common (in which case, a handful of comparisons is less work than the hash), but on the other hand I'd expect most identifiers to be variable names that fall between 2 to 12 chars (and have to go through all of the comparisons, which I'm guessing is much more work than the hash?).

Boshen · 2023-03-07T02:56:29Z

Are you doing this for fun or you think it matters? Lookups don't seem substantial in profiles.

I always have a gut feeling this can be improved, because @strager mentioned SIMD so I looked at it last night, and you showed me how to look at the assembly code.

This is the hottest path in our lexer and it's dominant on the profiler, around 2-5%.
I just couldn't sleep in peace when I know it is not optimized, I've been awoke since 5am 😭

So here are my findings:

Why does the const Atom::new_inline not compiling down to a compact representation? Is it because there's a loop inside the code? This part is beyond me.
Changing to &static seems to have a better performance, perf(lexer): try &static Atom for match_keyword #143, but the assembly code look almost identical?

If I want to fine tune the string matching part:

Why is the matching compared string by string? Should I explicitly do:
- branch by string length?
- order the keywords by frequency they appear in JavaScript / TypeScript?
- add an explicit check for TypeScript to match less keywords?

If I want to go the phf route:

I guess we need to find a better phf hashing alternative?

The best we can do is study and port the SIMD algorithm from @strager, but again, this is beyond my knowledge.

YoniFei · 2023-03-07T03:28:52Z

Why does the const Atom::new_inline not compiling down to a compact representation? Is it because there's a loop inside the code? This part is beyond me.

What do you mean?
As far as I can tell, the strings are compact - they live on the stack.

Changing to &static seems to have a better performance, #143, but the assembly code look almost identical?

You still see the bazillion inlined new_const?
Based on the improvement, I'm inclined to believe they aren't there (can take a look if needed).

Why is the matching compared string by string? Should I explicitly do:
branch by string length?
order the keywords by frequency they appear in JavaScript / TypeScript?
add an explicit check for TypeScript to match less keywords?

I'd expect it to give an improvement, yeah. Worth checking out.

I guess we need to find a better phf hashing alternative?

I searched around and didn't find anything.
Since I became vested, I might try to make something as a side project.
A basic one would just use a faster hash than phf-map. A more advanced one would also allow using a custom hash and key comparison, which would enable creating super-optimal solutions, like @strager's, per use-case.

The best we can do is study and port the SIMD algorithm

The only SIMD involved is for the string key comparison (after you hash and get an index you still need to compare the lookup key with the found key, because inputs can be arbitrary right? identifier names are arbitrary). That's keyword_lexer::key_strings_equal.

The real voodoo is static selection_type select(const char* key, std::size_t key_length), which efficiently takes the first and last two chars, which are unique in the keyword set, and uses those bits for the "dumb" hash algorithm.
It's brilliant.

Boshen · 2023-03-07T05:11:19Z

Since I became vested, I might try to make something as a side project.

There's a PR sitting there for two years, and given the amount of "Rewrite it in Rust" projects going on, this would be a great and valuable project to work on.

The real voodoo is static selection_type select(const char* key, std::size_t key_length), which efficiently takes the first and last two chars, which are unique in the keyword set, and uses those bits for the "dumb" hash algorithm.
It's brilliant.

This technique can be found from Wojciech Muła's article on SIMD-friendly algorithms for substring searching.

Boshen · 2023-03-07T05:13:17Z

So given all the things we can improve on, I'll walk up the ladder and tune this step by step.

Thank you so much for all the inputs, @YoniFeng and @strager, I learned so much from you.

perf(lexer): try phf_map for match_keyword

02a0aae

oxc-project deleted a comment from github-actions Bot Mar 6, 2023

strager reviewed Mar 6, 2023

View reviewed changes

Boshen closed this Mar 7, 2023

This was referenced Mar 10, 2023

perf(lexer): low hanging fruits #151

Closed

perf(lexer): try perfect hash table for keyword lookup #171

Closed

Boshen deleted the keyword branch March 21, 2023 02:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

perf(lexer): try phf_map for `match_keyword`#140

perf(lexer): try phf_map for `match_keyword`#140
Boshen wants to merge 1 commit into
mainfrom
keyword

Boshen commented Mar 6, 2023

Uh oh!

github-actions Bot commented Mar 6, 2023

Uh oh!

Boshen commented Mar 6, 2023

Uh oh!

strager Mar 6, 2023

Uh oh!

YoniFei Mar 6, 2023

Uh oh!

YoniFei commented Mar 6, 2023

Uh oh!

Boshen commented Mar 7, 2023 •

edited

Loading

Uh oh!

YoniFei commented Mar 7, 2023 •

edited

Loading

Uh oh!

Boshen commented Mar 7, 2023

Uh oh!

Boshen commented Mar 7, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

Boshen commented Mar 6, 2023

Uh oh!

github-actions Bot commented Mar 6, 2023

Parser Benchmark Results - ubuntu-latest

Uh oh!

Boshen commented Mar 6, 2023

Uh oh!

strager Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

YoniFei Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

YoniFei commented Mar 6, 2023

Uh oh!

Boshen commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YoniFei commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Boshen commented Mar 7, 2023

Uh oh!

Boshen commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Boshen commented Mar 7, 2023 •

edited

Loading

YoniFei commented Mar 7, 2023 •

edited

Loading

Boshen commented Mar 7, 2023 •

edited

Loading