Skip to content

perf(lexer): try phf_map for match_keyword#140

Closed
Boshen wants to merge 1 commit into
mainfrom
keyword
Closed

perf(lexer): try phf_map for match_keyword#140
Boshen wants to merge 1 commit into
mainfrom
keyword

Conversation

@Boshen

@Boshen Boshen commented Mar 6, 2023

Copy link
Copy Markdown
Member

No description provided.

@github-actions

github-actions Bot commented Mar 6, 2023

Copy link
Copy Markdown
Contributor

Parser Benchmark Results - ubuntu-latest

group                      main                                   pr
-----                      ----                                   --
parser/babylon.max.js      1.00    122.4±0.55ms    84.3 MB/sec    1.00    122.8±0.38ms    84.1 MB/sec
parser/d3.js               1.00     14.7±0.08ms    37.1 MB/sec    1.01     14.9±0.08ms    36.6 MB/sec
parser/lodash.js           1.00      4.2±0.01ms   121.6 MB/sec    1.02      4.3±0.04ms   118.8 MB/sec
parser/pdf.js              1.00      8.4±0.03ms    47.9 MB/sec    1.02      8.5±0.08ms    47.1 MB/sec
parser/typescript.js       1.00    122.8±0.27ms    78.3 MB/sec    1.01    124.2±0.26ms    77.4 MB/sec
semantic/babylon.max.js    1.00    274.6±6.01ms    37.6 MB/sec    1.02    279.6±8.56ms    36.9 MB/sec
semantic/d3.js             1.00     20.4±0.70ms    26.7 MB/sec    1.00     20.5±0.62ms    26.6 MB/sec
semantic/lodash.js         1.00      5.3±0.30ms    97.7 MB/sec    1.02      5.4±0.08ms    95.3 MB/sec
semantic/pdf.js            1.00     15.0±0.46ms    26.8 MB/sec    1.02     15.3±0.60ms    26.2 MB/sec
semantic/typescript.js     1.00    179.3±7.34ms    53.6 MB/sec    1.02    182.3±7.39ms    52.8 MB/sec

@Boshen

Boshen commented Mar 6, 2023

Copy link
Copy Markdown
Member Author

@YoniFeng I'm stuck optimizing this routine, do you have any trick up your sleeve?

I know @strager has done some SIMD magic, but I think this is beyond my reach:

But we should at least get LLVM to hand us some finely tuned assembly code, or have @strager point us in the right direction 😁

@oxc-project oxc-project deleted a comment from github-actions Bot Mar 6, 2023
@oxc-project oxc-project deleted a comment from github-actions Bot Mar 6, 2023
use std::fmt;

use oxc_ast::Atom;
use phf::phf_map;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phf is a terrible library in terms of performance. (In my benchmarking, it is easily beatable by a general purpose hash table with a custom hash function.) phf seems to be designed for large tables [1], not run-time performance.

If you want something off-the-shelf and easy, I suggest gperf. Unfortunately, I think gperf cannot generate Rust code. But you can have it generate C code which you wrap in Rust. Or you can add a Rust backend to gperf, or you can write a script which rewrites gperf's C output into Rust.

In quick-lint-js, switching from gperf to my custom keyword lookup function (without inline assembly) reduced jQuery parsing times by 4-5% on my Ryzen 9 5950X. Some benchmark data for just the lookup function (higher is better):

Rust std::collections::HashMap (default hash function):  31.6M lookups/s
Rust phf:                                                37.9M lookups/s
C++ std::unordered_map (default hash function):          33.9M lookups/s
C++ std::unordered_map (custom hash function):           45.3M lookups/s
gperf:                                                  122.2M lookups/s
mine (production):                                      (unknown; probably around 200M lookup/s)
mine (with inline assembly):                            338.1M lookups/s 😉

(The "Rust std::collections::HashMap" variant used unsafe code because there was a mutable global variable. If you're allergic to unsafe, expect it to be slightly slower due to lazy_static overhead or Mutex overhead or something.)

[1] From the docs: "It currently uses the CHD algorithm and can generate a 100,000 entry map in roughly .4 seconds."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.

I just looked at your code - it's exactly what I started imagining is possible.
Super efficient hash function based on the input assumptions (voodoo), and then the key comparison is SIMD string comparison.
Neat!

@YoniFei

YoniFei commented Mar 6, 2023

Copy link
Copy Markdown
Contributor

Are you doing this for fun or you think it matters?
Lookups don't seem substantial in profiles.
Still, you nerd sniped me to take a look..

I've noticed 2 things so far:

1.
The existing match statement's assembly is comparison after comparison:

pub fn match_keyword(s: &str) -> (Self, Atom) {
 push    r15
 push    r14
 push    r12
 push    rsi
 push    rdi
 push    rbx
 sub     rsp, 72
 mov     r12, r8
 mov     r15, rdx
 mov     r14, rcx
 if len == 1 || len >= 12 {
 cmp     r8, 1
 je      .LBB2359_4
 cmp     r12, 11
 ja      .LBB2359_4
     lea     rax, [r12, -, 2]
     cmp     rax, 9
     ja      .LBB2359_124
     lea     rcx, [rip, +, .LJTI2359_0]
     movsxd  rax, dword, ptr, [rcx, +, 4*rax]
     add     rax, rcx
     jmp     rax
.LBB2359_16:
     movzx   eax, word, ptr, [r15]
     cmp     eax, 29537
 "as" => (As, Atom::new_inline("as")),
 je      .LBB2359_127
     movzx   eax, word, ptr, [r15]
     cmp     eax, 28516
 "do" => (Do, Atom::new_inline("do")),
 je      .LBB2359_128
     movzx   eax, word, ptr, [r15]
     cmp     eax, 26217
 "if" => (If, Atom::new_inline("if")),
 je      .LBB2359_129
     movzx   eax, word, ptr, [r15]
     cmp     eax, 28265
 "in" => (In, Atom::new_inline("in")),
 je      .LBB2359_130
     movzx   eax, word, ptr, [r15]
     cmp     eax, 29545
 "is" => (Is, Atom::new_inline("is")),
 je      .LBB2359_23
     movzx   eax, word, ptr, [r15]
     cmp     eax, 26223
 "of" => (Of, Atom::new_inline("of")),
 jne     .LBB2359_124

Which gets more expensive the longer the string (i.e. once it's >4 chars, >8 chars etc)

The jumps are to blocks that look like this

.LBB2359_132:
     buffer[i - 1] = text[i - 1]; (C:\Users\yfeig\.cargo\registry\src\github.com-1ecc6299db9ec823\compact_str-0.7.0\src\repr\inline.rs:63)
     xorps   xmm0, xmm0
     movups  xmmword, ptr, [rsp, +, 51], xmm0
     mov     dword, ptr, [rsp, +, 67], 0
     mov     word, ptr, [rsp, +, 48], 28518
     mov     byte, ptr, [rsp, +, 50], 114
     InlineBuffer(buffer) (C:\Users\yfeig\.cargo\registry\src\github.com-1ecc6299db9ec823\compact_str-0.7.0\src\repr\inline.rs:67)
     mov     rax, qword, ptr, [rsp, +, 63]
     mov     qword, ptr, [r14, +, 23], rax
     movzx   eax, word, ptr, [rsp, +, 48]
     mov     word, ptr, [r14, +, 8], ax
     movzx   eax, byte, ptr, [rsp, +, 50]
     mov     byte, ptr, [r14, +, 10], al
     mov     rax, qword, ptr, [rsp, +, 51]
     mov     qword, ptr, [r14, +, 11], rax
     mov     eax, dword, ptr, [rsp, +, 59]
     mov     dword, ptr, [r14, +, 19], eax
     movzx   eax, byte, ptr, [rsp, +, 63]
     mov     byte, ptr, [r14, +, 23], al

This doesn't seem right, because that's the code for new_const.

pub const fn new_const(text: &str) -> Self {
        if text.len() > MAX_SIZE {
            panic!("Provided string has a length greater than our MAX_SIZE");
        }

        let len = text.len();
        let mut buffer = [0u8; MAX_SIZE];

        // set the length
        buffer[MAX_SIZE - 1] = len as u8 | LENGTH_MASK;

        // Note: for loops aren't allowed in `const fn`, hence the while.
        // Note: Iterating forward results in badly optimized code, because the compiler tries to
        //       unroll the loop.
        let text = text.as_bytes();
        let mut i = len;
        while i > 0 {
            buffer[i - 1] = text[i - 1];
            i -= 1;
        }

        InlineBuffer(buffer)
    }

It's as if the Atom::new_inline() aren't const.
(Note that in mov word, ptr, [rsp, +, 48], 28518, the number 28518 is for UTF-8 "of". These code blocks aren't for the case where we didn't have a match. There are dozens of such blocks, they each an inline of new_inline for a specific hardcoded value.)

I didn't test, but I wonder if moving the match to its own function that returns a &'static Atom would fix this.

I cloned your branch with the phf_map, and this isn't happening. I think it might thanks to KEYWORDS being static (maybe in combination with the value tuples being explicitly stored in an array?).
I'm surprised there's no noticeable improvement. I guess it doesn't outweigh #2.

2.
phf_map! is interesting.
It creates a (general) perfect hashing map for us with no effort, which is amazing:

phf::Map {
    key: 12913932095322966823u64,
    disps: &[
        (1u32, 45u32),
        (0u32, 0u32),
        (1u32, 37u32),
        (1u32, 22u32),
        (4u32, 54u32),
        (0u32, 1u32),
        (1u32, 50u32),
        (9u32, 76u32),
        (0u32, 29u32),
        (5u32, 43u32),
        (0u32, 0u32),
        (0u32, 1u32),
        (3u32, 5u32),
        (7u32, 64u32),
        (3u32, 5u32),
        (83u32, 20u32),
        (3u32, 69u32),
    ],
    entries: &[
        ("abstract", (Abstract, Atom::new_inline("abstract"))),
        ("implements", (Implements, Atom::new_inline("implements"))),
        ("false", (False, Atom::new_inline("false"))),
        ("do", (Do, Atom::new_inline("do"))),
        ("get", (Get, Atom::new_inline("get"))),
        ...
    ],

The hashing function used is SipHash which is the default in Rust (SipHasher13).
It has cryptographic benefits that aren't relevant for our use case (and, I imagine, for the majority of phf_map uses..), so using other hash functions could result in a speedup (however, the phf crate doesn't seem to have support for this).

A hand-written PHF map and hashing function tailored to this use case (and known assumptions: keys are finite (small) set of strings, length between 2 to 12) could definitely be faster.
Like I said initially - it doesn't seem like the perf gains would be meaningful, but it could be done "just for fun".

I'm still a little baffled at how the PR performs worse.
On one hand I'd expect most of the shorter keywords to be way more common (in which case, a handful of comparisons is less work than the hash), but on the other hand I'd expect most identifiers to be variable names that fall between 2 to 12 chars (and have to go through all of the comparisons, which I'm guessing is much more work than the hash?).

@Boshen

Boshen commented Mar 7, 2023

Copy link
Copy Markdown
Member Author

Are you doing this for fun or you think it matters? Lookups don't seem substantial in profiles.

I always have a gut feeling this can be improved, because @strager mentioned SIMD so I looked at it last night, and you showed me how to look at the assembly code.

This is the hottest path in our lexer and it's dominant on the profiler, around 2-5%.
I just couldn't sleep in peace when I know it is not optimized, I've been awoke since 5am 😭

So here are my findings:

  • Why does the const Atom::new_inline not compiling down to a compact representation? Is it because there's a loop inside the code? This part is beyond me.
  • Changing to &static seems to have a better performance, perf(lexer): try &static Atom for match_keyword  #143, but the assembly code look almost identical?

If I want to fine tune the string matching part:

  • Why is the matching compared string by string? Should I explicitly do:
    • branch by string length?
    • order the keywords by frequency they appear in JavaScript / TypeScript?
    • add an explicit check for TypeScript to match less keywords?

If I want to go the phf route:

  • I guess we need to find a better phf hashing alternative?

The best we can do is study and port the SIMD algorithm from @strager, but again, this is beyond my knowledge.

@YoniFei

YoniFei commented Mar 7, 2023

Copy link
Copy Markdown
Contributor

Why does the const Atom::new_inline not compiling down to a compact representation? Is it because there's a loop inside the code? This part is beyond me.

What do you mean?
As far as I can tell, the strings are compact - they live on the stack.

Changing to &static seems to have a better performance, #143, but the assembly code look almost identical?

You still see the bazillion inlined new_const?
Based on the improvement, I'm inclined to believe they aren't there (can take a look if needed).

Why is the matching compared string by string? Should I explicitly do:
branch by string length?
order the keywords by frequency they appear in JavaScript / TypeScript?
add an explicit check for TypeScript to match less keywords?

I'd expect it to give an improvement, yeah. Worth checking out.

I guess we need to find a better phf hashing alternative?

I searched around and didn't find anything.
Since I became vested, I might try to make something as a side project.
A basic one would just use a faster hash than phf-map. A more advanced one would also allow using a custom hash and key comparison, which would enable creating super-optimal solutions, like @strager's, per use-case.

The best we can do is study and port the SIMD algorithm

The only SIMD involved is for the string key comparison (after you hash and get an index you still need to compare the lookup key with the found key, because inputs can be arbitrary right? identifier names are arbitrary). That's keyword_lexer::key_strings_equal.

The real voodoo is static selection_type select(const char* key, std::size_t key_length), which efficiently takes the first and last two chars, which are unique in the keyword set, and uses those bits for the "dumb" hash algorithm.
It's brilliant.

@Boshen

Boshen commented Mar 7, 2023

Copy link
Copy Markdown
Member Author

Since I became vested, I might try to make something as a side project.

There's a PR sitting there for two years, and given the amount of "Rewrite it in Rust" projects going on, this would be a great and valuable project to work on.

The real voodoo is static selection_type select(const char* key, std::size_t key_length), which efficiently takes the first and last two chars, which are unique in the keyword set, and uses those bits for the "dumb" hash algorithm.
It's brilliant.

This technique can be found from Wojciech Muła's article on SIMD-friendly algorithms for substring searching.

@Boshen

Boshen commented Mar 7, 2023

Copy link
Copy Markdown
Member Author

So given all the things we can improve on, I'll walk up the ladder and tune this step by step.

Thank you so much for all the inputs, @YoniFeng and @strager, I learned so much from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants