Conversation
Parser Benchmark Results - ubuntu-latest |
|
@YoniFeng I'm stuck optimizing this routine, do you have any trick up your sleeve? I know @strager has done some SIMD magic, but I think this is beyond my reach:
But we should at least get LLVM to hand us some finely tuned assembly code, or have @strager point us in the right direction 😁 |
| use std::fmt; | ||
|
|
||
| use oxc_ast::Atom; | ||
| use phf::phf_map; |
There was a problem hiding this comment.
phf is a terrible library in terms of performance. (In my benchmarking, it is easily beatable by a general purpose hash table with a custom hash function.) phf seems to be designed for large tables [1], not run-time performance.
If you want something off-the-shelf and easy, I suggest gperf. Unfortunately, I think gperf cannot generate Rust code. But you can have it generate C code which you wrap in Rust. Or you can add a Rust backend to gperf, or you can write a script which rewrites gperf's C output into Rust.
In quick-lint-js, switching from gperf to my custom keyword lookup function (without inline assembly) reduced jQuery parsing times by 4-5% on my Ryzen 9 5950X. Some benchmark data for just the lookup function (higher is better):
Rust std::collections::HashMap (default hash function): 31.6M lookups/s
Rust phf: 37.9M lookups/s
C++ std::unordered_map (default hash function): 33.9M lookups/s
C++ std::unordered_map (custom hash function): 45.3M lookups/s
gperf: 122.2M lookups/s
mine (production): (unknown; probably around 200M lookup/s)
mine (with inline assembly): 338.1M lookups/s 😉
(The "Rust std::collections::HashMap" variant used unsafe code because there was a mutable global variable. If you're allergic to unsafe, expect it to be slightly slower due to lazy_static overhead or Mutex overhead or something.)
[1] From the docs: "It currently uses the CHD algorithm and can generate a 100,000 entry map in roughly .4 seconds."
There was a problem hiding this comment.
Interesting.
I just looked at your code - it's exactly what I started imagining is possible.
Super efficient hash function based on the input assumptions (voodoo), and then the key comparison is SIMD string comparison.
Neat!
|
Are you doing this for fun or you think it matters? I've noticed 2 things so far: 1. pub fn match_keyword(s: &str) -> (Self, Atom) {
push r15
push r14
push r12
push rsi
push rdi
push rbx
sub rsp, 72
mov r12, r8
mov r15, rdx
mov r14, rcx
if len == 1 || len >= 12 {
cmp r8, 1
je .LBB2359_4
cmp r12, 11
ja .LBB2359_4
lea rax, [r12, -, 2]
cmp rax, 9
ja .LBB2359_124
lea rcx, [rip, +, .LJTI2359_0]
movsxd rax, dword, ptr, [rcx, +, 4*rax]
add rax, rcx
jmp rax
.LBB2359_16:
movzx eax, word, ptr, [r15]
cmp eax, 29537
"as" => (As, Atom::new_inline("as")),
je .LBB2359_127
movzx eax, word, ptr, [r15]
cmp eax, 28516
"do" => (Do, Atom::new_inline("do")),
je .LBB2359_128
movzx eax, word, ptr, [r15]
cmp eax, 26217
"if" => (If, Atom::new_inline("if")),
je .LBB2359_129
movzx eax, word, ptr, [r15]
cmp eax, 28265
"in" => (In, Atom::new_inline("in")),
je .LBB2359_130
movzx eax, word, ptr, [r15]
cmp eax, 29545
"is" => (Is, Atom::new_inline("is")),
je .LBB2359_23
movzx eax, word, ptr, [r15]
cmp eax, 26223
"of" => (Of, Atom::new_inline("of")),
jne .LBB2359_124Which gets more expensive the longer the string (i.e. once it's >4 chars, >8 chars etc) The jumps are to blocks that look like this .LBB2359_132:
buffer[i - 1] = text[i - 1]; (C:\Users\yfeig\.cargo\registry\src\github.com-1ecc6299db9ec823\compact_str-0.7.0\src\repr\inline.rs:63)
xorps xmm0, xmm0
movups xmmword, ptr, [rsp, +, 51], xmm0
mov dword, ptr, [rsp, +, 67], 0
mov word, ptr, [rsp, +, 48], 28518
mov byte, ptr, [rsp, +, 50], 114
InlineBuffer(buffer) (C:\Users\yfeig\.cargo\registry\src\github.com-1ecc6299db9ec823\compact_str-0.7.0\src\repr\inline.rs:67)
mov rax, qword, ptr, [rsp, +, 63]
mov qword, ptr, [r14, +, 23], rax
movzx eax, word, ptr, [rsp, +, 48]
mov word, ptr, [r14, +, 8], ax
movzx eax, byte, ptr, [rsp, +, 50]
mov byte, ptr, [r14, +, 10], al
mov rax, qword, ptr, [rsp, +, 51]
mov qword, ptr, [r14, +, 11], rax
mov eax, dword, ptr, [rsp, +, 59]
mov dword, ptr, [r14, +, 19], eax
movzx eax, byte, ptr, [rsp, +, 63]
mov byte, ptr, [r14, +, 23], alThis doesn't seem right, because that's the code for pub const fn new_const(text: &str) -> Self {
if text.len() > MAX_SIZE {
panic!("Provided string has a length greater than our MAX_SIZE");
}
let len = text.len();
let mut buffer = [0u8; MAX_SIZE];
// set the length
buffer[MAX_SIZE - 1] = len as u8 | LENGTH_MASK;
// Note: for loops aren't allowed in `const fn`, hence the while.
// Note: Iterating forward results in badly optimized code, because the compiler tries to
// unroll the loop.
let text = text.as_bytes();
let mut i = len;
while i > 0 {
buffer[i - 1] = text[i - 1];
i -= 1;
}
InlineBuffer(buffer)
}It's as if the I didn't test, but I wonder if moving the I cloned your branch with the phf_map, and this isn't happening. I think it might thanks to 2. phf::Map {
key: 12913932095322966823u64,
disps: &[
(1u32, 45u32),
(0u32, 0u32),
(1u32, 37u32),
(1u32, 22u32),
(4u32, 54u32),
(0u32, 1u32),
(1u32, 50u32),
(9u32, 76u32),
(0u32, 29u32),
(5u32, 43u32),
(0u32, 0u32),
(0u32, 1u32),
(3u32, 5u32),
(7u32, 64u32),
(3u32, 5u32),
(83u32, 20u32),
(3u32, 69u32),
],
entries: &[
("abstract", (Abstract, Atom::new_inline("abstract"))),
("implements", (Implements, Atom::new_inline("implements"))),
("false", (False, Atom::new_inline("false"))),
("do", (Do, Atom::new_inline("do"))),
("get", (Get, Atom::new_inline("get"))),
...
],The hashing function used is SipHash which is the default in Rust ( A hand-written PHF map and hashing function tailored to this use case (and known assumptions: keys are finite (small) set of strings, length between 2 to 12) could definitely be faster. I'm still a little baffled at how the PR performs worse. |
I always have a gut feeling this can be improved, because @strager mentioned SIMD so I looked at it last night, and you showed me how to look at the assembly code. This is the hottest path in our lexer and it's dominant on the profiler, around 2-5%. So here are my findings:
If I want to fine tune the string matching part:
If I want to go the phf route:
The best we can do is study and port the SIMD algorithm from @strager, but again, this is beyond my knowledge. |
What do you mean?
You still see the bazillion inlined
I'd expect it to give an improvement, yeah. Worth checking out.
I searched around and didn't find anything.
The only SIMD involved is for the string key comparison (after you hash and get an index you still need to compare the lookup key with the found key, because inputs can be arbitrary right? identifier names are arbitrary). That's The real voodoo is |
There's a PR sitting there for two years, and given the amount of "Rewrite it in Rust" projects going on, this would be a great and valuable project to work on.
This technique can be found from Wojciech Muła's article on SIMD-friendly algorithms for substring searching. |
|
So given all the things we can improve on, I'll walk up the ladder and tune this step by step. Thank you so much for all the inputs, @YoniFeng and @strager, I learned so much from you. |
No description provided.