Copied from #46 (comment):
Related to #193.
The other thing that's slow with strings is == comparison. This affects both == comparison itself, and hash table lookups (since they perform a == comparison too).
The problem
The problem is that you're not allowed to read out of bounds of the string so, as I understand it, string comparison for short strings (which is common for JS identifiers) is either implemented as a byte-by-byte comparison (slow), or is implemented roughly like this:
fn eq(a: &str, b: &str) -> bool {
if a.len() != b.len() {
return false;
}
let len = a.len();
let a_ptr = a.as_ptr();
let b_ptr = b.as_ptr();
if len >= 8 {
// Omitted for brevity
} else if len == 0 {
true
} else if len >= 4 {
a_ptr.cast::<u32>().read_unaligned() == b_ptr.cast::<u32>().read_unaligned()
&& a_ptr.add(len - 4).cast::<u32>().read_unaligned() == b_ptr.add(len - 4).cast::<u32>().read_unaligned()
} else if len >= 2 {
a_ptr.cast::<u16>().read_unaligned() == b_ptr.cast::<u16>().read_unaligned()
&& a_ptr.add(len - 2).cast::<u16>().read_unaligned() == b_ptr.add(len - 4).cast::<u16>().read_unaligned()
} else {
a_ptr.read() == b_ptr.read()
}
}
This involves lots of reads, and lots of branches. And as strings are of varying lengths in a pretty random pattern, it's terrible for the branch predictor - it won't be able to correctly predict the branch.
It also involves a function call into memcmp (the string comparison function from libc which does all this).
A more efficient solution
If we made sure that all strings are stored in a location where:
- The string is followed by at least
8 - (len % 8) initalized bytes (i.e. abc has at least 5 more initialized bytes after it).
- Those following bytes are never aliased by a
&mut ref, so safe to read.
...then we can always read a batch of 8 bytes, and we just mask off the ones that aren't part of the string when comparing.
fn eq(a: Ident<'_>, b: Ident<'_>) -> bool {
if a.len() != b.len() {
return false;
}
let len = a.len();
if len >= 8 {
// Omitted for brevity
} else {
let a_block = a.as_ptr().cast::<u64>().read_unaligned();
let b_block = b.as_ptr().cast::<u64>().read_unaligned();
// XOR. Any bytes which are equal = 0
let mut diff = a_block ^ b_block;
// Shift bytes which are not part of the string off the top end
diff <<= (8 - len) * 8;
// If all bytes in diff are 0, strings are equal
diff == 0
}
}
https://godbolt.org/z/o8x33Pa7d
How would we achieve the requirements?
The requirements are already met for all strings in source text except those which end less than 8 bytes before of the file.
Handle those last ones by: In lexer/parser, copy any strings/identifiers which end within 8 bytes of end of the file to elsewhere in the arena and write padding bytes after them. The lexer already has separate code paths for handling strings/identifiers in last chunk of the file, so it wouldn't affect the fast path.
All other strings which are allocated elsewhere in the arena: Write padding bytes after them to make the length up to a multiple of 8.
Or, even better: Store all strings which aren't part of source text in a different part of the arena from other data, tightly packed. So then each string has another string directly after it, satisfying the "must have initialized bytes after" requirement. We only deal in immutable strings, so none of that data can be aliased by a &mut reference. Then you only need to write padding bytes after the last string.
Future extension - SIMD
With SIMD AVX2 ops could compare 2 strings of up to 32 bytes length with only 6 instructions and 1 very predictable branch (with AVX-512 this goes up to 64 bytes with same number of ops). Very few identifiers are longer than 32 bytes, so this fast path would cover the vast majority of cases.
Copied from #46 (comment):
Related to #193.
The other thing that's slow with strings is
==comparison. This affects both==comparison itself, and hash table lookups (since they perform a==comparison too).The problem
The problem is that you're not allowed to read out of bounds of the string so, as I understand it, string comparison for short strings (which is common for JS identifiers) is either implemented as a byte-by-byte comparison (slow), or is implemented roughly like this:
This involves lots of reads, and lots of branches. And as strings are of varying lengths in a pretty random pattern, it's terrible for the branch predictor - it won't be able to correctly predict the branch.
It also involves a function call into
memcmp(the string comparison function fromlibcwhich does all this).A more efficient solution
If we made sure that all strings are stored in a location where:
8 - (len % 8)initalized bytes (i.e.abchas at least 5 more initialized bytes after it).&mutref, so safe to read....then we can always read a batch of 8 bytes, and we just mask off the ones that aren't part of the string when comparing.
https://godbolt.org/z/o8x33Pa7d
How would we achieve the requirements?
The requirements are already met for all strings in source text except those which end less than 8 bytes before of the file.
Handle those last ones by: In lexer/parser, copy any strings/identifiers which end within 8 bytes of end of the file to elsewhere in the arena and write padding bytes after them. The lexer already has separate code paths for handling strings/identifiers in last chunk of the file, so it wouldn't affect the fast path.
All other strings which are allocated elsewhere in the arena: Write padding bytes after them to make the length up to a multiple of 8.
Or, even better: Store all strings which aren't part of source text in a different part of the arena from other data, tightly packed. So then each string has another string directly after it, satisfying the "must have initialized bytes after" requirement. We only deal in immutable strings, so none of that data can be aliased by a
&mutreference. Then you only need to write padding bytes after the last string.Future extension - SIMD
With SIMD AVX2 ops could compare 2 strings of up to 32 bytes length with only 6 instructions and 1 very predictable branch (with AVX-512 this goes up to 64 bytes with same number of ops). Very few identifiers are longer than 32 bytes, so this fast path would cover the vast majority of cases.