Skip to content

perf(parser): lexer skip single space before tokens#2029

Closed
overlookmotel wants to merge 1 commit intooxc-project:mainfrom
overlookmotel:lexer-space
Closed

perf(parser): lexer skip single space before tokens#2029
overlookmotel wants to merge 1 commit intooxc-project:mainfrom
overlookmotel:lexer-space

Conversation

@overlookmotel
Copy link
Copy Markdown
Member

@overlookmotel overlookmotel commented Jan 14, 2024

It's very common for JS code to contain a single space between tokens (e.g. x = y + 1).

This PR makes read_next_token() handle this common case, by skipping a single space before tokenizing. The implementation is branchless.

Produces +1% speed up on parser benchmarks.

However, I'm not sure if this is a good idea! (hence marking this PR as draft)

std::str::Chars doesn't have an API for doing this branchlessly, so implementation uses some fairly gnarly unsafe code which relies on details of the internal implementation of Chars.

This may well not be worth the +1%. But opening this PR as a basis for discussion.

The same optimization could be achieved safely by replacing Chars iterator with an iterator over string bytes, and possibly that'd make other optimizations possible too.

@overlookmotel overlookmotel marked this pull request as draft January 14, 2024 17:00
@github-actions github-actions bot added the A-parser Area - Parser label Jan 14, 2024
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Jan 14, 2024

CodSpeed Performance Report

Merging #2029 will not alter performance

Comparing overlookmotel:lexer-space (3018bf7) with main (7711f7a)

Summary

✅ 14 untouched benchmarks

@Boshen
Copy link
Copy Markdown
Member

Boshen commented Jan 15, 2024

May I refuse this PR due to the hard to understand code + unsafe+ minor improvement?

@Boshen
Copy link
Copy Markdown
Member

Boshen commented Jan 15, 2024

If we really want go down this route ... getting rid of the Chars interface and work with raw bytes may be a better idea, but that'll be a huge change which will make the code look like C 😅

@overlookmotel
Copy link
Copy Markdown
Member Author

Yes you certainly may refuse it! As I mentioned, I also wasn't sure it was a good idea. But do you mind leaving this PR open for now? I may return to it and see if I can come up with a cleaner way of doing the same thing.

@Boshen
Copy link
Copy Markdown
Member

Boshen commented Jan 15, 2024

Sure :-)

@overlookmotel
Copy link
Copy Markdown
Member Author

I came up with another version which is also branchless but easier to follow and less unsafe:

let next_byte = remaining.as_bytes()[0];
let skip = usize::from(next_byte == b' ');
// SAFETY: `skip` is 1 only if next byte is an ASCII space, so can safely slice off `skip` bytes.
// Otherwise, `skip` is 0 in which case this is a no-op.
unsafe { self.current.chars = remaining.get_unchecked(skip..).chars() };

Unfortunately, this is a few more instructions than my initial quite unsafe version, and so does not produce a benefit.

In any case, the benefit of the original version has now been lessened due to other optimizations to read_next_token() made since I originally opened this.

So I'm going to close this.

@overlookmotel
Copy link
Copy Markdown
Member Author

Am deleting the branch, so just to leave a record of what I was trying to do here, here's the original version:

fn read_next_token(&mut self) -> Kind {
    // Return `Kind::Eof` if end of file
    let remaining = self.current.chars.as_str();
    if remaining.is_empty() {
        self.current.token.start = self.offset();
        return Kind::Eof;
    }

    // Skip a single space.
    // It's very common for JS code to contain a single space between tokens (e.g. `x = y + 1`),
    // so handle this common case without branching.
    // We skip a space without branching by manually mutating the `Chars` iterator's pointer
    // - add 1 if next char is a space, or add 0 if it's not.
    // The unconditional add is what makes this branchless.
    let byte = remaining.as_bytes()[0];
    let increment = usize::from(byte == 32);
    // SAFETY: Depends on internal implementation details of `Chars<'a>`.
    // `Chars<'a>` wraps a `slice::Iter<'a, u8>`.
    // `slice::Iter<'a, u8>` has fields:
    //   * `ptr`: pointer to current position in the string being iterated.
    //   * `end_or_len`: pointer to end of the string being iterated.
    // Here we mutate value of the `ptr` field.
    // We handle these fields being in either order, and check that both fields are what's expected
    // for a sample string. So highly likely a change to internal implementation of `Chars<'a>`
    // would cause these checks to fail, rather than cause UB.
    // But it is theoretically possible this could become unsound if layout of `Chars<'a>` changed.
    unsafe {
        let p = std::ptr::addr_of_mut!(self.current.chars).cast::<usize>();

        // Offset pointer to point to `ptr` field of `slice::Iter<'a, u8>`.
        // This is const-folded down to either nothing, or just `let p = p.add(1);`.
        let ptr_offset = {
            let str = "xyz";
            #[allow(clippy::transmute_undefined_repr)]
            let parts = std::mem::transmute::<_, [*const u8; 2]>(str.chars());
            if parts[0] == str.as_ptr() {
                assert!(parts[1] == str.as_ptr().add(str.len()));
                0
            } else {
                assert!(parts[1] == str.as_ptr());
                assert!(parts[0] == str.as_ptr().add(str.len()));
                1
            }
        };
        let p = p.add(ptr_offset);

        *p += increment;
    }

    // Get next token, skipping over further whitespace
    // ...
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-parser Area - Parser

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants