perf(parser): lexer skip single space before tokens by overlookmotel · Pull Request #2029 · oxc-project/oxc

overlookmotel · 2024-01-14T17:00:22Z

It's very common for JS code to contain a single space between tokens (e.g. x = y + 1).

This PR makes read_next_token() handle this common case, by skipping a single space before tokenizing. The implementation is branchless.

Produces +1% speed up on parser benchmarks.

However, I'm not sure if this is a good idea! (hence marking this PR as draft)

std::str::Chars doesn't have an API for doing this branchlessly, so implementation uses some fairly gnarly unsafe code which relies on details of the internal implementation of Chars.

This may well not be worth the +1%. But opening this PR as a basis for discussion.

The same optimization could be achieved safely by replacing Chars iterator with an iterator over string bytes, and possibly that'd make other optimizations possible too.

codspeed-hq · 2024-01-14T17:11:56Z

CodSpeed Performance Report

Merging #2029 will not alter performance

_{Comparing overlookmotel:lexer-space (3018bf7) with main (7711f7a)}

Summary

✅ 14 untouched benchmarks

Boshen · 2024-01-15T02:39:46Z

May I refuse this PR due to the hard to understand code + unsafe+ minor improvement?

Boshen · 2024-01-15T02:42:39Z

If we really want go down this route ... getting rid of the Chars interface and work with raw bytes may be a better idea, but that'll be a huge change which will make the code look like C 😅

overlookmotel · 2024-01-15T15:58:36Z

Yes you certainly may refuse it! As I mentioned, I also wasn't sure it was a good idea. But do you mind leaving this PR open for now? I may return to it and see if I can come up with a cleaner way of doing the same thing.

Boshen · 2024-01-15T16:07:26Z

Sure :-)

overlookmotel · 2024-01-18T19:13:33Z

I came up with another version which is also branchless but easier to follow and less unsafe:

let next_byte = remaining.as_bytes()[0];
let skip = usize::from(next_byte == b' ');
// SAFETY: `skip` is 1 only if next byte is an ASCII space, so can safely slice off `skip` bytes.
// Otherwise, `skip` is 0 in which case this is a no-op.
unsafe { self.current.chars = remaining.get_unchecked(skip..).chars() };

Unfortunately, this is a few more instructions than my initial quite unsafe version, and so does not produce a benefit.

In any case, the benefit of the original version has now been lessened due to other optimizations to read_next_token() made since I originally opened this.

So I'm going to close this.

overlookmotel · 2024-01-18T19:16:20Z

Am deleting the branch, so just to leave a record of what I was trying to do here, here's the original version:

fn read_next_token(&mut self) -> Kind {
    // Return `Kind::Eof` if end of file
    let remaining = self.current.chars.as_str();
    if remaining.is_empty() {
        self.current.token.start = self.offset();
        return Kind::Eof;
    }

    // Skip a single space.
    // It's very common for JS code to contain a single space between tokens (e.g. `x = y + 1`),
    // so handle this common case without branching.
    // We skip a space without branching by manually mutating the `Chars` iterator's pointer
    // - add 1 if next char is a space, or add 0 if it's not.
    // The unconditional add is what makes this branchless.
    let byte = remaining.as_bytes()[0];
    let increment = usize::from(byte == 32);
    // SAFETY: Depends on internal implementation details of `Chars<'a>`.
    // `Chars<'a>` wraps a `slice::Iter<'a, u8>`.
    // `slice::Iter<'a, u8>` has fields:
    //   * `ptr`: pointer to current position in the string being iterated.
    //   * `end_or_len`: pointer to end of the string being iterated.
    // Here we mutate value of the `ptr` field.
    // We handle these fields being in either order, and check that both fields are what's expected
    // for a sample string. So highly likely a change to internal implementation of `Chars<'a>`
    // would cause these checks to fail, rather than cause UB.
    // But it is theoretically possible this could become unsound if layout of `Chars<'a>` changed.
    unsafe {
        let p = std::ptr::addr_of_mut!(self.current.chars).cast::<usize>();

        // Offset pointer to point to `ptr` field of `slice::Iter<'a, u8>`.
        // This is const-folded down to either nothing, or just `let p = p.add(1);`.
        let ptr_offset = {
            let str = "xyz";
            #[allow(clippy::transmute_undefined_repr)]
            let parts = std::mem::transmute::<_, [*const u8; 2]>(str.chars());
            if parts[0] == str.as_ptr() {
                assert!(parts[1] == str.as_ptr().add(str.len()));
                0
            } else {
                assert!(parts[1] == str.as_ptr());
                assert!(parts[0] == str.as_ptr().add(str.len()));
                1
            }
        };
        let p = p.add(ptr_offset);

        *p += increment;
    }

    // Get next token, skipping over further whitespace
    // ...
}

overlookmotel marked this pull request as draft January 14, 2024 17:00

github-actions bot added the A-parser Area - Parser label Jan 14, 2024

perf(parser): lexer skip single space before tokens

3018bf7

overlookmotel force-pushed the lexer-space branch from fa5b653 to 3018bf7 Compare January 18, 2024 18:18

overlookmotel closed this Jan 18, 2024

overlookmotel deleted the lexer-space branch January 18, 2024 19:17

overlookmotel mentioned this pull request Feb 8, 2024

perf(parser): skip single space between tokens branchlessly #2354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(parser): lexer skip single space before tokens#2029

perf(parser): lexer skip single space before tokens#2029
overlookmotel wants to merge 1 commit intooxc-project:mainfrom
overlookmotel:lexer-space

overlookmotel commented Jan 14, 2024 •

edited

Loading

Uh oh!

codspeed-hq bot commented Jan 14, 2024 •

edited

Loading

Uh oh!

Boshen commented Jan 15, 2024

Uh oh!

Boshen commented Jan 15, 2024

Uh oh!

overlookmotel commented Jan 15, 2024

Uh oh!

Boshen commented Jan 15, 2024

Uh oh!

overlookmotel commented Jan 18, 2024

Uh oh!

overlookmotel commented Jan 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

overlookmotel commented Jan 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Jan 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #2029 will not alter performance

Summary

Uh oh!

Boshen commented Jan 15, 2024

Uh oh!

Boshen commented Jan 15, 2024

Uh oh!

overlookmotel commented Jan 15, 2024

Uh oh!

Boshen commented Jan 15, 2024

Uh oh!

overlookmotel commented Jan 18, 2024

Uh oh!

overlookmotel commented Jan 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

overlookmotel commented Jan 14, 2024 •

edited

Loading

codspeed-hq bot commented Jan 14, 2024 •

edited

Loading