Skip to content

Improve boundary classification#17005

Merged
RobinMalfait merged 6 commits into
mainfrom
feat/classify-boundaries
Mar 7, 2025
Merged

Improve boundary classification#17005
RobinMalfait merged 6 commits into
mainfrom
feat/classify-boundaries

Conversation

@RobinMalfait

@RobinMalfait RobinMalfait commented Mar 6, 2025

Copy link
Copy Markdown
Member

This PR cleans up the boundary character checking by using similar classification techniques as we used for other classification problems.

For starters, this moves the boundary related items to its own file, next we setup the classification enum.

Last but not least, we removed } as an after boundary character, and instead handle that situation in the Ruby pre processor where we need it. This means the %w{flex} will still work in Ruby files.


This PR is a followup for #17001, the main goal is to clean up some of the boundary character checking code. The other big improvement is performance. Changing the boundary character checking to use a classification instead results in:

Took the best score of 10 runs each:

- CandidateMachine: Throughput: 311.96 MB/s
+ CandidateMachine: Throughput: 333.52 MB/s

So a ~20MB/s improvement.

Test plan

  1. Existing tests should pass. Due to the removal of } as an after boundary character, some tests are updated.
  2. Added new tests to ensure the Ruby pre processor still works as expected.

@RobinMalfait RobinMalfait requested a review from a team as a code owner March 6, 2025 23:44
This is a relatively small change, but with big impact:

```diff
- CandidateMachine: Throughput: 308.62 MB/s
+ CandidateMachine: Throughput: 324.34 MB/s
```

Almost 20 MB/s more throughput from this change alone.
We still need it for `%w{…}` in Ruby land, but let's solve that using
the pre processor instead.
@RobinMalfait RobinMalfait force-pushed the feat/classify-boundaries branch from 797ef23 to b3c597b Compare March 6, 2025 23:47
Comment thread crates/oxide/src/extractor/boundary.rs Outdated
Co-authored-by: Jordan Pittman <jordan@cryptica.me>
@RobinMalfait RobinMalfait enabled auto-merge (squash) March 6, 2025 23:58
@RobinMalfait RobinMalfait merged commit d0a9746 into main Mar 7, 2025
@RobinMalfait RobinMalfait deleted the feat/classify-boundaries branch March 7, 2025 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants