Skip to content

Add JSON pre-processor#17125

Merged
RobinMalfait merged 2 commits intomainfrom
feat/add-json-pre-processor
Mar 11, 2025
Merged

Add JSON pre-processor#17125
RobinMalfait merged 2 commits intomainfrom
feat/add-json-pre-processor

Conversation

@RobinMalfait
Copy link
Copy Markdown
Member

@RobinMalfait RobinMalfait commented Mar 11, 2025

This PR adds a small JSON pre processor to improve parsing JSON files. This is because the extractor creates "sub machines" whenever it encounters a [ or a { in the input. We do this because of things like %w[…] strings in Ruby or className={clsx({flex: true})} in JSX.

Due to the sheer amount of potential [ and ] brackets, it could be that parsing JSON files are way slower than they need to be.

To tackle this, after this PR, when given an input like this:

[1,[2,[3,4,["flex flex-1 content-['hello_world']"]]], {"flex": true}]

We'll preprocess all the important brackets and braces by replacing them with spaces so the extractor doesn't need special casing:

1, 2, 3,4, "flex flex-1 content-['hello_world']" , "flex": true

We saw this while debugging this issue: #17092

Test plan

  1. Added test to verify the pre processing works
  2. Existing tests still pass

@RobinMalfait RobinMalfait requested a review from a team as a code owner March 11, 2025 12:06
@RobinMalfait RobinMalfait force-pushed the feat/add-json-pre-processor branch from c0ed038 to 08b0203 Compare March 11, 2025 12:06
@RobinMalfait RobinMalfait force-pushed the feat/add-json-pre-processor branch from 3fe4791 to a45a6c4 Compare March 11, 2025 12:12
@RobinMalfait RobinMalfait enabled auto-merge (squash) March 11, 2025 12:15
@RobinMalfait RobinMalfait merged commit 9ddeb09 into main Mar 11, 2025
5 checks passed
@RobinMalfait RobinMalfait deleted the feat/add-json-pre-processor branch March 11, 2025 12:15
RobinMalfait pushed a commit that referenced this pull request Mar 26, 2026
## Summary

This specializes the `.jsonl` and `.ndjson` file extensions so they're
preprocessed like JSON instead of by the standard scanner. This prevents
them from creating thousands of sub machines and reduces scanning time
(see #17125 where this was done for `.json` files).

It seems reasonable to handle new-line delimited JSON files as well
otherwise scanning these files can take quite a long time.

It's quite unlikely that these will contain classes so, alternatively,
these *could* go in the binary extensions list so they get ignored
entirely.

## Test plan

I ran manual tests inside the `oxide` crate against some large-ish JSONL
files (5MB–15MB). These changes bring down scanning time from 2s–3s on
my M3 Max (via `cargo test --release …`) to less than 20ms.

I also ran tests through a full CLI build pipeline on a low-spec linux
box. This change brought scanning time down from ~90s to ~300ms for a
single ~15MB file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants