Skip to content

JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction#143702

Merged
quackaplop merged 7 commits intoelastic:mainfrom
quackaplop:feature/json-extract-byte-slicing
Mar 6, 2026
Merged

JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction#143702
quackaplop merged 7 commits intoelastic:mainfrom
quackaplop:feature/json-extract-byte-slicing

Conversation

@quackaplop
Copy link
Copy Markdown
Contributor

Summary

Optimizes JSON_EXTRACT to use zero-copy byte slicing instead of copyCurrentStructure()
re-serialization when extracting objects, arrays, and numbers from JSON input. This builds on
the byte offset API exposed in #143501.

What changed

Object/array extraction — Previously, extracting a nested object or array walked every token
in the subtree and rebuilt JSON from scratch via XContentBuilder.copyCurrentStructure(). Now it
slices bytes directly from the input buffer using getTokenLocation().byteOffset()
skipChildren()getCurrentLocation().byteOffset(). Zero allocation, zero re-parsing.

Number extraction — Previously called parser.text() which makes Jackson convert the number
to a Java String, then wraps in BytesRef. Now byte-slices the number literal directly from
the input array, avoiding the String allocation entirely.

Boolean extraction — Reuses static TRUE_BYTES / FALSE_BYTES constants instead of
allocating a new BytesRef("true") / BytesRef("false") per call.

Navigation refactoring — Replaced recursive descent (extractValuenavigateObject
extractValue → ...) with an iterative loop. Navigation methods are now pure parser-positioning
helpers that don't need the byte-slicing context, keeping raw byte access confined to the
extraction point.

Non-JSON _source formats (SMILE/CBOR/YAML) fall back to copyCurrentStructure().

Benchmarks

Also adds json_extract and json_extract_object scenarios to EvalBenchmark, and a
dedicated JsonExtractBenchmark with 10 scenarios through the full eval pipeline (EvalMapper →
Layout → Page → Evaluator).

Environment: Apple M3 Max, JDK 25.0.1, JMH 1.37, warmup 3×2s, measurement 5×2s.

Scenario Before (ns/op) After (ns/op) Change
small_object (30B) 222.0 ± 2.8 115.9 ± 3.1 -47.8%
medium_object (500B) 1,275.9 ± 27.0 662.2 ± 15.7 -48.1%
large_object (4KB) 24,531.3 ± 1,641 15,938.0 ± 721 -35.0%
large_nested_extract (10KB doc) 12,323.1 ± 458 6,664.0 ± 180 -45.9%
array_of_objects ([25] of 50) 4,253.0 ± 76 3,853.5 ± 68 -9.4%
nested_scalar (5 levels) 206.2 ± 4.9 178.9 ± 3.1 -13.2%
deep_nesting (10 levels) 478.7 ± 54 324.9 ± 10.7 -32.1%
number 160.0 ± 2.4 133.1 ± 3.2 -16.8%
boolean 106.0 ± 2.1 100.6 ± 2.5 -5.1%
string 107.6 ± 2.8 103.1 ± 2.0 -4.2%

Largest wins on object/array extraction (35–48%) where copyCurrentStructure was the hot path.

Relates #142873

…traction

Replace copyCurrentStructure() re-serialization with zero-copy byte
slicing for JSON input. When the extracted value is an object, array,
or number, slice bytes directly from the input buffer using
XContentLocation.byteOffset() offsets (exposed in elastic#143501).

Also refactors navigation from recursive descent to iterative loop,
confining raw byte access to the extraction point. Adds JMH benchmarks
for JSON_EXTRACT through the full eval pipeline.
@elasticsearchmachine elasticsearchmachine added v9.4.0 Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Mar 5, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@quackaplop quackaplop requested a review from nik9000 March 5, 2026 23:53
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 5, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🏷️ Required labels (at least one) (2)
  • Team:Delivery
  • Team:Search - Inference

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 1d529ddd-c75a-4492-9ff7-77c2bda9e45c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 6, 2026

Note

Unit test generation is a public access feature. Expect some limitations and changes as we gather feedback and continue to improve it.


Generating unit tests... This may take up to 20 minutes.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 6, 2026

❌ Failed to create PR with unit tests: AGENT_CHAT: Failed to open pull request

Copy link
Copy Markdown
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked a few questions, mostly "for later". The only one we should revolve before merging is "do we want to leave the benchmark in EvalBenchmark now that we have a big fancy one.

"boolean",
"string" }
)
public String scenario;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want both of these benchmarks? The one in EvalBenchmark and here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is much more specific. I'd be fine having just this one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is more broad, yeah

switch (token) {
case VALUE_STRING, VALUE_NUMBER -> builder.appendBytesRef(new BytesRef(parser.text()));
case VALUE_BOOLEAN -> builder.appendBytesRef(new BytesRef(Boolean.toString(parser.booleanValue())));
case VALUE_STRING -> builder.appendBytesRef(new BytesRef(parser.text()));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this on use the slice too? That'd save a bunch of allocations and potentially the utf-8 -> utf-16 -> utf-8 juggling.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this even if the original format isn't json - just so long as it leaves the underlying stuff in utf-8. Which, I think, all of our supported formats do.

Anyway, this can wait for later I think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooooh good point

if (tokenLocation.hasValidByteOffset() && currentLocation.hasValidByteOffset()) {
int start = (int) tokenLocation.byteOffset() + rawOffset;
int end = (int) currentLocation.byteOffset() + rawOffset;
builder.appendBytesRef(new BytesRef(rawBytes, start, end - start));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trick should work for yaml as well. And you could do a different trick for cbor et al. I suppose the question is, "do we use those frequently enough for it to be worth it?" Also for later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@quackaplop quackaplop merged commit a9ad9f5 into elastic:main Mar 6, 2026
36 checks passed
@quackaplop quackaplop requested a review from nik9000 March 6, 2026 15:05
szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 6, 2026
…locations

* upstream/main: (153 commits)
  ES|QL: Update docs for TOP_SNIPPETS and DECAY (elastic#143739)
  Correctly include endpoint id in log msg in AuthorizationPoller (elastic#143743)
  Bar searching or sorting on _seq_no when disabled (elastic#143600)
  Generalize `testClientCancellation` test (elastic#143586)
  JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction (elastic#143702)
  Track recycler pages in circuit breaker (elastic#143738)
  [ESQL] Enable distributed pipeline breakers for external sources via FragmentExec (elastic#143696)
  Adding 'mode' and 'codec' fields to ES monitoring template (elastic#143673)
  [ESQL] Columnar I/O and vectorized block conversion for external sources (elastic#143703)
  Fix flaky MMR diversification YAML tests (elastic#143706)
  ES|QL codegen: check builder arguments for vector support (elastic#143724)
  Add Views Security Model (elastic#141050)
  ESQL: Prevent pushdown of unmapped fields in filters and sorts (elastic#143460)
  Don't run seq_no pruning tests in release CI (elastic#143725)
  ESQL: Support intra-row field references in ROW command (elastic#140217)
  ES|QL: Remove implicit limit in FORK branches in CSV tests (elastic#143601)
  IndexRoutingTests with and without synthetic id (elastic#143566)
  Synthetic id upgrade test in serverless (elastic#142471)
  Disable "Review skipped" comments for PRs without specified labels (elastic#143728)
  Cleanup ES|QL T-Digest code duplication, add memory accounting (elastic#143662)
  ...
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Mar 6, 2026
…traction (elastic#143702)

* JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction

Replace copyCurrentStructure() re-serialization with zero-copy byte
slicing for JSON input. When the extracted value is an object, array,
or number, slice bytes directly from the input buffer using
XContentLocation.byteOffset() offsets (exposed in elastic#143501).

Also refactors navigation from recursive descent to iterative loop,
confining raw byte access to the extraction point. Adds JMH benchmarks
for JSON_EXTRACT through the full eval pipeline.

* Add changelog for elastic#143702

* [CI] Auto commit changes from spotless

* Clean up navigation helpers to avoid threading unused parameters

Navigation methods now only position the parser — they no longer carry
builder, segments, depth, rawBytes, or rawOffset.

* Use full variable names instead of abbreviations

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants