JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction by quackaplop · Pull Request #143702 · elastic/elasticsearch

quackaplop · 2026-03-05T17:18:07Z

Summary

Optimizes JSON_EXTRACT to use zero-copy byte slicing instead of copyCurrentStructure()
re-serialization when extracting objects, arrays, and numbers from JSON input. This builds on
the byte offset API exposed in #143501.

What changed

Object/array extraction — Previously, extracting a nested object or array walked every token
in the subtree and rebuilt JSON from scratch via XContentBuilder.copyCurrentStructure(). Now it
slices bytes directly from the input buffer using getTokenLocation().byteOffset() →
skipChildren() → getCurrentLocation().byteOffset(). Zero allocation, zero re-parsing.

Number extraction — Previously called parser.text() which makes Jackson convert the number
to a Java String, then wraps in BytesRef. Now byte-slices the number literal directly from
the input array, avoiding the String allocation entirely.

Boolean extraction — Reuses static TRUE_BYTES / FALSE_BYTES constants instead of
allocating a new BytesRef("true") / BytesRef("false") per call.

Navigation refactoring — Replaced recursive descent (extractValue → navigateObject →
extractValue → ...) with an iterative loop. Navigation methods are now pure parser-positioning
helpers that don't need the byte-slicing context, keeping raw byte access confined to the
extraction point.

Non-JSON _source formats (SMILE/CBOR/YAML) fall back to copyCurrentStructure().

Benchmarks

Also adds json_extract and json_extract_object scenarios to EvalBenchmark, and a
dedicated JsonExtractBenchmark with 10 scenarios through the full eval pipeline (EvalMapper →
Layout → Page → Evaluator).

Environment: Apple M3 Max, JDK 25.0.1, JMH 1.37, warmup 3×2s, measurement 5×2s.

Scenario	Before (ns/op)	After (ns/op)	Change
small_object (30B)	222.0 ± 2.8	115.9 ± 3.1	-47.8%
medium_object (500B)	1,275.9 ± 27.0	662.2 ± 15.7	-48.1%
large_object (4KB)	24,531.3 ± 1,641	15,938.0 ± 721	-35.0%
large_nested_extract (10KB doc)	12,323.1 ± 458	6,664.0 ± 180	-45.9%
array_of_objects ([25] of 50)	4,253.0 ± 76	3,853.5 ± 68	-9.4%
nested_scalar (5 levels)	206.2 ± 4.9	178.9 ± 3.1	-13.2%
deep_nesting (10 levels)	478.7 ± 54	324.9 ± 10.7	-32.1%
number	160.0 ± 2.4	133.1 ± 3.2	-16.8%
boolean	106.0 ± 2.1	100.6 ± 2.5	-5.1%
string	107.6 ± 2.8	103.1 ± 2.0	-4.2%

Largest wins on object/array extraction (35–48%) where copyCurrentStructure was the hot path.

Relates #142873

…traction Replace copyCurrentStructure() re-serialization with zero-copy byte slicing for JSON input. When the extracted value is an object, array, or number, slice bytes directly from the input buffer using XContentLocation.byteOffset() offsets (exposed in elastic#143501). Also refactors navigation from recursive descent to iterative loop, confining raw byte access to the extraction point. Adds JMH benchmarks for JSON_EXTRACT through the full eval pipeline.

elasticsearchmachine · 2026-03-05T17:18:32Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

Navigation methods now only position the parser — they no longer carry builder, segments, depth, rawBytes, or rawOffset.

coderabbitai · 2026-03-05T23:54:44Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🏷️ Required labels (at least one) (2)

Team:Delivery
Team:Search - Inference

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 1d529ddd-c75a-4492-9ff7-77c2bda9e45c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

✅ Review completed - (🔄 Check again to review again)

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2026-03-06T00:15:31Z

Note

Unit test generation is a public access feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Generating unit tests... This may take up to 20 minutes.

coderabbitai · 2026-03-06T00:19:59Z

❌ Failed to create PR with unit tests: AGENT_CHAT: Failed to open pull request

nik9000

Asked a few questions, mostly "for later". The only one we should revolve before merging is "do we want to leave the benchmark in EvalBenchmark now that we have a big fancy one.

nik9000 · 2026-03-06T14:46:02Z

benchmarks/src/main/java/org/elasticsearch/benchmark/esql/JsonExtractBenchmark.java

+            "boolean",
+            "string" }
+    )
+    public String scenario;


Do we want both of these benchmarks? The one in EvalBenchmark and here.

I suppose this is much more specific. I'd be fine having just this one.

This one is more broad, yeah

nik9000 · 2026-03-06T14:52:51Z

...rc/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/JsonExtract.java

        switch (token) {
-            case VALUE_STRING, VALUE_NUMBER -> builder.appendBytesRef(new BytesRef(parser.text()));
-            case VALUE_BOOLEAN -> builder.appendBytesRef(new BytesRef(Boolean.toString(parser.booleanValue())));
+            case VALUE_STRING -> builder.appendBytesRef(new BytesRef(parser.text()));


Can this on use the slice too? That'd save a bunch of allocations and potentially the utf-8 -> utf-16 -> utf-8 juggling.

You can do this even if the original format isn't json - just so long as it leaves the underlying stuff in utf-8. Which, I think, all of our supported formats do.

Anyway, this can wait for later I think.

ooooh good point

nik9000 · 2026-03-06T14:58:33Z

...rc/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/JsonExtract.java

+                    if (tokenLocation.hasValidByteOffset() && currentLocation.hasValidByteOffset()) {
+                        int start = (int) tokenLocation.byteOffset() + rawOffset;
+                        int end = (int) currentLocation.byteOffset() + rawOffset;
+                        builder.appendBytesRef(new BytesRef(rawBytes, start, end - start));


This trick should work for yaml as well. And you could do a different trick for cbor et al. I suppose the question is, "do we use those frequently enough for it to be worth it?" Also for later.

…locations * upstream/main: (153 commits) ES|QL: Update docs for TOP_SNIPPETS and DECAY (elastic#143739) Correctly include endpoint id in log msg in AuthorizationPoller (elastic#143743) Bar searching or sorting on _seq_no when disabled (elastic#143600) Generalize `testClientCancellation` test (elastic#143586) JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction (elastic#143702) Track recycler pages in circuit breaker (elastic#143738) [ESQL] Enable distributed pipeline breakers for external sources via FragmentExec (elastic#143696) Adding 'mode' and 'codec' fields to ES monitoring template (elastic#143673) [ESQL] Columnar I/O and vectorized block conversion for external sources (elastic#143703) Fix flaky MMR diversification YAML tests (elastic#143706) ES|QL codegen: check builder arguments for vector support (elastic#143724) Add Views Security Model (elastic#141050) ESQL: Prevent pushdown of unmapped fields in filters and sorts (elastic#143460) Don't run seq_no pruning tests in release CI (elastic#143725) ESQL: Support intra-row field references in ROW command (elastic#140217) ES|QL: Remove implicit limit in FORK branches in CSV tests (elastic#143601) IndexRoutingTests with and without synthetic id (elastic#143566) Synthetic id upgrade test in serverless (elastic#142471) Disable "Review skipped" comments for PRs without specified labels (elastic#143728) Cleanup ES|QL T-Digest code duplication, add memory accounting (elastic#143662) ...

…traction (elastic#143702) * JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction Replace copyCurrentStructure() re-serialization with zero-copy byte slicing for JSON input. When the extracted value is an object, array, or number, slice bytes directly from the input buffer using XContentLocation.byteOffset() offsets (exposed in elastic#143501). Also refactors navigation from recursive descent to iterative loop, confining raw byte access to the extraction point. Adds JMH benchmarks for JSON_EXTRACT through the full eval pipeline. * Add changelog for elastic#143702 * [CI] Auto commit changes from spotless * Clean up navigation helpers to avoid threading unused parameters Navigation methods now only position the parser — they no longer carry builder, segments, depth, rawBytes, or rawOffset. * Use full variable names instead of abbreviations --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>

quackaplop added >enhancement :Analytics/ES|QL AKA ESQL labels Mar 5, 2026

elasticsearchmachine added v9.4.0 Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Mar 5, 2026

quackaplop and others added 4 commits March 5, 2026 17:19

Add changelog for elastic#143702

8de1155

[CI] Auto commit changes from spotless

56f0451

Clean up navigation helpers to avoid threading unused parameters

8d49e7c

Navigation methods now only position the parser — they no longer carry builder, segments, depth, rawBytes, or rawOffset.

Merge branch 'main' into feature/json-extract-byte-slicing

20101a2

quackaplop requested a review from nik9000 March 5, 2026 23:53

quackaplop added 2 commits March 6, 2026 00:21

Use full variable names instead of abbreviations

cdf2e27

Merge branch 'main' into feature/json-extract-byte-slicing

67771c8

nik9000 approved these changes Mar 6, 2026

View reviewed changes

quackaplop merged commit a9ad9f5 into elastic:main Mar 6, 2026
36 checks passed

quackaplop requested a review from nik9000 March 6, 2026 15:05

prwhelan mentioned this pull request Mar 6, 2026

[ML] Wait for cluster state in test #143767

Merged

prwhelan mentioned this pull request Mar 9, 2026

[Transform] Disable PIT for CPS #143876

Closed

Conversation

quackaplop commented Mar 5, 2026

Summary

What changed

Benchmarks

Uh oh!

elasticsearchmachine commented Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

coderabbitai bot commented Mar 6, 2026

Uh oh!

coderabbitai bot commented Mar 6, 2026

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Mar 5, 2026 •

edited

Loading