JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction#143702
Conversation
…traction Replace copyCurrentStructure() re-serialization with zero-copy byte slicing for JSON input. When the extracted value is an object, array, or number, slice bytes directly from the input buffer using XContentLocation.byteOffset() offsets (exposed in elastic#143501). Also refactors navigation from recursive descent to iterative loop, confining raw byte access to the extraction point. Adds JMH benchmarks for JSON_EXTRACT through the full eval pipeline.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Navigation methods now only position the parser — they no longer carry builder, segments, depth, rawBytes, or rawOffset.
|
Important Review skippedAuto reviews are limited based on label configuration. 🏷️ Required labels (at least one) (2)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Note Unit test generation is a public access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
|
❌ Failed to create PR with unit tests: AGENT_CHAT: Failed to open pull request |
nik9000
left a comment
There was a problem hiding this comment.
Asked a few questions, mostly "for later". The only one we should revolve before merging is "do we want to leave the benchmark in EvalBenchmark now that we have a big fancy one.
| "boolean", | ||
| "string" } | ||
| ) | ||
| public String scenario; |
There was a problem hiding this comment.
Do we want both of these benchmarks? The one in EvalBenchmark and here.
There was a problem hiding this comment.
I suppose this is much more specific. I'd be fine having just this one.
There was a problem hiding this comment.
This one is more broad, yeah
| switch (token) { | ||
| case VALUE_STRING, VALUE_NUMBER -> builder.appendBytesRef(new BytesRef(parser.text())); | ||
| case VALUE_BOOLEAN -> builder.appendBytesRef(new BytesRef(Boolean.toString(parser.booleanValue()))); | ||
| case VALUE_STRING -> builder.appendBytesRef(new BytesRef(parser.text())); |
There was a problem hiding this comment.
Can this on use the slice too? That'd save a bunch of allocations and potentially the utf-8 -> utf-16 -> utf-8 juggling.
There was a problem hiding this comment.
You can do this even if the original format isn't json - just so long as it leaves the underlying stuff in utf-8. Which, I think, all of our supported formats do.
Anyway, this can wait for later I think.
| if (tokenLocation.hasValidByteOffset() && currentLocation.hasValidByteOffset()) { | ||
| int start = (int) tokenLocation.byteOffset() + rawOffset; | ||
| int end = (int) currentLocation.byteOffset() + rawOffset; | ||
| builder.appendBytesRef(new BytesRef(rawBytes, start, end - start)); |
There was a problem hiding this comment.
This trick should work for yaml as well. And you could do a different trick for cbor et al. I suppose the question is, "do we use those frequently enough for it to be worth it?" Also for later.
…locations * upstream/main: (153 commits) ES|QL: Update docs for TOP_SNIPPETS and DECAY (elastic#143739) Correctly include endpoint id in log msg in AuthorizationPoller (elastic#143743) Bar searching or sorting on _seq_no when disabled (elastic#143600) Generalize `testClientCancellation` test (elastic#143586) JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction (elastic#143702) Track recycler pages in circuit breaker (elastic#143738) [ESQL] Enable distributed pipeline breakers for external sources via FragmentExec (elastic#143696) Adding 'mode' and 'codec' fields to ES monitoring template (elastic#143673) [ESQL] Columnar I/O and vectorized block conversion for external sources (elastic#143703) Fix flaky MMR diversification YAML tests (elastic#143706) ES|QL codegen: check builder arguments for vector support (elastic#143724) Add Views Security Model (elastic#141050) ESQL: Prevent pushdown of unmapped fields in filters and sorts (elastic#143460) Don't run seq_no pruning tests in release CI (elastic#143725) ESQL: Support intra-row field references in ROW command (elastic#140217) ES|QL: Remove implicit limit in FORK branches in CSV tests (elastic#143601) IndexRoutingTests with and without synthetic id (elastic#143566) Synthetic id upgrade test in serverless (elastic#142471) Disable "Review skipped" comments for PRs without specified labels (elastic#143728) Cleanup ES|QL T-Digest code duplication, add memory accounting (elastic#143662) ...
…traction (elastic#143702) * JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction Replace copyCurrentStructure() re-serialization with zero-copy byte slicing for JSON input. When the extracted value is an object, array, or number, slice bytes directly from the input buffer using XContentLocation.byteOffset() offsets (exposed in elastic#143501). Also refactors navigation from recursive descent to iterative loop, confining raw byte access to the extraction point. Adds JMH benchmarks for JSON_EXTRACT through the full eval pipeline. * Add changelog for elastic#143702 * [CI] Auto commit changes from spotless * Clean up navigation helpers to avoid threading unused parameters Navigation methods now only position the parser — they no longer carry builder, segments, depth, rawBytes, or rawOffset. * Use full variable names instead of abbreviations --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Summary
Optimizes
JSON_EXTRACTto use zero-copy byte slicing instead ofcopyCurrentStructure()re-serialization when extracting objects, arrays, and numbers from JSON input. This builds on
the byte offset API exposed in #143501.
What changed
Object/array extraction — Previously, extracting a nested object or array walked every token
in the subtree and rebuilt JSON from scratch via
XContentBuilder.copyCurrentStructure(). Now itslices bytes directly from the input buffer using
getTokenLocation().byteOffset()→skipChildren()→getCurrentLocation().byteOffset(). Zero allocation, zero re-parsing.Number extraction — Previously called
parser.text()which makes Jackson convert the numberto a Java
String, then wraps inBytesRef. Now byte-slices the number literal directly fromthe input array, avoiding the
Stringallocation entirely.Boolean extraction — Reuses static
TRUE_BYTES/FALSE_BYTESconstants instead ofallocating a new
BytesRef("true")/BytesRef("false")per call.Navigation refactoring — Replaced recursive descent (
extractValue→navigateObject→extractValue→ ...) with an iterative loop. Navigation methods are now pure parser-positioninghelpers that don't need the byte-slicing context, keeping raw byte access confined to the
extraction point.
Non-JSON
_sourceformats (SMILE/CBOR/YAML) fall back tocopyCurrentStructure().Benchmarks
Also adds
json_extractandjson_extract_objectscenarios toEvalBenchmark, and adedicated
JsonExtractBenchmarkwith 10 scenarios through the full eval pipeline (EvalMapper →Layout → Page → Evaluator).
Environment: Apple M3 Max, JDK 25.0.1, JMH 1.37, warmup 3×2s, measurement 5×2s.
Largest wins on object/array extraction (35–48%) where
copyCurrentStructurewas the hot path.Relates #142873