Summary
XContentParser.getTokenLocation() returns an XContentLocation(lineNumber, columnNumber). The underlying Jackson JsonParser provides byte offsets via JsonLocation.getByteOffset(), but JsonXContentParser discards this information when wrapping the Jackson location. Exposing byte offsets would enable zero-copy extraction of sub-structures (objects/arrays) from the source byte array.
Motivation
The JSON_EXTRACT ES|QL function (#142375) extracts values from JSON strings. When the extracted value is an object or array, it currently uses XContentBuilder.copyCurrentStructure(parser) to serialize the sub-structure back to JSON — walking every token and rebuilding the string from scratch, even though the original bytes are already valid JSON in the source array.
With byte offsets, extraction becomes a direct array slice:
long start = parser.getTokenLocation().byteOffset();
parser.skipChildren();
long end = parser.getCurrentLocation().byteOffset();
builder.appendBytesRef(new BytesRef(bytes, offset + (int) start, (int) (end - start)));
This eliminates all intermediate parsing, string allocation, and JSON escaping for the sub-structure.
Proposed API Change
XContentLocation — add byteOffset with a backward-compatible constructor:
public record XContentLocation(int lineNumber, int columnNumber, long byteOffset) {
public XContentLocation(int lineNumber, int columnNumber) {
this(lineNumber, columnNumber, -1L);
}
}
XContentParser — add getCurrentLocation() (the other method already exists):
XContentLocation getTokenLocation(); // already exists — starts populating byteOffset
XContentLocation getCurrentLocation(); // new — position just past the last consumed byte
Implementation Impact
The XContentParser hierarchy has 19 implementations. The change is concentrated in one place.
Leaf implementations (5):
| Class |
Change needed |
Notes |
JsonXContentParser |
Pass through JsonLocation.getByteOffset() instead of discarding it. Add getCurrentLocation() delegating to Jackson. |
~10 lines changed |
SmileXContentParser |
None — inherits from JsonXContentParser |
|
CborXContentParser |
None — inherits from JsonXContentParser |
|
YamlXContentParser |
None — inherits from JsonXContentParser. Jackson's YAML parser returns -1 for byte offsets (only char offsets available). |
|
MapXContentParser |
Return -1 byte offset (no byte stream) |
Trivial |
Decorators (13): All transparent — delegate through FilterXContentParser.delegate(). Zero changes needed for 11 of 13. The two with overrides:
| Class |
Notes |
DotExpandingXContentParser |
Returns saved location for synthesized tokens — would carry -1 byte offset for synthetic tokens, real offsets for original content |
CompletionFieldMapper.MultiFieldParser |
Returns fixed location — would carry -1 byte offset |
Test-only (1): ParameterizableYamlXContentParser — delegates, transparent.
Byte Slicing Feasibility by Format
Not all content types support raw byte slicing even with offsets available:
| Format |
Byte offsets? |
Slicing safe? |
Why |
| JSON |
Yes |
Yes |
Sliced sub-structure is valid standalone JSON |
| CBOR |
Yes |
Yes |
Self-contained data items, no back-references |
| SMILE |
Yes |
No |
Back-references for repeated field names/strings — sliced fragment may contain unresolvable references |
| YAML |
No (-1) |
No |
Whitespace-sensitive grammar, anchor/alias system |
Consumers must check the content type before slicing. JSON and CBOR are safe; SMILE and YAML require the XContentBuilder.copyCurrentStructure fallback.
Summary
XContentParser.getTokenLocation()returns anXContentLocation(lineNumber, columnNumber). The underlying JacksonJsonParserprovides byte offsets viaJsonLocation.getByteOffset(), butJsonXContentParserdiscards this information when wrapping the Jackson location. Exposing byte offsets would enable zero-copy extraction of sub-structures (objects/arrays) from the source byte array.Motivation
The
JSON_EXTRACTES|QL function (#142375) extracts values from JSON strings. When the extracted value is an object or array, it currently usesXContentBuilder.copyCurrentStructure(parser)to serialize the sub-structure back to JSON — walking every token and rebuilding the string from scratch, even though the original bytes are already valid JSON in the source array.With byte offsets, extraction becomes a direct array slice:
This eliminates all intermediate parsing, string allocation, and JSON escaping for the sub-structure.
Proposed API Change
XContentLocation— addbyteOffsetwith a backward-compatible constructor:XContentParser— addgetCurrentLocation()(the other method already exists):Implementation Impact
The XContentParser hierarchy has 19 implementations. The change is concentrated in one place.
Leaf implementations (5):
JsonXContentParserJsonLocation.getByteOffset()instead of discarding it. AddgetCurrentLocation()delegating to Jackson.SmileXContentParserJsonXContentParserCborXContentParserJsonXContentParserYamlXContentParserJsonXContentParser. Jackson's YAML parser returns-1for byte offsets (only char offsets available).MapXContentParser-1byte offset (no byte stream)Decorators (13): All transparent — delegate through
FilterXContentParser.delegate(). Zero changes needed for 11 of 13. The two with overrides:DotExpandingXContentParser-1byte offset for synthetic tokens, real offsets for original contentCompletionFieldMapper.MultiFieldParser-1byte offsetTest-only (1):
ParameterizableYamlXContentParser— delegates, transparent.Byte Slicing Feasibility by Format
Not all content types support raw byte slicing even with offsets available:
-1)Consumers must check the content type before slicing. JSON and CBOR are safe; SMILE and YAML require the
XContentBuilder.copyCurrentStructurefallback.