Skip to content

Expose byte offsets on XContentParser for zero-copy sub-structure extraction #142873

@quackaplop

Description

@quackaplop

Summary

XContentParser.getTokenLocation() returns an XContentLocation(lineNumber, columnNumber). The underlying Jackson JsonParser provides byte offsets via JsonLocation.getByteOffset(), but JsonXContentParser discards this information when wrapping the Jackson location. Exposing byte offsets would enable zero-copy extraction of sub-structures (objects/arrays) from the source byte array.

Motivation

The JSON_EXTRACT ES|QL function (#142375) extracts values from JSON strings. When the extracted value is an object or array, it currently uses XContentBuilder.copyCurrentStructure(parser) to serialize the sub-structure back to JSON — walking every token and rebuilding the string from scratch, even though the original bytes are already valid JSON in the source array.

With byte offsets, extraction becomes a direct array slice:

long start = parser.getTokenLocation().byteOffset();
parser.skipChildren();
long end = parser.getCurrentLocation().byteOffset();
builder.appendBytesRef(new BytesRef(bytes, offset + (int) start, (int) (end - start)));

This eliminates all intermediate parsing, string allocation, and JSON escaping for the sub-structure.

Proposed API Change

XContentLocation — add byteOffset with a backward-compatible constructor:

public record XContentLocation(int lineNumber, int columnNumber, long byteOffset) {
    public XContentLocation(int lineNumber, int columnNumber) {
        this(lineNumber, columnNumber, -1L);
    }
}

XContentParser — add getCurrentLocation() (the other method already exists):

XContentLocation getTokenLocation();    // already exists — starts populating byteOffset
XContentLocation getCurrentLocation();  // new — position just past the last consumed byte

Implementation Impact

The XContentParser hierarchy has 19 implementations. The change is concentrated in one place.

Leaf implementations (5):

Class Change needed Notes
JsonXContentParser Pass through JsonLocation.getByteOffset() instead of discarding it. Add getCurrentLocation() delegating to Jackson. ~10 lines changed
SmileXContentParser None — inherits from JsonXContentParser
CborXContentParser None — inherits from JsonXContentParser
YamlXContentParser None — inherits from JsonXContentParser. Jackson's YAML parser returns -1 for byte offsets (only char offsets available).
MapXContentParser Return -1 byte offset (no byte stream) Trivial

Decorators (13): All transparent — delegate through FilterXContentParser.delegate(). Zero changes needed for 11 of 13. The two with overrides:

Class Notes
DotExpandingXContentParser Returns saved location for synthesized tokens — would carry -1 byte offset for synthetic tokens, real offsets for original content
CompletionFieldMapper.MultiFieldParser Returns fixed location — would carry -1 byte offset

Test-only (1): ParameterizableYamlXContentParser — delegates, transparent.

Byte Slicing Feasibility by Format

Not all content types support raw byte slicing even with offsets available:

Format Byte offsets? Slicing safe? Why
JSON Yes Yes Sliced sub-structure is valid standalone JSON
CBOR Yes Yes Self-contained data items, no back-references
SMILE Yes No Back-references for repeated field names/strings — sliced fragment may contain unresolvable references
YAML No (-1) No Whitespace-sensitive grammar, anchor/alias system

Consumers must check the content type before slicing. JSON and CBOR are safe; SMILE and YAML require the XContentBuilder.copyCurrentStructure fallback.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions