Skip to content

Validation of the _source to reject contradicting and ambiguous requests#20742

Draft
urmichm wants to merge 15 commits intoopensearch-project:mainfrom
urmichm:20612-source-validation
Draft

Validation of the _source to reject contradicting and ambiguous requests#20742
urmichm wants to merge 15 commits intoopensearch-project:mainfrom
urmichm:20612-source-validation

Conversation

@urmichm
Copy link
Copy Markdown
Contributor

@urmichm urmichm commented Feb 27, 2026

Description

Validation of the _source object added to avoid confusion and reject contradicting and ambiguous requests:

  • "_source": {} At lease one of includes or excludes must be defined. 🚫
  • "_source": [] Explicitly defined empty array of includes is not allowed. 🚫
  • "_source": { "includes": "text", "excludes": ["title", "text"] } The text field is defined in both includes and excludes. Contradiction. 🚫
  • "_source": { "includes": [], "excludes": ["title"] } or _source: { "includes": ["title"], "excludes": [] } Explicitly defined empty arrays are not allowed to avoid confusion. 🚫

To include the whole _source object or to exclude it completely, the existing boolean logic is encouraged.
"_source": true and "_source": false

Unit tests added to test behaviour of parsing implementation.
Existing unit tests have been extended to cover invalid cases.

Related Issues

Resolves #20612

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 27, 2026

PR Reviewer Guide 🔍

(Review updated until commit 706f30b)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Core _source validation logic and unit tests

Relevant files:

  • server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java
  • server/src/test/java/org/opensearch/search/fetch/subphase/FetchSourceContextTests.java

Sub-PR theme: gRPC and REST API integration tests for _source validation

Relevant files:

  • modules/transport-grpc/src/test/java/org/opensearch/transport/grpc/proto/request/common/FetchSourceContextProtoUtilsTests.java
  • rest-api-spec/src/main/resources/rest-api-spec/test/search/10_source_filtering.yml

⚡ Recommended focus areas for review

Breaking Change

The validateAmbiguousFields() call in the constructor and the new empty-array/empty-object restrictions in fromXContent and parseSourceFieldArray are breaking changes. Existing users who pass empty arrays or overlapping includes/excludes (even programmatically via the public constructor) will now get exceptions. This could break backward compatibility for existing integrations and stored queries.

public FetchSourceContext(boolean fetchSource, String[] includes, String[] excludes) {
    this.fetchSource = fetchSource;
    this.includes = includes == null ? Strings.EMPTY_ARRAY : includes;
    this.excludes = excludes == null ? Strings.EMPTY_ARRAY : excludes;
    validateAmbiguousFields();
}

public FetchSourceContext(boolean fetchSource) {
    this(fetchSource, Strings.EMPTY_ARRAY, Strings.EMPTY_ARRAY);
}

public FetchSourceContext(StreamInput in) throws IOException {
    fetchSource = in.readBoolean();
    includes = in.readStringArray();
    excludes = in.readStringArray();
}

/**
 * The same entry cannot be both included and excluded in _source.
 * Since the constructors are public, this validation is required to be called in the constructor.
 * */
private void validateAmbiguousFields() {
    Set<String> includeSet = new HashSet<>(Arrays.asList(this.includes));
    for (String exclude : this.excludes) {
        if (includeSet.contains(exclude)) {
            throw new OpenSearchException(AMBIGUOUS_FIELD_MESSAGE, exclude);
        }
    }
}
StreamInput Bypass

The FetchSourceContext(StreamInput in) constructor reads fields directly and does not call validateAmbiguousFields(). This means deserialized instances from the wire (e.g., from older nodes in a mixed-version cluster) bypass the new validation, creating an inconsistency where the same data can be valid when deserialized but invalid when constructed directly.

public FetchSourceContext(StreamInput in) throws IOException {
    fetchSource = in.readBoolean();
    includes = in.readStringArray();
    excludes = in.readStringArray();
}
Public API Exposure

parseSourceObject is now a public static method, which exposes an internal parsing detail as a public API. This seems unintentional and may cause issues if callers invoke it with a non-START_OBJECT token. It was previously private logic inside fromXContent.

public static FetchSourceContext parseSourceObject(XContentParser parser) throws IOException {
    XContentParser.Token token = parser.currentToken();
    Set<String> includes = Collections.emptySet();
    Set<String> excludes = Collections.emptySet();
    String currentFieldName = null;
    if (token != XContentParser.Token.START_OBJECT) {
        throw new ParsingException(
            parser.getTokenLocation(),
            "Expected a " + XContentParser.Token.START_OBJECT + " but got a " + token + " in [" + parser.currentName() + "]."
        );
    }
    while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
        if (token == XContentParser.Token.FIELD_NAME) {
            currentFieldName = parser.currentName();
            continue; // only field name is required in this iteration
        }
        if (currentFieldName == null) {
            throw new ParsingException(
                parser.getTokenLocation(),
                "Expected a field name but got a " + token + " in [" + parser.currentName() + "]."
            );
        }
        // process field value
        switch (token) {
            case XContentParser.Token.START_ARRAY -> {
                if (INCLUDES_FIELD.match(currentFieldName, parser.getDeprecationHandler())) {
                    includes = parseSourceFieldArray(parser, INCLUDES_FIELD, excludes);
                } else if (EXCLUDES_FIELD.match(currentFieldName, parser.getDeprecationHandler())) {
                    excludes = parseSourceFieldArray(parser, EXCLUDES_FIELD, includes);
                } else {
                    throw new ParsingException(
                        parser.getTokenLocation(),
                        "Unknown key for a " + token + " in [" + currentFieldName + "]."
                    );
                }
            }
            case XContentParser.Token.VALUE_STRING -> {
                if (INCLUDES_FIELD.match(currentFieldName, parser.getDeprecationHandler())) {
                    String includeEntry = parser.text();
                    if (excludes.contains(includeEntry)) {
                        throw new ParsingException(parser.getTokenLocation(), AMBIGUOUS_FIELD_MESSAGE, includeEntry);
                    }
                    includes = Collections.singleton(includeEntry);
                } else if (EXCLUDES_FIELD.match(currentFieldName, parser.getDeprecationHandler())) {
                    String excludeEntry = parser.text();
                    if (includes.contains(excludeEntry)) {
                        throw new ParsingException(parser.getTokenLocation(), AMBIGUOUS_FIELD_MESSAGE, excludeEntry);
                    }
                    excludes = Collections.singleton(excludeEntry);
                } else {
                    throw new ParsingException(
                        parser.getTokenLocation(),
                        "Unknown key for a " + token + " in [" + currentFieldName + "]."
                    );
                }
            }
            default -> {
                throw new ParsingException(parser.getTokenLocation(), "Unknown key for a " + token + " in [" + currentFieldName + "].");
            }
        }
    }
    if (includes.isEmpty() && excludes.isEmpty()) {
        // no valid field names -> empty or unrecognized fields; not allowed
        throw new ParsingException(
            parser.getTokenLocation(),
            "Expected at least one of [" + INCLUDES_FIELD.getPreferredName() + "] or [" + EXCLUDES_FIELD.getPreferredName() + "]"
        );
    }
    return new FetchSourceContext(true, includes.toArray(new String[0]), excludes.toArray(new String[0]));
}
Duplicate Entry

The same changelog entry for _source validation is added twice — once under the first section (line 14) and once under a second section (line 32). One of these entries should be removed.

- Add validation of the `_source` object to reject contradicting and ambiguous requests. ([#20612](https://github.com/opensearch-project/OpenSearch/issues/20612))
Duplicate Deduplication

The test testFetchSourceArray adds "include2" twice and asserts result.includes().length == 2 (no duplicates). However, the old implementation using ArrayList would have kept duplicates. This test implicitly validates that LinkedHashSet deduplication is now applied to top-level arrays, but this behavior change is not documented or explicitly called out as intentional.

public void testFetchSourceArray() throws IOException {
    final XContentBuilder source = XContentFactory.jsonBuilder()
        .startObject()
        .field("_source")
        .startArray()
        .value("include1")
        .value("include2")
        .value("include2")
        .endArray()
        .endObject();
    final XContentParser parser = createSourceParser(source);

    FetchSourceContext result = FetchSourceContext.fromXContent(parser);
    assertTrue(result.fetchSource()); // fetch source
    // validate includes
    assertEquals(2, result.includes().length); // no duplicates
    assertTrue(Arrays.asList(result.includes()).containsAll(Arrays.asList("include1", "include2")));
    // validate no excludes
    assertEquals(0, result.excludes().length);
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 27, 2026

PR Code Suggestions ✨

Latest suggestions up to 706f30b

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate deserialized instances for ambiguous fields

The StreamInput constructor does not call validateAmbiguousFields(), meaning
deserialized instances can bypass the ambiguity validation. If a node running an
older version serializes a FetchSourceContext with conflicting includes/excludes and
sends it to a node running the new version, the new node will accept it silently.
Add the validation call here as well.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [91-95]

 public FetchSourceContext(StreamInput in) throws IOException {
     fetchSource = in.readBoolean();
     includes = in.readStringArray();
     excludes = in.readStringArray();
+    validateAmbiguousFields();
 }
Suggestion importance[1-10]: 6

__

Why: The StreamInput constructor bypasses validateAmbiguousFields(), which could allow invalid states when deserializing data from older nodes. Adding the validation call is a legitimate correctness concern, though the cross-version scenario is somewhat edge-case.

Low
General
Improve empty object validation accuracy

The validation only checks if both includes and excludes are empty, but it doesn't
account for the case where unrecognized field names were encountered (e.g.,
{"unknown_field": ["val"]}). In that scenario, the sets remain empty and the error
message is misleading. Consider tracking whether any recognized fields were parsed
to provide a more accurate error.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [259-264]

 if (includes.isEmpty() && excludes.isEmpty()) {
     // no valid field names -> empty or unrecognized fields; not allowed
     throw new ParsingException(
         parser.getTokenLocation(),
-        "Expected at least one of [" + INCLUDES_FIELD.getPreferredName() + "] or [" + EXCLUDES_FIELD.getPreferredName() + "]"
+        "Expected at least one of [" + INCLUDES_FIELD.getPreferredName() + "] or [" + EXCLUDES_FIELD.getPreferredName() + "] but found none"
     );
 }
Suggestion importance[1-10]: 2

__

Why: The suggestion changes only the error message text by appending "but found none", which is a very minor improvement. The actual logic issue described (unrecognized fields) is not addressed by the improved_code, and unrecognized fields already throw a ParsingException in the default case of the switch statement before reaching this check.

Low

Previous suggestions

Suggestions up to commit 53120e7
CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate deserialized data in stream constructor

The deserialization constructor FetchSourceContext(StreamInput in) does not call
validateAmbiguousFields(), unlike the other constructors. This means a
FetchSourceContext deserialized from a stream could bypass the new validation,
potentially allowing contradicting includes/excludes to be reconstructed from
persisted or transported data. Add a call to validateAmbiguousFields() at the end of
this constructor.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [91-95]

 public FetchSourceContext(StreamInput in) throws IOException {
     fetchSource = in.readBoolean();
     includes = in.readStringArray();
     excludes = in.readStringArray();
+    validateAmbiguousFields();
 }
Suggestion importance[1-10]: 6

__

Why: The StreamInput constructor bypasses validateAmbiguousFields(), which could allow invalid FetchSourceContext objects to be reconstructed from serialized data. Adding validation here closes a real consistency gap, though in practice serialized data would have been validated before serialization.

Low
General
Reject duplicate entries within the same array

The parseSourceFieldArray method does not detect duplicate entries within the same
array (e.g., includes: ["field1", "field1"]). While LinkedHashSet silently
deduplicates them, the caller and the FetchSourceContext constructor will receive a
deduplicated array without any warning or error. If the intent is to reject
ambiguous/contradicting requests, duplicate entries within the same array should
either be rejected or at least consistently handled. Consider throwing a
ParsingException when a duplicate entry is detected within the same array.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [269-279]

 private static Set<String> parseSourceFieldArray(XContentParser parser, ParseField parseField, Set<String> opposite)
     throws IOException {
     Set<String> sourceArr = new LinkedHashSet<>(); // include or exclude lists, LinkedHashSet preserves the order of fields
     while (parser.nextToken() != XContentParser.Token.END_ARRAY) {
         if (parser.currentToken() == XContentParser.Token.VALUE_STRING) {
             String entry = parser.text();
             if (opposite != null && opposite.contains(entry)) {
                 throw new ParsingException(parser.getTokenLocation(), AMBIGUOUS_FIELD_MESSAGE, entry);
             }
-            sourceArr.add(entry);
+            if (!sourceArr.add(entry)) {
+                throw new ParsingException(
+                    parser.getTokenLocation(),
+                    "Duplicate entry [" + entry + "] in [" + parseField.getPreferredName() + "] of _source."
+                );
+            }
Suggestion importance[1-10]: 4

__

Why: Silently deduplicating entries within the same array could be surprising to users. However, this is a behavioral change beyond the PR's stated goal of rejecting contradicting includes/excludes, and duplicate entries within a single array are not inherently contradictory.

Low
Improve empty object error message clarity

The check includes.isEmpty() && excludes.isEmpty() will also throw when the object
has unrecognized field names that were silently skipped, but more critically, it
won't throw when only one of the two sets is populated (which is valid). However,
the real issue is that this check fires even when the object contained only
unrecognized field names that were never assigned — the default branch already
throws for unknown tokens, but if a field name is recognized yet has an empty array,
the parseSourceFieldArray method throws first. The logic appears correct for the
intended cases, but the condition should also guard against the case where the
object was parsed but both sets remain as Collections.emptySet() due to unrecognized
field names not caught by the default branch. Consider verifying that the condition
correctly distinguishes between "no fields provided" and "only unrecognized fields
provided" to avoid misleading error messages.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [259-265]

 if (includes.isEmpty() && excludes.isEmpty()) {
     // no valid field names -> empty or unrecognized fields; not allowed
     throw new ParsingException(
         parser.getTokenLocation(),
-        "Expected at least one of [" + INCLUDES_FIELD.getPreferredName() + "] or [" + EXCLUDES_FIELD.getPreferredName() + "]"
+        "Expected at least one of [" + INCLUDES_FIELD.getPreferredName() + "] or [" + EXCLUDES_FIELD.getPreferredName() + "] in [_source]"
     );
 }
Suggestion importance[1-10]: 2

__

Why: The suggestion only adds " in [_source]" to the error message, which is a very minor UX improvement. The improved_code is nearly identical to existing_code with only a cosmetic change, and the existing REST API tests already validate the current message format.

Low
Clarify error message for top-level empty array

When _source is a top-level array (list of include fields), parseSourceFieldArray is
called with null as the opposite parameter, which correctly skips the ambiguity
check. However, the parseSourceFieldArray method throws if the array is empty. This
means _source: [] will throw with the message "Expected at least one value for an
array of [includes]", which is consistent with the REST API test. This is fine, but
note that the INCLUDES_FIELD parse field is passed here even though this is not
inside an object context — the error message will reference "includes" which may be
slightly confusing to users who wrote _source: []. This is a minor UX issue but
worth noting.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [171-174]

+case XContentParser.Token.START_ARRAY -> {
+    String[] includes = parseSourceFieldArray(parser, INCLUDES_FIELD, null).toArray(new String[0]);
+    return new FetchSourceContext(true, includes, null);
+}
 
-
Suggestion importance[1-10]: 1

__

Why: The improved_code is identical to the existing_code, making this a zero-change suggestion. It only asks the user to verify/note a minor UX concern about the error message referencing "includes" for a top-level array context.

Low
Suggestions up to commit 7cb356d
CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate deserialized objects for ambiguous fields

The StreamInput constructor does not call validateAmbiguousFields(), unlike the
other constructors. This means deserialized FetchSourceContext objects (e.g., from
inter-node communication) bypass the ambiguity validation, potentially allowing
invalid states to propagate through the cluster. The validation should be called
here as well.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [91-95]

 public FetchSourceContext(StreamInput in) throws IOException {
     fetchSource = in.readBoolean();
     includes = in.readStringArray();
     excludes = in.readStringArray();
+    validateAmbiguousFields();
 }
Suggestion importance[1-10]: 6

__

Why: The StreamInput constructor skips validateAmbiguousFields(), which could allow invalid states to propagate during deserialization. However, in practice, data written to StreamOutput would have already been validated at construction time, making this a low-risk but valid defensive improvement.

Low
General
Improve readability of null-coalescing boolean expression

The expression fetchSource == null || fetchSource is a subtle change from the
original fetchSource == null ? true : fetchSource. While logically equivalent, the
new form is less readable and could be confusing to future maintainers. The original
ternary was clearer in intent. Consider reverting to the explicit ternary or using a
named variable for clarity.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [156]

-return new FetchSourceContext(fetchSource == null || fetchSource, sourceIncludes, sourceExcludes);
+return new FetchSourceContext(fetchSource == null ? true : fetchSource, sourceIncludes, sourceExcludes);
Suggestion importance[1-10]: 3

__

Why: The expression fetchSource == null || fetchSource is logically equivalent to fetchSource == null ? true : fetchSource but slightly less readable. This is a minor style improvement with no functional impact.

Low
Clarify empty-object validation logic comment

The check includes.isEmpty() && excludes.isEmpty() only catches the case where both
sets are empty (i.e., an empty object {}). However, if the object contains
unrecognized field names, the sets will also be empty, but the error message says
"Expected at least one of [includes] or [excludes]", which is misleading. More
importantly, a valid object with only excludes and no includes (or vice versa) is
allowed and should not throw. The current logic is correct for the empty-object
case, but consider whether a valid single-field object (e.g., only excludes) should
be accepted — based on the test testFetchSourceAsObject, it should be, so this logic
appears correct. However, the condition should also guard against unrecognized field
names that silently produce empty sets.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [259-264]

 if (includes.isEmpty() && excludes.isEmpty()) {
-    // no valid field names -> empty or unrecognized fields; not allowed
+    // no valid field names -> empty object or only unrecognized fields; not allowed
     throw new ParsingException(
         parser.getTokenLocation(),
         "Expected at least one of [" + INCLUDES_FIELD.getPreferredName() + "] or [" + EXCLUDES_FIELD.getPreferredName() + "]"
     );
 }
Suggestion importance[1-10]: 1

__

Why: The suggestion only changes a comment (adding "or only unrecognized fields" to the comment text), which is a minor documentation improvement with no functional impact. The improved_code is nearly identical to existing_code with only a comment change.

Low
Document silent deduplication behavior in array parsing

Duplicate entries within the same array are silently deduplicated by the
LinkedHashSet, but there is no warning or error for them. While deduplication is
acceptable, the test testFetchSourceArray explicitly asserts that duplicates are
silently removed. However, a duplicate entry that also appears in the opposite set
will throw an ambiguity error, which is correct. The issue is that if opposite is
null (as called from fromXContent for the top-level array case), duplicates are
silently dropped — this is fine. The logic is consistent, but consider documenting
this behavior explicitly.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [269-279]

 private static Set<String> parseSourceFieldArray(XContentParser parser, ParseField parseField, Set<String> opposite)
     throws IOException {
-    Set<String> sourceArr = new LinkedHashSet<>(); // include or exclude lists, LinkedHashSet preserves the order of fields
+    Set<String> sourceArr = new LinkedHashSet<>(); // LinkedHashSet preserves insertion order and deduplicates entries
     while (parser.nextToken() != XContentParser.Token.END_ARRAY) {
         if (parser.currentToken() == XContentParser.Token.VALUE_STRING) {
             String entry = parser.text();
             if (opposite != null && opposite.contains(entry)) {
                 throw new ParsingException(parser.getTokenLocation(), AMBIGUOUS_FIELD_MESSAGE, entry);
             }
-            sourceArr.add(entry);
+            sourceArr.add(entry); // duplicate entries within the same array are silently deduplicated
         } else {
Suggestion importance[1-10]: 1

__

Why: This suggestion only modifies comments to document existing behavior, with no functional code changes. Comment-only improvements have minimal impact on correctness or functionality.

Low
Suggestions up to commit 0b4c2c8
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add validation to deserialization constructor

The StreamInput constructor does not call validateAmbiguousFields(), unlike the
other constructors. This means deserialized FetchSourceContext objects (e.g., from
inter-node communication) bypass the ambiguity validation. Call
validateAmbiguousFields() in this constructor as well to ensure consistency.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [91-95]

 public FetchSourceContext(StreamInput in) throws IOException {
     fetchSource = in.readBoolean();
     includes = in.readStringArray();
     excludes = in.readStringArray();
+    validateAmbiguousFields();
 }
Suggestion importance[1-10]: 7

__

Why: The StreamInput constructor bypasses validateAmbiguousFields(), creating an inconsistency where deserialized objects skip validation. This is a valid concern for correctness, though in practice the data would have been validated before serialization. Adding the validation call ensures consistency across all construction paths.

Medium
Replace immutable singleton sets with mutable sets

The parseSourceObject method throws an error when both includes and excludes are
empty, but this check uses Collections.emptySet() as the initial value. If a user
provides only unrecognized field names (not includes/excludes), the sets remain
empty and the error is thrown correctly. However, if a user provides only includes
or only excludes, the other set remains Collections.emptySet() — this is fine. But
the check should also guard against the case where both sets are non-empty but were
set to Collections.emptySet() due to logic flow. The current logic is actually
correct, but the condition should be includes.isEmpty() && excludes.isEmpty() which
is already the case. No change needed here — but note that Collections.emptySet() is
immutable, so calling .contains() on it is safe but calling .add() would throw.
Since parseSourceFieldArray returns a new LinkedHashSet, this is fine. The real
issue is that Collections.singleton() (used for VALUE_STRING case) is also immutable
— if the same field appears twice (e.g., two includes string values), the second
assignment would silently overwrite the first without error. Consider using a
mutable set instead of Collections.singleton().

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [234-252]

-if (includes.isEmpty() && excludes.isEmpty()) {
-    // no valid field names -> empty or unrecognized fields; not allowed
-    throw new ParsingException(
-        parser.getTokenLocation(),
-        "Expected at least one of [" + INCLUDES_FIELD.getPreferredName() + "] or [" + EXCLUDES_FIELD.getPreferredName() + "]"
-    );
+case XContentParser.Token.VALUE_STRING -> {
+    if (INCLUDES_FIELD.match(currentFieldName, parser.getDeprecationHandler())) {
+        String includeEntry = parser.text();
+        if (excludes.contains(includeEntry)) {
+            throw new ParsingException(parser.getTokenLocation(), AMBIGUOUS_FIELD_MESSAGE, includeEntry);
+        }
+        Set<String> newIncludes = new LinkedHashSet<>(includes);
+        newIncludes.add(includeEntry);
+        includes = newIncludes;
+    } else if (EXCLUDES_FIELD.match(currentFieldName, parser.getDeprecationHandler())) {
+        String excludeEntry = parser.text();
+        if (includes.contains(excludeEntry)) {
+            throw new ParsingException(parser.getTokenLocation(), AMBIGUOUS_FIELD_MESSAGE, excludeEntry);
+        }
+        Set<String> newExcludes = new LinkedHashSet<>(excludes);
+        newExcludes.add(excludeEntry);
+        excludes = newExcludes;
+    } else {
+        throw new ParsingException(
+            parser.getTokenLocation(),
+            "Unknown key for a " + token + " in [" + currentFieldName + "]."
+        );
+    }
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that Collections.singleton() is immutable and could cause issues if the same field name appears twice in the JSON (the second assignment silently overwrites the first). Using a mutable LinkedHashSet would be more robust. However, in practice, JSON parsers typically don't produce duplicate field names, so this is a minor edge case.

Low
General
Reject duplicate entries within the same array

The parseSourceFieldArray method silently ignores duplicate entries within the same
array (e.g., ["field1", "field1"]) by using a LinkedHashSet. While deduplication is
handled, there is no warning or error for duplicate entries within the same array.
This may hide user mistakes. Consider throwing a ParsingException when a duplicate
is detected within the same array to provide clearer feedback.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [269-279]

-private static Set<String> parseSourceFieldArray(XContentParser parser, ParseField parseField, Set<String> opposite)
-    throws IOException {
-    Set<String> sourceArr = new LinkedHashSet<>(); // include or exclude lists, LinkedHashSet preserves the order of fields
-    while (parser.nextToken() != XContentParser.Token.END_ARRAY) {
-        if (parser.currentToken() == XContentParser.Token.VALUE_STRING) {
-            String entry = parser.text();
-            if (opposite != null && opposite.contains(entry)) {
-                throw new ParsingException(parser.getTokenLocation(), AMBIGUOUS_FIELD_MESSAGE, entry);
-            }
-            sourceArr.add(entry);
+String entry = parser.text();
+if (opposite != null && opposite.contains(entry)) {
+    throw new ParsingException(parser.getTokenLocation(), AMBIGUOUS_FIELD_MESSAGE, entry);
+}
+if (!sourceArr.add(entry)) {
+    throw new ParsingException(
+        parser.getTokenLocation(),
+        "Duplicate entry [" + entry + "] in [" + parseField.getPreferredName() + "] of _source."
+    );
+}
Suggestion importance[1-10]: 3

__

Why: While throwing on duplicates would provide clearer feedback, the current behavior of silently deduplicating via LinkedHashSet is a reasonable and common approach. The test in FetchSourceContextTests.java at line 68-69 even explicitly tests that duplicates are deduplicated, suggesting this is intentional behavior.

Low
Suggestions up to commit 9d97f05
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix placeholder format mismatch in error messages

The AMBIGUOUS_FIELD_MESSAGE uses {} as a placeholder (SLF4J/log4j style), but
ParsingException uses String.format-style formatting with %s. This means the error
message thrown from parseSourceFieldArray and parseSourceObject will contain the
literal {} instead of the actual field name. The OpenSearchException used in
validateAmbiguousFields does support {} placeholders, so that path is correct. The
ParsingException calls should use String.format or inline the value directly.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [269-279]

-private static Set<String> parseSourceFieldArray(XContentParser parser, ParseField parseField, Set<String> opposite)
-    throws IOException {
-    Set<String> sourceArr = new LinkedHashSet<>(); // include or exclude lists, LinkedHashSet preserves the order of fields
-    while (parser.nextToken() != XContentParser.Token.END_ARRAY) {
-        if (parser.currentToken() == XContentParser.Token.VALUE_STRING) {
-            String entry = parser.text();
-            if (opposite != null && opposite.contains(entry)) {
-                throw new ParsingException(parser.getTokenLocation(), AMBIGUOUS_FIELD_MESSAGE, entry);
-            }
-            sourceArr.add(entry);
+private static final String AMBIGUOUS_FIELD_MESSAGE = "The same entry [%s] cannot be both included and excluded in _source.";
 
+// In validateAmbiguousFields, use OpenSearchException with its own formatting:
+throw new OpenSearchException("The same entry [{}] cannot be both included and excluded in _source.", exclude);
+
+// In parseSourceFieldArray and parseSourceObject, use:
+throw new ParsingException(parser.getTokenLocation(), String.format(AMBIGUOUS_FIELD_MESSAGE, entry));
+
Suggestion importance[1-10]: 8

__

Why: This is a real bug — ParsingException uses String.format-style %s placeholders, not {} style, so the error messages from parseSourceFieldArray and parseSourceObject would contain literal {} instead of the actual field name. The test at line 211 (assertEquals("The same entry [AAA] cannot be both included and excluded in _source.", result.getMessage())) would fail if this bug exists, confirming the issue is critical.

Medium
Validate deserialized instances for ambiguous fields

The StreamInput constructor does not call validateAmbiguousFields(), unlike the
other constructors. This means deserialized instances from the network or storage
could bypass the new validation, potentially allowing contradicting
includes/excludes to be reconstructed without error. The validation should be called
here as well to ensure consistency.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [91-95]

 public FetchSourceContext(StreamInput in) throws IOException {
     fetchSource = in.readBoolean();
     includes = in.readStringArray();
     excludes = in.readStringArray();
+    validateAmbiguousFields();
 }
Suggestion importance[1-10]: 7

__

Why: The StreamInput constructor skips validateAmbiguousFields(), creating an inconsistency where deserialized objects bypass the new validation. This is a valid concern for correctness and consistency, though in practice serialized data would have been validated before serialization.

Medium
General
Avoid unnecessarily exposing internal parsing method

The check includes.isEmpty() && excludes.isEmpty() will also throw when both fields
are absent from the object, but it won't catch the case where the object contains
only unrecognized field names (since those throw earlier). However, a more critical
issue is that a valid object with only excludes (no includes) would have includes as
Collections.emptySet(), which is fine. But consider that a user could provide
{"excludes": ["field1"]} with no includes — this is a valid use case that should be
allowed. The current check correctly allows this, but the error message says
"Expected at least one of [includes] or [excludes]" which is accurate. This logic is
correct as-is, but the condition should use || semantics — actually the && is
correct here (both empty means neither was provided). No change needed for this
specific logic, but the parseSourceObject method is now public which exposes
internal parsing logic unnecessarily. Consider making it package-private or keeping
it private.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [198]

-if (includes.isEmpty() && excludes.isEmpty()) {
-    // no valid field names -> empty or unrecognized fields; not allowed
-    throw new ParsingException(
-        parser.getTokenLocation(),
-        "Expected at least one of [" + INCLUDES_FIELD.getPreferredName() + "] or [" + EXCLUDES_FIELD.getPreferredName() + "]"
-    );
-}
+static FetchSourceContext parseSourceObject(XContentParser parser) throws IOException {
Suggestion importance[1-10]: 3

__

Why: Making parseSourceObject package-private instead of public is a minor visibility concern. However, the test file FetchSourceContextTests.java directly calls FetchSourceContext.parseSourceObject(parser) at line 279, so changing visibility would break the test. The suggestion's improved_code doesn't address this dependency.

Low
Suggestions up to commit 68f6ccd
CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate deserialized instances for ambiguous fields

The StreamInput constructor does not call validateAmbiguousFields(), meaning
deserialized instances bypass the new validation. If a node on an older version
serializes a FetchSourceContext with conflicting includes/excludes and sends it to a
newer node, the validation will be silently skipped. Add the validation call here as
well.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [91-95]

 public FetchSourceContext(StreamInput in) throws IOException {
     fetchSource = in.readBoolean();
     includes = in.readStringArray();
     excludes = in.readStringArray();
+    validateAmbiguousFields();
 }
Suggestion importance[1-10]: 6

__

Why: The StreamInput constructor bypasses validateAmbiguousFields(), which could allow invalid deserialized instances to skip the new validation. Adding the validation call here ensures consistency across all construction paths, though the practical risk is limited since conflicting fields would typically be caught at the source.

Low
Verify empty-object validation allows excludes-only input

The validation that rejects an empty object fires even when the object contained
unrecognized field names (which already threw a ParsingException). More critically,
a valid object with only excludes (no includes) will be incorrectly rejected. The
check should allow either includes or excludes to be non-empty independently, which
is already the case, but the error message implies both are required. Consider
verifying the logic is correct for the case where only excludes is provided.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [259-266]

 if (includes.isEmpty() && excludes.isEmpty()) {
     // no valid field names -> empty or unrecognized fields; not allowed
     throw new ParsingException(
         parser.getTokenLocation(),
         "Expected at least one of [" + INCLUDES_FIELD.getPreferredName() + "] or [" + EXCLUDES_FIELD.getPreferredName() + "]"
     );
 }
+return new FetchSourceContext(true, includes.toArray(new String[0]), excludes.toArray(new String[0]));
Suggestion importance[1-10]: 2

__

Why: The suggestion asks to verify the logic is correct for excludes-only input, but the existing code already handles this correctly - the check includes.isEmpty() && excludes.isEmpty() only throws when both are empty. The improved_code just adds the return statement that already exists on line 266, making this suggestion inaccurate about the actual issue.

Low
General
Improve error message for empty top-level source array

When _source is specified as a top-level array (not inside an object),
parseSourceFieldArray is called with INCLUDES_FIELD and null as the opposite
argument. In this case, duplicate entries are silently deduplicated by the
LinkedHashSet. This is fine, but the parseField parameter is only used in the
empty-array error message, so passing INCLUDES_FIELD for a top-level array is
misleading. This is a minor concern, but the real issue is that the empty-array
check for a top-level _source: [] will produce the message "Expected at least one
value for an array of [includes]" which may confuse users since includes was not
explicitly used.

server/src/main/java/org/opensearch/search/fetch/subphase/FetchSourceContext.java [269-279]

-private static Set<String> parseSourceFieldArray(XContentParser parser, ParseField parseField, Set<String> opposite)
-    throws IOException {
-    Set<String> sourceArr = new LinkedHashSet<>(); // include or exclude lists, LinkedHashSet preserves the order of fields
-    while (parser.nextToken() != XContentParser.Token.END_ARRAY) {
+case XContentParser.Token.START_ARRAY -> {
+    if (parser.nextToken() == XContentParser.Token.END_ARRAY) {
+        throw new ParsingException(
+            parser.getTokenLocation(),
+            "Expected at least one value for [_source] array"
+        );
+    }
+    Set<String> includes = new LinkedHashSet<>();
+    do {
         if (parser.currentToken() == XContentParser.Token.VALUE_STRING) {
-            String entry = parser.text();
-            if (opposite != null && opposite.contains(entry)) {
-                throw new ParsingException(parser.getTokenLocation(), AMBIGUOUS_FIELD_MESSAGE, entry);
-            }
-            sourceArr.add(entry);
+            includes.add(parser.text());
+        } else {
+            throw new ParsingException(
+                parser.getTokenLocation(),
+                "Unknown key for a " + parser.currentToken() + " in [" + parser.currentName() + "]."
+            );
+        }
+    } while (parser.nextToken() != XContentParser.Token.END_ARRAY);
+    return new FetchSourceContext(true, includes.toArray(new String[0]), null);
+}
Suggestion importance[1-10]: 3

__

Why: The suggestion raises a valid UX concern about the misleading error message "Expected at least one value for an array of [includes]" when _source: [] is used at the top level. However, the improved_code replaces the wrong section (it modifies fromXContent switch case instead of parseSourceFieldArray), making it inaccurate relative to the existing_code snippet.

Low

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for c2718e8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 418c9b7

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 8f222c4

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 49d359c

@gaobinlong
Copy link
Copy Markdown
Contributor

Is the version in the skip block not an OpenSearch version, but rather qa:mised-cluster?
Where can I find the latest version for the YAML tests?

We should use - 3.5.99 for now since the current latest version in main branch is 3.6.0 now.

@urmichm urmichm force-pushed the 20612-source-validation branch from fe8b28f to 7cb356d Compare April 1, 2026 15:33
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Persistent review updated to latest commit 7cb356d

@urmichm urmichm force-pushed the 20612-source-validation branch from 7cb356d to 53120e7 Compare April 1, 2026 15:39
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Persistent review updated to latest commit 53120e7

@urmichm
Copy link
Copy Markdown
Contributor Author

urmichm commented Apr 1, 2026

thank you @gaobinlong !
tests updated, please have a look when you have a chance

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

❌ Gradle check result for 53120e7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@gaobinlong
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 53120e7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@urmichm , sorry for that, I think we need to change the skip version to 3.6.99 as the 3.6.0 release is code frozen and we won't catch up the train, the current latest version has been changed to 3.7.0, that' the reason why these YAML rest tests failed.

@gaobinlong
Copy link
Copy Markdown
Contributor

gaobinlong commented Apr 2, 2026

And please merge the latest main branch into your branch, thanks!


package org.opensearch.search.fetch.subphase;

import org.opensearch.OpenSearchException;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you keep this PR as a pure refactor. And raise another PR for the change? That would be very helpful to the reviewer. I'm sure pure refactor will be merged quickly :)

Regarding the change, I notice you already think we cannot introduce breaking behavior for 3.x here, which I agree!

Unfortunately we don't have a 4.x branch so we cannot get the breaking change in. But at least we can use the deprecation log to capture the effort being spent here. At the time of 4.0, we should do a scan of all the deprecation logs and change to exceptions.

cc: @andrross

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example

} else if (INDICES_BOOST_FIELD.match(currentFieldName, parser.getDeprecationHandler())) {
deprecationLogger.deprecate(
"indices_boost_object_format",
"Object format in indices_boost is deprecated, please use array format instead"
);
while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {

} else if (INDICES_BOOST_FIELD.match(currentFieldName, parser.getDeprecationHandler())) {
while ((token = parser.nextToken()) != XContentParser.Token.END_ARRAY) {
indexBoosts.add(new IndexBoost(parser));
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found your suggestion of splitting changes into multiple PRs very useful :)

I have created an MR for refactoring only #21086
Please have a look :)

I will update this PR (or create a new one) once the refactoring is merged

cc: @andrross

@urmichm
Copy link
Copy Markdown
Contributor Author

urmichm commented Apr 2, 2026

@gaobinlong @bowenlan-amzn Thank you for the feedback!
I convert the PR to Draft and update/refactor the changes!

@urmichm urmichm marked this pull request as draft April 2, 2026 11:09
urmichm and others added 15 commits April 2, 2026 13:39
Signed-off-by: Mikhail Urmich <m.urmich@jobware.de>
Signed-off-by: Mikhail Urmich <m.urmich@jobware.de>
Signed-off-by: Mikhail Urmich <m.urmich@jobware.de>
extract array parsing as its own function

Signed-off-by: Mikhail Urmich <m.urmich@jobware.de>
parseSourceObject: split key-value process into different code-blocks

Signed-off-by: Mikhail Urmich <urmich.m@gmail.com>
include and exclude collections must not contain the same entries

Signed-off-by: Mikhail Urmich <urmich.m@gmail.com>
The same entry MUST NOT be present in both inludes AND excludes arrays.
This leads to ambiguity.

Signed-off-by: Mikhail Urmich <m.urmich@jobware.de>
Explicitly defined source object '_source={}' not allowed

Signed-off-by: Mikhail Urmich <m.urmich@jobware.de>
Signed-off-by: Mikhail Urmich <m.urmich@jobware.de>
Signed-off-by: Mikhail Urmich <m.urmich@jobware.de>
Signed-off-by: Mikhail Urmich <urmich.m@gmail.com>
Signed-off-by: Mikhail Urmich <m.urmich@jobware.de>
empty object validation updated to cover all possible cases

Signed-off-by: Mikhail Urmich <m.urmich@jobware.de>
toXContent to create valid object

Signed-off-by: Mikhail Urmich <urmich.m@gmail.com>
Signed-off-by: Mikhail Urmich <m.urmich@jobware.de>
@urmichm urmichm force-pushed the 20612-source-validation branch from 53120e7 to 706f30b Compare April 2, 2026 11:39
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit 706f30b

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

❌ Gradle check result for 706f30b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Search:Query Insights

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] [BUG] _source include / exclude validation

5 participants