Skip to content

Improve CPS cluster exclusion handling#143488

Merged
richard-dennehy merged 12 commits intoelastic:mainfrom
richard-dennehy:remote-project-exclusion-4
Mar 12, 2026
Merged

Improve CPS cluster exclusion handling#143488
richard-dennehy merged 12 commits intoelastic:mainfrom
richard-dennehy:remote-project-exclusion-4

Conversation

@richard-dennehy
Copy link
Copy Markdown
Contributor

Allow expressions such as *:logs,-linked_project:*.

Assume that missing remote resolution results for a project alias are due to cluster exclusion, unless we have an exception

@richard-dennehy richard-dennehy added >non-issue :Security/Security Security issues without another label labels Mar 3, 2026
@elasticsearchmachine elasticsearchmachine added v9.4.0 serverless-linked Added by automation, don't add manually labels Mar 3, 2026
@richard-dennehy richard-dennehy force-pushed the remote-project-exclusion-4 branch 3 times, most recently from adb38fb to 29c88b2 Compare March 10, 2026 14:52
@richard-dennehy richard-dennehy changed the title WIP handle CPS cluster exclusions Improve CPS cluster exclusion handling Mar 10, 2026
terminalHandler.run();
}, failure -> {
logger.info("failed to resolve indices on remote cluster [" + clusterAlias + "]", failure);
remoteExceptions.put(clusterAlias, failure);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we get at most one response per cluster alias

Comment on lines -295 to -298
Map<String, ResolvedIndexExpressions.Builder> resolvedRemotelyBuilder = new ConcurrentHashMap<>();
for (String clusterAlias : remoteClusterIndices.keySet()) {
resolvedRemotelyBuilder.put(clusterAlias, ResolvedIndexExpressions.builder());
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-populating ResolvedIndexExpressions for each cluster like this conflicts with the error handling added in validate, as we expect missing entries for excluded clusters.

I haven't run into any reason why it needs to be pre-populated like this, beyond the single unit test that I changed

Copy link
Copy Markdown
Contributor

@n1v0lg n1v0lg Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piergm could you validate that this change is okay? Seems fine to me because the only consumers I'm aware of filter for SUCCESS anyhow but would be good to get a second opinion

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me (and nice simplification), plus we have enough tests to catch any corner cases in Field caps, ESQL, etc that I am confident that the change is ok if the CI is green!

Comment on lines -360 to -362
Map<String, ResolvedIndexExpressions> resolvedRemotely = resolvedRemotelyBuilder.entrySet()
.stream()
.collect(Collectors.toMap(Map.Entry::getKey, e -> e.getValue().build()));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a side effect of the expressions being cloned from the response to the map here - I assume we don't need that either

new ActionListenerResponseHandler<>(groupedListener.delegateResponse((l, failure) -> {
logger.info("Error occurred on remote cluster [" + clusterAlias + "]", failure);
l.onResponse(Map.entry(clusterAlias, new SearchPlanningPhaseResolutionResult(null, failure)));
delegate.onResponse(Map.entry(clusterAlias, new SearchPlanningPhaseResolutionResult(null, failure)));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need this change, I can't see that it makes a difference; I'm just semi-blindly copying it from the search changes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this change is fine and still leads to correctly wrapping failure with a Map.entry instead of hard-failing. I want to simplify the listener handling here in a follow up there are more delegate.... calls in this flow than necessary.

public final void onFailure(Exception e) {
ShardSearchFailure f = new ShardSearchFailure(e);
logCCSError(f, clusterAlias, skipOnFailure);
remoteExceptions.put(clusterAlias, e);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need something like this to capture connection errors in MRT=false, but having to provide an empty HashMap in the MRT=true path is something of a code smell

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't come up with anything clean other than CCSActionListener sub-classes and something like an onInnerFailure method -- feels over-engineered though so lets leave as is.

resolvedIndexExpressions,
resolvedRemotely
resolvedRemotely,
Map.of()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to add tests to cover this lead to the issues I raised with EQL and SQL; I've removed the EQL and SQL tests for now.

As I understand, this is here because the field caps call has lenient index options; I'd much prefer we passed the original index options (but with CPS rewriting disabled) so field caps fails when appropriate, as trying to gather remote exceptions here is potentially awkward.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lets defer to the analytics team to fix those

@richard-dennehy richard-dennehy marked this pull request as ready for review March 11, 2026 11:13
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Mar 11, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-security (Team:Security)

Copy link
Copy Markdown
Contributor

@n1v0lg n1v0lg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- nice work on this!

new ActionListenerResponseHandler<>(groupedListener.delegateResponse((l, failure) -> {
logger.info("Error occurred on remote cluster [" + clusterAlias + "]", failure);
l.onResponse(Map.entry(clusterAlias, new SearchPlanningPhaseResolutionResult(null, failure)));
delegate.onResponse(Map.entry(clusterAlias, new SearchPlanningPhaseResolutionResult(null, failure)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this change is fine and still leads to correctly wrapping failure with a Map.entry instead of hard-failing. I want to simplify the listener handling here in a follow up there are more delegate.... calls in this flow than necessary.

resolvedIndexExpressions,
resolvedRemotely
resolvedRemotely,
Map.of()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lets defer to the analytics team to fix those

public final void onFailure(Exception e) {
ShardSearchFailure f = new ShardSearchFailure(e);
logCCSError(f, clusterAlias, skipOnFailure);
remoteExceptions.put(clusterAlias, e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't come up with anything clean other than CCSActionListener sub-classes and something like an onInnerFailure method -- feels over-engineered though so lets leave as is.

@richard-dennehy richard-dennehy force-pushed the remote-project-exclusion-4 branch from ce165b0 to 6ec48b0 Compare March 12, 2026 12:15
@richard-dennehy richard-dennehy added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Mar 12, 2026
@richard-dennehy
Copy link
Copy Markdown
Contributor Author

CI is passing, but one of the jobs hasn't reported that to GH - I promise I'm merging on green CI 🙏

@richard-dennehy richard-dennehy merged commit 498aa28 into elastic:main Mar 12, 2026
41 of 42 checks passed
szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 12, 2026
…elocations

* upstream/main: (49 commits)
  CCS logging fixes (elastic#144070)
  Improve CPS cluster exclusion handling (elastic#143488)
  Remove snapshot condition now that node_reduce phase is in non-snapshot builds (elastic#144090)
  Drop deprecation warnings when updating a mapping in the cluster state applier (elastic#143884) (elastic#144040)
  Add ensureGreenAndNoInitializingShards helper (elastic#144044)
  Removed unnecessary applies_to blocks from deprecated query (elastic#144096)
  [CPS] Use single CrossProjectModeDecider instance (elastic#144030)
  Fix ESQL TS requests with LIMIT 0 (elastic#144031)
  ESQL: Remove `create` methods in aggs (elastic#144098)
  ES|QL: Refactor ChangeLimitOperator (elastic#144017)
  Add Paginated Hit Source Tests (elastic#142592)
  Fix test failure not preferred (elastic#144019)
  Remove serialization logic from EIS authorization response (elastic#144021)
  ESQL: CSV schema inference and parsing enhancements (elastic#144050)
  ESQL: Fix incorrectly optimized fork with nullify unmapped_fields (elastic#143030)
  Fix MMR release test using subqueries (elastic#144087)
  Refactoring `UserAgentPlugin` (elastic#140712)
  Drop non-finite samples in Prometheus remote write (elastic#144055)
  [TEST] Wait for internal inference indices to be created in authorization IT (elastic#143885)
  Disable ndjson datasource QA tests in release-tests (elastic#143992)
  ...
michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >non-issue :Security/Security Security issues without another label serverless-linked Added by automation, don't add manually Team:Security Meta label for security team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants