Improve CPS cluster exclusion handling#143488
Conversation
adb38fb to
29c88b2
Compare
| terminalHandler.run(); | ||
| }, failure -> { | ||
| logger.info("failed to resolve indices on remote cluster [" + clusterAlias + "]", failure); | ||
| remoteExceptions.put(clusterAlias, failure); |
There was a problem hiding this comment.
Assuming we get at most one response per cluster alias
| Map<String, ResolvedIndexExpressions.Builder> resolvedRemotelyBuilder = new ConcurrentHashMap<>(); | ||
| for (String clusterAlias : remoteClusterIndices.keySet()) { | ||
| resolvedRemotelyBuilder.put(clusterAlias, ResolvedIndexExpressions.builder()); | ||
| } |
There was a problem hiding this comment.
Pre-populating ResolvedIndexExpressions for each cluster like this conflicts with the error handling added in validate, as we expect missing entries for excluded clusters.
I haven't run into any reason why it needs to be pre-populated like this, beyond the single unit test that I changed
There was a problem hiding this comment.
@piergm could you validate that this change is okay? Seems fine to me because the only consumers I'm aware of filter for SUCCESS anyhow but would be good to get a second opinion
There was a problem hiding this comment.
Looks good to me (and nice simplification), plus we have enough tests to catch any corner cases in Field caps, ESQL, etc that I am confident that the change is ok if the CI is green!
| Map<String, ResolvedIndexExpressions> resolvedRemotely = resolvedRemotelyBuilder.entrySet() | ||
| .stream() | ||
| .collect(Collectors.toMap(Map.Entry::getKey, e -> e.getValue().build())); |
There was a problem hiding this comment.
there's a side effect of the expressions being cloned from the response to the map here - I assume we don't need that either
| new ActionListenerResponseHandler<>(groupedListener.delegateResponse((l, failure) -> { | ||
| logger.info("Error occurred on remote cluster [" + clusterAlias + "]", failure); | ||
| l.onResponse(Map.entry(clusterAlias, new SearchPlanningPhaseResolutionResult(null, failure))); | ||
| delegate.onResponse(Map.entry(clusterAlias, new SearchPlanningPhaseResolutionResult(null, failure))); |
There was a problem hiding this comment.
I'm not sure if we need this change, I can't see that it makes a difference; I'm just semi-blindly copying it from the search changes
There was a problem hiding this comment.
Right, this change is fine and still leads to correctly wrapping failure with a Map.entry instead of hard-failing. I want to simplify the listener handling here in a follow up there are more delegate.... calls in this flow than necessary.
| public final void onFailure(Exception e) { | ||
| ShardSearchFailure f = new ShardSearchFailure(e); | ||
| logCCSError(f, clusterAlias, skipOnFailure); | ||
| remoteExceptions.put(clusterAlias, e); |
There was a problem hiding this comment.
we need something like this to capture connection errors in MRT=false, but having to provide an empty HashMap in the MRT=true path is something of a code smell
There was a problem hiding this comment.
can't come up with anything clean other than CCSActionListener sub-classes and something like an onInnerFailure method -- feels over-engineered though so lets leave as is.
| resolvedIndexExpressions, | ||
| resolvedRemotely | ||
| resolvedRemotely, | ||
| Map.of() |
There was a problem hiding this comment.
Trying to add tests to cover this lead to the issues I raised with EQL and SQL; I've removed the EQL and SQL tests for now.
As I understand, this is here because the field caps call has lenient index options; I'd much prefer we passed the original index options (but with CPS rewriting disabled) so field caps fails when appropriate, as trying to gather remote exceptions here is potentially awkward.
There was a problem hiding this comment.
Yeah lets defer to the analytics team to fix those
|
Pinging @elastic/es-security (Team:Security) |
n1v0lg
left a comment
There was a problem hiding this comment.
LGTM -- nice work on this!
| new ActionListenerResponseHandler<>(groupedListener.delegateResponse((l, failure) -> { | ||
| logger.info("Error occurred on remote cluster [" + clusterAlias + "]", failure); | ||
| l.onResponse(Map.entry(clusterAlias, new SearchPlanningPhaseResolutionResult(null, failure))); | ||
| delegate.onResponse(Map.entry(clusterAlias, new SearchPlanningPhaseResolutionResult(null, failure))); |
There was a problem hiding this comment.
Right, this change is fine and still leads to correctly wrapping failure with a Map.entry instead of hard-failing. I want to simplify the listener handling here in a follow up there are more delegate.... calls in this flow than necessary.
| resolvedIndexExpressions, | ||
| resolvedRemotely | ||
| resolvedRemotely, | ||
| Map.of() |
There was a problem hiding this comment.
Yeah lets defer to the analytics team to fix those
| public final void onFailure(Exception e) { | ||
| ShardSearchFailure f = new ShardSearchFailure(e); | ||
| logCCSError(f, clusterAlias, skipOnFailure); | ||
| remoteExceptions.put(clusterAlias, e); |
There was a problem hiding this comment.
can't come up with anything clean other than CCSActionListener sub-classes and something like an onInnerFailure method -- feels over-engineered though so lets leave as is.
# Conflicts: # server/src/main/java/org/elasticsearch/action/search/TransportOpenPointInTimeAction.java
ce165b0 to
6ec48b0
Compare
|
CI is passing, but one of the jobs hasn't reported that to GH - I promise I'm merging on green CI 🙏 |
…elocations * upstream/main: (49 commits) CCS logging fixes (elastic#144070) Improve CPS cluster exclusion handling (elastic#143488) Remove snapshot condition now that node_reduce phase is in non-snapshot builds (elastic#144090) Drop deprecation warnings when updating a mapping in the cluster state applier (elastic#143884) (elastic#144040) Add ensureGreenAndNoInitializingShards helper (elastic#144044) Removed unnecessary applies_to blocks from deprecated query (elastic#144096) [CPS] Use single CrossProjectModeDecider instance (elastic#144030) Fix ESQL TS requests with LIMIT 0 (elastic#144031) ESQL: Remove `create` methods in aggs (elastic#144098) ES|QL: Refactor ChangeLimitOperator (elastic#144017) Add Paginated Hit Source Tests (elastic#142592) Fix test failure not preferred (elastic#144019) Remove serialization logic from EIS authorization response (elastic#144021) ESQL: CSV schema inference and parsing enhancements (elastic#144050) ESQL: Fix incorrectly optimized fork with nullify unmapped_fields (elastic#143030) Fix MMR release test using subqueries (elastic#144087) Refactoring `UserAgentPlugin` (elastic#140712) Drop non-finite samples in Prometheus remote write (elastic#144055) [TEST] Wait for internal inference indices to be created in authorization IT (elastic#143885) Disable ndjson datasource QA tests in release-tests (elastic#143992) ...
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Allow expressions such as
*:logs,-linked_project:*.Assume that missing remote resolution results for a project alias are due to cluster exclusion, unless we have an exception