Track the count of failed invocations since last successful policy snapshot#88398
Merged
jbaiera merged 8 commits intoelastic:masterfrom Jul 12, 2022
Merged
Track the count of failed invocations since last successful policy snapshot#88398jbaiera merged 8 commits intoelastic:masterfrom
jbaiera merged 8 commits intoelastic:masterfrom
Conversation
Collaborator
|
Pinging @elastic/es-data-management (Team:Data Management) |
Collaborator
|
Hi @jbaiera, I've created a changelog YAML for you. |
dakrone
reviewed
Jul 8, 2022
Member
dakrone
left a comment
There was a problem hiding this comment.
This generally looks good to me, but I think we should make it non-null and treat the default missing value as 0 invocations, what do you think?
...gin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotLifecyclePolicyMetadata.java
Outdated
Show resolved
Hide resolved
Member
Author
|
@elasticmachine run elasticsearch-ci/docs |
weizijun
added a commit
to weizijun/elasticsearch
that referenced
this pull request
Jul 13, 2022
* upstream/master: (38 commits) Simplify map copying (elastic#88432) Make DiffableUtils.diff implementation agnostic (elastic#88403) Ingest: Start separating Metadata from IngestSourceAndMetadata (elastic#88401) Move runtime fields base scripts out of scripting fields api package. (elastic#88488) Enable TRACE Logging for test and increase timeout (elastic#88477) Mute ReactiveStorageIT#testScaleDuringSplitOrClone (elastic#88480) Track the count of failed invocations since last successful policy snapshot (elastic#88398) Avoid noisy exceptions on data nodes when aborting snapshots (elastic#88476) Fix ReactiveStorageDeciderServiceTests testNodeSizeForDataBelowLowWatermark (elastic#88452) INFO logging of snapshot restore and completion (elastic#88257) unmute test (elastic#88454) Updatable API keys - noop check (elastic#88346) Corrected an incomplete sentence. (elastic#86542) Use consistent shard map type in IndexService (elastic#88465) Stop registering TestGeoShapeFieldMapperPlugin in ESIntegTestCase (elastic#88460) TSDB: RollupShardIndexer logging improvements (elastic#88416) Audit API key ID when create or grant API keys (elastic#88456) Bound random negative size test in SearchSourceBuilderTests#testNegativeSizeErrors (elastic#88457) Updatable API keys - logging audit trail event (elastic#88276) Polish reworked LoggedExec task (elastic#88424) ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/RollupShardIndexer.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When an automated snapshot fails, the last failure for a policy is captured and stored in the cluster state. Similarly, we store the last successful snapshot invocation as well. We do not track how many invocations have passed between a successful snapshot and the most recent failure. These stats would be helpful for reporting on SLM policy health.
Instead of a fixed delay, snapshot lifecycle policies are scheduled using a cron expression which can produce variable execution times between snapshot attempts. This makes it difficult to select a window of time where continuous snapshot failure becomes indicative of a problem instead of a transient issue. By including the count of failed invocations since last success we can provide health reporting logic that allows for some transient failures while remaining agnostic of variable execution times that cron can produce.