Track the count of failed invocations since last successful policy snapshot by jbaiera · Pull Request #88398 · elastic/elasticsearch

jbaiera · 2022-07-08T20:19:16Z

When an automated snapshot fails, the last failure for a policy is captured and stored in the cluster state. Similarly, we store the last successful snapshot invocation as well. We do not track how many invocations have passed between a successful snapshot and the most recent failure. These stats would be helpful for reporting on SLM policy health.

Instead of a fixed delay, snapshot lifecycle policies are scheduled using a cron expression which can produce variable execution times between snapshot attempts. This makes it difficult to select a window of time where continuous snapshot failure becomes indicative of a problem instead of a transient issue. By including the count of failed invocations since last success we can provide health reporting logic that allows for some transient failures while remaining agnostic of variable execution times that cron can produce.

…le policy metadata.

elasticmachine · 2022-07-08T20:19:19Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2022-07-08T20:19:40Z

Hi @jbaiera, I've created a changelog YAML for you.

dakrone

This generally looks good to me, but I think we should make it non-null and treat the default missing value as 0 invocations, what do you think?

...gin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotLifecyclePolicyMetadata.java

jbaiera · 2022-07-11T21:11:16Z

@elasticmachine run elasticsearch-ci/docs

dakrone

LGTM

* upstream/master: (38 commits) Simplify map copying (elastic#88432) Make DiffableUtils.diff implementation agnostic (elastic#88403) Ingest: Start separating Metadata from IngestSourceAndMetadata (elastic#88401) Move runtime fields base scripts out of scripting fields api package. (elastic#88488) Enable TRACE Logging for test and increase timeout (elastic#88477) Mute ReactiveStorageIT#testScaleDuringSplitOrClone (elastic#88480) Track the count of failed invocations since last successful policy snapshot (elastic#88398) Avoid noisy exceptions on data nodes when aborting snapshots (elastic#88476) Fix ReactiveStorageDeciderServiceTests testNodeSizeForDataBelowLowWatermark (elastic#88452) INFO logging of snapshot restore and completion (elastic#88257) unmute test (elastic#88454) Updatable API keys - noop check (elastic#88346) Corrected an incomplete sentence. (elastic#86542) Use consistent shard map type in IndexService (elastic#88465) Stop registering TestGeoShapeFieldMapperPlugin in ESIntegTestCase (elastic#88460) TSDB: RollupShardIndexer logging improvements (elastic#88416) Audit API key ID when create or grant API keys (elastic#88456) Bound random negative size test in SearchSourceBuilderTests#testNegativeSizeErrors (elastic#88457) Updatable API keys - logging audit trail event (elastic#88276) Polish reworked LoggedExec task (elastic#88424) ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/RollupShardIndexer.java

jbaiera added 4 commits July 8, 2022 14:41

Add a count of invocations since the last success on snapshot lifecyc…

84a3aec

…le policy metadata.

Add invocations since last success logic to lifecycle task

8858efb

Assert invocationsSinceLastSuccess needs a success present.

145bc34

Include new field in serialization tests.

be0e6f3

jbaiera added >enhancement :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. v8.4.0 labels Jul 8, 2022

elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Jul 8, 2022

Update docs/changelog/88398.yaml

f93b85b

dakrone reviewed Jul 8, 2022

View reviewed changes

...gin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotLifecyclePolicyMetadata.java Outdated Show resolved Hide resolved

jbaiera added 3 commits July 11, 2022 14:38

fix setting bug

f72a75b

Use non nullable long value

0503839

Fix xcontent nullity logic

f8975bb

dakrone approved these changes Jul 11, 2022

View reviewed changes

jbaiera merged commit b790256 into elastic:master Jul 12, 2022

jbaiera deleted the slm-add-invocation-counts branch July 12, 2022 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track the count of failed invocations since last successful policy snapshot#88398

Track the count of failed invocations since last successful policy snapshot#88398
jbaiera merged 8 commits intoelastic:masterfrom
jbaiera:slm-add-invocation-counts

jbaiera commented Jul 8, 2022

Uh oh!

elasticmachine commented Jul 8, 2022

Uh oh!

elasticsearchmachine commented Jul 8, 2022

Uh oh!

dakrone left a comment

Uh oh!

Uh oh!

jbaiera commented Jul 11, 2022

Uh oh!

dakrone left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jbaiera commented Jul 8, 2022

Uh oh!

elasticmachine commented Jul 8, 2022

Uh oh!

elasticsearchmachine commented Jul 8, 2022

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jbaiera commented Jul 11, 2022

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants