Report shard counts for ongoing snapshots by arteam · Pull Request #78507 · elastic/elasticsearch

arteam · 2021-09-30T10:35:42Z

For in-progress snapshots we can try to calculate the amount of successful
shards based on the shard state. We can use a similar approach for failures.

Closes #76704

For in-progress snapshots we can try to calculate the amount of successful shards based on the shard state. We can use a similar approach for failures. Closes elastic#76704

arteam · 2021-10-01T14:11:52Z

@elasticmachine update branch

elasticmachine · 2021-10-01T18:23:10Z

Pinging @elastic/es-distributed (Team:Distributed)

arteam · 2021-10-01T18:27:01Z

...src/internalClusterTest/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

+        assertThat(snapshotInfo.totalShards(), equalTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getTotalShards()));
+        assertThat(
+            snapshotInfo.successfulShards(),
+            lessThanOrEqualTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getDoneShards())


I had to relax the condition to lessThanOrEqualTo from equalTo, because the getSnapshot and snapshotStatus API calls happen with a delay, so we can't guarantee that successfulShards and doneShards are actually the same

These (the less than comparison ones) are the only tests for the new shard count numbers? Shouldn't we have at least one stable+deterministic test that actually makes sure that the counts come out right?
I think these tests would even pass without the changes here right (0 is always less than what the status API returns)?

Yeah, that's the crux of the problem. I'm don't know a way to force a pin down the amount of processed shards.
Do you have any ideas?

Ping @original-brownbear

One way I see here is to set the snapshot pool size to exactly 1 and use a single data node and then run over a couple of shards that all have files to snapshot. Then concurrently set the repository to blocked and wait for block. That way you will have a clearly defined situation as soon as the data node has become blocked. (you have to account for the fact that it takes the data node a short time to report successful shards to master here, a busy assert seems like an easy solution to this)

arteam · 2021-10-01T18:27:18Z

server/src/internalClusterTest/java/org/elasticsearch/snapshots/SnapshotStatusApisIT.java

+        assertThat(snapshotInfo.totalShards(), equalTo(snapshotStatus.getIndices().get(indexName).getShardsStats().getTotalShards()));
+        assertThat(
+            snapshotInfo.successfulShards(),
+            lessThanOrEqualTo(snapshotStatus.getIndices().get(indexName).getShardsStats().getDoneShards())


Same here: I had to relax the condition to lessThanOrEqualTo from equalTo, because the getSnapshot and snapshotStatus API calls happen with a delay, so we can't guarantee that successfulShards and doneShards are actually the same

...src/internalClusterTest/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

server/src/internalClusterTest/java/org/elasticsearch/snapshots/SnapshotStatusApisIT.java

server/src/main/java/org/elasticsearch/snapshots/SnapshotInfo.java

…s/SharedClusterSnapshotRestoreIT.java Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>

…s/SnapshotStatusApisIT.java Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>

tlrx

LGTM, but better wait for Armin's feedback given the noise it generated on CI.

arteam · 2021-10-05T10:59:20Z

@original-brownbear Can you please take a look? Thanks!

server/src/internalClusterTest/java/org/elasticsearch/snapshots/SnapshotStatusApisIT.java

server/src/internalClusterTest/java/org/elasticsearch/snapshots/GetSnapshotsIT.java

original-brownbear · 2021-10-07T09:40:30Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotInfo.java

-            SnapshotState.IN_PROGRESS,
-            Collections.emptyMap()
-        );
+        int successfulShards = 0;


It would be really nice if we could stick with just one primary constructor for this class, adding duplication for this kind of complicated constructor isn't great.
Can't we just make it a static constructor method that computes the counts and then invokes one of the existing constructors?

I've extracted it to the static method inProgress which calls a constructor of SnapshotInfo

original-brownbear · 2021-10-07T09:43:58Z

...src/internalClusterTest/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

+        assertThat(snapshotInfo.totalShards(), equalTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getTotalShards()));
+        assertThat(
+            snapshotInfo.successfulShards(),
+            lessThanOrEqualTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getDoneShards())


These (the less than comparison ones) are the only tests for the new shard count numbers? Shouldn't we have at least one stable+deterministic test that actually makes sure that the counts come out right?
I think these tests would even pass without the changes here right (0 is always less than what the status API returns)?

arteam · 2021-10-21T09:39:22Z

Thanks, Armin! I will do a follow up PR with the MISSING state.

original-brownbear · 2021-10-21T09:43:46Z

@arteam please fix the missing here :) The test can go into a follow-up, but lets not half-fix things here please.

arteam · 2021-10-21T09:45:08Z

Alright, will do!

arteam · 2021-10-21T09:55:56Z

@elasticmachine update branch

For in-progress snapshots we can try to calculate the amount of successful shards based on the shard state. We can use a similar approach for failures. Closes elastic#76704

elasticsearchmachine · 2021-10-21T11:05:32Z

💚 Backport successful

Status	Branch	Result
✅	7.16

For in-progress snapshots we can try to calculate the amount of successful shards based on the shard state. We can use a similar approach for failures. Closes #76704

Start asserting snapshots in progress only in case when they reach a stable state (the first index has finished, the second has been blocked). Fixes elastic#79779 Relates elastic#78507

Start asserting snapshots in progress only in case when they reach a stable state (the first index has finished, the second has been blocked). * Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused * Check that test-index-2 is blocked * Be more clear that the 2nd index is blocked Fixes #79779 Relates #78507

Start asserting snapshots in progress only in case when they reach a stable state (the first index has finished, the second has been blocked). * Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused * Check that test-index-2 is blocked * Be more clear that the 2nd index is blocked Fixes elastic#79779 Relates elastic#78507

Start asserting snapshots in progress only in case when they reach a stable state (the first index has finished, the second has been blocked). * Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused * Check that test-index-2 is blocked * Be more clear that the 2nd index is blocked Fixes #79779 Relates #78507

…82759) Start asserting snapshots in progress only in case when they reach a stable state (the first index has finished, the second has been blocked). * Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused * Check that test-index-2 is blocked * Be more clear that the 2nd index is blocked Fixes #79779 Relates #78507

Report shard counts for ongoing snapshots

475c1a5

For in-progress snapshots we can try to calculate the amount of successful shards based on the shard state. We can use a similar approach for failures. Closes elastic#76704

arteam marked this pull request as draft September 30, 2021 10:35

arteam added v8.0.0 :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team. auto-backport Automatically create backport pull requests when merged v7.16.0 labels Sep 30, 2021

Merge branch 'master' into report-shard-counts-2

e20f15a

elasticmachine and others added 4 commits October 2, 2021 00:11

Merge branch 'master' into report-shard-counts-2

2939143

Soften the checks on the amount of successfully migrated shards

8737da5

Fix checkstyle warning

9a6dab1

Apply spotless

e91c3ed

arteam marked this pull request as ready for review October 1, 2021 18:23

arteam commented Oct 1, 2021

View reviewed changes

arteam requested review from original-brownbear and tlrx October 4, 2021 08:30

tlrx reviewed Oct 4, 2021

View reviewed changes

arteam and others added 2 commits October 4, 2021 18:29

Update server/src/internalClusterTest/java/org/elasticsearch/snapshot…

9c90310

…s/SharedClusterSnapshotRestoreIT.java Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>

Update server/src/internalClusterTest/java/org/elasticsearch/snapshot…

95b90b9

…s/SnapshotStatusApisIT.java Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>

arteam requested a review from tlrx October 4, 2021 16:41

tlrx approved these changes Oct 5, 2021

View reviewed changes

This was linked to issues Oct 6, 2021

[CI] SharedClusterSnapshotRestoreIT testSnapshotStatus failing #78371

Closed

[CI] ConcurrentSnapshotsIT.testConcurrentRestoreDeleteAndClone #78436

Closed

original-brownbear suggested changes Oct 7, 2021

View reviewed changes

arteam and others added 2 commits October 7, 2021 12:48

Rename the in-progress constructor to a static method

bb8aab4

Merge branch 'master' into report-shard-counts-2

63605a0

arteam added v7.16.1 and removed v7.16.0 labels Oct 21, 2021

Make sure MISSING shards are not counted as failures

8736ab7

Merge branch 'master' into report-shard-counts-2

fc5fcd6

arteam merged commit 6f422d8 into elastic:master Oct 21, 2021

arteam deleted the report-shard-counts-2 branch October 21, 2021 11:04

arteam mentioned this pull request Oct 21, 2021

[7.16] Report shard counts for ongoing snapshots (#78507) #79614

Merged

original-brownbear mentioned this pull request Oct 26, 2021

[CI] RestGetSnapshotsIT testSortAndPaginateWithInProgress failing #79779

Closed

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

danhermann added v7.16.0 and removed v7.16.1 labels Oct 27, 2021

This was referenced Oct 27, 2021

[CI] SnapshotStressTestsIT testRandomActivities failing #79718

Closed

Don't Fill Stack Traces in SnapshotShardFailure #80009

Merged

arteam mentioned this pull request Nov 9, 2021

Make testSortAndPaginateWithInProgress test stable #80530

Merged

danhermann added the >non-issue label Dec 3, 2021

Conversation

arteam commented Sep 30, 2021

Uh oh!

arteam commented Oct 1, 2021

Uh oh!

elasticmachine commented Oct 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

arteam commented Oct 5, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arteam commented Oct 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

original-brownbear commented Oct 21, 2021

Uh oh!

arteam commented Oct 21, 2021

Uh oh!

arteam commented Oct 21, 2021

Uh oh!

elasticsearchmachine commented Oct 21, 2021

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

arteam commented Oct 21, 2021 •

edited

Loading