Report shard counts for ongoing snapshots#78507
Conversation
For in-progress snapshots we can try to calculate the amount of successful shards based on the shard state. We can use a similar approach for failures. Closes elastic#76704
|
@elasticmachine update branch |
|
Pinging @elastic/es-distributed (Team:Distributed) |
| assertThat(snapshotInfo.totalShards(), equalTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getTotalShards())); | ||
| assertThat( | ||
| snapshotInfo.successfulShards(), | ||
| lessThanOrEqualTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getDoneShards()) |
There was a problem hiding this comment.
I had to relax the condition to lessThanOrEqualTo from equalTo, because the getSnapshot and snapshotStatus API calls happen with a delay, so we can't guarantee that successfulShards and doneShards are actually the same
There was a problem hiding this comment.
These (the less than comparison ones) are the only tests for the new shard count numbers? Shouldn't we have at least one stable+deterministic test that actually makes sure that the counts come out right?
I think these tests would even pass without the changes here right (0 is always less than what the status API returns)?
There was a problem hiding this comment.
Yeah, that's the crux of the problem. I'm don't know a way to force a pin down the amount of processed shards.
Do you have any ideas?
There was a problem hiding this comment.
One way I see here is to set the snapshot pool size to exactly 1 and use a single data node and then run over a couple of shards that all have files to snapshot. Then concurrently set the repository to blocked and wait for block. That way you will have a clearly defined situation as soon as the data node has become blocked. (you have to account for the fact that it takes the data node a short time to report successful shards to master here, a busy assert seems like an easy solution to this)
| assertThat(snapshotInfo.totalShards(), equalTo(snapshotStatus.getIndices().get(indexName).getShardsStats().getTotalShards())); | ||
| assertThat( | ||
| snapshotInfo.successfulShards(), | ||
| lessThanOrEqualTo(snapshotStatus.getIndices().get(indexName).getShardsStats().getDoneShards()) |
There was a problem hiding this comment.
Same here: I had to relax the condition to lessThanOrEqualTo from equalTo, because the getSnapshot and snapshotStatus API calls happen with a delay, so we can't guarantee that successfulShards and doneShards are actually the same
...src/internalClusterTest/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/snapshots/SnapshotStatusApisIT.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/snapshots/SnapshotInfo.java
Outdated
Show resolved
Hide resolved
…s/SharedClusterSnapshotRestoreIT.java Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
…s/SnapshotStatusApisIT.java Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
tlrx
left a comment
There was a problem hiding this comment.
LGTM, but better wait for Armin's feedback given the noise it generated on CI.
|
@original-brownbear Can you please take a look? Thanks! |
server/src/internalClusterTest/java/org/elasticsearch/snapshots/SnapshotStatusApisIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/snapshots/GetSnapshotsIT.java
Show resolved
Hide resolved
| SnapshotState.IN_PROGRESS, | ||
| Collections.emptyMap() | ||
| ); | ||
| int successfulShards = 0; |
There was a problem hiding this comment.
It would be really nice if we could stick with just one primary constructor for this class, adding duplication for this kind of complicated constructor isn't great.
Can't we just make it a static constructor method that computes the counts and then invokes one of the existing constructors?
There was a problem hiding this comment.
I've extracted it to the static method inProgress which calls a constructor of SnapshotInfo
| assertThat(snapshotInfo.totalShards(), equalTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getTotalShards())); | ||
| assertThat( | ||
| snapshotInfo.successfulShards(), | ||
| lessThanOrEqualTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getDoneShards()) |
There was a problem hiding this comment.
These (the less than comparison ones) are the only tests for the new shard count numbers? Shouldn't we have at least one stable+deterministic test that actually makes sure that the counts come out right?
I think these tests would even pass without the changes here right (0 is always less than what the status API returns)?
|
Thanks, Armin! I will do a follow up PR with the |
|
@arteam please fix the missing here :) The test can go into a follow-up, but lets not half-fix things here please. |
|
Alright, will do! |
|
@elasticmachine update branch |
For in-progress snapshots we can try to calculate the amount of successful shards based on the shard state. We can use a similar approach for failures. Closes elastic#76704
💚 Backport successful
|
Start asserting snapshots in progress only in case when they reach a stable state (the first index has finished, the second has been blocked). Fixes elastic#79779 Relates elastic#78507
Start asserting snapshots in progress only in case when they reach a stable state (the first index has finished, the second has been blocked). * Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused * Check that test-index-2 is blocked * Be more clear that the 2nd index is blocked Fixes #79779 Relates #78507
Start asserting snapshots in progress only in case when they reach a stable state (the first index has finished, the second has been blocked). * Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused * Check that test-index-2 is blocked * Be more clear that the 2nd index is blocked Fixes elastic#79779 Relates elastic#78507
Start asserting snapshots in progress only in case when they reach a stable state (the first index has finished, the second has been blocked). * Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused * Check that test-index-2 is blocked * Be more clear that the 2nd index is blocked Fixes elastic#79779 Relates elastic#78507
Start asserting snapshots in progress only in case when they reach a stable state (the first index has finished, the second has been blocked). * Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused * Check that test-index-2 is blocked * Be more clear that the 2nd index is blocked Fixes #79779 Relates #78507
…82759) Start asserting snapshots in progress only in case when they reach a stable state (the first index has finished, the second has been blocked). * Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused * Check that test-index-2 is blocked * Be more clear that the 2nd index is blocked Fixes #79779 Relates #78507
For in-progress snapshots we can try to calculate the amount of successful
shards based on the shard state. We can use a similar approach for failures.
Closes #76704