Skip to content

Report shard counts for ongoing snapshots#78507

Merged
arteam merged 30 commits intoelastic:masterfrom
arteam:report-shard-counts-2
Oct 21, 2021
Merged

Report shard counts for ongoing snapshots#78507
arteam merged 30 commits intoelastic:masterfrom
arteam:report-shard-counts-2

Conversation

@arteam
Copy link
Copy Markdown
Contributor

@arteam arteam commented Sep 30, 2021

For in-progress snapshots we can try to calculate the amount of successful
shards based on the shard state. We can use a similar approach for failures.

Closes #76704

For in-progress snapshots we can try to calculate the amount of successful
shards based on the shard state. We can use a similar approach for failures.

Closes elastic#76704
@arteam arteam marked this pull request as draft September 30, 2021 10:35
@arteam arteam added v8.0.0 :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team. auto-backport Automatically create backport pull requests when merged v7.16.0 labels Sep 30, 2021
@arteam
Copy link
Copy Markdown
Contributor Author

arteam commented Oct 1, 2021

@elasticmachine update branch

@arteam arteam marked this pull request as ready for review October 1, 2021 18:23
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

assertThat(snapshotInfo.totalShards(), equalTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getTotalShards()));
assertThat(
snapshotInfo.successfulShards(),
lessThanOrEqualTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getDoneShards())
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to relax the condition to lessThanOrEqualTo from equalTo, because the getSnapshot and snapshotStatus API calls happen with a delay, so we can't guarantee that successfulShards and doneShards are actually the same

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These (the less than comparison ones) are the only tests for the new shard count numbers? Shouldn't we have at least one stable+deterministic test that actually makes sure that the counts come out right?
I think these tests would even pass without the changes here right (0 is always less than what the status API returns)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's the crux of the problem. I'm don't know a way to force a pin down the amount of processed shards.
Do you have any ideas?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way I see here is to set the snapshot pool size to exactly 1 and use a single data node and then run over a couple of shards that all have files to snapshot. Then concurrently set the repository to blocked and wait for block. That way you will have a clearly defined situation as soon as the data node has become blocked. (you have to account for the fact that it takes the data node a short time to report successful shards to master here, a busy assert seems like an easy solution to this)

assertThat(snapshotInfo.totalShards(), equalTo(snapshotStatus.getIndices().get(indexName).getShardsStats().getTotalShards()));
assertThat(
snapshotInfo.successfulShards(),
lessThanOrEqualTo(snapshotStatus.getIndices().get(indexName).getShardsStats().getDoneShards())
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: I had to relax the condition to lessThanOrEqualTo from equalTo, because the getSnapshot and snapshotStatus API calls happen with a delay, so we can't guarantee that successfulShards and doneShards are actually the same

arteam and others added 2 commits October 4, 2021 18:29
…s/SharedClusterSnapshotRestoreIT.java

Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
…s/SnapshotStatusApisIT.java

Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
@arteam arteam requested a review from tlrx October 4, 2021 16:41
Copy link
Copy Markdown
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but better wait for Armin's feedback given the noise it generated on CI.

@arteam
Copy link
Copy Markdown
Contributor Author

arteam commented Oct 5, 2021

@original-brownbear Can you please take a look? Thanks!

SnapshotState.IN_PROGRESS,
Collections.emptyMap()
);
int successfulShards = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really nice if we could stick with just one primary constructor for this class, adding duplication for this kind of complicated constructor isn't great.
Can't we just make it a static constructor method that computes the counts and then invokes one of the existing constructors?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extracted it to the static method inProgress which calls a constructor of SnapshotInfo

assertThat(snapshotInfo.totalShards(), equalTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getTotalShards()));
assertThat(
snapshotInfo.successfulShards(),
lessThanOrEqualTo(snapshotStatus.getIndices().get("test-idx").getShardsStats().getDoneShards())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These (the less than comparison ones) are the only tests for the new shard count numbers? Shouldn't we have at least one stable+deterministic test that actually makes sure that the counts come out right?
I think these tests would even pass without the changes here right (0 is always less than what the status API returns)?

@arteam
Copy link
Copy Markdown
Contributor Author

arteam commented Oct 21, 2021

Thanks, Armin! I will do a follow up PR with the MISSING state.

@arteam arteam added v7.16.1 and removed v7.16.0 labels Oct 21, 2021
@original-brownbear
Copy link
Copy Markdown
Contributor

@arteam please fix the missing here :) The test can go into a follow-up, but lets not half-fix things here please.

@arteam
Copy link
Copy Markdown
Contributor Author

arteam commented Oct 21, 2021

Alright, will do!

@arteam
Copy link
Copy Markdown
Contributor Author

arteam commented Oct 21, 2021

@elasticmachine update branch

@arteam arteam merged commit 6f422d8 into elastic:master Oct 21, 2021
@arteam arteam deleted the report-shard-counts-2 branch October 21, 2021 11:04
arteam added a commit to arteam/elasticsearch that referenced this pull request Oct 21, 2021
For in-progress snapshots we can try to calculate the amount of successful
shards based on the shard state. We can use a similar approach for failures.

Closes elastic#76704
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
7.16

arteam added a commit that referenced this pull request Oct 21, 2021
For in-progress snapshots we can try to calculate the amount of successful
shards based on the shard state. We can use a similar approach for failures.

Closes #76704
arteam added a commit to arteam/elasticsearch that referenced this pull request Nov 9, 2021
Start asserting snapshots in progress only in case when they reach
a stable state (the first index has finished, the second has been
blocked).

Fixes elastic#79779
Relates elastic#78507
arteam added a commit that referenced this pull request Jan 18, 2022
Start asserting snapshots in progress only in case when they reach
a stable state (the first index has finished, the second has been
blocked).

* Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused
* Check that test-index-2 is blocked
* Be more clear that the 2nd index is blocked

Fixes #79779
Relates #78507
arteam added a commit to arteam/elasticsearch that referenced this pull request Jan 18, 2022
Start asserting snapshots in progress only in case when they reach
a stable state (the first index has finished, the second has been
blocked).

* Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused
* Check that test-index-2 is blocked
* Be more clear that the 2nd index is blocked

Fixes elastic#79779
Relates elastic#78507
arteam added a commit to arteam/elasticsearch that referenced this pull request Jan 18, 2022
Start asserting snapshots in progress only in case when they reach
a stable state (the first index has finished, the second has been
blocked).

* Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused
* Check that test-index-2 is blocked
* Be more clear that the 2nd index is blocked

Fixes elastic#79779
Relates elastic#78507
arteam added a commit that referenced this pull request Jan 18, 2022
Start asserting snapshots in progress only in case when they reach
a stable state (the first index has finished, the second has been
blocked).

* Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused
* Check that test-index-2 is blocked
* Be more clear that the 2nd index is blocked

Fixes #79779
Relates #78507
arteam added a commit that referenced this pull request Jan 19, 2022
…82759)

Start asserting snapshots in progress only in case when they reach
a stable state (the first index has finished, the second has been
blocked).

* Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused
* Check that test-index-2 is blocked
* Be more clear that the 2nd index is blocked

Fixes #79779
Relates #78507
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Meta label for distributed team. v7.16.0 v8.0.0-beta1

Projects

None yet

7 participants