Skip to content

Fix Snapshot Completion Listener Lost on Master Failover#54286

Merged
original-brownbear merged 4 commits intoelastic:masterfrom
original-brownbear:master-failover-after-finalization-bug
Mar 27, 2020
Merged

Fix Snapshot Completion Listener Lost on Master Failover#54286
original-brownbear merged 4 commits intoelastic:masterfrom
original-brownbear:master-failover-after-finalization-bug

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear commented Mar 26, 2020

If master fails over before (or we run into any other exception) when removing
the snapshot from the CS we must still resolve all the completion listeners for
the snapshot.

Without this change master failing over during a CS update in snapshot finalization will lead to leaking the snapshot listeners in the SnapshotsService listeners map and will snapshot and snapshot delete requests to never be answered on the transport layer + it's a (small) memory leak.

If master fails over before (or we run into any other exception) when removing
the snapshot from the CS we must still resolve all the completion listeners for
the snapshot.
@original-brownbear original-brownbear added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.0 labels Mar 26, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Ideally, in the future, we will support master failovers on these listeners (as we currently communicate the snapshot as failed even though it might be completed successfully by another master?)

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Yannick!

Hmpf BwC tests are busted because of #54313

Ideally, in the future, we will support master failovers on these listeners (as we currently communicate the snapshot as failed even though it might be completed successfully by another master?)

Already doing that in the concurrent snapshots branch via cluster state observers :)

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Jenkins test this (seems Jenkins completely locked up mid-way)

@original-brownbear original-brownbear merged commit 1e60848 into elastic:master Mar 27, 2020
@original-brownbear original-brownbear deleted the master-failover-after-finalization-bug branch March 27, 2020 12:18
original-brownbear added a commit that referenced this pull request Mar 27, 2020
…4330)

* Fix Snapshot Completion Listener Lost on Master Failover

If master fails over before (or we run into any other exception) when removing
the snapshot from the CS we must still resolve all the completion listeners for
the snapshot.
original-brownbear added a commit that referenced this pull request Mar 27, 2020
…4332)

* Fix Snapshot Completion Listener Lost on Master Failover

If master fails over before (or we run into any other exception) when removing
the snapshot from the CS we must still resolve all the completion listeners for
the snapshot.
@bpintea bpintea added v7.7.0 and removed v7.7.1 labels Apr 21, 2020
@original-brownbear original-brownbear restored the master-failover-after-finalization-bug branch August 6, 2020 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v7.7.0 v7.8.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants