Skip to content

BMM Restart Improvements Part 4. Edge cases, cleanup and general improvements#924

Merged
jzakaryan merged 11 commits intolinkedin:masterfrom
jzakaryan:bmmRestart4
Feb 8, 2023
Merged

BMM Restart Improvements Part 4. Edge cases, cleanup and general improvements#924
jzakaryan merged 11 commits intolinkedin:masterfrom
jzakaryan:bmmRestart4

Conversation

@jzakaryan
Copy link
Copy Markdown
Collaborator

This pull request is part of a series of changes that are meant to improve BMM Restart and make it easier to debug restart failures and trace them to the faulty hosts. Part 4 handles edge cases such as leader failover. It also introduces the following improvements suggested in the previous parts:

  • A metric for the number of datastreams that are inferred as stopping has been added to help debug the feature (@vmaheshw's suggestion)
  • Timeout for the rest.li request handler node waiting on the leader to mark the datastream as STOPPED has been increased from 60 seconds to 90 seconds. This is done to make sure the leader will have time to handle stop propagation (which has a 60 second timeout of its own). (@shrinandthakkar's suggestion)

The PRs in this series deal with the following aspects:
Part 1 – Introduction of assignment tokens and support for issuing tokens by the leader coordinator (#919)
Part 2 – Changes to the followers' handleAssignmentChange to make them claim the tokens issued by the leader. (#921)
Part 3 – Changes to the leader to make it poll the ZooKeeper and wait for the assignment change (stop) to be propagated and executed by the cluster. ((#922)
Part 4 – Edge cases and cleanup and general improvements

@jzakaryan jzakaryan marked this pull request as draft January 26, 2023 01:36
@jzakaryan jzakaryan marked this pull request as ready for review January 30, 2023 21:31
@jzakaryan jzakaryan requested a review from ehoner January 31, 2023 19:19
ehoner
ehoner previously approved these changes Feb 6, 2023
@jzakaryan jzakaryan merged commit a35caa7 into linkedin:master Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants