Skip to content

fix rebalancing-tasks bug and added tests#900

Merged
shrinandthakkar merged 2 commits intolinkedin:masterfrom
shrinandthakkar:fixRebalancingTasksBug
Apr 12, 2022
Merged

fix rebalancing-tasks bug and added tests#900
shrinandthakkar merged 2 commits intolinkedin:masterfrom
shrinandthakkar:fixRebalancingTasksBug

Conversation

@shrinandthakkar
Copy link
Copy Markdown
Collaborator

@shrinandthakkar shrinandthakkar commented Mar 25, 2022

Problem:

Both the LoadBasedPartition and StickyPartition Strategy follow the StickyMulticastStrategy to assign the datastream tasks across the available instances.

In the StickyMulticastStrategy, the last step of the process is to rebalance the tasks across instances such that the number of tasks across the instances adheres to the imbalance threshold value.

Basically, the difference between the max #tasks assigned to any instance and the min #tasks assigned to any other instance should not exceed the imbalance threshold.

In the current implementation, there was a minor bug, which in some cases failed to achieve that proper rebalance. That issue causes validation failures and thus availability misses on the mirror-making service.


Failure Scenario: Observed in production service with 6 instances connected to ZK and has 538 tasks in total.

We saw from the production logs that the tasks were distributed across the instances with counts: 89, 89, 89, 89, 93, 89

And given that we enforce the imbalance threshold of 1, the actual distribution should have been something like
89, 89, 90, 90, 90, 90

In the below pasted rebalance code,

tasksTotal = 538
minTasksPerInstance = 538 / 6 = 89
And since the minimum number of tasks assigned == minTasksPerInstance, it won't go in the while loop to rebalance.

// STEP 3: Trigger rebalance if the number of different tasks more than the configured threshold

if (newAssignment.get(instancesBySize.get(instancesBySize.size() - 1)).size() - newAssignment.get(instancesBySize.get(0)).size() > _imbalanceThreshold) {
 int tasksTotal = newAssignment.values().stream().mapToInt(Set::size).sum();
 int minTasksPerInstance = tasksTotal / instances.size();

 // some rebalance to increase the task count in instances below the minTasksPerInstance
 while (newAssignment.get(instancesBySize.get(0)).size() < minTasksPerInstance) {
   ...
 }
}

Solution:

Update the rebalancing code condition to only terminate when the min #tasks assigned + the imbalance threshold is less than the max #tasks assigned.


Important: DO NOT REPORT SECURITY ISSUES DIRECTLY ON GITHUB.
For reporting security issues and contributing security fixes,
please, email security@linkedin.com instead, as described in
the contribution guidelines.

Please, take a minute to review the contribution guidelines at:
https://github.com/linkedin/Brooklin/blob/master/CONTRIBUTING.md

Copy link
Copy Markdown
Collaborator

@vmaheshw vmaheshw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you improve the description explaining the problem in detail and how are you solving the problem?

@shrinandthakkar
Copy link
Copy Markdown
Collaborator Author

Can you improve the description explaining the problem in detail and how are you solving the problem?

Done


// some rebalance to increase the task count in instances below the minTasksPerInstance
while (newAssignment.get(instancesBySize.get(0)).size() < minTasksPerInstance) {
while (newAssignment.get(instancesBySize.get(0)).size() + _imbalanceThreshold < newAssignment.get(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot to parse in one line. Consider pulling out into a couple vars so the logic is more obvious.

@shrinandthakkar shrinandthakkar merged commit b821911 into linkedin:master Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants