[DNM] mgr/balancer: set upmap_max_deviation to 1#57216
[DNM] mgr/balancer: set upmap_max_deviation to 1#57216JoshuaGabriel wants to merge 1 commit intoceph:mainfrom
Conversation
Field experience shows that previous value 5 leads to skewed OSD utilization. Set it to 1 and update the relevant test. Fixes: https://tracker.ceph.com/issues/65748 Signed-off-by: Joshua Blanch <joshua.blanch@clyso.com>
|
Hi @ljflores here's a simple balancer qol improvement. |
ljflores
left a comment
There was a problem hiding this comment.
LGTM. I have made this adjustment often in testing.
|
This PR is under test in https://tracker.ceph.com/issues/65796. |
ljflores
left a comment
There was a problem hiding this comment.
In talking to @neha-ojha, she mentioned that there may be certain CRUSH rules where deviation 1 doesn't result in good mappings. Putting "Request Changes" and DNM for now so we can look into it.
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
FYI This passed RADOS test https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues65796 However, we are not merging due to the comments above. |
@neha-ojha do you have more info? |
|
@dvanders This PR will need further testing to prove that this setting works well with current CRUSH configurations in the wild. Neha and I discussed that the experiments previously conducted to set the default to 5 (see #32247 (comment) for the history) should be repeated with current osdmaps from the wild. I had collected some of these osdmaps for read balancer testing (ref: https://tracker.ceph.com/issues/53622). I plan to use these osdmaps to repeat the experiments done in the past to prove whether this setting is better. If @JoshuaGabriel also wants to provide proof along those lines, that is fine too. But ultimately this change needs to be further justified against real world CRUSH configurations. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
Field experience shows that previous value 5 leads to skewed OSD utilization. Set it to 1 and update the relevant test.
Fixes: https://tracker.ceph.com/issues/65748