ceph-mon: Improve efficiency of upmap cleanup in osdmaps for large clusters#66204
ceph-mon: Improve efficiency of upmap cleanup in osdmaps for large clusters#66204rzarzynski merged 1 commit intoceph:mainfrom
Conversation
src/osd/OSDMap.cc
Outdated
| auto it = rule_weight_map.find(crush_rule); | ||
| if (it == rule_weight_map.end()) { | ||
| auto r = crush->get_rule_weight_osd_map(crush_rule, &weight_map); | ||
| it = rule_weight_map.emplace(crush_rule, map<int, float>()).first; |
There was a problem hiding this comment.
Nit: According to the C++ documentation, emplace still constructs a temporary map even if the key already exists in rule_weight_map. For large numbers of iterations, could this behavior lead to noticeable CPU overhead? I’m wondering if using try_emplace might save a meaningful amount of CPU in this case.
There was a problem hiding this comment.
This line is executed once per crush root (not once per PG) - i.e. only a handful of times, and inserts an empty weight_map into rule_weight_map. The cost of populating the weights (the get_rule_weight_osd_map call that follows) is orders of magnitude more than copying an empty map once. Also - the key doesn't exist for sure - that's what the if statement above checks.
|
Great work @apataki. I've recently seen a 7000+ OSD cluster that needed mon_lease = 10 in order to keep quorum. This was probably the root cause. |
src/osd/OSDMap.cc
Outdated
| } else { | ||
| weight_map = it->second; | ||
| } | ||
| const map<int, float> &weight_map = it->second; |
There was a problem hiding this comment.
I would suggest renaming it to rule_entry or something more descriptive, as it was difficult to parse out what exactly it was referencing. But now I understand that this refers to the full rule entry, including the rule id and its associated weight map.
I've rewritten it here in full:
// below we check against crush-topology changing..
auto rule_entry = rule_weight_map.find(crush_rule);
if (rule_entry == rule_weight_map.end()) {
rule_entry = rule_weight_map.emplace(crush_rule, map<int, float>()).first;
auto r = crush->get_rule_weight_osd_map(crush_rule, &rule_entry->second);
if (r < 0) {
lderr(cct) << __func__ << " unable to get crush weight_map for "
<< "crush_rule " << crush_rule
<< dendl;
rule_weight_map.erase(rule_entry);
continue;
}
}
const map<int, float> &weight_map = rule_entry->second;
ldout(cct, 10) << __func__ << " pg " << pg
<< " weight_map " << weight_map
<< dendl;
There was a problem hiding this comment.
Also, I have another suggestion for readability. It seems redundant to use emplace(), which returns a pair containing:
- The crush rule entry
- A boolean indicating if the entry was freshly inserted or not
If the crush entry already exists, my understanding is that rule_weight_map.emplace(crush_rule, map<int, float>()) would return a pair containing 1. the crush entry, and 2. false, indicating that it did not create a fresh entry.
Currently, we only make use of 1, but ignore 2. 2 could be a good indicator of whether or not we need to populate the weight map.
What if we did something like this and got rid of the iterator?
// below we check against crush-topology changing..
// rule_check inserts a fresh entry only if one does not already exist.
auto rule_check = rule_weight_map.emplace(crush_rule, map<int, float>());
auto rule_entry = rule_check.first;
bool new_entry_inserted = rule_check.second;
if (new_entry_inserted) {
// Populate empty weight map
auto r = crush->get_rule_weight_osd_map(crush_rule, &rule_entry->second);
if (r < 0) {
lderr(cct) << __func__ << " unable to get crush weight_map for "
<< "crush_rule " << crush_rule
<< dendl;
rule_weight_map.erase(rule_entry);
continue;
}
}
const map<int, float> &weight_map = rule_entry->second;
ldout(cct, 10) << __func__ << " pg " << pg
<< " weight_map " << weight_map
<< dendl;
I don't think this compromises logic or performance. But please disregard if it does. Definitely the first option I listed is the more conservative option for improving readability.
There was a problem hiding this comment.
I've made the change for renaming 'it' to 'rule_entry' - it definitely makes the code clearer.
The concern I have about the second change is that an empty map<int, float>() is created and destroyed for every PG iteration, adding lots of memory allocations/deallocations, which isn't ideal.
There was a problem hiding this comment.
Got it, yeah we don't want to compromise performance. LGTM otherwise!
Don't copy large weight_map objects. Signed-off-by: Andras Pataki <apataki@flatironinstitute.org>
68e5099 to
c9cf2e7
Compare
| rule_weight_map[crush_rule] = weight_map; | ||
| } else { | ||
| weight_map = it->second; |
There was a problem hiding this comment.
Ouch, that was the problem. How has it been found?
|
QA results under review. |
|
RADOS Approved: https://tracker.ceph.com/issues/73968#note-3 |
|
This is an automated message by src/script/redmine-upkeep.py. I have resolved the following tracker ticket due to the merge of this PR: No backports are pending for the ticket. If this is incorrect, please update the tracker Update Log: https://github.com/ceph/ceph/actions/runs/20243547063 |
ceph-mon: Improve efficiency of upmap cleanup in osdmaps for large clusters
OSDMap::check_pg_upmaps is called within ceph-mon when the state of PGs change as backfills complete. Ultimately a job is scheduled to review and clean upmaps, which calls OSDMap::check_pg_upmaps.
There is a loop over all PGs to be checked, and a map is created from osd_id to crush weight for osds that are under the crush root for the PG in question (weight_map). Doing this for every PG is expensive, but there is an attempt in the code to cache this data outside the loop (rule_weight_map) but this cache isn't effective because the large weight maps are still copied (since weight_map is a local variable).
The PR changes weight_map to a reference inside the cache to avoid copying the map on every iteration.
Fixes: https://tracker.ceph.com/issues/73795
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.