Skip to content

ceph-mon: Improve efficiency of upmap cleanup in osdmaps for large clusters#66204

Merged
rzarzynski merged 1 commit intoceph:mainfrom
apataki:73795-cephmon-cleanupmap
Dec 15, 2025
Merged

ceph-mon: Improve efficiency of upmap cleanup in osdmaps for large clusters#66204
rzarzynski merged 1 commit intoceph:mainfrom
apataki:73795-cephmon-cleanupmap

Conversation

@apataki
Copy link
Contributor

@apataki apataki commented Nov 11, 2025

ceph-mon: Improve efficiency of upmap cleanup in osdmaps for large clusters

OSDMap::check_pg_upmaps is called within ceph-mon when the state of PGs change as backfills complete. Ultimately a job is scheduled to review and clean upmaps, which calls OSDMap::check_pg_upmaps.
There is a loop over all PGs to be checked, and a map is created from osd_id to crush weight for osds that are under the crush root for the PG in question (weight_map). Doing this for every PG is expensive, but there is an attempt in the code to cache this data outside the loop (rule_weight_map) but this cache isn't effective because the large weight maps are still copied (since weight_map is a local variable).
The PR changes weight_map to a reference inside the cache to avoid copying the map on every iteration.

Fixes: https://tracker.ceph.com/issues/73795

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@apataki apataki requested a review from a team as a code owner November 11, 2025 21:13
@github-actions github-actions bot added the core label Nov 11, 2025
auto it = rule_weight_map.find(crush_rule);
if (it == rule_weight_map.end()) {
auto r = crush->get_rule_weight_osd_map(crush_rule, &weight_map);
it = rule_weight_map.emplace(crush_rule, map<int, float>()).first;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: According to the C++ documentation, emplace still constructs a temporary map even if the key already exists in rule_weight_map. For large numbers of iterations, could this behavior lead to noticeable CPU overhead? I’m wondering if using try_emplace might save a meaningful amount of CPU in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is executed once per crush root (not once per PG) - i.e. only a handful of times, and inserts an empty weight_map into rule_weight_map. The cost of populating the weights (the get_rule_weight_osd_map call that follows) is orders of magnitude more than copying an empty map once. Also - the key doesn't exist for sure - that's what the if statement above checks.

@dvanders
Copy link
Contributor

Great work @apataki. I've recently seen a 7000+ OSD cluster that needed mon_lease = 10 in order to keep quorum. This was probably the root cause.

} else {
weight_map = it->second;
}
const map<int, float> &weight_map = it->second;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest renaming it to rule_entry or something more descriptive, as it was difficult to parse out what exactly it was referencing. But now I understand that this refers to the full rule entry, including the rule id and its associated weight map.

I've rewritten it here in full:

    // below we check against crush-topology changing..
    auto rule_entry = rule_weight_map.find(crush_rule);
    if (rule_entry == rule_weight_map.end()) {
      rule_entry = rule_weight_map.emplace(crush_rule, map<int, float>()).first;
      auto r = crush->get_rule_weight_osd_map(crush_rule, &rule_entry->second);
      if (r < 0) {
        lderr(cct) << __func__ << " unable to get crush weight_map for "
                   << "crush_rule " << crush_rule
                   << dendl;
        rule_weight_map.erase(rule_entry);
        continue;
      }
    }
    const map<int, float> &weight_map = rule_entry->second;
    ldout(cct, 10) << __func__ << " pg " << pg
                   << " weight_map " << weight_map
                   << dendl;

Copy link
Member

@ljflores ljflores Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I have another suggestion for readability. It seems redundant to use emplace(), which returns a pair containing:

  1. The crush rule entry
  2. A boolean indicating if the entry was freshly inserted or not

If the crush entry already exists, my understanding is that rule_weight_map.emplace(crush_rule, map<int, float>()) would return a pair containing 1. the crush entry, and 2. false, indicating that it did not create a fresh entry.

Currently, we only make use of 1, but ignore 2. 2 could be a good indicator of whether or not we need to populate the weight map.

What if we did something like this and got rid of the iterator?

    // below we check against crush-topology changing..
    // rule_check inserts a fresh entry only if one does not already exist.
    auto rule_check = rule_weight_map.emplace(crush_rule, map<int, float>());
    auto rule_entry = rule_check.first;
    bool new_entry_inserted = rule_check.second;
    if (new_entry_inserted) {
      // Populate empty weight map
      auto r = crush->get_rule_weight_osd_map(crush_rule, &rule_entry->second);
      if (r < 0) {
        lderr(cct) << __func__ << " unable to get crush weight_map for "
                   << "crush_rule " << crush_rule
                   << dendl;
        rule_weight_map.erase(rule_entry);
        continue;
      }
    }
    const map<int, float> &weight_map = rule_entry->second;
    ldout(cct, 10) << __func__ << " pg " << pg
                   << " weight_map " << weight_map
                   << dendl;

I don't think this compromises logic or performance. But please disregard if it does. Definitely the first option I listed is the more conservative option for improving readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the change for renaming 'it' to 'rule_entry' - it definitely makes the code clearer.
The concern I have about the second change is that an empty map<int, float>() is created and destroyed for every PG iteration, adding lots of memory allocations/deallocations, which isn't ideal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, yeah we don't want to compromise performance. LGTM otherwise!

Don't copy large weight_map objects.

Signed-off-by: Andras Pataki <apataki@flatironinstitute.org>
@apataki apataki force-pushed the 73795-cephmon-cleanupmap branch from 68e5099 to c9cf2e7 Compare November 15, 2025 15:37
Comment on lines -2159 to -2161
rule_weight_map[crush_rule] = weight_map;
} else {
weight_map = it->second;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, that was the problem. How has it been found?

@rzarzynski
Copy link
Contributor

QA results under review.

@Naveenaidu
Copy link
Contributor

RADOS Approved: https://tracker.ceph.com/issues/73968#note-3

@rzarzynski rzarzynski merged commit a954f89 into ceph:main Dec 15, 2025
13 checks passed
@github-actions
Copy link

This is an automated message by src/script/redmine-upkeep.py.

I have resolved the following tracker ticket due to the merge of this PR:

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

Update Log: https://github.com/ceph/ceph/actions/runs/20243547063

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants