ceph-mon: Improve efficiency of upmap cleanup in osdmaps for large clusters by apataki · Pull Request #66204 · ceph/ceph

apataki · 2025-11-11T21:13:44Z

ceph-mon: Improve efficiency of upmap cleanup in osdmaps for large clusters

OSDMap::check_pg_upmaps is called within ceph-mon when the state of PGs change as backfills complete. Ultimately a job is scheduled to review and clean upmaps, which calls OSDMap::check_pg_upmaps.
There is a loop over all PGs to be checked, and a map is created from osd_id to crush weight for osds that are under the crush root for the PG in question (weight_map). Doing this for every PG is expensive, but there is an attempt in the code to cache this data outside the loop (rule_weight_map) but this cache isn't effective because the large weight maps are still copied (since weight_map is a local variable).
The PR changes weight_map to a reference inside the cache to avoid copying the map on every iteration.

Fixes: https://tracker.ceph.com/issues/73795

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

yzhan298 · 2025-11-12T16:58:45Z

src/osd/OSDMap.cc

    auto it = rule_weight_map.find(crush_rule);
    if (it == rule_weight_map.end()) {
-      auto r = crush->get_rule_weight_osd_map(crush_rule, &weight_map);
+      it = rule_weight_map.emplace(crush_rule, map<int, float>()).first;


Nit: According to the C++ documentation, emplace still constructs a temporary map even if the key already exists in rule_weight_map. For large numbers of iterations, could this behavior lead to noticeable CPU overhead? I’m wondering if using try_emplace might save a meaningful amount of CPU in this case.

This line is executed once per crush root (not once per PG) - i.e. only a handful of times, and inserts an empty weight_map into rule_weight_map. The cost of populating the weights (the get_rule_weight_osd_map call that follows) is orders of magnitude more than copying an empty map once. Also - the key doesn't exist for sure - that's what the if statement above checks.

dvanders · 2025-11-12T18:22:13Z

Great work @apataki. I've recently seen a 7000+ OSD cluster that needed mon_lease = 10 in order to keep quorum. This was probably the root cause.

ljflores · 2025-11-14T21:38:40Z

src/osd/OSDMap.cc

-    } else {
-      weight_map = it->second;
    }
+    const map<int, float> &weight_map = it->second;


I would suggest renaming it to rule_entry or something more descriptive, as it was difficult to parse out what exactly it was referencing. But now I understand that this refers to the full rule entry, including the rule id and its associated weight map.

I've rewritten it here in full:

// below we check against crush-topology changing.. auto rule_entry = rule_weight_map.find(crush_rule); if (rule_entry == rule_weight_map.end()) { rule_entry = rule_weight_map.emplace(crush_rule, map<int, float>()).first; auto r = crush->get_rule_weight_osd_map(crush_rule, &rule_entry->second); if (r < 0) { lderr(cct) << __func__ << " unable to get crush weight_map for " << "crush_rule " << crush_rule << dendl; rule_weight_map.erase(rule_entry); continue; } } const map<int, float> &weight_map = rule_entry->second; ldout(cct, 10) << __func__ << " pg " << pg << " weight_map " << weight_map << dendl;

Also, I have another suggestion for readability. It seems redundant to use emplace(), which returns a pair containing:

The crush rule entry

A boolean indicating if the entry was freshly inserted or not

If the crush entry already exists, my understanding is that rule_weight_map.emplace(crush_rule, map<int, float>()) would return a pair containing 1. the crush entry, and 2. false, indicating that it did not create a fresh entry.

Currently, we only make use of 1, but ignore 2. 2 could be a good indicator of whether or not we need to populate the weight map.

What if we did something like this and got rid of the iterator?

// below we check against crush-topology changing.. // rule_check inserts a fresh entry only if one does not already exist. auto rule_check = rule_weight_map.emplace(crush_rule, map<int, float>()); auto rule_entry = rule_check.first; bool new_entry_inserted = rule_check.second; if (new_entry_inserted) { // Populate empty weight map auto r = crush->get_rule_weight_osd_map(crush_rule, &rule_entry->second); if (r < 0) { lderr(cct) << __func__ << " unable to get crush weight_map for " << "crush_rule " << crush_rule << dendl; rule_weight_map.erase(rule_entry); continue; } } const map<int, float> &weight_map = rule_entry->second; ldout(cct, 10) << __func__ << " pg " << pg << " weight_map " << weight_map << dendl;

I don't think this compromises logic or performance. But please disregard if it does. Definitely the first option I listed is the more conservative option for improving readability.

I've made the change for renaming 'it' to 'rule_entry' - it definitely makes the code clearer.
The concern I have about the second change is that an empty map<int, float>() is created and destroyed for every PG iteration, adding lots of memory allocations/deallocations, which isn't ideal.

Got it, yeah we don't want to compromise performance. LGTM otherwise!

Don't copy large weight_map objects. Signed-off-by: Andras Pataki <apataki@flatironinstitute.org>

rzarzynski · 2025-11-17T10:58:53Z

src/osd/OSDMap.cc

-      rule_weight_map[crush_rule] = weight_map;
-    } else {
-      weight_map = it->second;


Ouch, that was the problem. How has it been found?

rzarzynski · 2025-12-08T18:35:52Z

QA results under review.

Naveenaidu · 2025-12-09T03:13:35Z

RADOS Approved: https://tracker.ceph.com/issues/73968#note-3

github-actions · 2025-12-15T18:46:35Z

This is an automated message by src/script/redmine-upkeep.py.

I have resolved the following tracker ticket due to the merge of this PR:

https://tracker.ceph.com/issues/73795

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

Update Log: https://github.com/ceph/ceph/actions/runs/20243547063

apataki requested a review from a team as a code owner November 11, 2025 21:13

github-actions bot added the core label Nov 11, 2025

yzhan298 reviewed Nov 12, 2025

View reviewed changes

dvanders requested a review from ljflores November 12, 2025 18:22

dvanders assigned ljflores Nov 12, 2025

markhpc added the performance label Nov 13, 2025

ljflores reviewed Nov 14, 2025

View reviewed changes

ljflores requested review from athanatos and rzarzynski November 14, 2025 22:08

Improve efficiency of OSDMap::check_pg_upmaps on large clusters.

c9cf2e7

Don't copy large weight_map objects. Signed-off-by: Andras Pataki <apataki@flatironinstitute.org>

apataki force-pushed the 73795-cephmon-cleanupmap branch from 68e5099 to c9cf2e7 Compare November 15, 2025 15:37

rzarzynski approved these changes Nov 17, 2025

View reviewed changes

ljflores approved these changes Nov 18, 2025

View reviewed changes

ljflores added the needs-qa label Nov 18, 2025

SrinivasaBharath added the wip-bharath11-testing label Nov 20, 2025

rzarzynski merged commit a954f89 into ceph:main Dec 15, 2025
13 checks passed

SrinivasaBharath removed the wip-bharath11-testing label Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ceph-mon: Improve efficiency of upmap cleanup in osdmaps for large clusters#66204

ceph-mon: Improve efficiency of upmap cleanup in osdmaps for large clusters#66204
rzarzynski merged 1 commit intoceph:mainfrom
apataki:73795-cephmon-cleanupmap

apataki commented Nov 11, 2025

Uh oh!

yzhan298 Nov 12, 2025

Uh oh!

apataki Nov 12, 2025

Uh oh!

dvanders commented Nov 12, 2025

Uh oh!

ljflores Nov 14, 2025

Uh oh!

ljflores Nov 14, 2025 •

edited

Loading

Uh oh!

apataki Nov 15, 2025

Uh oh!

ljflores Nov 18, 2025

Uh oh!

rzarzynski Nov 17, 2025

Uh oh!

rzarzynski commented Dec 8, 2025

Uh oh!

Naveenaidu commented Dec 9, 2025

Uh oh!

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

apataki commented Nov 11, 2025

Contribution Guidelines

Checklist

Uh oh!

yzhan298 Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

apataki Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

dvanders commented Nov 12, 2025

Uh oh!

ljflores Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

ljflores Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apataki Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

ljflores Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

rzarzynski Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

rzarzynski commented Dec 8, 2025

Uh oh!

Naveenaidu commented Dec 9, 2025

Uh oh!

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ljflores Nov 14, 2025 •

edited

Loading