Project

General

Profile

Actions

Bug #73795

closed

ceph-mon high CPU usage and sluggish responsiveness when expanding large clusters

Added by Andras Pataki 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Community (user)
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Tags (freeform):
Fixed In:
v20.3.0-4607-ga954f89c87
Released In:
Upkeep Timestamp:
2025-12-15T18:46:35+00:00

Description

When expanding a large cluster with 5000+ OSDs, as backfills complete and new osdmaps are created, ceph-mon gets very busy, uses 4 cores (400% CPU in top) and its response becomes very sluggish. Running 'ceph -s' often takes up to 5 seconds. This is quite repeatable on our large cluster. There is concern that the monitors can lose quorum potentially.
The cluster this is observed on is run Quincy 17.2.8, currently has 6050 OSDs and about 67000 PGs.


Files

ceph-mon-wcp-1.txt (197 KB) ceph-mon-wcp-1.txt Wall clock profiler output Andras Pataki, 11/11/2025 04:58 PM
Actions #1

Updated by Andras Pataki 4 months ago

Here is an analysis of the problem:

On certain PG state changes (for example when a backfill completes), OSDMonitor::encode_pending() is called
It creates a CleanUpmapJob and runs in in parallel on the CPU threads of the monitor:

    // clean inappropriate pg_upmap/pg_upmap_items (if any)                                                                                                                                                                                   
    {
      // check every upmapped pg for now                                                                                                                                                                                                      
      // until we could reliably identify certain cases to ignore,                                                                                                                                                                            
      // which is obviously the hard part TBD..                                                                                                                                                                                               
    ...
        CleanUpmapJob job(cct, tmp, pending_inc);
        mapper.queue(&job, g_conf()->mon_clean_pg_upmaps_per_chunk, pgs_to_check);
        job.wait();
    ...

This job goes through PGs with upmaps (which may be many) and calls OSDMap::check_pg_upmap()

  struct CleanUpmapJob : public ParallelPGMapper::Job {
    CleanUpmapJob(CephContext *cct, const OSDMap& om, OSDMap::Incremental& pi)
      : ParallelPGMapper::Job(&om),
      ...

    void process(const std::vector<pg_t>& to_check) override {
      ...
      osdmap.check_pg_upmaps(cct, to_check, &to_cancel, &to_remap);
      ...
    }

OSDMap::check_pg_upmap() has a variable called weight_map in it that maps osd_id to osd weights for all OSDs under the crush root of the current PG:

    map<int, float> weight_map;

This inside the loop over PGs, so this map is created/destroyed with every loop iteration (every PG checked). On a large cluster this can contain thousands of items and there are tens of thousands of PGs to iterate through. I.e. this is O(N^2) work in terms of the size of the cluster (N being the number of OSDs).
What makes this worse is that for large clusters there are O(N) PG state changes during expansion/recovery so the total amount of work to be done on cluster expansion by ceph-mon can be as bad as O(N^3) just for cleanups.

Note: there is an attempt to cache data in a local cache variable:

  map<int, map<int, float>> rule_weight_map;

to cache the weight_map objects per crush root.
This is ineffective because weight_map is a local copy of one value from this cache, so on each iteration a copy of a full weight_map is done and destroyed.

Actions #2

Updated by Andras Pataki 4 months ago

Attaching a profiling output of the code using Mark Nelson's wall clock profiler.
The ceph-mon process in question has mon_cpu_threads=12 (increased from the default 4) in an attempt to alleviate the problem.
Most of the 12 threads are spinning in memory management operations (creating/destroying the STL tree underlying weight_map).

Actions #3

Updated by Andras Pataki 4 months ago

I am working on a fix.

Actions #4

Updated by Laura Flores 4 months ago

  • Status changed from New to In Progress
  • Assignee set to Andras Pataki

@apataki I'm assigning this to you; feel free to assign me as reviewer when you have the fix.

Actions #5

Updated by Radoslaw Zarzynski 4 months ago

  • Priority changed from Normal to High

Let's keep the ticket on top of our bug scrub's queue.

Actions #6

Updated by Andras Pataki 4 months ago

  • Pull request ID set to 66204

Here is a fix that resolves the issue on my cluster:
PR: https://github.com/ceph/ceph/pull/66204

Actions #7

Updated by Neha Ojha 4 months ago

  • Status changed from In Progress to Fix Under Review
Actions #8

Updated by Laura Flores 4 months ago

PR under review.

Actions #9

Updated by Radoslaw Zarzynski 4 months ago ยท Edited

scrub note: PR went into QA: https://tracker.ceph.com/issues/73968

Actions #10

Updated by Radoslaw Zarzynski 4 months ago

scrub note: still in QA.

Actions #12

Updated by Radoslaw Zarzynski 3 months ago

Merged!

Actions #13

Updated by Radoslaw Zarzynski 3 months ago

Do we need to backport this?

Actions #14

Updated by Upkeep Bot 3 months ago

  • Status changed from Fix Under Review to Resolved
  • Merge Commit set to a954f89c8767cd87d8f455b82fa4d620e237657f
  • Fixed In set to v20.3.0-4607-ga954f89c87
  • Upkeep Timestamp set to 2025-12-15T18:46:35+00:00
Actions #15

Updated by Andras Pataki 3 months ago

Yes - to all branches that currently get backports please. The code is the same in recent releases, so the patch should apply easily (I'm running a patched Quincy release with the same patch).

Actions

Also available in: Atom PDF