Bug #73795: ceph-mon high CPU usage and sluggish responsiveness when expanding large clusters - RADOS - Ceph

Actions

Copy link

Bug #73795

closed

ceph-mon high CPU usage and sluggish responsiveness when expanding large clusters

Added by Andras Pataki 4 months ago. Updated 3 months ago.

Status:

Resolved

Priority:

High

Assignee:

Andras Pataki

Category:

Performance/Resource Usage

Target version:

% Done:

Source:

Community (user)

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

66204

Tags (freeform):

Merge Commit:

a954f89c8767cd87d8f455b82fa4d620e237657f

Fixed In:

v20.3.0-4607-ga954f89c87

Released In:

Upkeep Timestamp:

2025-12-15T18:46:35+00:00

Description

When expanding a large cluster with 5000+ OSDs, as backfills complete and new osdmaps are created, ceph-mon gets very busy, uses 4 cores (400% CPU in top) and its response becomes very sluggish. Running 'ceph -s' often takes up to 5 seconds. This is quite repeatable on our large cluster. There is concern that the monitors can lose quorum potentially.
The cluster this is observed on is run Quincy 17.2.8, currently has 6050 OSDs and about 67000 PGs.

Files

ceph-mon-wcp-1.txt (197 KB) ceph-mon-wcp-1.txt

Wall clock profiler output

Andras Pataki, 11/11/2025 04:58 PM

Actions

Copy link

Updated by Andras Pataki 4 months ago

Here is an analysis of the problem:

On certain PG state changes (for example when a backfill completes), OSDMonitor::encode_pending() is called
It creates a CleanUpmapJob and runs in in parallel on the CPU threads of the monitor:

    // clean inappropriate pg_upmap/pg_upmap_items (if any)                                                                                                                                                                                   
    {
      // check every upmapped pg for now                                                                                                                                                                                                      
      // until we could reliably identify certain cases to ignore,                                                                                                                                                                            
      // which is obviously the hard part TBD..                                                                                                                                                                                               
    ...
        CleanUpmapJob job(cct, tmp, pending_inc);
        mapper.queue(&job, g_conf()->mon_clean_pg_upmaps_per_chunk, pgs_to_check);
        job.wait();
    ...

This job goes through PGs with upmaps (which may be many) and calls OSDMap::check_pg_upmap()

  struct CleanUpmapJob : public ParallelPGMapper::Job {
    CleanUpmapJob(CephContext *cct, const OSDMap& om, OSDMap::Incremental& pi)
      : ParallelPGMapper::Job(&om),
      ...

    void process(const std::vector<pg_t>& to_check) override {
      ...
      osdmap.check_pg_upmaps(cct, to_check, &to_cancel, &to_remap);
      ...
    }

OSDMap::check_pg_upmap() has a variable called weight_map in it that maps osd_id to osd weights for all OSDs under the crush root of the current PG:

    map<int, float> weight_map;

This inside the loop over PGs, so this map is created/destroyed with every loop iteration (every PG checked). On a large cluster this can contain thousands of items and there are tens of thousands of PGs to iterate through. I.e. this is O(N^2) work in terms of the size of the cluster (N being the number of OSDs).
What makes this worse is that for large clusters there are O(N) PG state changes during expansion/recovery so the total amount of work to be done on cluster expansion by ceph-mon can be as bad as O(N^3) just for cleanups.

Note: there is an attempt to cache data in a local cache variable:

  map<int, map<int, float>> rule_weight_map;

to cache the weight_map objects per crush root.
This is ineffective because weight_map is a local copy of one value from this cache, so on each iteration a copy of a full weight_map is done and destroyed.

Actions

Copy link

Updated by Andras Pataki 4 months ago

File ceph-mon-wcp-1.txt ceph-mon-wcp-1.txt added

Attaching a profiling output of the code using Mark Nelson's wall clock profiler.
The ceph-mon process in question has mon_cpu_threads=12 (increased from the default 4) in an attempt to alleviate the problem.
Most of the 12 threads are spinning in memory management operations (creating/destroying the STL tree underlying weight_map).

Actions

Copy link

Status changed from In Progress to Fix Under Review

Actions

Copy link

Yes - to all branches that currently get backports please. The code is the same in recent releases, so the patch should apply easily (I'm running a patched Quincy release with the same patch).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Tags

Custom queries

Bug #73795

ceph-mon high CPU usage and sluggish responsiveness when expanding large clusters

Updated by Andras Pataki 4 months ago

Updated by Andras Pataki 4 months ago

Updated by Andras Pataki 4 months ago

Updated by Laura Flores 4 months ago

Updated by Radoslaw Zarzynski 4 months ago

Updated by Andras Pataki 4 months ago

Updated by Neha Ojha 4 months ago

Updated by Laura Flores 4 months ago

Updated by Radoslaw Zarzynski 4 months ago · Edited

Updated by Radoslaw Zarzynski 4 months ago

Updated by Laura Flores 4 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Upkeep Bot 3 months ago

Updated by Andras Pataki 3 months ago