Bug #73795
closedceph-mon high CPU usage and sluggish responsiveness when expanding large clusters
0%
Description
When expanding a large cluster with 5000+ OSDs, as backfills complete and new osdmaps are created, ceph-mon gets very busy, uses 4 cores (400% CPU in top) and its response becomes very sluggish. Running 'ceph -s' often takes up to 5 seconds. This is quite repeatable on our large cluster. There is concern that the monitors can lose quorum potentially.
The cluster this is observed on is run Quincy 17.2.8, currently has 6050 OSDs and about 67000 PGs.
Files
Updated by Andras Pataki 4 months ago
Here is an analysis of the problem:
On certain PG state changes (for example when a backfill completes), OSDMonitor::encode_pending() is called
It creates a CleanUpmapJob and runs in in parallel on the CPU threads of the monitor:
// clean inappropriate pg_upmap/pg_upmap_items (if any)
{
// check every upmapped pg for now
// until we could reliably identify certain cases to ignore,
// which is obviously the hard part TBD..
...
CleanUpmapJob job(cct, tmp, pending_inc);
mapper.queue(&job, g_conf()->mon_clean_pg_upmaps_per_chunk, pgs_to_check);
job.wait();
...
This job goes through PGs with upmaps (which may be many) and calls OSDMap::check_pg_upmap()
struct CleanUpmapJob : public ParallelPGMapper::Job {
CleanUpmapJob(CephContext *cct, const OSDMap& om, OSDMap::Incremental& pi)
: ParallelPGMapper::Job(&om),
...
void process(const std::vector<pg_t>& to_check) override {
...
osdmap.check_pg_upmaps(cct, to_check, &to_cancel, &to_remap);
...
}
OSDMap::check_pg_upmap() has a variable called weight_map in it that maps osd_id to osd weights for all OSDs under the crush root of the current PG:
map<int, float> weight_map;
This inside the loop over PGs, so this map is created/destroyed with every loop iteration (every PG checked). On a large cluster this can contain thousands of items and there are tens of thousands of PGs to iterate through. I.e. this is O(N^2) work in terms of the size of the cluster (N being the number of OSDs).
What makes this worse is that for large clusters there are O(N) PG state changes during expansion/recovery so the total amount of work to be done on cluster expansion by ceph-mon can be as bad as O(N^3) just for cleanups.
Note: there is an attempt to cache data in a local cache variable:
map<int, map<int, float>> rule_weight_map;
to cache the weight_map objects per crush root.
This is ineffective because weight_map is a local copy of one value from this cache, so on each iteration a copy of a full weight_map is done and destroyed.
Updated by Andras Pataki 4 months ago
- File ceph-mon-wcp-1.txt ceph-mon-wcp-1.txt added
Attaching a profiling output of the code using Mark Nelson's wall clock profiler.
The ceph-mon process in question has mon_cpu_threads=12 (increased from the default 4) in an attempt to alleviate the problem.
Most of the 12 threads are spinning in memory management operations (creating/destroying the STL tree underlying weight_map).
Updated by Laura Flores 4 months ago
- Status changed from New to In Progress
- Assignee set to Andras Pataki
@apataki I'm assigning this to you; feel free to assign me as reviewer when you have the fix.
Updated by Radoslaw Zarzynski 4 months ago
- Priority changed from Normal to High
Let's keep the ticket on top of our bug scrub's queue.
Updated by Andras Pataki 4 months ago
- Pull request ID set to 66204
Here is a fix that resolves the issue on my cluster:
PR: https://github.com/ceph/ceph/pull/66204
Updated by Radoslaw Zarzynski 4 months ago ยท Edited
scrub note: PR went into QA: https://tracker.ceph.com/issues/73968
Updated by Laura Flores 4 months ago
In QA here: https://tracker.ceph.com/issues/73968
Updated by Upkeep Bot 3 months ago
- Status changed from Fix Under Review to Resolved
- Merge Commit set to a954f89c8767cd87d8f455b82fa4d620e237657f
- Fixed In set to v20.3.0-4607-ga954f89c87
- Upkeep Timestamp set to 2025-12-15T18:46:35+00:00
Updated by Andras Pataki 3 months ago
Yes - to all branches that currently get backports please. The code is the same in recent releases, so the patch should apply easily (I'm running a patched Quincy release with the same patch).