Bug #71136
closedscrub uses unbounded memory and OOM when repairing the inotable
0%
Description
When fixing "scrub: inode wrongly marked free", the mds scrub queues up inotable writes faster than they can be written to the OSD, apparently because it's not respecting the objecter throttle.
Here's an example. I'm scrubbing a filesystem with ~2M files after the inotable was wiped. The MDS goes OOM if I don't pause it to let it "catch up" on pending inotable writes.
Here's a scrub error (there are millions like this to fix):
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.cache.ino(0x10001a83e90) decoded 198 bytes of backtrace successfully Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.cache.ino(0x10001a83e90) scrub: inotable ino = 0x10001a83e90 Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.cache.ino(0x10001a83e90) scrub: inotable free says 1 Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: log_channel(cluster) log [ERR] : scrub: inode wrongly marked free: 0x10001a83e90 Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: repair: before status. ino = 0x10001a83e90 pver =567542 ver= 567542 Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: repair: after status. ino = 0x10001a83e90 pver =567543 ver= 567543 Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: log_channel(cluster) log [ERR] : inode table repaired for inode: 0x10001a83e90 Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: save v 567543 Apr 29 18:03:10 ceph-prod-mon1 conmon[1712997]: 2025-04-29T22:03:10.557+0000 7eff5cd03700 -1 log_channel(cluster) log [ERR] : scrub: inode wrongly marked free: 0x10001a83e90 Apr 29 18:03:10 ceph-prod-mon1 conmon[1712997]: 2025-04-29T22:03:10.557+0000 7eff5cd03700 -1 log_channel(cluster) log [ERR] : inode table repaired for inode: 0x10001a83e90 Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: MDSContext::complete: 16C_InodeValidated
The MDS goes OOM, with the mem mostly in buffer_anon:
"mempool": {
...
"buffer_anon_bytes": 69473374766,
"buffer_anon_items": 1083935,
"buffer_meta_bytes": 1468368,
"buffer_meta_items": 16686,
...
"osdmap_bytes": 69536,
"osdmap_items": 1794,
...
"mds_co_bytes": 3026272591,
"mds_co_items": 67441982,
...
},
It seems that the objecter throttles are not applied:
"throttle-objecter_bytes": {
"val": 70154258238,
"max": 104857600,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 606016,
"take_sum": 438504722789,
"put": 589195,
"put_sum": 368350464551,
"wait": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
}
},
"throttle-objecter_ops": {
"val": 16821,
"max": 1024,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 606016,
"take_sum": 606016,
"put": 589195,
"put_sum": 589195,
"wait": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
}
},
Here's with debug_objecter 10... we get the writefull for each individual inode fixed:
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: save_2 v 621873 Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter ms_dispatch 0x56112e17a000 osd_op_reply(549401 mds0_inotable [writefull 0~4187026] v36198'52482307 uv52482307 ondisk = 0) v8 Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter in handle_osd_op_reply Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter handle_osd_op_reply 549401 ondisk uv 52482307 in 8.3 attempt 137 Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter op 0 rval 0 len 0 Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter 8441 in flight Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: MDSIOContextBase::complete: 12C_IO_MT_Save Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: MDSContext::complete: 12C_IO_MT_Save
As a workaround to scrub this FS, i'm pausing and resuming the scrub until it gets through all the inodes.
Updated by Venky Shankar 11 months ago
- Assignee set to Patrick Donnelly
- Target version set to v21.0.0
Updated by Dan van der Ster 3 months ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 66578
I sent https://github.com/ceph/ceph/pull/66578, untested. I wouldn't be surprised if this change causes other breakage -- but ideally we should try to get this throttle in.
Updated by Md Mahamudur Rahaman Sajib 3 months ago
@Dan van der Ster https://tracker.ceph.com/issues/71167 Isn't it same ticket? I did an attempt to reduce the memory for scrub significantly https://github.com/ceph/ceph/pull/65858
Would you like to have a look into this PR as well? I think we are working on duplicate ticket. I will also will have a look into your PR as well.
Updated by Dan van der Ster 3 months ago
Md Mahamudur Rahaman Sajib wrote in #note-4:
@Dan van der Ster https://tracker.ceph.com/issues/71167 Isn't it same ticket? I did an attempt to reduce the memory for scrub significantly https://github.com/ceph/ceph/pull/65858
Would you like to have a look into this PR as well? I think we are working on duplicate ticket. I will also will have a look into your PR as well.
@Md Mahamudur Rahaman Sajib this is a different issue.
That tracker is about pinned inodes.
This tracker is about queuing up writes faster than they can be written.
Updated by Venky Shankar 19 days ago
- Status changed from Fix Under Review to Resolved
Updated by Upkeep Bot 19 days ago
- Merge Commit set to 15d87f6c9cfcd19a7a408bcd822e93d1478a757e
- Fixed In set to v20.3.0-5752-g15d87f6c9c
- Upkeep Timestamp set to 2026-03-03T05:13:38+00:00