Project

General

Profile

Actions

Bug #71136

closed

scrub uses unbounded memory and OOM when repairing the inotable

Added by Dan van der Ster 11 months ago. Updated 19 days ago.

Status:
Resolved
Priority:
Normal
Category:
fsck/damage handling
Target version:
% Done:

0%

Source:
Community (user)
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
scrub
Pull request ID:
Tags (freeform):
Fixed In:
v20.3.0-5752-g15d87f6c9c
Released In:
Upkeep Timestamp:
2026-03-03T05:13:38+00:00

Description

When fixing "scrub: inode wrongly marked free", the mds scrub queues up inotable writes faster than they can be written to the OSD, apparently because it's not respecting the objecter throttle.

Here's an example. I'm scrubbing a filesystem with ~2M files after the inotable was wiped. The MDS goes OOM if I don't pause it to let it "catch up" on pending inotable writes.

Here's a scrub error (there are millions like this to fix):

Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.cache.ino(0x10001a83e90) decoded 198 bytes of backtrace successfully
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.cache.ino(0x10001a83e90) scrub: inotable ino = 0x10001a83e90
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.cache.ino(0x10001a83e90) scrub: inotable free says 1
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: log_channel(cluster) log [ERR] : scrub: inode wrongly marked free: 0x10001a83e90
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: repair: before status. ino = 0x10001a83e90 pver =567542 ver= 567542
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: repair: after status. ino = 0x10001a83e90 pver =567543 ver= 567543
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: log_channel(cluster) log [ERR] : inode table repaired for inode: 0x10001a83e90
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: save v 567543
Apr 29 18:03:10 ceph-prod-mon1 conmon[1712997]: 2025-04-29T22:03:10.557+0000 7eff5cd03700 -1 log_channel(cluster) log [ERR] : scrub: inode wrongly marked free: 0x10001a83e90
Apr 29 18:03:10 ceph-prod-mon1 conmon[1712997]: 2025-04-29T22:03:10.557+0000 7eff5cd03700 -1 log_channel(cluster) log [ERR] : inode table repaired for inode: 0x10001a83e90
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: MDSContext::complete: 16C_InodeValidated

The MDS goes OOM, with the mem mostly in buffer_anon:

    "mempool": {
...
        "buffer_anon_bytes": 69473374766,
        "buffer_anon_items": 1083935,
        "buffer_meta_bytes": 1468368,
        "buffer_meta_items": 16686,
...
        "osdmap_bytes": 69536,
        "osdmap_items": 1794,
...
        "mds_co_bytes": 3026272591,
        "mds_co_items": 67441982,
...
    },

It seems that the objecter throttles are not applied:

    "throttle-objecter_bytes": {
        "val": 70154258238,
        "max": 104857600,
        "get_started": 0,
        "get": 0,
        "get_sum": 0,
        "get_or_fail_fail": 0,
        "get_or_fail_success": 0,
        "take": 606016,
        "take_sum": 438504722789,
        "put": 589195,
        "put_sum": 368350464551,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "throttle-objecter_ops": {
        "val": 16821,
        "max": 1024,
        "get_started": 0,
        "get": 0,
        "get_sum": 0,
        "get_or_fail_fail": 0,
        "get_or_fail_success": 0,
        "take": 606016,
        "take_sum": 606016,
        "put": 589195,
        "put_sum": 589195,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },

Here's with debug_objecter 10... we get the writefull for each individual inode fixed:

Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: save_2 v 621873
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter ms_dispatch 0x56112e17a000 osd_op_reply(549401 mds0_inotable [writefull 0~4187026] v36198'52482307 uv52482307 ondisk = 0) v8
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter in handle_osd_op_reply
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter handle_osd_op_reply 549401 ondisk uv 52482307 in 8.3 attempt 137
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter  op 0 rval 0 len 0
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter 8441 in flight
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: MDSIOContextBase::complete: 12C_IO_MT_Save
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: MDSContext::complete: 12C_IO_MT_Save

As a workaround to scrub this FS, i'm pausing and resuming the scrub until it gets through all the inodes.

Actions #1

Updated by Venky Shankar 11 months ago

  • Assignee set to Patrick Donnelly
  • Target version set to v21.0.0
Actions #2

Updated by Venky Shankar 11 months ago

  • Status changed from New to Triaged
Actions #3

Updated by Dan van der Ster 3 months ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 66578

I sent https://github.com/ceph/ceph/pull/66578, untested. I wouldn't be surprised if this change causes other breakage -- but ideally we should try to get this throttle in.

Actions #4

Updated by Md Mahamudur Rahaman Sajib 3 months ago

@Dan van der Ster https://tracker.ceph.com/issues/71167 Isn't it same ticket? I did an attempt to reduce the memory for scrub significantly https://github.com/ceph/ceph/pull/65858

Would you like to have a look into this PR as well? I think we are working on duplicate ticket. I will also will have a look into your PR as well.

Actions #5

Updated by Dan van der Ster 3 months ago

Md Mahamudur Rahaman Sajib wrote in #note-4:

@Dan van der Ster https://tracker.ceph.com/issues/71167 Isn't it same ticket? I did an attempt to reduce the memory for scrub significantly https://github.com/ceph/ceph/pull/65858

Would you like to have a look into this PR as well? I think we are working on duplicate ticket. I will also will have a look into your PR as well.

@Md Mahamudur Rahaman Sajib this is a different issue.
That tracker is about pinned inodes.

This tracker is about queuing up writes faster than they can be written.

Actions #6

Updated by Venky Shankar 19 days ago

  • Status changed from Fix Under Review to Resolved
Actions #7

Updated by Upkeep Bot 19 days ago

  • Merge Commit set to 15d87f6c9cfcd19a7a408bcd822e93d1478a757e
  • Fixed In set to v20.3.0-5752-g15d87f6c9c
  • Upkeep Timestamp set to 2026-03-03T05:13:38+00:00
Actions

Also available in: Atom PDF