Bug #71136: scrub uses unbounded memory and OOM when repairing the inotable - CephFS - Ceph

Actions

Copy link

Bug #71136

closed

scrub uses unbounded memory and OOM when repairing the inotable

Added by Dan van der Ster 11 months ago. Updated 19 days ago.

Status:

Resolved

Priority:

Normal

Assignee:

Patrick Donnelly

Category:

fsck/damage handling

Target version:

Ceph - v21.0.0

% Done:

Source:

Community (user)

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v17.2.6

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

scrub

Pull request ID:

66578

Tags (freeform):

Merge Commit:

15d87f6c9cfcd19a7a408bcd822e93d1478a757e

Fixed In:

v20.3.0-5752-g15d87f6c9c

Released In:

Upkeep Timestamp:

2026-03-03T05:13:38+00:00

Description

When fixing "scrub: inode wrongly marked free", the mds scrub queues up inotable writes faster than they can be written to the OSD, apparently because it's not respecting the objecter throttle.

Here's an example. I'm scrubbing a filesystem with ~2M files after the inotable was wiped. The MDS goes OOM if I don't pause it to let it "catch up" on pending inotable writes.

Here's a scrub error (there are millions like this to fix):

Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.cache.ino(0x10001a83e90) decoded 198 bytes of backtrace successfully
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.cache.ino(0x10001a83e90) scrub: inotable ino = 0x10001a83e90
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.cache.ino(0x10001a83e90) scrub: inotable free says 1
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: log_channel(cluster) log [ERR] : scrub: inode wrongly marked free: 0x10001a83e90
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: repair: before status. ino = 0x10001a83e90 pver =567542 ver= 567542
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: repair: after status. ino = 0x10001a83e90 pver =567543 ver= 567543
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: log_channel(cluster) log [ERR] : inode table repaired for inode: 0x10001a83e90
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: save v 567543
Apr 29 18:03:10 ceph-prod-mon1 conmon[1712997]: 2025-04-29T22:03:10.557+0000 7eff5cd03700 -1 log_channel(cluster) log [ERR] : scrub: inode wrongly marked free: 0x10001a83e90
Apr 29 18:03:10 ceph-prod-mon1 conmon[1712997]: 2025-04-29T22:03:10.557+0000 7eff5cd03700 -1 log_channel(cluster) log [ERR] : inode table repaired for inode: 0x10001a83e90
Apr 29 18:03:10 ceph-prod-mon1 ceph-mds[1713024]: MDSContext::complete: 16C_InodeValidated

The MDS goes OOM, with the mem mostly in buffer_anon:

    "mempool": {
...
        "buffer_anon_bytes": 69473374766,
        "buffer_anon_items": 1083935,
        "buffer_meta_bytes": 1468368,
        "buffer_meta_items": 16686,
...
        "osdmap_bytes": 69536,
        "osdmap_items": 1794,
...
        "mds_co_bytes": 3026272591,
        "mds_co_items": 67441982,
...
    },

It seems that the objecter throttles are not applied:

    "throttle-objecter_bytes": {
        "val": 70154258238,
        "max": 104857600,
        "get_started": 0,
        "get": 0,
        "get_sum": 0,
        "get_or_fail_fail": 0,
        "get_or_fail_success": 0,
        "take": 606016,
        "take_sum": 438504722789,
        "put": 589195,
        "put_sum": 368350464551,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "throttle-objecter_ops": {
        "val": 16821,
        "max": 1024,
        "get_started": 0,
        "get": 0,
        "get_sum": 0,
        "get_or_fail_fail": 0,
        "get_or_fail_success": 0,
        "take": 606016,
        "take_sum": 606016,
        "put": 589195,
        "put_sum": 589195,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },

Here's with debug_objecter 10... we get the writefull for each individual inode fixed:

Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.inotable: save_2 v 621873
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter ms_dispatch 0x56112e17a000 osd_op_reply(549401 mds0_inotable [writefull 0~4187026] v36198'52482307 uv52482307 ondisk = 0) v8
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter in handle_osd_op_reply
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter handle_osd_op_reply 549401 ondisk uv 52482307 in 8.3 attempt 137
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter  op 0 rval 0 len 0
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: mds.0.objecter 8441 in flight
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: MDSIOContextBase::complete: 12C_IO_MT_Save
Apr 29 19:06:10 ceph-prod-mon1 ceph-mds[1713024]: MDSContext::complete: 12C_IO_MT_Save

As a workaround to scrub this FS, i'm pausing and resuming the scrub until it gets through all the inodes.

Actions

Copy link

Updated by Venky Shankar 11 months ago

Assignee set to Patrick Donnelly
Target version set to v21.0.0

Actions

Copy link

Updated by Venky Shankar 11 months ago

Status changed from New to Triaged

Actions

Copy link

Updated by Dan van der Ster 3 months ago

Status changed from Triaged to Fix Under Review
Pull request ID set to 66578

I sent https://github.com/ceph/ceph/pull/66578, untested. I wouldn't be surprised if this change causes other breakage -- but ideally we should try to get this throttle in.

Actions

Copy link

Updated by Md Mahamudur Rahaman Sajib 3 months ago

@Dan van der Ster https://tracker.ceph.com/issues/71167 Isn't it same ticket? I did an attempt to reduce the memory for scrub significantly https://github.com/ceph/ceph/pull/65858

Would you like to have a look into this PR as well? I think we are working on duplicate ticket. I will also will have a look into your PR as well.

Actions

Copy link

Updated by Dan van der Ster 3 months ago

Md Mahamudur Rahaman Sajib wrote in #note-4:

@Dan van der Ster https://tracker.ceph.com/issues/71167 Isn't it same ticket? I did an attempt to reduce the memory for scrub significantly https://github.com/ceph/ceph/pull/65858

Would you like to have a look into this PR as well? I think we are working on duplicate ticket. I will also will have a look into your PR as well.

@Md Mahamudur Rahaman Sajib this is a different issue.
That tracker is about pinned inodes.

This tracker is about queuing up writes faster than they can be written.

Actions

Copy link

Updated by Venky Shankar 19 days ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Updated by Upkeep Bot 19 days ago

Merge Commit set to 15d87f6c9cfcd19a7a408bcd822e93d1478a757e
Fixed In set to v20.3.0-5752-g15d87f6c9c
Upkeep Timestamp set to 2026-03-03T05:13:38+00:00

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Tags

Custom queries

Bug #71136

scrub uses unbounded memory and OOM when repairing the inotable

Updated by Venky Shankar 11 months ago

Updated by Venky Shankar 11 months ago

Updated by Dan van der Ster 3 months ago

Updated by Md Mahamudur Rahaman Sajib 3 months ago

Updated by Dan van der Ster 3 months ago

Updated by Venky Shankar 19 days ago

Updated by Upkeep Bot 19 days ago