Actions

Copy link

Bug #72375

open

RGW: Bulk Delete of Versioned Objects Triggers Excessive OLH Updates Leading to RGW Lock-up and OOM

Added by Oguzhan Ozmen 8 months ago. Updated 6 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Oguzhan Ozmen

Target version:

% Done:

Source:

Backport:

squid tentacle

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v19.2.1

ceph-qa-suite:

Pull request ID:

64800

Tags (freeform):

backport_processed

Merge Commit:

6f543b897a6474a6992de5ac34e0a1f67893a9b6

Fixed In:

v20.3.0-2930-g6f543b897a

Released In:

Upkeep Timestamp:

2025-09-10T19:37:42+00:00

Tags:

rgw olh

Description

TLDR¶

RGW becomes highly inefficient and eventually locks up when processing bulk deletes (up to 1000 object versions per request) in versioned buckets. This is due to the current recursive and duplicative logic of how RGW handles object deletions, particularly how it invokes update_olh. A proposed change that limits update_olh to only one invocation per bulk-delete request significantly improves system behavior under such workloads.

Background and Motivation¶

Recently, a client workload that issued large-scale bulk deletes on a versioned bucket caused our RGW nodes to lock up. Each object had thousands of versions, and the client issued high-concurrency multi-object deletes. The system became sluggish even after a few minutes client started the workload and only within 30-45 mins the system became mostly unresponsive, impacting not just the offending workload but also other clients, due to soaring latencies and timeouts. We had to contact the client to cease the operation - which wasn’t enough by itself - eventually we had to restart all RGW instances to recover.

Reproduction Recipe¶

To simulate the issue:

Bucket Setup: Created a versioned bucket with a single object having 10,000 versions.

Client Workload Simulation: Repeatedly retrieved and deleted versions in batches of 1,000 using concurrent bulk delete requests.

while True:
    fill_delete_queue_up_to_10K_versions()
    if delete_queue_empty:
        break
    while delete_queue_is_not_empty:
        chunk = get_1K_versions()
        send_bulk_delete_request(chunk)

Root Cause: Excessive and Redundant update_olh Calls¶

Each bulk delete request (which can contain up to 1000 objects in its payload) is handled with concurrency (rgw_multi_obj_del_max_aio, default = 16). For each object version in the request, the following flow is triggered:

RGWDeleteMultiObj::execute (https://bbgithub.dev.bloomberg.com/ceph/ceph/blob/5abc34b8658517e473864cd8f95e6009393c0d64/src/rgw/rgw_op.cc#L7674) →
handle_individual_object →
RGWRados::Object::Delete::delete_obj (https://bbgithub.dev.bloomberg.com/ceph/ceph/blob/5abc34b8658517e473864cd8f95e6009393c0d64/src/rgw/driver/rados/rgw_rados.cc#L6359) → {set_olh, Unlink_obj_instance}
RGWRados::update_olh (https://bbgithub.dev.bloomberg.com/ceph/ceph/blob/5abc34b8658517e473864cd8f95e6009393c0d64/src/rgw/driver/rados/rgw_rados.cc#L9170) → reads and replays OLH logs.

This becomes pathological when objects have thousands of versions: update_olh is redundantly called on each version, replaying the same log repeatedly, leading to:

Recursive behavior.
Exponential duplication of work.
Lock-up of RGW threads.
Requests timing out on client-end and client retries both SSK level and client;s own logic of listing & sending bulk-deletes again after a timeout adds extra pressure on rgws.

For testing purposes, increased the logging level and decorated the for loop that works on the remove objects instances list at RGWRados::apply_olh_log to show the extent of calls into update_log and the duplications.

  ldpp_dout(dpp, 0) << " apply olh start of remove_instances for loop" << dendl;
  /* first remove object instances */
  for (list<cls_rgw_obj_key>::iterator liter = remove_instances.begin();
       liter != remove_instances.end(); ++liter) {
    cls_rgw_obj_key& key = *liter;
    rgw_obj obj_instance(bucket, key);
    ldpp_dout(dpp, 0) << " apply olh calls delete_obj obj_instance=" << obj_instance << dendl;
    int ret = delete_obj(dpp, obj_ctx, bucket_info, obj_instance, 0, y,
             null_verid, RGW_BILOG_FLAG_VERSIONED_OP,
             ceph::real_time(), zones_trace, log_op, force);
    ldpp_dout(dpp, 0) << " apply olh call to delete_obj returned  obj_instance=" << obj_instance << " ret=" << ret << dendl;
    if (ret < 0 && ret != -ENOENT) {
      ldpp_dout(dpp, 0) << "ERROR: delete_obj() returned " << ret << " obj_instance=" << obj_instance << dendl;
      return ret;
    }
  }

Note that, since bulk-delete requests take too long to complete, client sdk (if configured) retries the request and it exacerbates the problem. Tested this using a vstart cluster and as soon as bulk-delete workload is started memory usage of the rgw instance started to cripple up to the point it’s OOM-killed a few hours later:

Although I stopped the bulk-delete workload after 1 hour (at 15:49 in the above graph), as seen, the memory usage kept increasing although there is no other workload in the system. Analysis of the rgw log reveals the extent of duplicate work:

During the whole test, there were only 237 distinct requests

# egrep -o "req [0-9]+" radosgw.8000.log | sort | uniq -c | wc -l
237

However, 17715 calls into the apply_olh_log

grep "apply olh start of remove_instances for loop" radosgw.8000.log | wc -l
17715

It’s not only that each object found in a bulk-delete results in applying olh but also handling of a single object results in duplication within itself (recursive behavior) as exemplified in the below list of events. So,

one bulk-delete request (req 9522416970407138023)
handling the deletion of a version found in the bulk delete (y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU)

replay the olh log and tries to apply the removal of the same version (qV.tx3vPtBJPdMipXe3-QISykee9xTr) over and over again

2025-07-28T19:15:29.903+0000 7fc059d98700  0 req 9522416970407138023 2093.866210938s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh calls delete_obj obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0
2025-07-28T19:15:30.744+0000 7fbfc64ac700  0 req 9522416970407138023 2094.707275391s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh call to delete_obj returned  obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0 ret=-2

2025-07-28T19:55:40.542+0000 7fbff16f1700  0 req 9522416970407138023 4504.504882812s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh calls delete_obj obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0
2025-07-28T19:55:42.083+0000 7fbfdb8ce700  0 req 9522416970407138023 4506.046386719s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh call to delete_obj returned  obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0 ret=-2

2025-07-28T20:35:39.051+0000 7fbfd86c9700  0 req 9522416970407138023 6903.013671875s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh calls delete_obj obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0
2025-07-28T20:35:39.751+0000 7fc04a37f700  0 req 9522416970407138023 6903.713867188s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh call to delete_obj returned  obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0 ret=-2

2025-07-28T21:12:50.387+0000 7fbf9f86e700  0 req 9522416970407138023 9134.349609375s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh calls delete_obj obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0
2025-07-28T21:12:51.773+0000 7fc0787c9700  0 req 9522416970407138023 9135.736328125s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh call to delete_obj returned  obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0 ret=-2

2025-07-28T21:20:31.602+0000 7fc09cc03700  0 req 9522416970407138023 9595.565429688s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh calls delete_obj obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0
2025-07-28T21:20:33.175+0000 7fc052f8d700  0 req 9522416970407138023 9597.137695312s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh call to delete_obj returned  obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0 ret=-2

2025-07-29T00:46:15.079+0000 7fbfcdcb8700  0 req 9522416970407138023 21939.041015625s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh calls delete_obj obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0
2025-07-29T00:46:19.062+0000 7fbfebce8700  0 req 9522416970407138023 21943.025390625s s3:multi_object_delete object-0[y4T7s5K5F06nryAtOCuBSYVgN7FJ6LU] apply olh call to delete_obj returned  obj_instance=versioned-thisisbcstestuser001-0:_:qV.tx3vPtBJPdMipXe3-QISykee9xTr_object-0 ret=-2

Eventually, after OOM-kill and restart of rgw, I see in “bucket stats” that only 3030 objects (out of 10K) were deleted.

Proposed Solution¶

TLDR; Optimize bulk delete handling by limiting update_olh to only one call per request rather than per object version.

RGW: multi object delete op; skip olh update for all deletes but the last one (https://github.com/BBoozmen/ceph/commit/a30b33af7b34be8d2bfc1f86da607660010d4274) implements the solution: If a multi-object delete request includes multiple object versions, skip update_olh on all but the final object in the batch.

Results with the Proposed Fix¶

Observations:

Client-side object listings reflect deletions earlier.
Retries are minimized (client sees deletions and avoids repeat delete requests).
The same workload now completes in ~40 minutes, as opposed to never finishing under the previous implementation.
No RGW crashes or OOM events.
Memory usage remained stable.

Timeline:

Workload start: 14:41 EST
Workload complete: 15:26 EST

Final request completion at rgw-side: 21:00 EST

2025-07-31T01:00:35.259+0000 7f6110974700  1 beast: 0x7f5fd5c635f0: 10.74.16.44 - myuser1 [30/Jul/2025:18:41:37.370 +0000] "POST /versioned-dev-thisisbcstestuser001-0?delete HTTP/1.1" 200 132 - "aiobotocore/2.22.0 m
d/Botocore#1.37.3 ua/2.0 os/linux#4.18.0-553.45.1.el8_10.x86_64 md/arch#x86_64 lang/python#3.12.10 md/pyimpl#CPython cfg/retry-mode#standard botocore/1.37.3" - latency=22737.888671875s

In the end, both bucket stats and bucket listing showed the full removal successfully: all 9999 versions out of 10k as retain count of the workload was 1.

Caveats:

Actual deletions still take hours. The change improves system stability and client experience but doesn't significantly speed up backend deletions.
It removes the recursive duplication, making the system robust under concurrent deletes.

Files

Download all files

RGW_OOM.png (122 KB) RGW_OOM.png		Oguzhan Ozmen, 08/01/2025 05:55 PM
RGW_MEM_WITH_FIX.png (118 KB) RGW_MEM_WITH_FIX.png		Oguzhan Ozmen, 08/01/2025 06:01 PM

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Updated by Oguzhan Ozmen 8 months ago

Pull request ID set to 64800

Added https://github.com/ceph/ceph/pull/64800 (RGW: multi object delete op; skip olh update for all deletes but the last one #64800) as the potential fix implementing the proposed solution.

Actions

Copy link