rgw/dedup: full object dedup continuous work by benhanokh · Pull Request #63560 · ceph/ceph

benhanokh · 2025-05-28T14:57:22Z

Moved all control objects (EPOCH, WATCH, Tokens) to default.rgw.control pool.
rgw.dedup pool is created on dedup start and removed when the scan is over

report duplicated space after dedup because of the head-object
report potential dedup for smaller objects (64KB-4MB)
code cleanup

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

benhanokh · 2025-05-29T05:46:06Z

jenkins test api

benhanokh · 2025-05-29T05:46:13Z

jenkins test windows

cbodley · 2025-05-29T14:54:23Z

can you please give this pr and commit a real title? you just copied the exact same text from #62179

benhanokh · 2025-05-29T16:46:23Z

jenkins test windows

benhanokh · 2025-05-29T17:01:23Z

jenkins test windows

benhanokh · 2025-06-03T15:04:44Z

jenkins test make check

yuvalif · 2025-06-09T16:08:27Z

src/rgw/rgw_dedup_cluster.cc

+    int ret = get_epoch(store, dpp, &old_epoch, __func__);
    if (ret != 0) {
-      return ret;
+      // generate an empty epoch with zero counters


this is a change in behavior.
what could be the reason for the failure? do we always create an empty epoch regardless of why we failed?

We need to have an EPOCH object to be able to run scans.
The EPOCH object is created on the first scan (we used to create one when loading the code)
The only objects created at startup are the watch objects

src/rgw/rgw_dedup_cluster.cc

yuvalif · 2025-06-09T16:14:07Z

src/rgw/rgw_dedup_cluster.cc

-        return false;
+        ldpp_dout(dpp, 1) << __func__ << "::failed shard_progress_t decode!" << dendl;
+        completed_arr[shard] = TOKEN_STATE_CORRUPTED;
+        continue;


this is a behavior change - not bailing out on a malformed token?

The first pass collects state from all token, but will still fail the call.
After a wait time of 120 seconds the caller will pass "FORCE" option allowing it to skip passed failing tokens.
Active in-process tokens will still wait for completion.

src/rgw/rgw_dedup_utils.h

src/rgw/rgw_dedup.cc

yuvalif · 2025-06-09T16:37:42Z

are the 2 following counters the same or could be different:

one part objects that were not duplicated because we do not duplicate head objects?
objects that werte not duplicated because they were below some threshold (in case this threshold is larger than 4MB)?

yuvalif · 2025-06-10T14:01:31Z

are the 2 following counters the same or could be different:

one part objects that were not duplicated because we do not duplicate head objects?

objects that werte not duplicated because they were below some threshold (in case this threshold is larger than 4MB)?

answered above

benhanokh · 2025-06-18T17:34:39Z

jenkins test api

github-actions · 2025-06-18T17:38:22Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Moved all control objects (EPOCH, WATCH, Tokens) to default.rgw.control pool. Add dedup_pool to RGWZoneParams to make the name unique across zones rgw.dedup pool is created on dedup start and removed when the scan is over report duplicated space after dedup because of the head-object report potential dedup for smaller objects (64KB-4MB) added tests for the new reporting facilities Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>

benhanokh · 2025-06-24T14:34:42Z

restarted failing tests (mostly No module named 'tasks.ceph')
https://pulpito.ceph.com/benhanokh-2025-06-24_14:33:40-rgw-wip_dedup_pool_A-distro-default-smithi/

benhanokh · 2025-06-29T08:11:23Z

rebased and repeated the failing tests:
https://pulpito.ceph.com/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/
Looks much better after the rebase.
We are down to 6 failures:
3 valgrind errors from Objecter::start_tick() Objecter::start(OSDMap const*)

1 failure in rgw multisite test caused by a missing SSH private key:
"smithi050.front.sepia.ceph.com: SSHException('No existing session') (No SSH private key found!)"

1 failure from kafka_failover with rgw stdout reporting
"Cluster is is misconfigured!"
https://qa-proxy.ceph.com/teuthology/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/8356484/remote/smithi137/log/rgw.ceph.client.0.stdout

1 more failure I don't understand from rgw/singleton/
https://pulpito.ceph.com/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/8356482/
Traceback reports:
"ERROR:teuthology.run_tasks:Saw exception from tasks"

I saw nothing of interest on the system except for a report on ceph.log complaining that osd.0 was reported immediately failed by osd.1, osd.2 and osd.3
The only osd log we got in the system is from osd.0 (probably means that all other osds refused to start because they suspect osd.0)

yuvalif · 2025-06-30T08:40:00Z

rebased and repeated the failing tests: https://pulpito.ceph.com/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/ Looks much better after the rebase. We are down to 6 failures: 3 valgrind errors from Objecter::start_tick() Objecter::start(OSDMap const*)

1 failure in rgw multisite test caused by a missing SSH private key: "smithi050.front.sepia.ceph.com: SSHException('No existing session') (No SSH private key found!)"

1 failure from kafka_failover with rgw stdout reporting "Cluster is is misconfigured!" https://qa-proxy.ceph.com/teuthology/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/8356484/remote/smithi137/log/rgw.ceph.client.0.stdout

1 more failure I don't understand from rgw/singleton/ https://pulpito.ceph.com/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/8356482/ Traceback reports: "ERROR:teuthology.run_tasks:Saw exception from tasks"

I saw nothing of interest on the system except for a report on ceph.log complaining that osd.0 was reported immediately failed by osd.1, osd.2 and osd.3 The only osd log we got in the system is from osd.0 (probably means that all other osds refused to start because they suspect osd.0)

valgrind issue are known
kafka failover issue is known
multisite issue is known
this failure is also seen here

benhanokh self-assigned this May 28, 2025

benhanokh requested a review from a team as a code owner May 28, 2025 14:57

github-actions bot added rgw tests labels May 28, 2025

benhanokh requested a review from yuvalif May 28, 2025 14:58

benhanokh force-pushed the dedup_pool branch from a61f887 to 915e50e Compare May 29, 2025 14:25

benhanokh changed the title ~~rgw/dedup: full object dedup~~ rgw/dedup: full object dedup continuous work May 29, 2025

benhanokh force-pushed the dedup_pool branch from 915e50e to 8446bee Compare June 3, 2025 13:24

benhanokh requested a review from cbodley June 3, 2025 16:00

benhanokh mentioned this pull request Jun 4, 2025

rgw/dedup: full object dedup #62179

Merged

14 tasks

benhanokh force-pushed the dedup_pool branch from 8446bee to 7caa7e1 Compare June 4, 2025 12:41

benhanokh requested review from a team as code owners June 4, 2025 12:41

benhanokh requested review from Pegonzal and pujaoshahu June 4, 2025 12:41

github-actions bot added dashboard documentation pybind labels Jun 4, 2025

github-project-automation bot added this to Ceph-Dashboard Jun 4, 2025

github-project-automation bot moved this to New in Ceph-Dashboard Jun 4, 2025

benhanokh force-pushed the dedup_pool branch from 671ed38 to d965da6 Compare June 4, 2025 13:24

benhanokh removed request for a team and Pegonzal June 4, 2025 13:25

benhanokh removed the dashboard label Jun 4, 2025

benhanokh force-pushed the dedup_pool branch from d965da6 to b319108 Compare June 4, 2025 14:50

github-actions bot added the tests label Jun 4, 2025

yuvalif reviewed Jun 9, 2025

View reviewed changes

src/rgw/rgw_dedup_cluster.cc Show resolved Hide resolved

yuvalif reviewed Jun 9, 2025

View reviewed changes

src/rgw/rgw_dedup_utils.h Show resolved Hide resolved

yuvalif reviewed Jun 9, 2025

View reviewed changes

src/rgw/rgw_dedup.cc Show resolved Hide resolved

yuvalif reviewed Jun 9, 2025

View reviewed changes

src/rgw/rgw_dedup.cc Show resolved Hide resolved

benhanokh force-pushed the dedup_pool branch from b319108 to 4f13b5e Compare June 18, 2025 14:29

benhanokh requested a review from yuvalif June 18, 2025 14:29

github-actions bot added the needs-rebase label Jun 18, 2025

benhanokh force-pushed the dedup_pool branch from 4f13b5e to c6f026b Compare June 19, 2025 06:52

github-actions bot removed the needs-rebase label Jun 19, 2025

benhanokh force-pushed the dedup_pool branch from c6f026b to 7e6021f Compare June 19, 2025 07:21

yuvalif approved these changes Jun 19, 2025

View reviewed changes

github-project-automation bot moved this from New to Reviewer approved in Ceph-Dashboard Jun 19, 2025

yuvalif added the needs-qa label Jun 19, 2025

anrao19 added the wip-anrao3-testing label Jun 27, 2025

yuvalif merged commit 407e9b7 into ceph:main Jun 30, 2025
13 checks passed

github-project-automation bot moved this from Reviewer approved to Done in Ceph-Dashboard Jun 30, 2025

anrao19 removed the wip-anrao3-testing label Jul 7, 2025

Conversation

benhanokh commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

benhanokh commented May 29, 2025

Uh oh!

benhanokh commented May 29, 2025

Uh oh!

cbodley commented May 29, 2025

Uh oh!

benhanokh commented May 29, 2025

Uh oh!

benhanokh commented May 29, 2025

Uh oh!

benhanokh commented Jun 3, 2025

Uh oh!

yuvalif Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

benhanokh Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuvalif Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

benhanokh Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuvalif commented Jun 9, 2025

Uh oh!

yuvalif commented Jun 10, 2025

Uh oh!

benhanokh commented Jun 18, 2025

Uh oh!

github-actions bot commented Jun 18, 2025

Uh oh!

benhanokh commented Jun 24, 2025

Uh oh!

benhanokh commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuvalif commented Jun 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

benhanokh commented May 28, 2025 •

edited

Loading

benhanokh commented Jun 29, 2025 •

edited

Loading