Skip to content

rgw/dedup: full object dedup continuous work#63560

Merged
yuvalif merged 1 commit intoceph:mainfrom
benhanokh:dedup_pool
Jun 30, 2025
Merged

rgw/dedup: full object dedup continuous work#63560
yuvalif merged 1 commit intoceph:mainfrom
benhanokh:dedup_pool

Conversation

@benhanokh
Copy link
Contributor

@benhanokh benhanokh commented May 28, 2025

Moved all control objects (EPOCH, WATCH, Tokens) to default.rgw.control pool.
rgw.dedup pool is created on dedup start and removed when the scan is over

report duplicated space after dedup because of the head-object
report potential dedup for smaller objects (64KB-4MB)
code cleanup

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@benhanokh benhanokh self-assigned this May 28, 2025
@benhanokh benhanokh requested a review from a team as a code owner May 28, 2025 14:57
@benhanokh benhanokh requested a review from yuvalif May 28, 2025 14:58
@benhanokh
Copy link
Contributor Author

jenkins test api

@benhanokh
Copy link
Contributor Author

jenkins test windows

@cbodley
Copy link
Contributor

cbodley commented May 29, 2025

can you please give this pr and commit a real title? you just copied the exact same text from #62179

@benhanokh
Copy link
Contributor Author

jenkins test windows

@benhanokh benhanokh changed the title rgw/dedup: full object dedup rgw/dedup: full object dedup continuous work May 29, 2025
@benhanokh
Copy link
Contributor Author

jenkins test windows

@benhanokh
Copy link
Contributor Author

jenkins test make check

int ret = get_epoch(store, dpp, &old_epoch, __func__);
if (ret != 0) {
return ret;
// generate an empty epoch with zero counters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a change in behavior.
what could be the reason for the failure? do we always create an empty epoch regardless of why we failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have an EPOCH object to be able to run scans.
The EPOCH object is created on the first scan (we used to create one when loading the code)
The only objects created at startup are the watch objects

return false;
ldpp_dout(dpp, 1) << __func__ << "::failed shard_progress_t decode!" << dendl;
completed_arr[shard] = TOKEN_STATE_CORRUPTED;
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a behavior change - not bailing out on a malformed token?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first pass collects state from all token, but will still fail the call.
After a wait time of 120 seconds the caller will pass "FORCE" option allowing it to skip passed failing tokens.
Active in-process tokens will still wait for completion.

@yuvalif
Copy link
Contributor

yuvalif commented Jun 9, 2025

are the 2 following counters the same or could be different:

  • one part objects that were not duplicated because we do not duplicate head objects?
  • objects that werte not duplicated because they were below some threshold (in case this threshold is larger than 4MB)?

@yuvalif
Copy link
Contributor

yuvalif commented Jun 10, 2025

are the 2 following counters the same or could be different:

  • one part objects that were not duplicated because we do not duplicate head objects?
  • objects that werte not duplicated because they were below some threshold (in case this threshold is larger than 4MB)?

answered above

@benhanokh
Copy link
Contributor Author

jenkins test api

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Moved all control objects (EPOCH, WATCH, Tokens) to default.rgw.control
pool.
Add dedup_pool to RGWZoneParams to make the name unique across zones
rgw.dedup pool is created on dedup start and removed when the scan is
over

report duplicated space after dedup because of the head-object
report potential dedup for smaller objects (64KB-4MB)
added tests for the new reporting facilities

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
@github-project-automation github-project-automation bot moved this from New to Reviewer approved in Ceph-Dashboard Jun 19, 2025
@benhanokh
Copy link
Contributor Author

restarted failing tests (mostly No module named 'tasks.ceph')
https://pulpito.ceph.com/benhanokh-2025-06-24_14:33:40-rgw-wip_dedup_pool_A-distro-default-smithi/

@benhanokh
Copy link
Contributor Author

benhanokh commented Jun 29, 2025

rebased and repeated the failing tests:
https://pulpito.ceph.com/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/
Looks much better after the rebase.
We are down to 6 failures:
3 valgrind errors from Objecter::start_tick() Objecter::start(OSDMap const*)

1 failure in rgw multisite test caused by a missing SSH private key:
"smithi050.front.sepia.ceph.com: SSHException('No existing session') (No SSH private key found!)"

1 failure from kafka_failover with rgw stdout reporting
"Cluster is is misconfigured!"
https://qa-proxy.ceph.com/teuthology/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/8356484/remote/smithi137/log/rgw.ceph.client.0.stdout

1 more failure I don't understand from rgw/singleton/
https://pulpito.ceph.com/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/8356482/
Traceback reports:
"ERROR:teuthology.run_tasks:Saw exception from tasks"

I saw nothing of interest on the system except for a report on ceph.log complaining that osd.0 was reported immediately failed by osd.1, osd.2 and osd.3
The only osd log we got in the system is from osd.0 (probably means that all other osds refused to start because they suspect osd.0)

@yuvalif
Copy link
Contributor

yuvalif commented Jun 30, 2025

rebased and repeated the failing tests: https://pulpito.ceph.com/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/ Looks much better after the rebase. We are down to 6 failures: 3 valgrind errors from Objecter::start_tick() Objecter::start(OSDMap const*)

1 failure in rgw multisite test caused by a missing SSH private key: "smithi050.front.sepia.ceph.com: SSHException('No existing session') (No SSH private key found!)"

1 failure from kafka_failover with rgw stdout reporting "Cluster is is misconfigured!" https://qa-proxy.ceph.com/teuthology/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/8356484/remote/smithi137/log/rgw.ceph.client.0.stdout

1 more failure I don't understand from rgw/singleton/ https://pulpito.ceph.com/benhanokh-2025-06-29_05:50:46-rgw-wip_dedup_pool_C-distro-default-smithi/8356482/ Traceback reports: "ERROR:teuthology.run_tasks:Saw exception from tasks"

I saw nothing of interest on the system except for a report on ceph.log complaining that osd.0 was reported immediately failed by osd.1, osd.2 and osd.3 The only osd log we got in the system is from osd.0 (probably means that all other osds refused to start because they suspect osd.0)

  • valgrind issue are known
  • kafka failover issue is known
  • multisite issue is known
  • this failure is also seen here

@yuvalif yuvalif merged commit 407e9b7 into ceph:main Jun 30, 2025
13 checks passed
@github-project-automation github-project-automation bot moved this from Reviewer approved to Done in Ceph-Dashboard Jun 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants