Skip to content

rgw/multisite: fix sync requests on existing objects#65479

Merged
cbodley merged 1 commit intoceph:mainfrom
smanjara:wip-fix-object-sync-trace
Sep 30, 2025
Merged

rgw/multisite: fix sync requests on existing objects#65479
cbodley merged 1 commit intoceph:mainfrom
smanjara:wip-fix-object-sync-trace

Conversation

@smanjara
Copy link
Copy Markdown
Contributor

@smanjara smanjara commented Sep 10, 2025

resolves https://tracker.ceph.com/issues/72950

the fix here resets RGW_ATTR_OBJ_REPLICATION_TRACE during object attr changes.

otherwise, if a zone receives request for any s3 object api requests like PutObjectAcl, PutObjectTagging etc. and this zone was originally the source zone for the object put request, then such subsequent sync ops will fail. this is because the zone id was added to the replication trace to ensure that we don't sync the object back to it. for example in a put/delete race during full sync(https://tracker.ceph.com/issues/58911) so, if the same zone ever becomes the destination for subsequent sync requests on the same object, we compare this zone as the destination zone against the zone entries in replication trace and because it's entry is already present in the trace, the sync operation returns -ERR_NOT_MODIFIED.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@smanjara smanjara requested a review from a team as a code owner September 10, 2025 18:10
@smanjara smanjara requested a review from cbodley September 10, 2025 18:12
@cbodley
Copy link
Copy Markdown
Contributor

cbodley commented Sep 10, 2025

from https://tracker.ceph.com/issues/72950:

this is because when the initial object creation syncs from zone A, datasync on destination zone B adds zone A into source_trace_entry which is stored as an object attr RGW_ATTR_OBJ_REPLICATION_TRACE. now when we add/modify any attributes on this object in the opposite direction, where zone A becomes the destination zone, we add zone A into dst_zone_trace. during a GetObj on zone B, we then go on to compare if we have already synced to zone A by comparing dst_zone_trace with entries stored in RGW_ATTR_OBJ_REPLICATION_TRACE and since zone A already exists in the trace as part of the initial object create op, we return ERR_NOT_MODIFIED, thus failing to sync the obj attribute.

for object uploads/overwrites, the new object starts with a fresh RGW_ATTR_OBJ_REPLICATION_TRACE. maybe metadata ops like PutObjectAcl etc should also reset the RGW_ATTR_OBJ_REPLICATION_TRACE, since we want that change to re-replicate everywhere?

@smanjara smanjara force-pushed the wip-fix-object-sync-trace branch from 864b814 to bb5b58f Compare September 11, 2025 15:29
@smanjara
Copy link
Copy Markdown
Contributor Author

from https://tracker.ceph.com/issues/72950:

this is because when the initial object creation syncs from zone A, datasync on destination zone B adds zone A into source_trace_entry which is stored as an object attr RGW_ATTR_OBJ_REPLICATION_TRACE. now when we add/modify any attributes on this object in the opposite direction, where zone A becomes the destination zone, we add zone A into dst_zone_trace. during a GetObj on zone B, we then go on to compare if we have already synced to zone A by comparing dst_zone_trace with entries stored in RGW_ATTR_OBJ_REPLICATION_TRACE and since zone A already exists in the trace as part of the initial object create op, we return ERR_NOT_MODIFIED, thus failing to sync the obj attribute.

for object uploads/overwrites, the new object starts with a fresh RGW_ATTR_OBJ_REPLICATION_TRACE. maybe metadata ops like PutObjectAcl etc should also reset the RGW_ATTR_OBJ_REPLICATION_TRACE, since we want that change to re-replicate everywhere?

hmm I added a change to set_attrs() to erase the attr. but I don't think it is clearing it. not sure what is missing.

@anrao19
Copy link
Copy Markdown
Contributor

anrao19 commented Sep 16, 2025

pr testing completed : https://tracker.ceph.com/issues/73008 and got approved by @ivancich
@smanjara, if no further testing is need then this pr could be merged

@smanjara
Copy link
Copy Markdown
Contributor Author

pr testing completed : https://tracker.ceph.com/issues/73008 and got approved by @ivancich @smanjara, if no further testing is need then this pr could be merged

hi @anrao19 i pushed new changes to the pr. sorry we will have to re-run it once approved. thanks!

Copy link
Copy Markdown
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the op.rmxattr() changes look good 👍

…r changes.

otherwise, if a zone receives request for any s3 object api requests like PutObjectAcl, PutObjectTagging etc. and this zone
was originally the source zone for the object put request, then such subsequent sync ops will fail. this is because the
zone id was added to the replication trace to ensure that we don't sync the object back to it.
for example in a put/delete race during full sync(https://tracker.ceph.com/issues/58911)
so, if the same zone ever becomes the destination for subsequent sync requests on the same object, we compare this zone as
the destination zone against the zone entries in replication trace and because it's entry is already present in the trace,
the sync operation returns -ERR_NOT_MODIFIED.

Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
@smanjara smanjara force-pushed the wip-fix-object-sync-trace branch from af5adab to e1ac09e Compare September 16, 2025 20:39
@anrao19
Copy link
Copy Markdown
Contributor

anrao19 commented Sep 17, 2025

pr testing completed : https://tracker.ceph.com/issues/73008 and got approved by @ivancich @smanjara, if no further testing is need then this pr could be merged

hi @anrao19 i pushed new changes to the pr. sorry we will have to re-run it once approved. thanks!

Hi @smanjara , Could you please let me know once pr is ready for re-test, for Now i will remove tables added

@smanjara
Copy link
Copy Markdown
Contributor Author

Hi @smanjara , Could you please let me know once pr is ready for re-test, for Now i will remove tables added

yes, it is ready to be tested. thanks @anrao19

@ivancich
Copy link
Copy Markdown
Member

ivancich commented Sep 29, 2025

@smanjara : I see that this code change involves calling "set_canned_acl". The QA run is also seeing this error test_block_public_object_canned_acls. I'm wondering if they're related and if you could look into it.

The specific error occurs here: https://qa-proxy.ceph.com/teuthology/anuchaithra-2025-09-27_12:21:44-rgw-wip-anrao3-testing-2025-09-27-1011-distro-default-smithi/8522524/teuthology.log

The full run is here: https://pulpito.ceph.com/anuchaithra-2025-09-27_12:21:44-rgw-wip-anrao3-testing-2025-09-27-1011-distro-default-smithi/

I'm going to remove the needs-qa label until you've had a chance to look into this. Thanks!

@ivancich ivancich removed the needs-qa label Sep 29, 2025
@ivancich
Copy link
Copy Markdown
Member

@smanjara : I see that this code change involves calling "set_canned_acl". The QA run is also seeing this error test_block_public_object_canned_acls. I'm wondering if they're related and if you could look into it.

The specific error occurs here: https://qa-proxy.ceph.com/teuthology/anuchaithra-2025-09-27_12:21:44-rgw-wip-anrao3-testing-2025-09-27-1011-distro-default-smithi/8522524/teuthology.log

The full run is here: https://pulpito.ceph.com/anuchaithra-2025-09-27_12:21:44-rgw-wip-anrao3-testing-2025-09-27-1011-distro-default-smithi/

I'm going to remove the needs-qa label until you've had a chance to look into this. Thanks!

It looks like @cbodley is narrowing in on the issue. So I believe this can now be merged.

@cbodley cbodley merged commit c452120 into ceph:main Sep 30, 2025
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants