Skip to content

rgw/notifications: prevent deletion of skipped notifications#64010

Merged
yuvalif merged 4 commits intoceph:mainfrom
yuvalif:wip-yuval-70756
Jun 30, 2025
Merged

rgw/notifications: prevent deletion of skipped notifications#64010
yuvalif merged 4 commits intoceph:mainfrom
yuvalif:wip-yuval-70756

Conversation

@yuvalif
Copy link
Contributor

@yuvalif yuvalif commented Jun 18, 2025

Fixes: https://tracker.ceph.com/issues/70756

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@yuvalif yuvalif marked this pull request as ready for review June 18, 2025 15:01
@yuvalif yuvalif requested a review from a team as a code owner June 18, 2025 15:01
@yuvalif yuvalif requested a review from AliMasarweh June 18, 2025 15:02
Copy link
Member

@AliMasarweh AliMasarweh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 19, 2025

notification tests are failing on known issues: https://pulpito.ceph.com/yuvalif-2025-06-19_08:22:33-rgw:notifications-wip-yuval-70756-distro-default-smithi/
rgw regression : https://pulpito.ceph.com/yuvalif-2025-06-19_08:22:33-rgw:notifications-wip-yuval-70756-distro-default-smithi/
rerun has 12 failures: https://pulpito.ceph.com/yuvalif-2025-06-19_08:24:27-rgw-wip-yuval-70756-distro-default-smithi/

  • known issues
  • timeout expired in wait_for_all_osds_up - probably unrelated
  • Command failed on smithi037 with status 1: 'sudo fuser -v /var/lib/dpkg/lock-frontend' looks like a test issue

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 23, 2025

re-rerun: https://pulpito.ceph.com/yuvalif-2025-06-19_15:17:03-rgw-wip-yuval-70756-distro-default-smithi/
has 11 failures:

  • run-d4n.sh - known
  • valgrind error: Leak_PossiblyLost operator new[](unsigned long) Objecter::start_tick() Objecter::start(OSDMap const*) - known
  • kafka_failover - known
  • timeout expired in wait_for_all_osds_up - looks like a test issue
  • test_bucket_log_trim_after_delete_bucket_secondary_reshard - known

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 23, 2025

jenkins test make check arm64

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 23, 2025

jenkins test api

1 similar comment
@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 24, 2025

jenkins test api

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 24, 2025

jenkins test make check arm64

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 24, 2025

jenkins test submodules

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 24, 2025

jenkins test make check

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 25, 2025

jenkins test make check

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 25, 2025

jenkins test make check arm64

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 25, 2025

jenkins test submodules

yuvalif added 4 commits June 25, 2025 15:09
if a notification retry should be skipped, we should stop processing
all notifications. if we successfully processing another notification
it will not be removed (as we will remove only up to the marker of the
skipped notification). as a result, the successfull notification will be
processed again.

Fixes: https://tracker.ceph.com/issues/70756

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
if the RGW is slow, and the client retry, it may cause test to fail
since the number of notifications would be off.
in addition, in slow RGW, we need to verify that the expiry time did
not pass before checking the queue, so we see the expected number of
entries in the queue before they expire.

Fixes: https://tracker.ceph.com/issues/70756

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
…ep" state

this will prevent re-reading the queue when there is no work to do
also, put into "idle" state in case of failure with -EBUSY error code

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
…ations

if we fail to decode a notification entry we should remove it.
o/w we will keep failing on that entry

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 26, 2025

jenkins test make check

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 26, 2025

jenkins test make check arm64

@yuvalif
Copy link
Contributor Author

yuvalif commented Jun 26, 2025

jenkins test make check

@yuvalif yuvalif merged commit 26b60e4 into ceph:main Jun 30, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants