Skip to content

librados/watch_notify: reconnect after socket injection#45825

Merged
yuriw merged 2 commits intoceph:masterfrom
NitzanMordhai:wip-nitzan-watch-notify-reconnect-107
May 24, 2022
Merged

librados/watch_notify: reconnect after socket injection#45825
yuriw merged 2 commits intoceph:masterfrom
NitzanMordhai:wip-nitzan-watch-notify-reconnect-107

Conversation

@NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai commented Apr 8, 2022

For some tests, if socket failure injection cause watch to fail, we can re-watch using the error callback

Fixes: https://tracker.ceph.com/issues/45868
Signed-off-by: Nitzan Mordechai nmordec@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@github-actions github-actions bot added the tests label Apr 8, 2022
@NitzanMordhai NitzanMordhai requested a review from badone April 8, 2022 09:03
Copy link
Contributor

@badone badone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good so far but could you remove "?next_issue_id=52488" from the tracker link in the commit message and change "socket injection" to "socket failure injection". Please also take a look at the comment I left on the tracker regarding expanding this solution to cover more of the known failure cases.

For some tests, if socket failure injection cause watch to fail, we can re-watch using the error callback

Fixes: https://tracker.ceph.com/issues/45868
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
When socket failure injection or any delay that time out watch we need reconnect the watch
using the error callback to reconnect

Fixes: https://tracker.ceph.com/issues/47025
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
@neha-ojha neha-ojha added the core label Apr 12, 2022
@badone badone added the bug-fix label Apr 21, 2022
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-watch-notify-reconnect-107 branch from 20e6e39 to d2c5d7a Compare April 26, 2022 06:03
@NitzanMordhai
Copy link
Contributor Author

This looks good so far but could you remove "?next_issue_id=52488" from the tracker link in the commit message and change "socket injection" to "socket failure injection". Please also take a look at the comment I left on the tracker regarding expanding this solution to cover more of the known failure cases.

I added another commit to handle another known issue with test

Copy link
Contributor

@badone badone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@badone
Copy link
Contributor

badone commented Apr 27, 2022

jenkins test api

@ljflores
Copy link
Member

http://pulpito.front.sepia.ceph.com/yuriw-2022-05-19_01:43:57-rados-wip-yuri7-testing-2022-05-18-1636-distro-default-smithi/

Failures, unrelated:
1. https://tracker.ceph.com/issues/52124
2. https://tracker.ceph.com/issues/55741
3. https://tracker.ceph.com/issues/51835
4. https://tracker.ceph.com/issues/52321

Details:
1. Invalid read of size 8 in handle_recovery_delete() - Ceph - RADOS
2. cephadm/test_dashboard_e2e.sh: Unable to find element cd-modal .custom-control-label when testing on orchestrator/01-hosts.e2e-spec.ts - Ceph - Mgr - Dashboard
3. mgr/DaemonServer.cc: FAILED ceph_assert(pending_service_map.epoch > service_map.epoch) - Ceph - RADOS
4. qa/tasks/rook times out: 'check osd count' reached maximum tries (90) after waiting for 900 seconds - Ceph - Orchestrator

@ljflores
Copy link
Member

@NitzanMordhai perhaps similar logic can be applied for this Tracker: https://tracker.ceph.com/issues/45721. Let me know if you have any thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants