Skip to content

osdc: fix lingerOp stray#36694

Merged
tchaikov merged 1 commit intoceph:masterfrom
shun-s:fix-linger-op-stray
Aug 21, 2020
Merged

osdc: fix lingerOp stray#36694
tchaikov merged 1 commit intoceph:masterfrom
shun-s:fix-linger-op-stray

Conversation

@shun-s
Copy link
Contributor

@shun-s shun-s commented Aug 18, 2020

when linger ping failed with error, like ENOTCONN
last_error set to error.
after that, last_error will never recovery to succecss(0),
even reconnecting successfully, which stops from sending linger ping to osd.
as a result, this normal client ** can't receive notify message **
after osd_client_watch_timeout goes away.

Fixes: https://tracker.ceph.com/issues/47004

Signed-off-by: Song Shun song.shun3@zte.com.cn

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this may explain some transient watch/notify test failures! nice find!

could you file a tracker ticket and reference it in the commit message so we can backport this as well? the logic here hasn't changed since 2014

@jdurgin
Copy link
Member

jdurgin commented Aug 18, 2020

@badone if you had a reliable way to get the watch/notify tests to fail it'd be interesting to try with this PR.

@dillaman not sure if you've seen similar issues at the rbd level

@shun-s
Copy link
Contributor Author

shun-s commented Aug 18, 2020

@jdurgin here is the ticket https://tracker.ceph.com/issues/47004

  when linger ping failed with error, like ENOTCONN
  last_error set to error.
  after that, last_error will never recovery to succecss(0),
  even reconnecting successfully, which stops from sending linger ping to osd.
  as a result, this normal client ** can't receive notify message **
  after osd_client_watch_timeout goes away.

  Fixes: https://tracker.ceph.com/issues/47004

Signed-off-by: Song Shun <song.shun3@zte.com.cn>
@shun-s shun-s force-pushed the fix-linger-op-stray branch from f886150 to 65d05fd Compare August 18, 2020 09:13
@dillaman
Copy link

dillaman commented Aug 18, 2020

@dillaman not sure if you've seen similar issues at the rbd level

Watches can be expected to fail randomly. librbd uses the handle_error callback [1] to re-establish the watch (and perform possible recovery from potentially missing some notifications).

[1] https://github.com/ceph/ceph/blob/master/src/include/rados/librados.hpp#L191

@shun-s
Copy link
Contributor Author

shun-s commented Aug 19, 2020

@jdurgin ceph API tests failure seems to be high kv commit latency, so is not related to this pr?

@neha-ojha
Copy link
Member

@badone if you had a reliable way to get the watch/notify tests to fail it'd be interesting to try with this PR.

We at least have 3 tracker tickets related to watch/notify tests, if not more.
https://tracker.ceph.com/issues/45424
https://tracker.ceph.com/issues/45615
https://tracker.ceph.com/issues/47025

It will be good to verify if this PR fixes all or any of these issues.

@dillaman not sure if you've seen similar issues at the rbd level

@badone
Copy link
Contributor

badone commented Aug 19, 2020

@jdurgin @neha-ojha Going to run a few iterations of these to see if I can reproduce reasonably reliably. I'll update here with findings.

@badone
Copy link
Contributor

badone commented Aug 21, 2020

@jdurgin @neha-ojha I successfully completed 50 iterations without being able to reproduce so I don't really have a way of testing this fix against those bugs.

@tchaikov tchaikov merged commit 2e029fa into ceph:master Aug 21, 2020
@shun-s shun-s deleted the fix-linger-op-stray branch August 24, 2020 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants