Skip to content

msg/async/ProtocolV2: take care of features when replacing the socket#35816

Merged
idryomov merged 2 commits intoceph:masterfrom
idryomov:wip-msgr21-fix-reuse
Jul 2, 2020
Merged

msg/async/ProtocolV2: take care of features when replacing the socket#35816
idryomov merged 2 commits intoceph:masterfrom
idryomov:wip-msgr21-fix-reuse

Conversation

@idryomov
Copy link
Contributor

No description provided.

idryomov added 2 commits June 28, 2020 10:32
Currently it's a mix of hex and dec, making it hard to grep for.
Converge on hex to match client_cookie.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
reuse_connection() can be called on exproto in BANNER_CONNECTING
(i.e. without peer_supported_features and with tx/rx_frame_asm set to
msgr2.0), but this state isn't carried over.  If the donor connection
is msgr2.1, this leads to repeated connection faults on crc or auth tag
mismatches because we end up assembling 2.0 frames while the peer is
expecting 2.1 frames.

Fixes: https://tracker.ceph.com/issues/46180
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
@idryomov idryomov added bug-fix messenger Issues involving one of the Ceph messenger implementations labels Jun 28, 2020
Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but do we need to worry about the potential for a downgrade? I can only think of a corner-case example of a package downgrade and program restart.

@idryomov
Copy link
Contributor Author

@dillaman This is not an up/downgrade issue, or at least I don't think so. The scenario here is two peers connecting to each other at the same time. To handle this race, one of the peers determines the winning connection and closes the losing one. The issue is that the winning connection can be behind in terms of the phase of the connection (i.e. how far it got), so some state must be carried over from the losing connection. If one of the peers is restarted, both connections would get closed.

@sebastian-philipp sebastian-philipp added wip-swagner-testing My Teuthology tests and removed wip-swagner-testing My Teuthology tests labels Jun 29, 2020
@neha-ojha
Copy link
Member

-s rados/singleton-nomsgr --filter 'all/health-warnings rados' -N 20 / 200:
https://pulpito.ceph.com/dis-2020-06-28_18:37:38-rados:singleton-nomsgr-wip-msgr21-fix-reuse-rebuildci-distro-basic-smithi/
https://pulpito.ceph.com/dis-2020-06-29_08:50:54-rados:singleton-nomsgr-wip-msgr21-fix-reuse-rebuildci-distro-basic-smithi/

-s rados --filter-out cephadm --subset 1/3333:
https://pulpito.ceph.com/dis-2020-06-28_18:43:20-rados-wip-msgr21-fix-reuse-rebuildci-distro-basic-smithi/

@idryomov I don't see any related failures in rados. I think we also want to run this through rgw, fs and upgrade suites, fyi @cbodley @batrick

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes it for cephfs. Thanks @idryomov !

@neha-ojha
Copy link
Member

@sebastian-philipp
Copy link
Contributor

sebastian-philipp commented Jul 1, 2020

@sebastian-philipp
Copy link
Contributor

sebastian-philipp commented Jul 1, 2020

got a strange SELinux error https://tracker.ceph.com/issues/46300 in my run testing this PR. any idea if this is related to this PR?

@idryomov
Copy link
Contributor Author

idryomov commented Jul 1, 2020

got a strange SELinux error https://tracker.ceph.com/issues/46300 in my run testing this PR. any idea if this is related to this PR?

type=AVC msg=audit(1593601294.109:4683): avc:  denied  { module_request } for
pid=18957 comm="ksmtuned" kmod="binfmt-464c" scontext=system_u:system_r:ksmtuned_t:s0
tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=1

From a cursory look, it appears that ksmtuned triggered auto-load of binfmt-464c kernel module, which would suggest an unrecognized ELF binary or some weird interaction between ksmtuned and the rest of the system? Definitely not related.

@idryomov idryomov merged commit 0195fd5 into ceph:master Jul 2, 2020
@idryomov idryomov deleted the wip-msgr21-fix-reuse branch July 8, 2020 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix messenger Issues involving one of the Ceph messenger implementations wip-pdonnell-testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants