msg/async/ProtocolV2: take care of features when replacing the socket#35816
msg/async/ProtocolV2: take care of features when replacing the socket#35816idryomov merged 2 commits intoceph:masterfrom
Conversation
Currently it's a mix of hex and dec, making it hard to grep for. Converge on hex to match client_cookie. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
reuse_connection() can be called on exproto in BANNER_CONNECTING (i.e. without peer_supported_features and with tx/rx_frame_asm set to msgr2.0), but this state isn't carried over. If the donor connection is msgr2.1, this leads to repeated connection faults on crc or auth tag mismatches because we end up assembling 2.0 frames while the peer is expecting 2.1 frames. Fixes: https://tracker.ceph.com/issues/46180 Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
-s rados/singleton-nomsgr --filter 'all/health-warnings rados' -N 20 / 200: -s rados --filter-out cephadm --subset 1/3333: |
dillaman
left a comment
There was a problem hiding this comment.
lgtm, but do we need to worry about the potential for a downgrade? I can only think of a corner-case example of a package downgrade and program restart.
|
@dillaman This is not an up/downgrade issue, or at least I don't think so. The scenario here is two peers connecting to each other at the same time. To handle this race, one of the peers determines the winning connection and closes the losing one. The issue is that the winning connection can be behind in terms of the phase of the connection (i.e. how far it got), so some state must be carried over from the losing connection. If one of the peers is restarted, both connections would get closed. |
@idryomov I don't see any related failures in rados. I think we also want to run this through rgw, fs and upgrade suites, fyi @cbodley @batrick |
no related failures in upgrade: https://pulpito.ceph.com/nojha-2020-06-29_20:23:12-upgrade-wip-msgr21-fix-reuse-rebuildci-distro-basic-smithi/ |
|
got a strange SELinux error https://tracker.ceph.com/issues/46300 in my run testing this PR. any idea if this is related to this PR? |
From a cursory look, it appears that ksmtuned triggered auto-load of binfmt-464c kernel module, which would suggest an unrecognized ELF binary or some weird interaction between ksmtuned and the rest of the system? Definitely not related. |
No description provided.