Bug #66003
closedmds: session reclaim could miss blocklisting an old session
0%
Description
This is coming from a downstream BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2260165
Went through the logs and I think there is a bug lingering around in the session reclaim path. This is what happened:
client.101504 initiates a session reclaim for old session uuid `ganesha-controller-2-03e9':
2024-03-14T14:04:51.793+0000 7f29eeee7700 1 -- [v2:172.17.3.117:6802/1420758655,v1:172.17.3.117:6803/1420758655] <== client.101504 172.17.3.138:0/1137839911 3 ==== client_reclaim(ganesha-controller-2-03e9 flags 0x1) v1 ==== 33+0+0 (crc 0 0 0) 0x55681e107760 con 0x55681efc1c00 2024-03-14T14:04:51.793+0000 7f29f16ec700 10 --2- [v2:172.17.3.117:6802/1420758655,v1:172.17.3.117:6803/1420758655] >> 172.17.3.138:0/1137839911 conn(0x55681efc1c00 0x55681f0ccf00 crc :-1 s=READY pgs=3 cs=0 l=0 rev1=1 rx=0 tx=0).write_event 2024-03-14T14:04:51.793+0000 7f29eeee7700 20 mds.0.30 get_session have 0x55681f0ce300 client.101504 172.17.3.138:0/1137839911 state open 2024-03-14T14:04:51.793+0000 7f29eeee7700 3 mds.0.server handle_client_reclaim client_reclaim(ganesha-controller-2-03e9 flags 0x1) v1 from client.101504
Normally, reclaim should find a session with the uuid in its client metadata and then evict (+blocklist) it, however, there ins't any evidence in logs that point out the fact that the old session was blocklisted and evicted. The session client request was replied signalling success (with epoch 0):
2024-03-14T14:04:51.793+0000 7f29eeee7700 1 -- [v2:172.17.3.117:6802/1420758655,v1:172.17.3.117:6803/1420758655] --> 172.17.3.138:0/1137839911 -- client_reclaim_reply(0 e 0) v1 -- 0x55681e106580 con 0x55681efc1c00
Note: (0 e 0) == return value:0 in epoch:0 - hinting that the old session wasn't blocklisted.
Now, later when client.101504 initiates session open, it passes in the UUID in its client metadata:
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.30 get_session have 0x55681f0ce300 client.101504 172.17.3.138:0/1137839911 state closing
2024-03-14T14:09:01.183+0000 7f29eeee7700 3 mds.0.server handle_client_session client_session(request_open) v5 from client.101504
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.server handle_client_session CEPH_SESSION_REQUEST_OPEN metadata entries:
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.server features: '0x000000000004ffff'
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.server metric specification: [{metric_flags: '0x000000000000ffff'}]
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.server ceph_sha1: 0edb63afd9bd3edb333364f2e0031b77e62f4896
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.server ceph_version: ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.server entity_id: ganesha-03dad586-51ba-437d-94bc-f30fc6d2cdd5
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.server hostname: controller-2
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.server pid: 142
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.server root: /volumes/_nogroup/03dad586-51ba-437d-94bc-f30fc6d2cdd5
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.server timeout: 300
2024-03-14T14:09:01.183+0000 7f29eeee7700 20 mds.0.server uuid: ganesha-controller-2-03e9
At this point, the MDS does the same search as it did during session reclaim to find an old session with the uuid, but finds the old session and complains about duplicate session uuid.
2024-03-14T14:09:01.183+0000 7f29eeee7700 0 log_channel(cluster) log [WRN] : client session with duplicated session uuid 'ganesha-controller-2-03e9' denied (client.101504 172.17.3.138:0/1137839911)
Files
Updated by Venky Shankar almost 2 years ago
The MDS log has the dump of session in up:reconnect state
2024-03-14T14:03:56.980+0000 7f29eeee7700 4 mds.0.30 apply_blocklist: killed 0, blocklisted sessions (25 blocklist entries, 1)
2024-03-14T14:03:56.980+0000 7f29eeee7700 1 mds.0.server reconnect_clients -- 1 sessions
2024-03-14T14:03:56.980+0000 7f29eeee7700 10 mds.0.sessionmap dump
2024-03-14T14:03:56.980+0000 7f29eeee7700 10 mds.0.sessionmap client.70093 0x55681f13f900 state open completed {} free_prealloc_inos [] delegated_inos []
and the reclaim message from the client came in later
2024-03-14T14:04:51.793+0000 7f29eeee7700 3 mds.0.server handle_client_reclaim client_reclaim(ganesha-controller-2-03e9 flags 0x1) v1 from client.101504
at this point, the MDS should have found the old session with the uuid, but it didn't.
Updated by Venky Shankar almost 2 years ago
I'm going to push a PR that adds some debug logs when the session with a uuid isn't found. Also, the reclaim reply message sent back to the client does not dump the reclaimed client addr. Those can be added for debugging.
Updated by Venky Shankar almost 2 years ago
debug pr to log missing session: https://github.com/ceph/ceph/pull/57859
Updated by Venky Shankar almost 2 years ago
When replaying the journal, there is a series of open -> close -> open for some sessions. E.g.:
2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.log _replay 4276642~121 / 4314625 2024-03-14T12:12:46.476034+0000: ESession client.56097 172.17.3.37:0/3465111544 close cmapv 43 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal ESession.replay sessionmap 42 < 43 close client.56097 172.17.3.37:0/3465111544 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal removed session client.56097 172.17.3.37:0/3465111544 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.sessionmap remove_session s=0x55681f13e500 name=client.56097 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.log _replay 4276783~155 / 4314625 2024-03-14T12:14:12.519419+0000: ESession client.70093 172.17.3.37:0/149367982 open cmapv 44 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal ESession.replay sessionmap 43 < 44 open client.70093 172.17.3.37:0/149367982 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal opened session client.70093 172.17.3.37:0/149367982 2024-03-14T14:03:55.999+0000 7f29e76d8700 20 mds.0.sessionmap replay_dirty_session s=0x55681f13e500 name=client.70093 v=43 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.log _replay 4276958~121 / 4314625 2024-03-14T12:14:13.125078+0000: ESession client.27141 172.17.3.37:0/601521236 close cmapv 45 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal ESession.replay sessionmap 44 < 45 close client.27141 172.17.3.37:0/601521236 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal removed session client.27141 172.17.3.37:0/601521236 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.sessionmap remove_session s=0x55681f13ea00 name=client.27141 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.log _replay 4277099~121 / 4314625 2024-03-14T12:14:13.131195+0000: ESession client.70093 172.17.3.37:0/149367982 close cmapv 46 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal ESession.replay sessionmap 45 < 46 close client.70093 172.17.3.37:0/149367982 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal removed session client.70093 172.17.3.37:0/149367982 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.sessionmap remove_session s=0x55681f13e500 name=client.70093 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.log _replay 4277240~528 / 4314625 2024-03-14T12:14:13.143734+0000: ESession client.70093 172.17.3.37:0/149367982 open cmapv 47 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal ESession.replay sessionmap 46 < 47 open client.70093 172.17.3.37:0/149367982 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal opened session client.70093 172.17.3.37:0/149367982 2024-03-14T14:03:55.999+0000 7f29e76d8700 20 mds.0.sessionmap replay_dirty_session s=0x55681f13f900 name=client.70093 v=46 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.log _replay 4277788~386 / 4314625 2024-03-14T12:43:09.039075+0000: ESession client.76035 172.17.3.37:0/2206945919 open cmapv 48 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal ESession.replay sessionmap 47 < 48 open client.76035 172.17.3.37:0/2206945919 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.journal opened session client.76035 172.17.3.37:0/2206945919 2024-03-14T14:03:55.999+0000 7f29e76d8700 20 mds.0.sessionmap replay_dirty_session s=0x55681f13fe00 name=client.76035 v=47
Consider the session client.70093 (which is the session that reclaimed)
2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.log _replay 4276783~155 / 4314625 2024-03-14T12:14:12.519419+0000: ESession client.70093 172.17.3.37:0/149367982 open cmapv 44 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.log _replay 4277099~121 / 4314625 2024-03-14T12:14:13.131195+0000: ESession client.70093 172.17.3.37:0/149367982 close cmapv 46 2024-03-14T14:03:55.999+0000 7f29e76d8700 10 mds.0.log _replay 4277240~528 / 4314625 2024-03-14T12:14:13.143734+0000: ESession client.70093 172.17.3.37:0/149367982 open cmapv 47
Why does this happen? I'd expect a ESession(.., open, ...) and only a ESession(..., close, ...) when the client disconnects.
Updated by Venky Shankar almost 2 years ago
FYI - this issues seems to happen during upgrade, but not totally sure. I'll have to reconfirm/retest.
Updated by Venky Shankar over 1 year ago
https://github.com/ceph/ceph/pull/57859 which a debug change to be able to aid in debugging when this happens again has been merged.
Updated by Konstantin Shalygin about 1 year ago
- Backport changed from quincy,reef,squid to reef,squid
Updated by ymo zhang 9 months ago
hello Venky Shankar
I also encountered this problem during my recent test.
2025-06-18T14:42:54.003+0800 fffe882db8d0 10 MDSRank.cc:1555 mds.0.8708 send_message_client client.14394138 10.61.2.239:0/1921725021 client_session(reject) v5
2025-06-18T14:42:54.003+0800 fffe882db8d0 2 Server.cc:644 mds.0.server New client session: addr="10.61.2.239:0/1921725021",elapsed=0.000200,throttled=0.000016,status="REJECTED",error="duplicated session uuid",root="/dir1"
Through the analysis of the MDS logs, a scenario was discovered: after receiving the reclaim_reply, the lib sets the UUID and then sends MClientSession(request_close). Since the MDS status is still in client_replay, this message is placed in wait_for_active. When the MDS recovers to active_start, MClientSession(request_close) performs MsgRetry. While the close message is waiting for the callback _session_logged, at this time, MClientSession(request_open) sent by the lib due to the active state change of handle_mds_map is received. As the previous session information has not been removed by the close yet, the UUID is still cached in the MDS, resulting in failure.
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 MDSRank.cc:1424 mds.0.8708 get_session have 0xfffd00581400 client.14394138 10.61.2.239:0/1921725021 state open
2025-06-18T14:42:54.003+0800 fffe882db8d0 3 Server.cc:548 mds.0.server handle_client_session client_session(request_close) v1 from client.14394138
2025-06-18T14:42:54.003+0800 fffe882db8d0 10 Server.cc:1373 mds.0.server journal_close_session : client.14394138 10.61.2.239:0/1921725021 pending_prealloc_inos [] free_prealloc_inos [] delegated_inos []
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 SessionMap.cc:791 mds.0.sessionmap mark_projected s=0xfffd00581400 name=client.14394138 pv=7272 -> 7273
2025-06-18T14:42:54.003+0800 fffe882db8d0 1 MDSRank.cc:2461 mds.0.8708 cluster recovered.
2025-06-18T14:42:54.003+0800 fffe811fb8d0 5 MDLog.cc:390 mds.0.log _submit_thread 7057848~121 : ESession client.14394138 10.61.2.239:0/1921725021 close cmapv 7273
2025-06-18T14:42:54.003+0800 fffe882db8d0 4 MDSRank.cc:1571 mds.0.8708 set_osd_epoch_barrier: epoch=17207
2025-06-18T14:42:54.003+0800 fffe882db8d0 10 MetricAggregator.cc:349 mds.metric.aggregator notify_mdsmap
2025-06-18T14:42:54.003+0800 fffe882db8d0 3 Server.cc:548 mds.0.server handle_client_session client_session(request_open) v5 from client.14394138
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 Server.cc:678 mds.0.server handle_client_session CEPH_SESSION_REQUEST_OPEN metadata entries:
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 Server.cc:679 mds.0.server features: '0x000000000007ffff'
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 Server.cc:680 mds.0.server metric specification: [{metric_flags: '0x000000000000ffff'}]
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 Server.cc:682 mds.0.server ceph_sha1: no_version
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 Server.cc:682 mds.0.server ceph_version: ceph version Development (no_version) quincy (stable)
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 Server.cc:682 mds.0.server entity_id: nfs.nas-nfs.1
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 Server.cc:682 mds.0.server hostname: ceph239
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 Server.cc:682 mds.0.server pid: 3449
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 Server.cc:682 mds.0.server root: /dir1
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 Server.cc:682 mds.0.server timeout: 300
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 Server.cc:682 mds.0.server uuid: ganesha-ceph239-0001
2025-06-18T14:42:54.003+0800 fffe882db8d0 10 MDSRank.cc:1555 mds.0.8708 send_message_client client.14394138 10.61.2.239:0/1921725021 client_session(reject) v5
2025-06-18T14:42:54.003+0800 fffe882db8d0 2 Server.cc:644 mds.0.server New client session: addr="10.61.2.239:0/1921725021",elapsed=0.000200,throttled=0.000016,status="REJECTED",error="duplicated session uuid",root="/dir1"
2025-06-18T14:42:54.003+0800 fffe882db8d0 0 LogClient.cc:159 log_channel(cluster) log [WRN] : client session with duplicated session uuid 'ganesha-ceph239-0001' denied (client.14394138 10.61.2.239:0/1921725021)
2025-06-18T14:42:54.003+0800 fffe882db8d0 10 SessionMap.cc:665 mds.0.sessionmap add_session s=0xfffb800d0500 name=client.14394497
2025-06-18T14:42:54.003+0800 fffe882db8d0 20 SessionMap.cc:791 mds.0.sessionmap mark_projected s=0xfffb800d0500 name=client.14394497 pv=7274 -> 7275
2025-06-18T14:42:54.003+0800 fffe882db8d0 10 SessionMap.cc:698 mds.0.sessionmap touch_session s=0xfffb800d0500 name=client.14394497
2025-06-18T14:42:54.003+0800 fffe811fb8d0 5 MDLog.cc:390 mds.0.log _submit_thread 7058328~319 : ESession client.14394497 10.61.2.239:0/3856640865 open cmapv 7275
2025-06-18T14:42:54.003+0800 fffe8b33b8d0 5 Beacon.cc:130 mds.beacon.mds.ceph239.wwvgyj received beacon reply up:active seq 83 rtt 0.980009
2025-06-18T14:42:54.003+0800 fffe8221b8d0 10 MDSContext.cc:94 MDSIOContextBase::complete: 20C_MDS_session_finish
2025-06-18T14:42:54.003+0800 fffe8221b8d0 10 MDSContext.cc:28 MDSContext::complete: 20C_MDS_session_finish
2025-06-18T14:42:54.003+0800 fffe8221b8d0 10 Server.cc:847 mds.0.server _session_logged client.14394138 10.61.2.239:0/1921725021 state_seq 3 close 7273 inos_to_free [] inotablev 0 inos_to_purge []
2025-06-18T14:42:54.003+0800 fffe8221b8d0 20 SessionMap.cc:729 mds.0.sessionmap mark_dirty s=0xfffd00581400 name=client.14394138 v=7272
2025-06-18T14:42:54.003+0800 fffe8221b8d0 10 MDSRank.cc:1555 mds.0.8708 send_message_client client.14394138 10.61.2.239:0/1921725021 client_session(close) v5
2025-06-18T14:42:54.003+0800 fffe8221b8d0 10 MetricsHandler.cc:98 remove_session: mds.metrics: session=0xfffd00581400, client=client.14394138 10.61.2.239:0/1921725021
2025-06-18T14:42:54.003+0800 fffe8221b8d0 10 MetricsHandler.cc:114 remove_session: mds.metrics: metric lus=0, last_updated_seq=0
2025-06-18T14:42:54.003+0800 fffe8221b8d0 10 SessionMap.cc:682 mds.0.sessionmap remove_session s=0xfffd00581400 name=client.14394138
2025-06-18T14:42:54.003+0800 fffe8221b8d0 10 MDSContext.cc:94 MDSIOContextBase::complete: 20C_MDS_session_finish
2025-06-18T14:42:54.003+0800 fffe8221b8d0 10 MDSContext.cc:28 MDSContext::complete: 20C_MDS_session_finish
2025-06-18T14:42:54.003+0800 fffe8221b8d0 10 Server.cc:847 mds.0.server _session_logged client.14394498 10.61.2.239:0/2822061139 state_seq 1 open 7274 inos_to_free [] inotablev 0 inos_to_purge []
2025-06-18T14:42:54.003+0800 fffe8221b8d0 20 SessionMap.cc:729 mds.0.sessionmap mark_dirty s=0xfffb800d3200 name=client.14394498 v=7273
!!
Please confirm if it is helpful to solve this problem. Thanks !!
Updated by Venky Shankar 8 months ago
ymo zhang wrote in #note-8:
hello Venky Shankar
I also encountered this problem during my recent test.[...]
Through the analysis of the MDS logs, a scenario was discovered: after receiving the reclaim_reply, the lib sets the UUID and then sends MClientSession(request_close). Since the MDS status is still in client_replay, this message is placed in wait_for_active. When the MDS recovers to active_start, MClientSession(request_close) performs MsgRetry. While the close message is waiting for the callback _session_logged, at this time, MClientSession(request_open) sent by the lib due to the active state change of handle_mds_map is received. As the previous session information has not been removed by the close yet, the UUID is still cached in the MDS, resulting in failure.
[...]
[...]
[...]!!
Please confirm if it is helpful to solve this problem. Thanks !!
Thanks for the detailed information. I will have a look today.
Updated by Venky Shankar 8 months ago · Edited
ymo zhang wrote in #note-8:
hello Venky Shankar
I also encountered this problem during my recent test.[...]
Through the analysis of the MDS logs, a scenario was discovered: after receiving the reclaim_reply, the lib sets the UUID and then sends MClientSession(request_close). Since the MDS status is still in client_replay, this message is placed in wait_for_active. When the MDS recovers to active_start, MClientSession(request_close) performs MsgRetry. While the close message is waiting for the callback _session_logged, at this time, MClientSession(request_open) sent by the lib due to the active state change of handle_mds_map is received. As the previous session information has not been removed by the close yet, the UUID is still cached in the MDS, resulting in failure.
I think your analysis is accurate. The _close_sessions() in `Client::set_uuid()` call will drop `client_lock` waiting for the MDS to send `MClientSession(CEPH_SESSION_CLOSE)`. While handling MDS map, `Client::connect_mds_targets()` will call `Client::_open_mds_session()` which in turn initiates `MClientSession(CEPH_SESSION_REQUEST_OPEN)` with the updated metadata (uuid).
I am thinking of a fix right now...
Updated by ymo zhang 8 months ago
Thank you for your reply.
My temporary solution for handling MClientSession (CEPH_SESSION_REQUEST_OPEN) is to check the current status of the session.
If it is closing, then:
if (session->is_closing()) {
mdlog->wait_for_safe(new MDSInternalContextWrapper(mds,
new C_MDS_RetryMessage(mds, m)));
return;
}
Please review whether it is feasible to wait for the processing of MClientSession (CEPH_SESSION_REQUEST_CLOSE) to be completed before handling the CEPH_SESSION_REQUEST_OPEN message. Are there any scenarios that have not been considered?
Thank you very much!
Updated by Venky Shankar 8 months ago
ymo zhang wrote in #note-12:
Thank you for your reply.
My temporary solution for handling MClientSession (CEPH_SESSION_REQUEST_OPEN) is to check the current status of the session.
If it is closing, then:
if (session->is_closing()) {
mdlog->wait_for_safe(new MDSInternalContextWrapper(mds,
new C_MDS_RetryMessage(mds, m)));
return;
}Please review whether it is feasible to wait for the processing of MClientSession (CEPH_SESSION_REQUEST_CLOSE) to be completed before handling the CEPH_SESSION_REQUEST_OPEN message. Are there any scenarios that have not been considered?
Thank you very much!
I will check out this week and update. Thanks for the proposed fix.
Updated by Venky Shankar 7 months ago
Venky Shankar wrote in #note-13:
ymo zhang wrote in #note-12:
Thank you for your reply.
My temporary solution for handling MClientSession (CEPH_SESSION_REQUEST_OPEN) is to check the current status of the session.
If it is closing, then:
if (session->is_closing()) {
mdlog->wait_for_safe(new MDSInternalContextWrapper(mds,
new C_MDS_RetryMessage(mds, m)));
return;
}Please review whether it is feasible to wait for the processing of MClientSession (CEPH_SESSION_REQUEST_CLOSE) to be completed before handling the CEPH_SESSION_REQUEST_OPEN message. Are there any scenarios that have not been considered?
Thank you very much!I will check out this week and update. Thanks for the proposed fix.
Sorry for the delay. I will be working on this today.
Updated by Venky Shankar 7 months ago
ymo zhang wrote in #note-12:
Thank you for your reply.
My temporary solution for handling MClientSession (CEPH_SESSION_REQUEST_OPEN) is to check the current status of the session.
If it is closing, then:
if (session->is_closing()) {
mdlog->wait_for_safe(new MDSInternalContextWrapper(mds,
new C_MDS_RetryMessage(mds, m)));
return;
}Please review whether it is feasible to wait for the processing of MClientSession (CEPH_SESSION_REQUEST_CLOSE) to be completed before handling the CEPH_SESSION_REQUEST_OPEN message. Are there any scenarios that have not been considered?
Thank you very much!
I am thinking if the fix needs to be in the user-space client, since this essentially is a race b/w session close and open messages (sent by and to the MDS).
Updated by Venky Shankar 7 months ago · Edited
Venky Shankar wrote in #note-15:
ymo zhang wrote in #note-12:
Thank you for your reply.
My temporary solution for handling MClientSession (CEPH_SESSION_REQUEST_OPEN) is to check the current status of the session.
If it is closing, then:
if (session->is_closing()) {
mdlog->wait_for_safe(new MDSInternalContextWrapper(mds,
new C_MDS_RetryMessage(mds, m)));
return;
}Please review whether it is feasible to wait for the processing of MClientSession (CEPH_SESSION_REQUEST_CLOSE) to be completed before handling the CEPH_SESSION_REQUEST_OPEN message. Are there any scenarios that have not been considered?
Thank you very much!I am thinking if the fix needs to be in the user-space client, since this essentially is a race b/w session close and open messages (sent by and to the MDS).
Hi @zhang6731 - I think your proposed solution to return the request on the MDS is a bit better than doing it on the client side. If done on the client side when handling mds map update, we would be blocking the messenger thread, even if client_lock is dropped. I will push a PR with your proposed fix and add your credits.
@zhang6731 ^^
Updated by Venky Shankar 6 months ago
Venky Shankar wrote in #note-18:
I am pushing a fix today for this.
Sorry. I wasn't able to get this done in the time I mentioned above.
It is in my list for tomorrow.
Updated by Venky Shankar 4 months ago
- Status changed from In Progress to Pending Backport
- Pull request ID set to 66180
This is a bit hard to reproduce and would possibly require introducing synthetic delays in the MDS to allow the race message.
Updated by Venky Shankar 4 months ago
- Priority changed from High to Normal
- Target version set to v21.0.0
- Source set to Development
- Backport changed from reef,squid to tentacle,squid
Updated by Upkeep Bot 4 months ago
- Copied to Backport #73767: tentacle: mds: session reclaim could miss blocklisting an old session added
Updated by Upkeep Bot 4 months ago
- Copied to Backport #73768: squid: mds: session reclaim could miss blocklisting an old session added
Updated by Venky Shankar 4 months ago
- Status changed from Pending Backport to Fix Under Review
- Tags (freeform) deleted (
backport_processed)
Ugh. Bad status change.
Updated by Venky Shankar 4 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot 4 months ago
- Status changed from Pending Backport to Resolved
- Merge Commit set to 684e8c00d12037fb0af402f746487bf343504b82
- Fixed In set to v20.3.0-4461-g684e8c00d1
- Upkeep Timestamp set to 2025-12-05T10:39:34+00:00