Bug #62098
closedlong latency of repop delivering
0%
Description
Recently, we observed the following phenomenon: the latency between sending and receiving of "repop" messages between two single-core crimson-osds can reach up to 40+ ms even if both OSDs are running on the same machine, while, on the contrary, "osd_ping" messages doesn't have this issue.
To facilitate our analysis, we changed the level of a few messenger log from debug to info, and here's an example:
osd.2 sent out two repops to osd.0 and osd.1 at 14:57:39,570:
INFO 2023-07-20 14:57:39,570 [shard 0] ms - [0x600000dd09a0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.0 v2:10.57.37.119:6801/541234587@63925] do_send() got message -- osd_repop(client.4151.0:100 1.1 e14/10 1:9f31207a:::rbd_object_map.1035add9be74:head v 14'75, rollback_to(legacy)=14'75) v2 INFO 2023-07-20 14:57:39,570 [shard 0] ms - [0x600000dd09a0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.0 v2:10.57.37.119:6801/541234587@63925] --> #144 === osd_repop(client.4151.0:100 1.1 14/10 1:9f31207a:::rbd_object_map.1035add9be74:head v 14'75, mlcod=14'75) v3 (112) INFO 2023-07-20 14:57:39,571 [shard 0] osd - send_to_osd: sending osd_repop(client.4151.0:100 1.1 e14/10 1:9f31207a:::rbd_object_map.1035add9be74:head v 14'75, rollback_to(legacy)=14'75) v2, to osd.1 INFO 2023-07-20 14:57:39,571 [shard 0] ms - [0x600000dd08c0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.1 v2:10.57.37.119:6805/3644417809@58394] do_send() got message -- osd_repop(client.4151.0:100 1.1 e14/10 1:9f31207a:::rbd_object_map.1035add9be74:head v 14'75, rollback_to(legacy)=14'75) v2 INFO 2023-07-20 14:57:39,571 [shard 0] ms - [0x600000dd08c0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.1 v2:10.57.37.119:6805/3644417809@58394] --> #216 === osd_repop(client.4151.0:100 1.1 e14/10 1:9f31207a:::rbd_object_map.1035add9be74:head v 14'75, mlcod=14'75) v3 (112)
osd.1 received the repop message at 14:57:39,613 and sent out the repop reply at 14:57:39,614
INFO 2023-07-20 14:57:39,613 [shard 0] ms - [0x600000dd09a0 osd.1(cluster) v2:10.57.37.119:6805/3644417809@58394 >> osd.2 v2:10.57.37.119:6810/2287946471] <== #216,222 === osd_repop(client.4151.0:100 1.1 e14/10) v3 (112) ...... INFO 2023-07-20 14:57:39,614 [shard 0] ms - [0x600000dd09a0 osd.1(cluster) v2:10.57.37.119:6805/3644417809@58394 >> osd.2 v2:10.57.37.119:6810/2287946471] do_send() got message -- osd_repop_reply(client.4151.0:100 1.1 e14/10 ondisk, result = 0) v2 INFO 2023-07-20 14:57:39,614 [shard 0] ms - [0x600000dd09a0 osd.1(cluster) v2:10.57.37.119:6805/3644417809@58394 >> osd.2 v2:10.57.37.119:6810/2287946471] --> #223 === osd_repop_reply(client.4151.0:100 1.1 e14/10 ondisk, result = 0) v2 (113)
osd.0 received the repop message at 14:57:39,571 and sent out the repop reply at 14:57:39,573
INFO 2023-07-20 14:57:39,571 [shard 0] ms - [0x600000dd18c0 osd.0(cluster) v2:10.57.37.119:6801/541234587@63925 >> osd.2 v2:10.57.37.119:6810/2287946471] <== #144,142 === osd_repop(client.4151.0:100 1.1 e14/10) v3 (112) ...... INFO 2023-07-20 14:57:39,573 [shard 0] ms - [0x600000dd18c0 osd.0(cluster) v2:10.57.37.119:6801/541234587@63925 >> osd.2 v2:10.57.37.119:6810/2287946471] do_send() got message -- osd_repop_reply(client.4151.0:100 1.1 e14/10 ondisk, result = 0) v2 INFO 2023-07-20 14:57:39,573 [shard 0] ms - [0x600000dd18c0 osd.0(cluster) v2:10.57.37.119:6801/541234587@63925 >> osd.2 v2:10.57.37.119:6810/2287946471] --> #143 === osd_repop_reply(client.4151.0:100 1.1 e14/10 ondisk, result = 0) v2 (113)
osd.2 received the repop reply from osd.0 and osd.1 respectively at 14:57:39,617 and 14:57:39,657
INFO 2023-07-20 14:57:39,617 [shard 0] ms - [0x600000dd09a0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.0 v2:10.57.37.119:6801/541234587@63925] <== #143,144 === osd_repop_reply(client.4151.0:100 1.1 e14/10) v2 (113) INFO 2023-07-20 14:57:39,657 [shard 0] ms - [0x600000dd08c0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.1 v2:10.57.37.119:6805/3644417809@58394] <== #223,216 === osd_repop_reply(client.4151.0:100 1.1 e14/10) v2 (113)
As can be seen, the latency between sending and receiving of "repop" from osd.2 to osd.1 and repop reply from osd.0/osd.1 to osd.2 are all more than 40ms.
The repop:
client_request(id=296, detail=m=[osd_op(client.4151.0:100 1.1 1:9f31207a:::rbd_object_map.1035add9be74:head {call lock.assert_locked in=44b, call rbd.object_map_update in=38b} snapc 0={} ondisk+write+known_if_redirected+supports_pool_eio e14) v8])
BTW, the cpu utilization during the test was very low, only about 5%
Updated by Xuehan Xu over 2 years ago
- Subject changed from long latency repop delivering to long latency of repop delivering
Updated by Upkeep Bot 9 months ago
- Merge Commit set to 86b36a132dfee795ee579842ed1b5fa7b3a1eea7
- Fixed In set to v18.0.0-5149-g86b36a132df
- Released In set to v19.2.0~1936
- Upkeep Timestamp set to 2025-07-12T04:44:16+00:00