Project

General

Profile

Actions

Bug #62098

closed

long latency of repop delivering

Added by Xuehan Xu over 2 years ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Fixed In:
v18.0.0-5149-g86b36a132df
Released In:
v19.2.0~1936
Upkeep Timestamp:
2025-07-12T04:44:16+00:00

Description

Recently, we observed the following phenomenon: the latency between sending and receiving of "repop" messages between two single-core crimson-osds can reach up to 40+ ms even if both OSDs are running on the same machine, while, on the contrary, "osd_ping" messages doesn't have this issue.

To facilitate our analysis, we changed the level of a few messenger log from debug to info, and here's an example:

osd.2 sent out two repops to osd.0 and osd.1 at 14:57:39,570:

INFO  2023-07-20 14:57:39,570 [shard 0] ms - [0x600000dd09a0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.0 v2:10.57.37.119:6801/541234587@63925] do_send() got message -- osd_repop(client.4151.0:100 1.1 e14/10 1:9f31207a:::rbd_object_map.1035add9be74:head v 14'75, rollback_to(legacy)=14'75) v2
INFO  2023-07-20 14:57:39,570 [shard 0] ms - [0x600000dd09a0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.0 v2:10.57.37.119:6801/541234587@63925] --> #144 === osd_repop(client.4151.0:100 1.1  14/10 1:9f31207a:::rbd_object_map.1035add9be74:head v 14'75, mlcod=14'75) v3 (112)
INFO  2023-07-20 14:57:39,571 [shard 0] osd - send_to_osd: sending osd_repop(client.4151.0:100 1.1 e14/10 1:9f31207a:::rbd_object_map.1035add9be74:head v 14'75, rollback_to(legacy)=14'75) v2, to osd.1
INFO  2023-07-20 14:57:39,571 [shard 0] ms - [0x600000dd08c0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.1 v2:10.57.37.119:6805/3644417809@58394] do_send() got message -- osd_repop(client.4151.0:100 1.1 e14/10 1:9f31207a:::rbd_object_map.1035add9be74:head v 14'75, rollback_to(legacy)=14'75) v2
INFO  2023-07-20 14:57:39,571 [shard 0] ms - [0x600000dd08c0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.1 v2:10.57.37.119:6805/3644417809@58394] --> #216 === osd_repop(client.4151.0:100 1.1 e14/10 1:9f31207a:::rbd_object_map.1035add9be74:head v 14'75, mlcod=14'75) v3 (112)

osd.1 received the repop message at 14:57:39,613 and sent out the repop reply at 14:57:39,614

INFO  2023-07-20 14:57:39,613 [shard 0] ms - [0x600000dd09a0 osd.1(cluster) v2:10.57.37.119:6805/3644417809@58394 >> osd.2 v2:10.57.37.119:6810/2287946471] <== #216,222 === osd_repop(client.4151.0:100 1.1 e14/10) v3 (112)
......
INFO  2023-07-20 14:57:39,614 [shard 0] ms - [0x600000dd09a0 osd.1(cluster) v2:10.57.37.119:6805/3644417809@58394 >> osd.2 v2:10.57.37.119:6810/2287946471] do_send() got message -- osd_repop_reply(client.4151.0:100 1.1 e14/10 ondisk, result = 0) v2
INFO  2023-07-20 14:57:39,614 [shard 0] ms - [0x600000dd09a0 osd.1(cluster) v2:10.57.37.119:6805/3644417809@58394 >> osd.2 v2:10.57.37.119:6810/2287946471] --> #223 === osd_repop_reply(client.4151.0:100 1.1 e14/10 ondisk, result = 0) v2 (113)

osd.0 received the repop message at 14:57:39,571 and sent out the repop reply at 14:57:39,573

INFO  2023-07-20 14:57:39,571 [shard 0] ms - [0x600000dd18c0 osd.0(cluster) v2:10.57.37.119:6801/541234587@63925 >> osd.2 v2:10.57.37.119:6810/2287946471] <== #144,142 === osd_repop(client.4151.0:100 1.1 e14/10) v3 (112)
......
INFO  2023-07-20 14:57:39,573 [shard 0] ms - [0x600000dd18c0 osd.0(cluster) v2:10.57.37.119:6801/541234587@63925 >> osd.2 v2:10.57.37.119:6810/2287946471] do_send() got message -- osd_repop_reply(client.4151.0:100 1.1 e14/10 ondisk, result = 0) v2
INFO  2023-07-20 14:57:39,573 [shard 0] ms - [0x600000dd18c0 osd.0(cluster) v2:10.57.37.119:6801/541234587@63925 >> osd.2 v2:10.57.37.119:6810/2287946471] --> #143 === osd_repop_reply(client.4151.0:100 1.1 e14/10 ondisk, result = 0) v2 (113)

osd.2 received the repop reply from osd.0 and osd.1 respectively at 14:57:39,617 and 14:57:39,657

INFO  2023-07-20 14:57:39,617 [shard 0] ms - [0x600000dd09a0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.0 v2:10.57.37.119:6801/541234587@63925] <== #143,144 === osd_repop_reply(client.4151.0:100 1.1 e14/10) v2 (113)
INFO  2023-07-20 14:57:39,657 [shard 0] ms - [0x600000dd08c0 osd.2(cluster) v2:10.57.37.119:6810/2287946471 >> osd.1 v2:10.57.37.119:6805/3644417809@58394] <== #223,216 === osd_repop_reply(client.4151.0:100 1.1 e14/10) v2 (113)

As can be seen, the latency between sending and receiving of "repop" from osd.2 to osd.1 and repop reply from osd.0/osd.1 to osd.2 are all more than 40ms.

The repop:

client_request(id=296, detail=m=[osd_op(client.4151.0:100 1.1 1:9f31207a:::rbd_object_map.1035add9be74:head {call lock.assert_locked in=44b, call rbd.object_map_update in=38b} snapc 0={} ondisk+write+known_if_redirected+supports_pool_eio e14) v8])

BTW, the cpu utilization during the test was very low, only about 5%

Actions #1

Updated by Xuehan Xu over 2 years ago

  • Subject changed from long latency repop delivering to long latency of repop delivering
Actions #2

Updated by Xuehan Xu over 2 years ago

  • Description updated (diff)
Actions #3

Updated by Xuehan Xu over 2 years ago

  • Description updated (diff)
Actions #4

Updated by Xuehan Xu over 2 years ago

  • Description updated (diff)
Actions #5

Updated by Xuehan Xu over 2 years ago

  • Pull request ID set to 52592
Actions #6

Updated by Matan Breizman about 2 years ago

  • Status changed from New to Resolved
Actions #7

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to 86b36a132dfee795ee579842ed1b5fa7b3a1eea7
  • Fixed In set to v18.0.0-5149-g86b36a132df
  • Released In set to v19.2.0~1936
  • Upkeep Timestamp set to 2025-07-12T04:44:16+00:00
Actions

Also available in: Atom PDF