Bug #65309
openqa: dbench.sh failed with "ERROR: handle 10318 was not found"
0%
Description
Link to the job - https://pulpito.ceph.com/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625621
This failure on surface looks similar to https://tracker.ceph.com/issues/57656 but isn't because the failed job for this ticket doesn't contain following lines -
2022-09-22T14:04:02.650 INFO:tasks.workunit.client.0.smithi124.stdout:[1415095] write failed on handle 10009 (Resource temporarily unavailable) 2022-09-22T14:04:02.650 INFO:tasks.workunit.client.0.smithi124.stdout:Child failed with status 1
Instead it contains following lines -
2024-03-27T07:16:18.287 DEBUG:teuthology.orchestra.run.smithi033:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd pool get cephfs_metadata pg_num 2024-03-27T07:16:18.431 INFO:tasks.workunit.client.0.smithi033.stdout:[2060] open ./clients/client0/~dmtmp/PARADOX/__50172.DB failed for handle 10318 (Transport endpoint is not connected) 2024-03-27T07:16:18.431 INFO:tasks.workunit.client.0.smithi033.stdout:(2062) ERROR: handle 10318 was not found 2024-03-27T07:16:18.431 INFO:tasks.workunit.client.0.smithi033.stdout:Child failed with status 1 2024-03-27T07:16:18.435 DEBUG:teuthology.orchestra.run:got remote process result: 1 2024-03-27T07:16:18.435 INFO:tasks.workunit:Stopping ['suites/dbench.sh'] on client.0...
Following entries were found from /a/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625621/remote/smithi033/log/ceph-client.0.44701.log.gz -
169792:2024-03-27T07:15:16.165+0000 143de640 10 client.4605 _lookup concluded ENOENT locally for 0x1000000021f.head(faked_ino=0 nref=5 ll_ref=187 cap_refs={} open={} mode=40700 size=0/0 nlink=1 btime=2024-03-27T07:13:12.869114+0000 mtime=2024-03-27T07:15:13.929586+0000 ctime=2024-03-27T07:15:13.929586+0000 change_attr=27 caps=pAsLsXsFsx(0=pAsLsXsFsx) COMPLETE parents=0x10000000004.head["PARADOX"] 0x15d9f720) dn 'ERRORCHG.FAM'
169853:2024-03-27T07:15:16.197+0000 13bdd640 10 client.4605 _lookup concluded ENOENT locally for 0x1000000021f.head(faked_ino=0 nref=6 ll_ref=188 cap_refs={} open={} mode=40700 size=0/0 nlink=1 btime=2024-03-27T07:13:12.869114+0000 mtime=2024-03-27T07:15:13.929586+0000 ctime=2024-03-27T07:15:13.929586+0000 change_attr=27 caps=pAsLsXsFsx(0=pAsLsXsFsx) COMPLETE parents=0x10000000004.head["PARADOX"] 0x15d9f720) dn 'ERRORCHG.DB'
There are more entries in the same log for ERRORCHG.DB and ERRORCHG.FAM.
Updated by Venky Shankar almost 2 years ago
- Category set to Correctness/Safety
- Assignee set to Xiubo Li
- Target version set to v20.0.0
- Backport set to quincy,reef,squid
Updated by Xiubo Li almost 2 years ago · Edited
The ceph-fuse mountpoint was unmounted during the dbench test was going on:
2024-03-27T07:16:09.559 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.c is failed for ~235s 2024-03-27T07:16:09.559 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.b is failed for ~296s 2024-03-27T07:16:10.466 INFO:tasks.workunit.client.0.smithi033.stdout: 1 1997 0.04 MB/sec execute 80 sec latency 681.539 ms 2024-03-27T07:16:11.467 INFO:tasks.workunit.client.0.smithi033.stdout: 1 2018 0.04 MB/sec execute 81 sec latency 937.546 ms 2024-03-27T07:16:12.355 DEBUG:teuthology.orchestra.run.smithi033:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2024-03-27T07:16:12.364 DEBUG:teuthology.orchestra.run.smithi130:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2024-03-27T07:16:12.466 INFO:tasks.workunit.client.0.smithi033.stdout: 1 2034 0.04 MB/sec execute 82 sec latency 214.462 ms 2024-03-27T07:16:13.466 INFO:tasks.workunit.client.0.smithi033.stdout: 1 2042 0.04 MB/sec execute 83 sec latency 789.478 ms 2024-03-27T07:16:14.467 INFO:tasks.workunit.client.0.smithi033.stdout: 1 2043 0.04 MB/sec execute 84 sec latency 882.551 ms 2024-03-27T07:16:15.467 INFO:tasks.workunit.client.0.smithi033.stdout: 1 2051 0.04 MB/sec execute 85 sec latency 932.632 ms 2024-03-27T07:16:16.271 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.c is failed for ~242s 2024-03-27T07:16:16.271 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.b is failed for ~303s 2024-03-27T07:16:16.271 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons 2024-03-27T07:16:16.272 DEBUG:teuthology.orchestra.run.smithi033:> set -ex
Just because the mds.b and mds.c daemons were barked by the watchdog and then killed all the daemons and at the same time unmounted the ceph-fuse daemon.
For example for the mds.b, it sent the beacon with seq 9, but didn't get any reply and then sent a new beacon with seq 10:
-1197> 2024-03-27T07:08:13.574+0000 ddf7640 10 monclient: _check_auth_tickets
-1196> 2024-03-27T07:08:13.574+0000 ddf7640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-27T07:07:43.575530+0000)
-1195> 2024-03-27T07:08:13.627+0000 edf9640 5 mds.beacon.b Sending beacon up:active seq 8
-1194> 2024-03-27T07:08:13.627+0000 edf9640 10 monclient: _send_mon_message to mon.b at v2:172.21.15.130:3300/0
-1193> 2024-03-27T07:08:13.627+0000 edf9640 1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:3300/0,v1:172.21.15.130:6789/0] -- mdsbeacon(4285/b up:active seq=8 v10) -- 0x166a94c0 con 0x8c3ed40
-1192> 2024-03-27T07:08:13.628+0000 edf9640 20 mds.beacon.b sender thread waiting interval 4s
-1191> 2024-03-27T07:08:13.651+0000 9def640 1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] <== mon.1 v2:172.21.15.130:3300/0 60 ==== mdsbeacon(4285/b up:active seq=8 v10) ==== 130+0+0 (secure 0 0 0) 0x145853f0 con 0x8c3ed40
-1190> 2024-03-27T07:08:13.651+0000 9def640 5 mds.beacon.b received beacon reply up:active seq 8 rtt 0.0239996
-1189> 2024-03-27T07:08:13.653+0000 edf9640 20 mds.beacon.b sender thread waiting interval 3.974s
-1188> 2024-03-27T07:08:13.843+0000 111fd640 2 mds.0.cache Memory usage: total 618156, rss 326256, heap 577588, baseline 498920, 0 / 13 inodes have caps, 0 caps, 0 caps per inode
-1187> 2024-03-27T07:08:13.843+0000 111fd640 20 mds.0.cache upkeep thread trimming cache; last trim 0.999981821s ago
...
-861> 2024-03-27T07:08:57.499+0000 edf9640 5 mds.beacon.b Sending beacon up:active seq 9
-860> 2024-03-27T07:08:57.499+0000 edf9640 10 monclient: _send_mon_message to mon.b at v2:172.21.15.130:3300/0
-859> 2024-03-27T07:08:57.499+0000 edf9640 1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:3300/0,v1:172.21.15.130:6789/0] -- mdsbeacon(4285/b up:active seq=9 v12) -- 0xf794310 con 0x8c3ed40
-858> 2024-03-27T07:08:57.499+0000 edf9640 20 mds.beacon.b sender thread waiting interval 4s
-857> 2024-03-27T07:08:57.513+0000 15605640 10 mds.metric.aggregator ping_all_active_ranks: pinging 2 active mds(s)
-856> 2024-03-27T07:08:57.513+0000 15605640 20 mds.metric.aggregator ping_all_active_ranks: pinging rank=0 addr=[v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123]
-855> 2024-03-27T07:08:57.513+0000 15605640 10 mds.pinger send_ping: rank=0
-854> 2024-03-27T07:08:57.513+0000 15605640 10 mds.pinger send_ping: sending ping with sequence=4 to rank=0
-853> 2024-03-27T07:08:57.514+0000 15605640 1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] send_to--> mds [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] -- mdsping -- ?+0 0x142f6cc0
-852> 2024-03-27T07:08:57.514+0000 15605640 1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] -- mdsping -- 0x142f6cc0 con 0x8c1d2a0
-851> 2024-03-27T07:08:57.514+0000 15605640 20 mds.metric.aggregator ping_all_active_ranks: pinging rank=1 addr=[v2:172.21.15.33:6836/1439810540,v1:172.21.15.33:6837/1439810540]
-850> 2024-03-27T07:08:57.514+0000 15605640 10 mds.pinger send_ping: rank=1
...
-124> 2024-03-27T07:11:07.737+0000 edf9640 5 mds.beacon.b Sending beacon up:active seq 10
-123> 2024-03-27T07:11:07.737+0000 edf9640 10 monclient: _send_mon_message to mon.b at v2:172.21.15.130:3300/0
-122> 2024-03-27T07:11:07.737+0000 edf9640 1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:3300/0,v1:172.21.15.130:6789/0] -- mdsbeacon(4285/b up:active seq=10 v12) -- 0xf7f94f0 con 0x8d2f4d0
-121> 2024-03-27T07:11:07.737+0000 edf9640 20 mds.beacon.b sender thread waiting interval 4s
-120> 2024-03-27T07:11:07.737+0000 cdf5640 1 mds.b Updating MDS map to version 13 from mon.1
-119> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
-118> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
-117> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b my gid is 4285
-116> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b map says I am mds.0.4 state up:active
-115> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b msgr says I am [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123]
-114> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b handle_mds_map: handling map as rank 0
-113> 2024-03-27T07:11:07.741+0000 cdf5640 4 mds.0.4 set_osd_epoch_barrier: epoch=24
-112> 2024-03-27T07:11:07.741+0000 cdf5640 10 mds.metric.aggregator notify_mdsmap
...
But with the beacon with seq 10 the MON just replaced the mds.b.
While from the mon.b logs it just received the mds.b's beacon seq 9 after around 2 seconds later. The mds.b's beacon message should be stuck somewhere due to some reasons:
2024-03-27T07:11:06.705+0000 ffa6640 1 -- [v2:172.21.15.130:3300/0,v1:172.21.15.130:6789/0] <== mds.? v2:172.21.15.130:6832/1011073123 54 ==== mdsbeacon(4285/b up:active seq=9 v12) ==== 512+0+0 (secure 0 0 0) 0x7d55a80 con 0x14c5dfa0 2024-03-27T07:11:06.708+0000 ffa6640 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x14cc7170 for mds.? 2024-03-27T07:11:06.708+0000 ffa6640 20 mon.b@1(peon) e1 entity_name mds.b global_id 4285 (new_ok) caps allow * 2024-03-27T07:11:06.708+0000 ffa6640 10 mon.b@1(peon).paxosservice(mdsmap 1..21) dispatch 0x7d55a80 mdsbeacon(4285/b up:active seq=9 v12) from mds.? v2:172.21.15.130:6832/1011073123 con 0x14c5dfa0 2024-03-27T07:11:06.708+0000 ffa6640 10 mon.b@1(peon).paxosservice(mdsmap 1..21) discarding message from disconnected client mds.? v2:172.21.15.130:6832/1011073123 mdsbeacon(4285/b up:active seq=9 v12) 2024-03-27T07:11:06.708+0000 ffa6640 10 mon.b@1(peon) e1 ms_handle_reset 0x14c5dfa0 v2:172.21.15.130:6832/1011073123
But the MON will only allow 60s and then will assume the mds.b is dead and then will replace it:
2024-03-27T07:09:17.648+0000 ffa6640 4 mon.a@0(leader).mgr e5 beacon from 4106 2024-03-27T07:09:17.648+0000 ffa6640 10 mon.a@0(leader).mgr e5 from existing standby 4106 2024-03-27T07:09:17.648+0000 ffa6640 10 mon.a@0(leader).mgr e5 no change 2024-03-27T07:09:17.658+0000 d7a1640 11 mon.a@0(leader) e1 tick 2024-03-27T07:09:17.661+0000 d7a1640 1 mon.a@0(leader).mds e14 no beacon from mds.0.4 (gid: 4285 addr: [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] state: up:active) since 64.0174 2024-03-27T07:09:17.668+0000 d7a1640 1 mon.a@0(leader).mds e14 marking 4285 [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] mds.0.4 up:active laggy 2024-03-27T07:09:17.689+0000 d7a1640 1 mon.a@0(leader).mds e14 replacing 4285 [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] mds.0.4 up:active with 4221/e [v2:172.21.15.33:6834/3132869688,v1:172.21.15.33:6835/3132869688] 2024-03-27T07:09:17.695+0000 d7a1640 0 log_channel(cluster) log [WRN] : Replacing daemon mds.b as rank 0 with standby daemon mds.e 2024-03-27T07:09:17.696+0000 d7a1640 1 -- [v2:172.21.15.33:3300/0,v1:172.21.15.33:6789/0] --> [v2:172.21.15.33:3300/0,v1:172.21.15.33:6789/0] -- log(1 entries from seq 330 at 2024-03-27T07:09:17.696139+0000) -- 0x67fe000 con 0xb230840 2024-03-27T07:09:17.700+0000 d7a1640 1 mon.a@0(leader).mds e14 fail_mds_gid 4285 mds.b role 0 2024-03-27T07:09:17.707+0000 d7a1640 10 mon.a@0(leader).osd e26 blocklist [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] until 2024-03-28T07:09:17.705494+0000 2024-03-27T07:09:17.735+0000 d7a1640 0 log_channel(cluster) log [INF] : MDS daemon mds.b is removed because it is dead or otherwise unavailable. 2024-03-27T07:09:17.736+0000 d7a1640 1 -- [v2:172.21.15.33:3300/0,v1:172.21.15.33:6789/0] --> [v2:172.21.15.33:3300/0,v1:172.21.15.33:6789/0] -- log(1 entries from seq 331 at 2024-03-27T07:09:17.736390+0000) -- 0x68e07d0 con 0xb230840 2024-03-27T07:09:17.738+0000 d7a1640 20 mon.a@0(leader).mds e14 maybe_resize_cluster in 3 max 3
While in the MDS side it will allow the interval of 4s, please see mds option mds_beacon_interval.
From the logs I couldn't know exactly why the beacon message was stuck. But for the option mds_beacon_interval we need to do a fix IMO ?
Updated by Xiubo Li almost 2 years ago
- Related to Bug #6608: samba teuthology dbench failure added
Updated by Xiubo Li almost 2 years ago
- Related to Bug #6718: handle not found error running dbench.sh added
Updated by Xiubo Li almost 2 years ago
- Related to Bug #42920: mds: removed from map due to dropped (?) beacons added
Updated by Xiubo Li almost 2 years ago
- Related to Bug #42919: mds: heartbeat timeout during large scale git-clone/rm workload added
Updated by Xiubo Li almost 2 years ago · Edited
Both the beacon and mdsping messages were sent out, but neither the MON and mds received them:
-861> 2024-03-27T07:08:57.499+0000 edf9640 5 mds.beacon.b Sending beacon up:active seq 9 -860> 2024-03-27T07:08:57.499+0000 edf9640 10 monclient: _send_mon_message to mon.b at v2:172.21.15.130:3300/0 -859> 2024-03-27T07:08:57.499+0000 edf9640 1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:3300/0,v1:172.21.15.130:6789/0] -- mdsbeacon(4285/b up:active seq=9 v12) -- 0xf794310 con 0x8c3ed40 -858> 2024-03-27T07:08:57.499+0000 edf9640 20 mds.beacon.b sender thread waiting interval 4s -857> 2024-03-27T07:08:57.513+0000 15605640 10 mds.metric.aggregator ping_all_active_ranks: pinging 2 active mds(s) -856> 2024-03-27T07:08:57.513+0000 15605640 20 mds.metric.aggregator ping_all_active_ranks: pinging rank=0 addr=[v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] -855> 2024-03-27T07:08:57.513+0000 15605640 10 mds.pinger send_ping: rank=0 -854> 2024-03-27T07:08:57.513+0000 15605640 10 mds.pinger send_ping: sending ping with sequence=4 to rank=0 -853> 2024-03-27T07:08:57.514+0000 15605640 1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] send_to--> mds [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] -- mdsping -- ?+0 0x142f6cc0 -852> 2024-03-27T07:08:57.514+0000 15605640 1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] -- mdsping -- 0x142f6cc0 con 0x8c1d2a0 -851> 2024-03-27T07:08:57.514+0000 15605640 20 mds.metric.aggregator ping_all_active_ranks: pinging rank=1 addr=[v2:172.21.15.33:6836/1439810540,v1:172.21.15.33:6837/1439810540] -850> 2024-03-27T07:08:57.514+0000 15605640 10 mds.pinger send_ping: rank=1 -849> 2024-03-27T07:08:57.515+0000 15605640 20 mds.pinger send_ping: init ping pong state for rank=1 -848> 2024-03-27T07:08:57.515+0000 15605640 10 mds.pinger send_ping: sending ping with sequence=1 to rank=1 -847> 2024-03-27T07:08:57.515+0000 15605640 1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] send_to--> mds [v2:172.21.15.33:6836/1439810540,v1:172.21.15.33:6837/1439810540] -- mdsping -- ?+0 0x142f6e80 -846> 2024-03-27T07:08:57.515+0000 15605640 1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.33:6836/1439810540,v1:172.21.15.33:6837/1439810540] -- mdsping -- 0x142f6e80 con 0x16f45ce0 -845> 2024-03-27T07:08:57.519+0000 e5f8640 20 mds.0.4 get_task_status -844> 2024-03-27T07:08:57.519+0000 e5f8640 20 mds.0.4 schedule_update_timer_task -843> 2024-03-27T07:08:57.519+0000 e5f8640 10 mds.0.server find_idle_sessions. last cleared laggy state 1199.64s ago
The messages should be stuck in the protocol sender queue or the route.
Updated by Xiubo Li almost 2 years ago
Xiubo Li wrote in #note-10:
Both the beacon and mdsping messages were sent out, but neither the MON and mds received them:
[...]
The messages should be stuck in the protocol sender queue or the route.
Raised one PR to add more debug logs in the protocol code: https://github.com/ceph/ceph/pull/57857.
Updated by Konstantin Shalygin about 1 year ago
- Backport changed from quincy,reef,squid to reef,squid