Project

General

Profile

Actions

Bug #65309

open

qa: dbench.sh failed with "ERROR: handle 10318 was not found"

Added by Rishabh Dave almost 2 years ago. Updated 9 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Backport:
reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
qa-failure
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

Link to the job - https://pulpito.ceph.com/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625621

This failure on surface looks similar to https://tracker.ceph.com/issues/57656 but isn't because the failed job for this ticket doesn't contain following lines -

2022-09-22T14:04:02.650 INFO:tasks.workunit.client.0.smithi124.stdout:[1415095] write failed on handle 10009 (Resource temporarily unavailable)
2022-09-22T14:04:02.650 INFO:tasks.workunit.client.0.smithi124.stdout:Child failed with status 1

Instead it contains following lines -

2024-03-27T07:16:18.287 DEBUG:teuthology.orchestra.run.smithi033:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd pool get cephfs_metadata pg_num
2024-03-27T07:16:18.431 INFO:tasks.workunit.client.0.smithi033.stdout:[2060] open ./clients/client0/~dmtmp/PARADOX/__50172.DB failed for handle 10318 (Transport endpoint is not connected)
2024-03-27T07:16:18.431 INFO:tasks.workunit.client.0.smithi033.stdout:(2062) ERROR: handle 10318 was not found
2024-03-27T07:16:18.431 INFO:tasks.workunit.client.0.smithi033.stdout:Child failed with status 1
2024-03-27T07:16:18.435 DEBUG:teuthology.orchestra.run:got remote process result: 1
2024-03-27T07:16:18.435 INFO:tasks.workunit:Stopping ['suites/dbench.sh'] on client.0...

Following entries were found from /a/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625621/remote/smithi033/log/ceph-client.0.44701.log.gz -

169792:2024-03-27T07:15:16.165+0000 143de640 10 client.4605 _lookup concluded ENOENT locally for 0x1000000021f.head(faked_ino=0 nref=5 ll_ref=187 cap_refs={} open={} mode=40700 size=0/0 nlink=1 btime=2024-03-27T07:13:12.869114+0000 mtime=2024-03-27T07:15:13.929586+0000 ctime=2024-03-27T07:15:13.929586+0000 change_attr=27 caps=pAsLsXsFsx(0=pAsLsXsFsx) COMPLETE parents=0x10000000004.head["PARADOX"] 0x15d9f720) dn 'ERRORCHG.FAM'
169853:2024-03-27T07:15:16.197+0000 13bdd640 10 client.4605 _lookup concluded ENOENT locally for 0x1000000021f.head(faked_ino=0 nref=6 ll_ref=188 cap_refs={} open={} mode=40700 size=0/0 nlink=1 btime=2024-03-27T07:13:12.869114+0000 mtime=2024-03-27T07:15:13.929586+0000 ctime=2024-03-27T07:15:13.929586+0000 change_attr=27 caps=pAsLsXsFsx(0=pAsLsXsFsx) COMPLETE parents=0x10000000004.head["PARADOX"] 0x15d9f720) dn 'ERRORCHG.DB'

There are more entries in the same log for ERRORCHG.DB and ERRORCHG.FAM.


Related issues 4 (2 open2 closed)

Related to CephFS - Bug #6608: samba teuthology dbench failureCan't reproduce10/21/2013

Actions
Related to CephFS - Bug #6718: handle not found error running dbench.shDuplicateGreg Farnum11/05/2013

Actions
Related to CephFS - Bug #42920: mds: removed from map due to dropped (?) beaconsNew

Actions
Related to CephFS - Bug #42919: mds: heartbeat timeout during large scale git-clone/rm workloadNew

Actions
Actions #1

Updated by Rishabh Dave almost 2 years ago

  • Description updated (diff)
Actions #2

Updated by Rishabh Dave almost 2 years ago

  • Labels (FS) qa-failure added
Actions #3

Updated by Venky Shankar almost 2 years ago

  • Category set to Correctness/Safety
  • Assignee set to Xiubo Li
  • Target version set to v20.0.0
  • Backport set to quincy,reef,squid
Actions #4

Updated by Xiubo Li almost 2 years ago

  • Status changed from New to In Progress
Actions #5

Updated by Xiubo Li almost 2 years ago · Edited

The ceph-fuse mountpoint was unmounted during the dbench test was going on:

2024-03-27T07:16:09.559 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.c is failed for ~235s
2024-03-27T07:16:09.559 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.b is failed for ~296s
2024-03-27T07:16:10.466 INFO:tasks.workunit.client.0.smithi033.stdout:   1      1997     0.04 MB/sec  execute  80 sec  latency 681.539 ms
2024-03-27T07:16:11.467 INFO:tasks.workunit.client.0.smithi033.stdout:   1      2018     0.04 MB/sec  execute  81 sec  latency 937.546 ms
2024-03-27T07:16:12.355 DEBUG:teuthology.orchestra.run.smithi033:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2024-03-27T07:16:12.364 DEBUG:teuthology.orchestra.run.smithi130:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2024-03-27T07:16:12.466 INFO:tasks.workunit.client.0.smithi033.stdout:   1      2034     0.04 MB/sec  execute  82 sec  latency 214.462 ms
2024-03-27T07:16:13.466 INFO:tasks.workunit.client.0.smithi033.stdout:   1      2042     0.04 MB/sec  execute  83 sec  latency 789.478 ms
2024-03-27T07:16:14.467 INFO:tasks.workunit.client.0.smithi033.stdout:   1      2043     0.04 MB/sec  execute  84 sec  latency 882.551 ms
2024-03-27T07:16:15.467 INFO:tasks.workunit.client.0.smithi033.stdout:   1      2051     0.04 MB/sec  execute  85 sec  latency 932.632 ms
2024-03-27T07:16:16.271 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.c is failed for ~242s
2024-03-27T07:16:16.271 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.b is failed for ~303s
2024-03-27T07:16:16.271 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons
2024-03-27T07:16:16.272 DEBUG:teuthology.orchestra.run.smithi033:> set -ex

Just because the mds.b and mds.c daemons were barked by the watchdog and then killed all the daemons and at the same time unmounted the ceph-fuse daemon.

For example for the mds.b, it sent the beacon with seq 9, but didn't get any reply and then sent a new beacon with seq 10:

 -1197> 2024-03-27T07:08:13.574+0000 ddf7640 10 monclient: _check_auth_tickets
 -1196> 2024-03-27T07:08:13.574+0000 ddf7640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-27T07:07:43.575530+0000)
 -1195> 2024-03-27T07:08:13.627+0000 edf9640  5 mds.beacon.b Sending beacon up:active seq 8
 -1194> 2024-03-27T07:08:13.627+0000 edf9640 10 monclient: _send_mon_message to mon.b at v2:172.21.15.130:3300/0
 -1193> 2024-03-27T07:08:13.627+0000 edf9640  1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:3300/0,v1:172.21.15.130:6789/0] -- mdsbeacon(4285/b up:active seq=8 v10) -- 0x166a94c0 con 0x8c3ed40
 -1192> 2024-03-27T07:08:13.628+0000 edf9640 20 mds.beacon.b sender thread waiting interval 4s
 -1191> 2024-03-27T07:08:13.651+0000 9def640  1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] <== mon.1 v2:172.21.15.130:3300/0 60 ==== mdsbeacon(4285/b up:active seq=8 v10) ==== 130+0+0 (secure 0 0 0) 0x145853f0 con 0x8c3ed40
 -1190> 2024-03-27T07:08:13.651+0000 9def640  5 mds.beacon.b received beacon reply up:active seq 8 rtt 0.0239996
 -1189> 2024-03-27T07:08:13.653+0000 edf9640 20 mds.beacon.b sender thread waiting interval 3.974s
 -1188> 2024-03-27T07:08:13.843+0000 111fd640  2 mds.0.cache Memory usage:  total 618156, rss 326256, heap 577588, baseline 498920, 0 / 13 inodes have caps, 0 caps, 0 caps per inode
 -1187> 2024-03-27T07:08:13.843+0000 111fd640 20 mds.0.cache upkeep thread trimming cache; last trim 0.999981821s ago

...

  -861> 2024-03-27T07:08:57.499+0000 edf9640  5 mds.beacon.b Sending beacon up:active seq 9
  -860> 2024-03-27T07:08:57.499+0000 edf9640 10 monclient: _send_mon_message to mon.b at v2:172.21.15.130:3300/0
  -859> 2024-03-27T07:08:57.499+0000 edf9640  1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:3300/0,v1:172.21.15.130:6789/0] -- mdsbeacon(4285/b up:active seq=9 v12) -- 0xf794310 con 0x8c3ed40
  -858> 2024-03-27T07:08:57.499+0000 edf9640 20 mds.beacon.b sender thread waiting interval 4s
  -857> 2024-03-27T07:08:57.513+0000 15605640 10 mds.metric.aggregator ping_all_active_ranks: pinging 2 active mds(s)
  -856> 2024-03-27T07:08:57.513+0000 15605640 20 mds.metric.aggregator ping_all_active_ranks: pinging rank=0 addr=[v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123]
  -855> 2024-03-27T07:08:57.513+0000 15605640 10 mds.pinger send_ping: rank=0
  -854> 2024-03-27T07:08:57.513+0000 15605640 10 mds.pinger send_ping: sending ping with sequence=4 to rank=0
  -853> 2024-03-27T07:08:57.514+0000 15605640  1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] send_to--> mds [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] -- mdsping -- ?+0 0x142f6cc0
  -852> 2024-03-27T07:08:57.514+0000 15605640  1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] -- mdsping -- 0x142f6cc0 con 0x8c1d2a0
  -851> 2024-03-27T07:08:57.514+0000 15605640 20 mds.metric.aggregator ping_all_active_ranks: pinging rank=1 addr=[v2:172.21.15.33:6836/1439810540,v1:172.21.15.33:6837/1439810540]
  -850> 2024-03-27T07:08:57.514+0000 15605640 10 mds.pinger send_ping: rank=1

...

  -124> 2024-03-27T07:11:07.737+0000 edf9640  5 mds.beacon.b Sending beacon up:active seq 10
  -123> 2024-03-27T07:11:07.737+0000 edf9640 10 monclient: _send_mon_message to mon.b at v2:172.21.15.130:3300/0
  -122> 2024-03-27T07:11:07.737+0000 edf9640  1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:3300/0,v1:172.21.15.130:6789/0] -- mdsbeacon(4285/b up:active seq=10 v12) -- 0xf7f94f0 con 0x8d2f4d0
  -121> 2024-03-27T07:11:07.737+0000 edf9640 20 mds.beacon.b sender thread waiting interval 4s
  -120> 2024-03-27T07:11:07.737+0000 cdf5640  1 mds.b Updating MDS map to version 13 from mon.1
  -119> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
  -118> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
  -117> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b my gid is 4285
  -116> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b map says I am mds.0.4 state up:active
  -115> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b msgr says I am [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123]
  -114> 2024-03-27T07:11:07.739+0000 cdf5640 10 mds.b handle_mds_map: handling map as rank 0
  -113> 2024-03-27T07:11:07.741+0000 cdf5640  4 mds.0.4 set_osd_epoch_barrier: epoch=24
  -112> 2024-03-27T07:11:07.741+0000 cdf5640 10 mds.metric.aggregator notify_mdsmap
...

But with the beacon with seq 10 the MON just replaced the mds.b.

While from the mon.b logs it just received the mds.b's beacon seq 9 after around 2 seconds later. The mds.b's beacon message should be stuck somewhere due to some reasons:

2024-03-27T07:11:06.705+0000 ffa6640  1 -- [v2:172.21.15.130:3300/0,v1:172.21.15.130:6789/0] <== mds.? v2:172.21.15.130:6832/1011073123 54 ==== mdsbeacon(4285/b up:active seq=9 v12) ==== 512+0+0 (secure 0 0 0) 0x7d55a80 con 0x14c5dfa0
2024-03-27T07:11:06.708+0000 ffa6640 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x14cc7170 for mds.?
2024-03-27T07:11:06.708+0000 ffa6640 20 mon.b@1(peon) e1  entity_name mds.b global_id 4285 (new_ok) caps allow *
2024-03-27T07:11:06.708+0000 ffa6640 10 mon.b@1(peon).paxosservice(mdsmap 1..21) dispatch 0x7d55a80 mdsbeacon(4285/b up:active seq=9 v12) from mds.? v2:172.21.15.130:6832/1011073123 con 0x14c5dfa0
2024-03-27T07:11:06.708+0000 ffa6640 10 mon.b@1(peon).paxosservice(mdsmap 1..21)  discarding message from disconnected client mds.? v2:172.21.15.130:6832/1011073123 mdsbeacon(4285/b up:active seq=9 v12)
2024-03-27T07:11:06.708+0000 ffa6640 10 mon.b@1(peon) e1 ms_handle_reset 0x14c5dfa0 v2:172.21.15.130:6832/1011073123

But the MON will only allow 60s and then will assume the mds.b is dead and then will replace it:

2024-03-27T07:09:17.648+0000 ffa6640  4 mon.a@0(leader).mgr e5 beacon from 4106
2024-03-27T07:09:17.648+0000 ffa6640 10 mon.a@0(leader).mgr e5 from existing standby 4106
2024-03-27T07:09:17.648+0000 ffa6640 10 mon.a@0(leader).mgr e5 no change
2024-03-27T07:09:17.658+0000 d7a1640 11 mon.a@0(leader) e1 tick
2024-03-27T07:09:17.661+0000 d7a1640  1 mon.a@0(leader).mds e14 no beacon from mds.0.4 (gid: 4285 addr: [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] state: up:active) since 64.0174
2024-03-27T07:09:17.668+0000 d7a1640  1 mon.a@0(leader).mds e14  marking 4285 [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] mds.0.4 up:active laggy
2024-03-27T07:09:17.689+0000 d7a1640  1 mon.a@0(leader).mds e14  replacing 4285 [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] mds.0.4 up:active with 4221/e [v2:172.21.15.33:6834/3132869688,v1:172.21.15.33:6835/3132869688]
2024-03-27T07:09:17.695+0000 d7a1640  0 log_channel(cluster) log [WRN] : Replacing daemon mds.b as rank 0 with standby daemon mds.e
2024-03-27T07:09:17.696+0000 d7a1640  1 -- [v2:172.21.15.33:3300/0,v1:172.21.15.33:6789/0] --> [v2:172.21.15.33:3300/0,v1:172.21.15.33:6789/0] -- log(1 entries from seq 330 at 2024-03-27T07:09:17.696139+0000) -- 0x67fe000 con 0xb230840
2024-03-27T07:09:17.700+0000 d7a1640  1 mon.a@0(leader).mds e14 fail_mds_gid 4285 mds.b role 0
2024-03-27T07:09:17.707+0000 d7a1640 10 mon.a@0(leader).osd e26 blocklist [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] until 2024-03-28T07:09:17.705494+0000
2024-03-27T07:09:17.735+0000 d7a1640  0 log_channel(cluster) log [INF] : MDS daemon mds.b is removed because it is dead or otherwise unavailable.
2024-03-27T07:09:17.736+0000 d7a1640  1 -- [v2:172.21.15.33:3300/0,v1:172.21.15.33:6789/0] --> [v2:172.21.15.33:3300/0,v1:172.21.15.33:6789/0] -- log(1 entries from seq 331 at 2024-03-27T07:09:17.736390+0000) -- 0x68e07d0 con 0xb230840
2024-03-27T07:09:17.738+0000 d7a1640 20 mon.a@0(leader).mds e14 maybe_resize_cluster in 3 max 3

While in the MDS side it will allow the interval of 4s, please see mds option mds_beacon_interval.

From the logs I couldn't know exactly why the beacon message was stuck. But for the option mds_beacon_interval we need to do a fix IMO ?

Actions #6

Updated by Xiubo Li almost 2 years ago

  • Related to Bug #6608: samba teuthology dbench failure added
Actions #7

Updated by Xiubo Li almost 2 years ago

  • Related to Bug #6718: handle not found error running dbench.sh added
Actions #8

Updated by Xiubo Li almost 2 years ago

  • Related to Bug #42920: mds: removed from map due to dropped (?) beacons added
Actions #9

Updated by Xiubo Li almost 2 years ago

  • Related to Bug #42919: mds: heartbeat timeout during large scale git-clone/rm workload added
Actions #10

Updated by Xiubo Li almost 2 years ago · Edited

Both the beacon and mdsping messages were sent out, but neither the MON and mds received them:

  -861> 2024-03-27T07:08:57.499+0000 edf9640  5 mds.beacon.b Sending beacon up:active seq 9
  -860> 2024-03-27T07:08:57.499+0000 edf9640 10 monclient: _send_mon_message to mon.b at v2:172.21.15.130:3300/0
  -859> 2024-03-27T07:08:57.499+0000 edf9640  1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:3300/0,v1:172.21.15.130:6789/0] -- mdsbeacon(4285/b up:active seq=9 v12) -- 0xf794310 con 0x8c3ed40
  -858> 2024-03-27T07:08:57.499+0000 edf9640 20 mds.beacon.b sender thread waiting interval 4s
  -857> 2024-03-27T07:08:57.513+0000 15605640 10 mds.metric.aggregator ping_all_active_ranks: pinging 2 active mds(s)
  -856> 2024-03-27T07:08:57.513+0000 15605640 20 mds.metric.aggregator ping_all_active_ranks: pinging rank=0 addr=[v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123]
  -855> 2024-03-27T07:08:57.513+0000 15605640 10 mds.pinger send_ping: rank=0
  -854> 2024-03-27T07:08:57.513+0000 15605640 10 mds.pinger send_ping: sending ping with sequence=4 to rank=0
  -853> 2024-03-27T07:08:57.514+0000 15605640  1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] send_to--> mds [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] -- mdsping -- ?+0 0x142f6cc0
  -852> 2024-03-27T07:08:57.514+0000 15605640  1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] -- mdsping -- 0x142f6cc0 con 0x8c1d2a0
  -851> 2024-03-27T07:08:57.514+0000 15605640 20 mds.metric.aggregator ping_all_active_ranks: pinging rank=1 addr=[v2:172.21.15.33:6836/1439810540,v1:172.21.15.33:6837/1439810540]
  -850> 2024-03-27T07:08:57.514+0000 15605640 10 mds.pinger send_ping: rank=1
  -849> 2024-03-27T07:08:57.515+0000 15605640 20 mds.pinger send_ping: init ping pong state for rank=1
  -848> 2024-03-27T07:08:57.515+0000 15605640 10 mds.pinger send_ping: sending ping with sequence=1 to rank=1
  -847> 2024-03-27T07:08:57.515+0000 15605640  1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] send_to--> mds [v2:172.21.15.33:6836/1439810540,v1:172.21.15.33:6837/1439810540] -- mdsping -- ?+0 0x142f6e80
  -846> 2024-03-27T07:08:57.515+0000 15605640  1 -- [v2:172.21.15.130:6832/1011073123,v1:172.21.15.130:6833/1011073123] --> [v2:172.21.15.33:6836/1439810540,v1:172.21.15.33:6837/1439810540] -- mdsping -- 0x142f6e80 con 0x16f45ce0
  -845> 2024-03-27T07:08:57.519+0000 e5f8640 20 mds.0.4 get_task_status
  -844> 2024-03-27T07:08:57.519+0000 e5f8640 20 mds.0.4 schedule_update_timer_task
  -843> 2024-03-27T07:08:57.519+0000 e5f8640 10 mds.0.server find_idle_sessions. last cleared laggy state 1199.64s ago

The messages should be stuck in the protocol sender queue or the route.

Actions #11

Updated by Xiubo Li almost 2 years ago

Xiubo Li wrote in #note-10:

Both the beacon and mdsping messages were sent out, but neither the MON and mds received them:

[...]

The messages should be stuck in the protocol sender queue or the route.

Raised one PR to add more debug logs in the protocol code: https://github.com/ceph/ceph/pull/57857.

Actions #12

Updated by Konstantin Shalygin about 1 year ago

  • Backport changed from quincy,reef,squid to reef,squid
Actions #13

Updated by Patrick Donnelly 9 months ago

  • Target version deleted (v20.0.0)
Actions

Also available in: Atom PDF