Project

General

Profile

Actions

Bug #67565

closed

/tmp/ccxRxwnL.s: Fatal error: can't close fs/file.o: Bad file descriptor in kernel_untar_build.sh

Added by Brad Hubbard over 1 year ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Backport:
reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
powercycle
Component(FS):
Labels (FS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-4564-g704d4f67a0
Released In:
v20.2.0~2140
Upkeep Timestamp:
2025-11-01T01:27:44+00:00

Description

/a/yuriw-2024-08-02_15:42:13-powercycle-squid-release-distro-default-smithi/7833420

2024-08-03T06:55:27.453 INFO:tasks.workunit.client.0.smithi052.stderr:/tmp/ccxRxwnL.s: Assembler messages:
2024-08-03T06:55:27.453 INFO:tasks.workunit.client.0.smithi052.stderr:/tmp/ccxRxwnL.s: Fatal error: can't close fs/file.o: Bad file descriptor
2024-08-03T06:55:27.453 INFO:tasks.workunit.client.0.smithi052.stderr:make[3]: *** [scripts/Makefile.build:243: fs/file.o] Error 1
2024-08-03T06:55:27.453 INFO:tasks.workunit.client.0.smithi052.stderr:make[3]: *** Waiting for unfinished jobs....
...
2024-08-03T07:07:19.346 DEBUG:teuthology.orchestra.run:got remote process result: 2
2024-08-03T07:07:19.347 INFO:tasks.workunit.client.0.smithi052.stderr:make: *** [Makefile:234: __sub-make] Error 2

Looks like an actual build failure building the kernel.


Related issues 4 (1 open3 closed)

Related to CephFS - Bug #67567: stdout:Probably out of disk space in /qa/workunits/suites/ffsb.shNewVenky Shankar

Actions
Related to CephFS - Bug #66030: dbench.sh fails with Bad file descriptor (fs:cephadm:multivolume)DuplicateVenky Shankar

Actions
Copied to CephFS - Backport #67775: squid: /tmp/ccxRxwnL.s: Fatal error: can't close fs/file.o: Bad file descriptor in kernel_untar_build.shResolvedVenky ShankarActions
Copied to CephFS - Backport #67776: reef: /tmp/ccxRxwnL.s: Fatal error: can't close fs/file.o: Bad file descriptor in kernel_untar_build.shResolvedVenky ShankarActions
Actions #1

Updated by Brad Hubbard over 1 year ago

  • Description updated (diff)
Actions #2

Updated by Brad Hubbard over 1 year ago · Edited

I reproduced this and when I try to change into the test directory I see the following.

$ cd /home/ubuntu/cephtest/mnt.0/client.0/tmp
-bash: cd: /home/ubuntu/cephtest/mnt.0/client.0/tmp: Transport endpoint is not connected

$ mount|grep cephtest
nsfs on /run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 type nsfs (rw)
nsfs on /run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 type nsfs (rw)
ceph-fuse on /home/ubuntu/cephtest/mnt.0 type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)

In the logs I see this and I noticed the same stack trace in one of my other
reproducers, but not all. Chicken or egg?

2024-08-15T06:39:41.465+0000 7fa575ffb640 -1 *** Caught signal (Segmentation fault) **
 in thread 7fa575ffb640 thread_name:ceph-fuse

 ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)
 1: /lib64/libc.so.6(+0x3e6f0) [0x7fa59383e6f0]
 2: (Client::ll_write(Fh*, long, long, char const*)+0xd0) [0x55ede16eac40]
 3: ceph-fuse(+0x9167d) [0x55ede164b67d]
 4: /lib64/libfuse.so.2(+0x13f13) [0x7fa594b26f13]
 5: /lib64/libfuse.so.2(+0x1f9ac) [0x7fa594b329ac]
 6: /lib64/libfuse.so.2(+0x109ad) [0x7fa594b239ad]
 7: /lib64/libc.so.6(+0x89c02) [0x7fa593889c02]
 8: /lib64/libc.so.6(+0x10ec40) [0x7fa59390ec40]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Looking at the coredump now.

(gdb) bt
#0  0x00007fa59388b94c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007fa59383e646 in raise () from /lib64/libc.so.6
#2  0x000055ede1724d9a in reraise_fatal (signum=11) at /usr/src/debug/ceph-19.1.0-1260.g26c3fb8e.el9.x86_64/src/global/signal_handler.cc:88
#3  handle_oneshot_fatal_signal (signum=11) at /usr/src/debug/ceph-19.1.0-1260.g26c3fb8e.el9.x86_64/src/global/signal_handler.cc:367
#4  <signal handler called>
#5  std::_Rb_tree<Fh*, Fh*, std::_Identity<Fh*>, std::less<Fh*>, std::allocator<Fh*> >::_S_left (__x=<optimized out>) at /usr/include/c++/11/bits/stl_function.h:447
#6  std::_Rb_tree<Fh*, Fh*, std::_Identity<Fh*>, std::less<Fh*>, std::allocator<Fh*> >::_M_lower_bound (__k=@0x7fa575ff8e38: 0x7fa51c267c10, __y=0x7fa4c8230ee0, __x=0x6be8e0, this=0x55ede205cb98)
    at /usr/include/c++/11/bits/stl_tree.h:1922
#7  std::_Rb_tree<Fh*, Fh*, std::_Identity<Fh*>, std::less<Fh*>, std::allocator<Fh*> >::find (__k=@0x7fa575ff8e38: 0x7fa51c267c10, this=0x55ede205cb98) at /usr/include/c++/11/bits/stl_tree.h:2536
#8  std::set<Fh*, std::less<Fh*>, std::allocator<Fh*> >::count (__x=@0x7fa575ff8e38: 0x7fa51c267c10, this=0x55ede205cb98) at /usr/include/c++/11/bits/stl_set.h:749
#9  Client::_ll_fh_exists (f=0x7fa51c267c10, this=0x55ede205bea0) at /usr/src/debug/ceph-19.1.0-1260.g26c3fb8e.el9.x86_64/src/client/Client.h:1054
#10 Client::ll_write (this=0x55ede205bea0, fh=0x7fa51c267c10, off=2624, len=40, data=0x7fa55a55b310 "") at /usr/src/debug/ceph-19.1.0-1260.g26c3fb8e.el9.x86_64/src/client/Client.cc:15933
#11 0x000055ede164b67d in fuse_ll_write (req=0x7fa4f0b5f110, ino=<optimized out>, buf=0x7fa55a55b310 "", size=40, off=2624, fi=0x7fa575ff8f10) at /usr/src/debug/ceph-19.1.0-1260.g26c3fb8e.el9.x86_64/src/client/fuse_ll.cc:906
#12 0x00007fa594b26f13 in do_write.lto_priv () from /lib64/libfuse.so.2
#13 0x00007fa594b329ac in fuse_ll_process_buf () from /lib64/libfuse.so.2
#14 0x00007fa594b239ad in fuse_do_work () from /lib64/libfuse.so.2
#15 0x00007fa593889c02 in start_thread () from /lib64/libc.so.6
#16 0x00007fa59390ec40 in clone3 () from /lib64/libc.so.6
(gdb) f 10
#10 Client::ll_write (this=0x55ede205bea0, fh=0x7fa51c267c10, off=2624, len=40, data=0x7fa55a55b310 "") at /usr/src/debug/ceph-19.1.0-1260.g26c3fb8e.el9.x86_64/src/client/Client.cc:15933
15933     if (fh == NULL || !_ll_fh_exists(fh)) {
(gdb) down
#9  Client::_ll_fh_exists (f=0x7fa51c267c10, this=0x55ede205bea0) at /usr/src/debug/ceph-19.1.0-1260.g26c3fb8e.el9.x86_64/src/client/Client.h:1054
1054        return ll_unclosed_fh_set.count(f);
(gdb) whatis ll_unclosed_fh_set
type = std::set<Fh*>
(gdb) p ll_unclosed_fh_set
$5 = std::set with 45 elements = {
  [0] = 0x7fa4a03f5590,
  [1] = 0x7fa4a04062b0,
  [2] = 0x7fa4ac3976c0,
  [3] = 0x7fa4ac3a5a50,
  [4] = 0x7fa4b44a3ec0,
  [5] = 0x7fa4b44d31b0,
  [6] = 0x7fa4b44eac70,
  [7] = 0x7fa4b44eafd0,
  [8] = 0x7fa4b44eb4a0,
  [9] = 0x7fa4b44eb7e0,
  [10] = 0x7fa4b44ed450,
  [11] = 0x7fa4b44ee9e0,
  [12] = 0x7fa4b44f1d00,
  [13] = 0x7fa4b44f2550,
  [14] = 0x7fa4bc4bf070,
  [15] = 0x7fa4bc4c5800,
  [16] = 0x7fa4c87accc0,
  [17] = 0x7fa4c87f4090,
  [18] = 0x7fa4d494a1f0,
  [19] = 0x7fa4e810ff20,
  [20] = 0x7fa4e814f8c0,
  [21] = 0x7fa4ec1bc7a0,
  [22] = 0x7fa4f0102450,
  [23] = 0x7fa4f01f7600,
  [24] = 0x7fa4f0ba37a0,
  [25] = 0x7fa4fc1bb4a0,
  [26] = 0x7fa50c252820,
  [27] = 0x7fa50c252980,
  [28] = 0x7fa5100aa540,
  [29] = 0x7fa5100fc510,
  [30] = 0x7fa51c1d1250,
  [31] = 0x7fa51c24da30,
  [32] = 0x7fa51c267350,
  [33] = 0x7fa51c267c10,
  [34] = 0x7fa51c272850,
  [35] = 0x7fa5401e5ae0,
  [36] = 0x7fa540c53f90,
  [37] = 0x7fa5415b3cc0,
  [38] = 0x7fa55811fff0,
  [39] = 0x7fa559f44350,
  [40] = 0x7fa55aab1c00,
  [41] = 0x7fa55b0174c0,
  [42] = 0x7fa55b088c20,
  [43] = 0x7fa55b1f38c0,
  [44] = 0x7fa55bb91250
(gdb) p f
$6 = (Fh *) 0x7fa51c267c10

So count() should have returned 1 but obviously encountered corruption but it's
not immediately clear why?

Actions #3

Updated by Brad Hubbard over 1 year ago · Edited

  • Project changed from Ceph to CephFS
  • Category deleted (qa)

Moving this to FS for triage there.

Could this be related to https://tracker.ceph.com/issues/66771 ?

Actions #4

Updated by Brad Hubbard over 1 year ago

  • Related to Bug #67567: stdout:Probably out of disk space in /qa/workunits/suites/ffsb.sh added
Actions #5

Updated by Venky Shankar over 1 year ago

Brad Hubbard wrote in #note-2:

I reproduced this and when I try to change into the test directory I see the following.

[...]

In the logs I see this and I noticed the same stack trace in one of my other
reproducers, but not all. Chicken or egg?

[...]

Looking at the coredump now.

[...]

So count() should have returned 1 but obviously encountered corruption but it's
not immediately clear why?

Likely because call to _ll_fh_exists should happen under client_lock.

Actions #6

Updated by Venky Shankar over 1 year ago

  • Category set to Correctness/Safety
  • Status changed from In Progress to Fix Under Review
  • Target version set to v20.0.0
  • Backport set to reef,squid
  • Pull request ID set to 59300
Actions #7

Updated by Venky Shankar over 1 year ago

Brad Hubbard wrote in #note-2:

I reproduced this and when I try to change into the test directory I see the following.

Thanks for the reproducing and the crash backtrace - it helped immensely.

I checked the failed teuthology job and unfortunately there isn't a coredump captured o_O

Actions #8

Updated by Venky Shankar over 1 year ago

  • Related to Bug #66030: dbench.sh fails with Bad file descriptor (fs:cephadm:multivolume) added
Actions #9

Updated by Venky Shankar over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #10

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #67775: squid: /tmp/ccxRxwnL.s: Fatal error: can't close fs/file.o: Bad file descriptor in kernel_untar_build.sh added
Actions #11

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #67776: reef: /tmp/ccxRxwnL.s: Fatal error: can't close fs/file.o: Bad file descriptor in kernel_untar_build.sh added
Actions #12

Updated by Upkeep Bot over 1 year ago

  • Tags (freeform) set to backport_processed
Actions #13

Updated by Brad Hubbard about 1 year ago

  • Status changed from Pending Backport to Resolved
Actions #14

Updated by Patrick Donnelly 9 months ago

  • Merge Commit set to 704d4f67a0d63a551e28aa7f6c4b729554357954
  • Fixed In set to v19.3.0-4564-g704d4f67a0d6
Actions #15

Updated by Upkeep Bot 9 months ago

  • Fixed In changed from v19.3.0-4564-g704d4f67a0d6 to v19.3.0-4564-g704d4f67a0d
  • Upkeep Timestamp set to 2025-07-08T18:29:42+00:00
Actions #16

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-4564-g704d4f67a0d to v19.3.0-4564-g704d4f67a0
  • Upkeep Timestamp changed from 2025-07-08T18:29:42+00:00 to 2025-07-14T17:10:23+00:00
Actions #17

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2140
  • Upkeep Timestamp changed from 2025-07-14T17:10:23+00:00 to 2025-11-01T01:27:44+00:00
Actions

Also available in: Atom PDF