Bug #63265: qa: fs/snaps/snaptest-git-ceph.sh failed when reseting to tag 'v0.1' - CephFS - Ceph

Actions

Copy link

Bug #63265

closed

qa: fs/snaps/snaptest-git-ceph.sh failed when reseting to tag 'v0.1'

Added by Xiubo Li over 2 years ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xiubo Li

Category:

Correctness/Safety

Target version:

Ceph - v20.0.0

% Done:

100%

Source:

Q/A

Backport:

squid,reef,quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

57754

Tags (freeform):

backport_processed

Merge Commit:

63ad865ce7ec5198a565736e2ffff3dc51b79fbe

Fixed In:

v19.3.0-2922-g63ad865ce7

Released In:

v20.2.0~2643

Upkeep Timestamp:

2025-11-01T01:27:11+00:00

Description

https://pulpito.ceph.com/yuriw-2023-10-16_14:43:00-fs-wip-yuri4-testing-2023-10-11-0735-reef-distro-default-smithi/7429579/

Failed with " unable to create file src/mon/MDSMonitor.cc: File exists":

2023-10-17T01:33:09.933 INFO:tasks.workunit.client.0.smithi062.stderr:Updating files:   0% (17/12853)^MUpdating files:   0% (30/12853)^MUpdating files:   0% (54/12853)^MUpdating files:   1% (129/12853)^MUpdating files:   1% (138/12853)^MUpdating files:   2% (258/12853)^MUpdating files:   2% (298/12853)^MUpdating files:   3% (386/12853)^MUpdating files:   3% (417/12853)^MUpdating files:   3% (490/12853)^MUpdating files:   4% (515/12853)^MUpdating files:   4% (561/12853)^MUpdating files:   4% (627/12853)^MUpdating files:   5% (643/12853)^MUpdating files:   5% (679/12853)^MUpdating files:   5% (722/12853)^MUpdating files:   6% (772/12853)^MUpdating files:   6% (786/12853)^MUpdating files:   6% (853/12853)^MUpdating files:   7% (900/12853)^MUpdating files:   7% (910/12853)^MUpdating files:   7% (977/12853)^MUpdating files:   7% (997/12853)^MUpdating files:   8% (1029/12853)^MUpdating files:   8% (1041/12853)^MUpdating files:   8% (1053/12853)^MUpdating files:   8% (1068/12853)^MUpdating files:   8% (1095/12853)^MUpdating files:   8% (1116/12853)^MUpdating files:   8% (1134/12853)^MUpdating files:   8% (1147/12853)^MUpdating files:   9% (1157/12853)^MUpdating files:   9% (1194/12853)^MUpdating files:   9% (1255/12853)^MUpdating files:  10% (1286/12853)^MUpdating files:  10% (1303/12853)^MUpdating files:  10% (1396/12853)^MUpdating files:  11% (1414/12853)^MUpdating files:  11% (1427/12853)^MUpdating files:  11% (1444/12853)^MUpdating files:  11% (1451/12853)^MUpdating files:  11% (1491/12853)^MUpdating files:  11% (1538/12853)^MUpdating files:  12% (1543/12853)^MUpdating files:  12% (1651/12853)^MUpdating files:  12% (1652/12853)^MUpdating files:  13% (1671/12853)^MUpdating files:  13% (1712/12853)^MUpdating files:  13% (1740/12853)^MUpdating files:  13% (1741/12853)^MUpdating files:  14% (1800/12853)^MUpdating files:  14% (1860/12853)^MUpdating files:  14% (1875/12853)^MUpdating files:  15% (1928/12853)^MUpdating files:  15% (1932/12853)^MUpdating files:  15% (2016/12853)^MUpdating files:  15% (2055/12853)^MUpdating files:  16% (2057/12853)^MUpdating files:  16% (2176/12853)^MUpdating files:  17% (2186/12853)^MUpdating files:  17% (2218/12853)^MUpdating files:  17% (2268/12853)^MUpdating files:  18% (2314/12853)^MUpdating files:  18% (2370/12853)^MUpdating files:  18% (2384/12853)^MUpdating files:  18% (2393/12853)^MUpdating files:  18% (2401/12853)^MUpdating files:  18% (2414/12853)^MUpdating files:  18% (2416/12853)^MUpdating files:  19% (2443/12853)^MUpdating files:  19% (2462/12853)^MUpdating files:  19% (2544/12853)^MUpdating files:  20% (2571/12853)^MUpdating files:  20% (2613/12853)^MUpdating files:  20% (2651/12853)^MUpdating files:  21% (2700/12853)^MUpdating files:  21% (2711/12853)^MUpdating files:  21% (2781/12853)^MUpdating files:  22% (2828/12853)^MUpdating files:  22% (2860/12853)^MUpdating files:  22% (2893/12853)^MUpdating files:  22% (2919/12853)^MUpdating files:  22% (2944/12853)^MUpdating files:  23% (2957/12853)^MUpdating files:  24% (3085/12853)^MUpdating files:  24% (3124/12853)^MUpdating files:  25% (3214/12853)^MUpdating files:  25% (3232/12853)^MUpdating files:  26% (3342/12853)^MUpdating files:  26% (3356/12853)^MUpdating files:  27% (3471/12853)^MUpdating files:  27% (3490/12853)^MUpdating files:  27% (3517/12853)^MUpdating files:  28% (3599/12853)^MUpdating files:  28% (3601/12853)^MUpdating files:  28% (3676/12853)^MUpdating files:  29% (3728/12853)^MUpdating files:  29% (3751/12853)^MUpdating files:  30% (3856/12853)^MUpdating files:  30% (3878/12853)^MUpdating files:  30% (3921/12853)^MUpdating files:  30% (3976/12853)^MUpdating files:  31% (3985/12853)^MUpdating files:  32% (4113/12853)^MUpdating files:  32% (4177/12853)^MUpdating files:  32% (4180/12853)^MUpdating files:  32% (4224/12853)^MUpdating files:  33% (4242/12853)^MUpdating files:  33% (4254/12853)^MUpdating files:  33% (4285/12853)^MUpdating files:  33% (4359/12853)^MUpdating files:  34% (4371/12853)^MUpdating files:  34% (4389/12853)^MUpdating files:  34% (4423/12853)^MUpdating files:  35% (4499/12853)^MUpdating files:  35% (4624/12853)^MUpdating files:  36% (4628/12853)^MUpdating files:  36% (4667/12853)^MUpdating files:  36% (4725/12853)^MUpdating files:  37% (4756/12853)^MUpdating files:  38% (4885/12853)^MUpdating files:  38% (4889/12853)^MUpdating files:  38% (4922/12853)^MUpdating files:  38% (4951/12853)^MUpdating files:  38% (5000/12853)^MUpdating files:  39% (5013/12853)^MUpdating files:  39% (5094/12853)^MUpdating files:  40% (5142/12853)^MUpdating files:  40% (5168/12853)^MUpdating files:  40% (5193/12853)^MUpdating files:  41% (5270/12853)^MUpdating files:  41% (5305/12853)^MUpdating files:  41% (5344/12853)^MUpdating files:  42% (5399/12853)^MUpdating files:  42% (5439/12853)^MUpdating files:  42% (5502/12853)^MUpdating files:  43% (5527/12853)^MUpdating files:  43% (5596/12853)^MUpdating files:  44% (5656/12853)^MUpdating files:  44% (5666/12853)^MUpdating files:  44% (5735/12853)^MUpdating files:  45% (5784/12853)^MUpdating files:  45% (5845/12853)^MUpdating files:  45% (5853/12853)^MUpdating files:  45% (5882/12853)^MUpdating files:  46% (5913/12853)^MUpdating files:  46% (6012/12853)^MUpdating files:  47% (6041/12853)^MUpdating files:  47% (6101/12853)^MUpdating files:  47% (6169/12853)^MUpdating files:  48% (6170/12853)^MUpdating files:  48% (6252/12853)^MUpdating files:  48% (6289/12853)^MUpdating files:  49% (6298/12853)^MUpdating files:  49% (6326/12853)^MUpdating files:  49% (6408/12853)^MUpdating files:  50% (6427/12853)^MUpdating files:  51% (6556/12853)^MUpdating files:  51% (6609/12853)^MUpdating files:  51% (6648/12853)^MUpdating files:  52% (6684/12853)^MUpdating files:  53% (6813/12853)^MUpdating files:  53% (6843/12853)^MUpdating files:  54% (6941/12853)^MUpdating files:  54% (7005/12853)^MUpdating files:  54% (7021/12853)^MUpdating files:  54% (7048/12853)^MUpdating files:  54% (7062/12853)^MUpdating files:  55% (7070/12853)^MUpdating files:  56% (7198/12853)^MUpdating files:  56% (7213/12853)^MUpdating files:  57% (7327/12853)^MUpdating files:  57% (7338/12853)^MUpdating files:  57% (7359/12853)^MUpdating files:  58% (7455/12853)^MUpdating files:  58% (7503/12853)^MUpdating files:  58% (7582/12853)^MUpdating files:  59% (7584/12853)^MUpdating files:  59% (7686/12853)^MUpdating files:  60% (7712/12853)^MUpdating files:  61% (7841/12853)^MUpdating files:  61% (7851/12853)^MUpdating files:  61% (7935/12853)^MUpdating files:  62% (7969/12853)^MUpdating files:  62% (8004/12853)^MUpdating files:  62% (8007/12853)^MUpdating files:  62% (8038/12853)^MUpdating files:  62% (8089/12853)^MUpdating files:  63% (8098/12853)^MUpdating files:  63% (8113/12853)^MUpdating files:  63% (8147/12853)^MUpdating files:  64% (8226/12853)^MUpdating files:  64% (8259/12853)^MUpdating files:  64% (8354/12853)^MUpdating files:  65% (8355/12853)^MUpdating files:  65% (8449/12853)^MUpdating files:  66% (8483/12853)^MUpdating files:  66% (8495/12853)^MUpdating files:  66% (8524/12853)^MUpdating files:  66% (8552/12853)^MUpdating files:  66% (8591/12853)^MUpdating files:  67% (8612/12853)^MUpdating files:  67% (8701/12853)^MUpdating files:  68% (8741/12853)^MUpdating files:  68% (8749/12853)^MUpdating files:  68% (8806/12853)^MUpdating files:  69% (8869/12853)^MUpdating files:  69% (8887/12853)^MUpdating files:  69% (8979/12853)^MUpdating files:  69% (8992/12853)^MUpdating files:  70% (8998/12853)^MUpdating files:  70% (9037/12853)^MUpdating files:  70% (9109/12853)^MUpdating files:  71% (9126/12853)^MUpdating files:  71% (9203/12853)^MUpdating files:  72% (9255/12853)^MUpdating files:  72% (9329/12853)^MUpdating files:  72% (9369/12853)^MUpdating files:  73% (9383/12853)^MUpdating files:  73% (9427/12853)^MUpdating files:  74% (9512/12853)^MUpdating files:  74% (9534/12853)^MUpdating files:  74% (9569/12853)^MUpdating files:  75% (9640/12853)^MUpdating files:  75% (9681/12853)^MUpdating files:  75% (9686/12853)^MUpdating files:  76% (9769/12853)^MUpdating files:  76% (9781/12853)^MUpdating files:  77% (9897/12853)^MUpdating files:  77% (9911/12853)^MUpdating files:  77% (10012/12853)^MUpdating files:  77% (10013/12853)^MUpdating files:  78% (10026/12853)^MUpdating files:  78% (10135/12853)^MUpdating files:  79% (10154/12853)^MUpdating files:  79% (10262/12853)^MUpdating files:  80% (10283/12853)^MUpdating files:  80% (10286/12853)^MUpdating files:  80% (10325/12853)^MUpdating files:  80% (10343/12853)^MUpdating files:  80% (10401/12853)^MUpdating files:  81% (10411/12853)^MUpdating files:  81% (10468/12853)^MUpdating files:  81% (10523/12853)^MUpdating files:  82% (10540/12853)^MUpdating files:  82% (10576/12853)^MUpdating files:  82% (10645/12853)^MUpdating files:  83% (10668/12853)^MUpdating files:  83% (10697/12853)^MUpdating files:  83% (10755/12853)^MUpdating files:  83% (10775/12853)^MUpdating files:  84% (10797/12853)^MUpdating files:  84% (10908/12853)^MUpdating files:  85% (10926/12853)^MUpdating files:  85% (10929/12853)^MUpdating files:  85% (10956/12853)^MUpdating files:  85% (10972/12853)^MUpdating files:  85% (10987/12853)^MUpdating files:  85% (11026/12853)^MUpdating files:  85% (11040/12853)^MUpdating files:  86% (11054/12853)^MUpdating files:  86% (11098/12853)^MUpdating files:  86% (11175/12853)^MUpdating files:  86% (11177/12853)^MUpdating files:  87% (11183/12853)^MUpdating files:  87% (11237/12853)^MUpdating files:  87% (11241/12853)^MUpdating files:  87% (11273/12853)^MUpdating files:  88% (11311/12853)^MUpdating files:  88% (11346/12853)^MUpdating files:  88% (11382/12853)^MUpdating files:  89% (11440/12853)^MUpdating files:  89% (11479/12853)^MUpdating files:  89% (11553/12853)^MUpdating files:  90% (11568/12853)^MUpdating files:  90% (11576/12853)^MUpdating files:  90% (11623/12853)^MUpdating files:  91% (11697/12853)^MUpdating files:  91% (11735/12853)^MUpdating files:  91% (11802/12853)^MUpdating files:  92% (11825/12853)^MUpdating files:  92% (11865/12853)^MUpdating files:  93% (11954/12853)^MUpdating files:  93% (12029/12853)^MUpdating files:  94% (12082/12853)^MUpdating files:  94% (12083/12853)^MUpdating files:  95% (12211/12853)^MUpdating files:  95% (12255/12853)^MUpdating files:  95% (12286/12853)^MUpdating files:  95% (12328/12853)^MUpdating files:  96% (12339/12853)^MUpdating files:  96% (12400/12853)^MUpdating files:  97% (12468/12853)^MUpdating files:  98% (12596/12853)^MUpdating files:  98% (12609/12853)^MUpdating files:  98% (12610/12853)^MUpdating files:  98% (12663/12853)^MUpdating files:  99% (12725/12853)^Merror: unable to create file src/mon/MDSMonitor.cc: File exists
...
2023-10-17T01:33:13.535 INFO:tasks.workunit.client.0.smithi062.stderr:Updating files:  99% (12726/12853)^MUpdating files:  99% (12727/12853)^MUpdating files:  99% (12783/12853)^MUpdating files:  99% (12846/12853)^MUpdating files: 100% (12853/12853)^MUpdating files: 100% (12853/12853), done.
2023-10-17T01:33:13.535 INFO:tasks.workunit.client.0.smithi062.stderr:fatal: Could not reset index file to revision 'v0.1'.
2023-10-17T01:33:13.548 DEBUG:teuthology.orchestra.run:got remote process result: 128
2023-10-17T01:33:13.549 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0...
2023-10-17T01:33:13.550 DEBUG:teuthology.orchestra.run.smithi062:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2023-10-17T01:33:14.545 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_ab9fe43522bc41de59746200b95f65d4fbcccffe/qa/tasks/workunit.py", line 145, in task
    _spawn_on_all_clients(ctx, refspec, all_tasks, config.get('env'),
  File "/home/teuthworker/src/github.com_ceph_ceph-c_ab9fe43522bc41de59746200b95f65d4fbcccffe/qa/tasks/workunit.py", line 295, in _spawn_on_all_clients
    p.spawn(_run_tests, ctx, refspec, role, [unit], env,
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_ab9fe43522bc41de59746200b95f65d4fbcccffe/qa/tasks/workunit.py", line 424, in _run_tests
    remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/remote.py", line 522, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_8cdab074dcca9a68965bc5a50e9c30b691949723/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed (workunit test fs/snaps/snaptest-git-ceph.sh) on smithi062 with status 128: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=ab9fe43522bc41de59746200b95f65d4fbcccffe TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/snaps/snaptest-git-ceph.sh'

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by Milind Changire over 2 years ago

Assignee set to Xiubo Li

Actions

Copy link

Updated by Xiubo Li over 2 years ago

reef: https://pulpito.ceph.com/yuriw-2023-10-24_00:06:34-fs-wip-yuri6-testing-2023-10-23-1148-reef-distro-default-smithi/7435313/

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Xiubo Li wrote:

https://pulpito.ceph.com/yuriw-2023-10-16_14:43:00-fs-wip-yuri4-testing-2023-10-11-0735-reef-distro-default-smithi/7429579/

Failed with " unable to create file src/mon/MDSMonitor.cc: File exists":

[...]

From the mds.a log file, the src/mon/MDSMonitor.cc file was successfully created without any issue:

2023-10-17T01:32:43.352+0000 7f28bdbf6700 20 mds.0.server lat 0.001938
2023-10-17T01:32:43.352+0000 7f28b6be8700  5 mds.0.log _submit_thread 354800868~4258 : EUpdate unlink_local [metablob 0x10000002e46.1*, 4 dirs]
2023-10-17T01:32:43.352+0000 7f28bdbf6700  1 -- [v2:172.21.15.62:6832/3723432109,v1:172.21.15.62:6833/3723432109] <== client.4676 172.21.15.62:0/1312196696 21204 ==== client_request(client.4676:65053 create #0x1000000392c/MDSMonitor.cc 2023-10-17T01:32:43.352476+0000 caller_uid=1000, caller_gid=1285{6,36,1000,1285,}) v6 ==== 201+0+62 (crc 0 0 0) 0x5568b2151b00 con 0x5568b2383400
2023-10-17T01:32:43.352+0000 7f28bdbf6700  4 mds.0.server handle_client_request client_request(client.4676:65053 create #0x1000000392c/MDSMonitor.cc 2023-10-17T01:32:43.352476+0000 caller_uid=1000, caller_gid=1285{6,36,1000,1285,}) v6
2023-10-17T01:32:43.352+0000 7f28bdbf6700 20 mds.0.89 get_session have 0x5568ae392800 client.4676 172.21.15.62:0/1312196696 state open
2023-10-17T01:32:43.352+0000 7f28bdbf6700 15 mds.0.server  oldest_client_tid=64732
2023-10-17T01:32:43.352+0000 7f28bdbf6700  7 mds.0.cache request_start request(client.4676:65053 nref=2 cr=0x5568b2151b00)
2023-10-17T01:32:43.352+0000 7f28bdbf6700  7 mds.0.server dispatch_client_request client_request(client.4676:65053 create #0x1000000392c/MDSMonitor.cc 2023-10-17T01:32:43.352476+0000 caller_uid=1000, caller_gid=1285{6,36,1000,1285,}) v6
2023-10-17T01:32:43.352+0000 7f28bdbf6700  7 mds.0.server open w/ O_CREAT on #0x1000000392c/MDSMonitor.cc
2023-10-17T01:32:43.352+0000 7f28bdbf6700 10 mds.0.server rdlock_path_xlock_dentry request(client.4676:65053 nref=2 cr=0x5568b2151b00) #0x1000000392c/MDSMonitor.cc
2023-10-17T01:32:43.352+0000 7f28bdbf6700  7 mds.0.cache traverse: opening base ino 0x1000000392c snap head
2023-10-17T01:32:43.352+0000 7f28bdbf6700 10 mds.0.locker found lock cache for create on [inode 0x1000000392c [...1fa,head] /client.0/tmp/ceph/src/mon/ auth v7336 pv10235 ap=10 DIRTYPARENT f(v0 m2023-10-17T01:31:19.978737+0000 16=16+0)->f(v0 m2023-10-17T01:32:43.350476+0000 17=17+0) n(v0 rc2023-10-17T01:31:19.978737+0000 b1086509 17=16+1)->n(v0 rc2023-10-17T01:32:43.350476+0000 b981500 18=17+1) (iauth sync r=1) (isnap sync r=2) (inest lock w=3) (ipolicy sync r=1) (ifile excl w=2) (iversion lock w=8 last_client=4676) caps={4676=pAsLsXsFsxcr/pAsLsXsFsxcral@106},l=4676 | request=0 lock=6 dirfrag=1 caps=1 dirtyparent=1 dirty=0 waiter=0 authpin=1 0x5568b10dfb80]
2023-10-17T01:32:43.353+0000 7f28bdbf6700 12 mds.0.cache traverse: path seg depth 0 'MDSMonitor.cc' snapid head
2023-10-17T01:32:43.353+0000 7f28bdbf6700 20 mds.0.cache.dir(0x1000000392c) lookup (MDSMonitor.cc, 'head')
2023-10-17T01:32:43.353+0000 7f28bdbf6700 20 mds.0.cache.dir(0x1000000392c)   hit -> (MDSMonitor.cc,head)
...
2023-10-17T01:32:43.353+0000 7f28bdbf6700 10 mds.0.server early_reply 0 ((0) Success) client_request(client.4676:65053 create #0x1000000392c/MDSMonitor.cc 2023-10-17T01:32:43.352476+0000 caller_uid=1000, caller_gid=1285{6,36,1000,1285,}) v6
...
2023-10-17T01:32:43.472+0000 7f28b7bea700 20 mds.0.bal hit_dir 1 pop is 8.99723, frag * size 18 [pop IRD:[C 2.95e+00] IWR:[C 9.00e+00] RDR:[C 0.00e+00] FET:[C 0.00e+00] STR:[C 0.00e+00] *LOAD:20.9]
2023-10-17T01:32:43.472+0000 7f28b7bea700  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.4676:65053 create #0x1000000392c/MDSMonitor.cc 2023-10-17T01:32:43.352476+0000 caller_uid=1000, caller_gid=1285{6,36,1000,1285,}) v6

Xiubo Li wrote:

reef: https://pulpito.ceph.com/yuriw-2023-10-24_00:06:34-fs-wip-yuri6-testing-2023-10-23-1148-reef-distro-default-smithi/7435313/

And for this test failure, there is not any log about failing to create any file when resetting head to v0.74:

2023-10-24T23:28:47.491 INFO:teuthology.orchestra.run.smithi029.stderr:2023-10-24T23:28:47.483+0000 7f4b16cc1700  1 -- 172.21.15.29:0/989232246 shutdown_connections
2023-10-24T23:28:47.491 INFO:teuthology.orchestra.run.smithi029.stderr:2023-10-24T23:28:47.483+0000 7f4b16cc1700  1 -- 172.21.15.29:0/989232246 wait complete.
2023-10-24T23:28:47.492 INFO:teuthology.orchestra.run.smithi029.stderr:dumped fsmap epoch 193
2023-10-24T23:28:47.508 INFO:tasks.mds_thrash.fs.[a]:waiting till mds map indicates mds.c is in active, standby or standby-replay
2023-10-24T23:28:48.300 INFO:tasks.workunit.client.0.smithi029.stderr:Updating files:  46% (90/192)^MUpdating files:  47% (91/192)^MUpdating files:  48% (93/192)^MUpdating files:  49% (95/192)^MUpdating files:  50% (96/192)^MUpdating files:  51% (98/192)^MUpdating files:  52% (100/192)^MUpdating files:  53% (102/192)^MUpdating files:  54% (104/192)^MUpdating files:  55% (106/192)^MUpdating files:  56% (108/192)^MUpdating files:  57% (110/192)^MUpdating files:  58% (112/192)^MUpdating files:  59% (114/192)^MUpdating files:  60% (116/192)^MUpdating files:  61% (118/192)^MUpdating files:  62% (120/192)^MUpdating files:  63% (121/192)^MUpdating files:  64% (123/192)^MUpdating files:  65% (125/192)^MUpdating files:  66% (127/192)^MUpdating files:  67% (129/192)^MUpdating files:  68% (131/192)^MUpdating files:  69% (133/192)^MUpdating files:  70% (135/192)^MUpdating files:  71% (137/192)^MUpdating files:  72% (139/192)^MUpdating files:  73% (141/192)^MUpdating files:  74% (143/192)^MUpdating files:  75% (144/192)^MUpdating files:  76% (146/192)^MUpdating files:  77% (148/192)^MUpdating files:  78% (150/192)^MUpdating files:  79% (152/192)^MUpdating files:  80% (154/192)^MUpdating files:  81% (156/192)^MUpdating files:  82% (158/192)^MUpdating files:  83% (160/192)^MUpdating files:  84% (162/192)^MUpdating files:  85% (164/192)^MUpdating files:  86% (166/192)^MUpdating files:  87% (168/192)^MUpdating files:  88% (169/192)^MUpdating files:  89% (171/192)^MUpdating files:  89% (172/192)^MUpdating files:  90% (173/192)^MUpdating files:  91% (175/192)^MUpdating files:  92% (177/192)^MUpdating files:  93% (179/192)^MUpdating files:  94% (181/192)^MUpdating files:  95% (183/192)^MUpdating files:  96% (185/192)^MUpdating files:  97% (187/192)^MUpdating files:  98% (189/192)^MUpdating files:  99% (191/192)^MUpdating files: 100% (192/192)^MUpdating files: 100% (192/192), done.
2023-10-24T23:28:48.300 INFO:tasks.workunit.client.0.smithi029.stderr:fatal: Could not reset index file to revision 'v0.74'.
2023-10-24T23:28:48.312 DEBUG:teuthology.orchestra.run:got remote process result: 128
2023-10-24T23:28:48.313 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0...

And also I didn't find any failure from mds logs and kclient logs.

While for both the above two failures I could see the git command itself crashed:

2023-10-17T01:40:07.786 INFO:teuthology.task.internal:Transferring binaries for coredumps...
2023-10-17T01:40:07.826 INFO:teuthology.task.internal: core looks like: /home/teuthworker/archive/yuriw-2023-10-16_14:43:00-fs-wip-yuri4-testing-2023-10-11-0735-reef-distro-default-smithi/7429579/remote/smithi183/coredump/1697506436.127268.core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from 'git reset --hard v0.7', real uid: 1000, effective uid: 1000, real gid: 1285, effective gid: 1285, execfn: '/usr/bin/git', platform: 'x86_64'

And

2023-10-24T23:38:23.501 INFO:teuthology.task.internal:Transferring binaries for coredumps...
2023-10-24T23:38:23.780 INFO:teuthology.task.internal: core looks like: /home/teuthworker/archive/yuriw-2023-10-24_00:06:34-fs-wip-yuri6-testing-2023-10-23-1148-reef-distro-default-smithi/7435313/remote/smithi186/coredump/1698190173.43665.core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from 'git reset --hard v0.86', real uid: 1000, effective uid: 1000, real gid: 1267, effective gid: 1267, execfn: '/usr/bin/git', platform: 'x86_64'

I just suspect this should be a git tool's bug.

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Xiubo, a pretty similar issue got hit here: https://pulpito.ceph.com/vshankar-2024-01-10_15:00:23-fs-wip-vshankar-testing-20240103.072409-1-testing-default-smithi/7511467/

However, this instance the errno is EIO and there is no git binary crash dump.

2024-01-11T17:50:09.316 INFO:tasks.workunit.client.0.smithi064.stderr:Updating files:  59% (114/193)^MUpdating files:  60% (116/193)^MUpdating files:  61% (118/193)^MUpdating files:  62% (120/193)^MUpdating files:  63% (122/193)^MUpdating files:  64% (124/193)^MUpdating files:  65% (126/193)^MUpdating files:  66% (128/193)^MUpdating files:  67% (130/193)^MUpdating files:  68% (132/193)^MUpdating files:  69% (134/193)^MUpdating files:  70% (136/193)^MUpdating files:  71% (138/193)^MUpdating files:  72% (139/193)^MUpdating files:  73% (141/193)^MUpdating files:  74% (143/193)^MUpdating files:  75% (145/193)^MUpdating files:  76% (147/193)^MUpdating files:  77% (149/193)^MUpdating files:  78% (151/193)^MUpdating files:  79% (153/193)^MUpdating files:  80% (155/193)^MUpdating files:  81% (157/193)^MUpdating files:  82% (159/193)^MUpdating files:  83% (161/193)^MUpdating files:  84% (163/193)^Merror: unable to create file src/osd/OSD.cc: Input/output error
2024-01-11T17:50:09.317 INFO:tasks.workunit.client.0.smithi064.stderr:error: unable to create file src/osd/OSD.h: Input/output error

Nothing in MDS or the kernel ring buffer.

Actions

Copy link

Updated by Xiubo Li about 2 years ago

Venky Shankar wrote:

Xiubo, a pretty similar issue got hit here: https://pulpito.ceph.com/vshankar-2024-01-10_15:00:23-fs-wip-vshankar-testing-20240103.072409-1-testing-default-smithi/7511467/

However, this instance the errno is EIO and there is no git binary crash dump.

[...]

Nothing in MDS or the kernel ring buffer.

Okay, let me have a look later. Thanks!

Actions

Copy link

Updated by Venky Shankar about 2 years ago

/a/vshankar-2024-01-22_07:03:31-fs-wip-vshankar-testing-20240119.075157-1-testing-default-smithi/7525694

Actions

Copy link

Updated by Rishabh Dave about 2 years ago

https://pulpito.ceph.com/rishabh-2024-03-08_19:54:47-fs-rishabh-2024mar8-testing-default-smithi/7588250.

Same error occurred on this job but with a different version number.

Following is the log entry - stderr:fatal: Could not reset index file to revision 'v0.64'.

Actions

Copy link

#10

Updated by Venky Shankar about 2 years ago

/a/yuriw-2024-03-16_15:03:17-fs-wip-yuri10-testing-2024-03-15-1653-reef-distro-default-smithi/7606232

Failure is with a fuse client. Nothing in client logs that point to an Input/output error.

Actions

Copy link

#11

Updated by Xiubo Li almost 2 years ago

Rishabh Dave wrote in #note-9:

https://pulpito.ceph.com/rishabh-2024-03-08_19:54:47-fs-rishabh-2024mar8-testing-default-smithi/7588250.

Same error occurred on this job but with a different version number.

Following is the log entry - stderr:fatal: Could not reset index file to revision 'v0.64'.

From teuthology.log it failed with creating the ceph_fuse.cc file:

2024-03-08T22:59:47.344 INFO:tasks.workunit.client.0.smithi039.stdout:HEAD is now at 054e96cf79e v0.63
2024-03-08T22:59:47.346 INFO:tasks.workunit.client.0.smithi039.stderr:+ mkdir .snap/v0.63
2024-03-08T22:59:47.412 INFO:tasks.workunit.client.0.smithi039.stdout:v0.64
2024-03-08T22:59:47.413 INFO:tasks.workunit.client.0.smithi039.stderr:+ for v in $versions
2024-03-08T22:59:47.413 INFO:tasks.workunit.client.0.smithi039.stderr:+ '[' 64 -eq 48 ']'
2024-03-08T22:59:47.413 INFO:tasks.workunit.client.0.smithi039.stderr:+ ver=v0.64
2024-03-08T22:59:47.413 INFO:tasks.workunit.client.0.smithi039.stderr:+ echo v0.64
2024-03-08T22:59:47.413 INFO:tasks.workunit.client.0.smithi039.stderr:+ git reset --hard v0.64
...
2024-03-08T23:00:13.070 INFO:tasks.workunit.client.0.smithi039.stderr:error: unable to create file src/ceph_fuse.cc: File exists
...
2024-03-08T23:00:18.314 INFO:tasks.workunit.client.0.smithi039.stderr:Updating files:  22% (32/143)^MUpdating files:  23% (33/143)^MUpdating files:  24% (35/143)^MUpdating files:  25% (36/143)^MUpdating files:  26% (38/143)^MUpdating files:  27% (39/143)^MUpdating files:  28% (41/143)^MUpdating files:  29% (42/143)^MUpdating files:  30% (43/143)^MUpdating files:  31% (45/143)^MUpdating files:  32% (46/143)^MUpdating files:  32% (47/143)^MUpdating files:  33% (48/143)^MUpdating files:  34% (49/143)^MUpdating files:  35% (51/143)^MUpdating files:  36% (52/143)^MUpdating files:  37% (53/143)^MUpdating files:  38% (55/143)^MUpdating files:  39% (56/143)^MUpdating files:  40% (58/143)^MUpdating files:  41% (59/143)^MUpdating files:  42% (61/143)^MUpdating files:  43% (62/143)^MUpdating files:  44% (63/143)^MUpdating files:  45% (65/143)^MUpdating files:  46% (66/143)^MUpdating files:  46% (67/143)^MUpdating files:  47% (68/143)^MUpdating files:  48% (69/143)^MUpdating files:  49% (71/143)^MUpdating files:  50% (72/143)^MUpdating files:  51% (73/143)^MUpdating files:  52% (75/143)^MUpdating files:  53% (76/143)^MUpdating files:  54% (78/143)^MUpdating files:  55% (79/143)^MUpdating files:  56% (81/143)^MUpdating files:  57% (82/143)^MUpdating files:  58% (83/143)^MUpdating files:  59% (85/143)^MUpdating files:  60% (86/143)^MUpdating files:  61% (88/143)^MUpdating files:  62% (89/143)^MUpdating files:  63% (91/143)^MUpdating files:  64% (92/143)^MUpdating files:  65% (93/143)^MUpdating files:  66% (95/143)^MUpdating files:  67% (96/143)^MUpdating files:  68% (98/143)^MUpdating files:  69% (99/143)^MUpdating files:  70% (101/143)^MUpdating files:  71% (102/143)^MUpdating files:  72% (103/143)^MUpdating files:  73% (105/143)^MUpdating files:  74% (106/143)^MUpdating files:  75% (108/143)^MUpdating files:  76% (109/143)^MUpdating files:  77% (111/143)^MUpdating files:  78% (112/143)^MUpdating files:  79% (113/143)^MUpdating files:  80% (115/143)^MUpdating files:  81% (116/143)^MUpdating files:  82% (118/143)^MUpdating files:  83% (119/143)^MUpdating files:  84% (121/143)^MUpdating files:  85% (122/143)^MUpdating files:  86% (123/143)^MUpdating files:  87% (125/143)^MUpdating files:  88% (126/143)^MUpdating files:  89% (128/143)^MUpdating files:  90% (129/143)^MUpdating files:  91% (131/143)^MUpdating files:  92% (132/143)^MUpdating files:  93% (133/143)^MUpdating files:  94% (135/143)^MUpdating files:  95% (136/143)^MUpdating files:  96% (138/143)^MUpdating files:  97% (139/143)^MUpdating files:  98% (141/143)^MUpdating files:  99% (142/143)^MUpdating files: 100% (143/143)^MUpdating files: 100% (143/143), done.
2024-03-08T23:00:18.315 INFO:tasks.workunit.client.0.smithi039.stderr:fatal: Could not reset index file to revision 'v0.64'.
2024-03-08T23:00:18.320 DEBUG:teuthology.orchestra.run:got remote process result: 128
2024-03-08T23:00:18.321 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0...

Actually the create #0x10000002e89/ceph_fuse.cc request had succeeded just before the mds was failed over:

2024-03-08T22:59:48.971+0000 7fbdde1d4640  1 -- [v2:172.21.15.39:6834/1247150009,v1:172.21.15.39:6835/1247150009] <== client.4559 v1:172.21.15.39:0/38623762 4477 ==== client_request(client.4559:104000 create owner_uid=1000, owner_gid=1301 #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.971091+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4 ==== 180+0+65 (unknown 1253445394 0 2155696498) 0x5564dd02d500 con 0x5564d8211c00
2024-03-08T22:59:48.971+0000 7fbdde1d4640  4 mds.0.server handle_client_request client_request(client.4559:104000 create owner_uid=1000, owner_gid=1301 #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.971091+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4
2024-03-08T22:59:48.971+0000 7fbdde1d4640 20 mds.0.151 get_session have 0x5564d5b73200 client.4559 v1:172.21.15.39:0/38623762 state open
2024-03-08T22:59:48.971+0000 7fbdde1d4640 15 mds.0.server  oldest_client_tid=103934
2024-03-08T22:59:48.971+0000 7fbdd69c5640  5 mds.0.log _submit_thread 1500400575~55420 : EUpdate unlink_local [metablob 0x100000017b0.000*, 4 dirs]
2024-03-08T22:59:48.971+0000 7fbdde1d4640  7 mds.0.cache request_start request(client.4559:104000 nref=2 cr=0x5564dd02d500)
2024-03-08T22:59:48.971+0000 7fbdde1d4640  7 mds.0.server dispatch_client_request client_request(client.4559:104000 create owner_uid=1000, owner_gid=1301 #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.971091+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4
2024-03-08T22:59:48.971+0000 7fbdde1d4640  7 mds.0.server open w/ O_CREAT on #0x10000002e89/ceph_fuse.cc
2024-03-08T22:59:48.971+0000 7fbdde1d4640 10 mds.0.server rdlock_path_xlock_dentry request(client.4559:104000 nref=2 cr=0x5564dd02d500) #0x10000002e89/ceph_fuse.cc
2024-03-08T22:59:48.971+0000 7fbdde1d4640  7 mds.0.cache traverse: opening base ino 0x10000002e89 snap head
2024-03-08T22:59:48.971+0000 7fbdde1d4640 10 mds.0.locker found lock cache for create on [inode 0x10000002e89 [...238,head] /client.0/tmp/ceph/src/ auth fragtree_t(*^1 0*^3 0000*^3 0001*^2 000100*^1 000101*^1 000110*^1 0010*^2 001000*^1 001010*^1 001011*^1 0011*^1 00110*^1 001101*^1 0101*^3 0110*^2 011001*^1 011011*^1 1*^1 10*^3 10100*^1 101000*^1 1010000*^1 10110*^1 101101*^1 1011011*^1 10110111*^2 1011011100*^1 1011011101*^1 1011011111*^1 11*^3 11000*^1 11001*^1 11010*^1 11100*^1 11101*^1 111010*^1 1110101*^1 11101011*^1 11110*^1) v18649 pv18659 ap=7 f(v9 m2024-03-08T22:59:46.602138+0000 111=74+37)->f(v9 m2024-03-08T22:59:48.969091+0000 110=73+37) n(v124 rc2024-03-08T22:59:46.603138+0000 b13653799 1559=1437+122)->n(v124 rc2024-03-08T22:59:48.969091+0000 b13521593 1558=1436+122) old_inodes=62 (iauth sync r=1) (isnap sync r=4) (inest lock w=4 dirty) (ipolicy sync r=2) (ifile excl w=2) (iversion lock w=5 last_client=4559) caps={4559=pAsLsXsFsxcr/pAsLsXsFsxcral@255},l=4559 | dirtyscattered=1 request=0 lock=6 dirfrag=79 caps=1 dirtyrstat=0 dirty=1 authpin=1 0x5564d72da580]
2024-03-08T22:59:48.971+0000 7fbdde1d4640 12 mds.0.cache traverse: path seg depth 0 'ceph_fuse.cc' snapid head
2024-03-08T22:59:48.971+0000 7fbdde1d4640 20 mds.0.cache.dir(0x10000002e89.0110110*) lookup (ceph_fuse.cc, 'head')
2024-03-08T22:59:48.971+0000 7fbdde1d4640 20 mds.0.cache.dir(0x10000002e89.0110110*)   hit -> (ceph_fuse.cc,head)
2024-03-08T22:59:48.971+0000 7fbdde1d4640 10 mds.0.locker acquire_locks request(client.4559:104000 nref=3 cr=0x5564dd02d500)
...
2024-03-08T22:59:48.972+0000 7fbdde1d4640 10 mds.0.server adding created_ino and delegated_inos
2024-03-08T22:59:48.972+0000 7fbdde1d4640 10 mds.0.server journal_and_reply tracei 0x5564d98e6580 tracedn 0x5564d85a8500
2024-03-08T22:59:48.972+0000 7fbdde1d4640 10 mds.0.locker set_xlocks_done on (dn xlock x=2 by 0x5564d6a9b680) [dentry #0x1/client.0/tmp/ceph/src/ceph_fuse.cc [238,head] auth (dn xlock x=2 by 0x5564d6a9b680) (dversion lock w=2 last_client=4559) pv=16732 v=16686 ap=3 ino=0x100000081b9 state=1073741824 | request=2 lock=3 inodepin=1 dirty=0 authpin=1 0x5564d85a8500]
2024-03-08T22:59:48.972+0000 7fbdde1d4640 10 mds.0.server early_reply 0 ((0) Success) client_request(client.4559:104000 create owner_uid=1000, owner_gid=1301 #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.971091+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4
...
lock=2 inodepin=1 dirty=1 authpin=1 0x5564d85a8500]
2024-03-08T22:59:49.081+0000 7fbdd79c7640 20 mds.0.bal hit_dir 1 pop is 1.99972, frag 0110110* size 2 [pop IRD:[C 1.52e+00] IWR:[C 2.00e+00] RDR:[C 0.00e+00] FET:[C 0.00e+00] STR:[C 1.23e-01] *LOAD:6.0]
2024-03-08T22:59:49.081+0000 7fbdd79c7640  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.4559:104000 create owner_uid=1000, owner_gid=1301 #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.971091+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4
2024-03-08T22:59:49.081+0000 7fbdd79c7640 10 mds.0.server apply_allocated_inos 0x0 / [] / 0x100000088a9
2024-03-08T22:59:49.081+0000 7fbdd79c7640 20 mds.0.sessionmap mark_dirty s=0x5564d5b73200 name=client.4559 v=34559
...
2024-03-08T22:59:49.081+0000 7fbdd89c9640 10 mds.0.purge_queue _consume:  cannot consume right now
2024-03-08T22:59:49.081+0000 7fbdd79c7640  1 -- [v2:172.21.15.39:6834/1247150009,v1:172.21.15.39:6835/1247150009] --> v1:172.21.15.39:0/38623762 -- client_reply(???:104000 = 0 (0) Success safe) v1 -- 0x5564d69d1600 con 0x5564d8211c00
2024-03-08T22:59:49.081+0000 7fbdd79c7640  7 mds.0.cache request_finish request(client.4559:104000 nref=2 cr=0x5564dd02d500)
2024-03-08T22:59:49.081+0000 7fbdd89c9640 10 mds.0.purge_queue _execute_item_complete: complete at 0x77b577
2024-03-08T22:59:49.081+0000 7fbdd89c9640 10 mds.0.purge_queue _execute_item_complete: non-sequential completion, not expiring anything
2024-03-08T22:59:49.081+0000 7fbdd79c7640 15 mds.0.cache request_cleanup request(client.4559:104000 nref=2 cr=0x5564dd02d500)
2024-03-08T22:59:49.081+0000 7fbdd89c9640 10 mds.0.purge_queue _execute_item_complete: completed item for ino 0x100000087a0
2024-03-08T22:59:49.081+0000 7fbdd89c9640 10 mds.0.purge_queue _execute_item_complete: in_flight.size() now 31

But in the new mds the request was handled again even it was marked as completed:

2024-03-08T23:00:10.838+0000 7f6a026aa640 20 mds.0.cache.dir(0x10000002e89.0110110*) lookup_exact_snap (head, 'ceph_fuse.cc')
2024-03-08T23:00:10.838+0000 7f6a026aa640 10 mds.0.journal EMetaBlob.replay for [238,head] had [dentry #0x1/client.0/tmp/ceph/src/ceph_fuse.cc [238,head] auth NULL (dversion lock) v=16732 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x55dc6bfd4780]
2024-03-08T23:00:10.838+0000 7f6a026aa640 12 mds.0.cache.dir(0x10000002e89.0110110*) link_primary_inode [dentry #0x1/client.0/tmp/ceph/src/ceph_fuse.cc [238,head] auth NULL (dversion lock) v=16732 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x55dc6bfd4780] [inode 0x100000088a9 [238,head] #100000088a9 auth v16732 s=0 n(v0 1=1+0) (iversion lock) cr={4559=0-4194304@237} 0x55dc69c78680]
2024-03-08T23:00:10.838+0000 7f6a026aa640 10 mds.0.journal EMetaBlob.replay added [inode 0x100000088a9 [238,head] /client.0/tmp/ceph/src/ceph_fuse.cc auth v16732 s=0 n(v0 1=1+0) (iversion lock) cr={4559=0-4194304@237} 0x55dc69c78680]
2024-03-08T23:00:10.838+0000 7f6a026aa640 10 mds.0.cache.ino(0x100000088a9) mark_dirty_parent
2024-03-08T23:00:10.838+0000 7f6a026aa640 10 mds.0.journal EMetaBlob.replay noting opened inode [inode 0x100000088a9 [238,head] /client.0/tmp/ceph/src/ceph_fuse.cc auth v16732 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock) cr={4559=0-4194304@237} | dirtyparent=1 dirty=1 0x55dc69c78680]
2024-03-08T23:00:10.838+0000 7f6a026aa640 10 mds.0.journal EMetaBlob.replay inotable tablev 73 <= table 73
2024-03-08T23:00:10.838+0000 7f6a026aa640 10 mds.0.journal EMetaBlob.replay sessionmap v 34560, table 34559 prealloc [] used 0x100000088a9
2024-03-08T23:00:10.838+0000 7f6a026aa640 20 mds.0.journal  (session prealloc [0x10000007dff~0x79,0x10000008310~0xfb,0x100000086a6~0x68,0x100000087aa~0x13,0x100000088a9~0x24e])
2024-03-08T23:00:10.838+0000 7f6a026aa640  5 mds.0.journal received ino 0x100000088a9 from the session
2024-03-08T23:00:10.838+0000 7f6a026aa640 20 mds.0.sessionmap replay_dirty_session s=0x55dc69663e00 name=client.4559 v=34559
2024-03-08T23:00:10.838+0000 7f6a026aa640 10 mds.0.journal EMetaBlob.replay request client.4559:104000 trim_to 103934
2024-03-08T23:00:10.838+0000 7f6a026aa640 10 mds.0.log _replay: read_pos == write_pos
2024-03-08T23:00:10.838+0000 7f6a026aa640 10 mds.0.log _replay - complete, 34584 events
2024-03-08T23:00:10.838+0000 7f6a026aa640 10 mds.0.log _replay_thread kicking waiters
2024-03-08T23:00:10.838+0000 7f6a026aa640 10 MDSContext::complete: 15C_MDS_BootStart
2024-03-08T23:00:10.838+0000 7f6a026aa640  1 mds.0.156 Finished replaying journal
...
2024-03-08T23:00:10.977+0000 7f6a0a6ba640 10 mds.d map says I am mds.0.156 state up:reconnect
2024-03-08T23:00:10.977+0000 7f6a0a6ba640 10 mds.d msgr says I am [v2:172.21.15.173:6835/3994328214,v1:172.21.15.173:6837/3994328214]
2024-03-08T23:00:10.977+0000 7f6a0a6ba640 10 mds.d handle_mds_map: handling map as rank 0
2024-03-08T23:00:10.977+0000 7f6a0a6ba640  1 mds.0.156 handle_mds_map I am now mds.0.156
2024-03-08T23:00:10.977+0000 7f6a0a6ba640  1 mds.0.156 handle_mds_map state change up:replay --> up:reconnect
2024-03-08T23:00:10.977+0000 7f6a0a6ba640  1 mds.0.156 reconnect_start
2024-03-08T23:00:10.977+0000 7f6a0a6ba640  1 mds.0.156 reopen_log
2024-03-08T23:00:10.977+0000 7f6a0a6ba640 10 mds.0.cache rollback_uncommitted_fragments: 0 pending
2024-03-08T23:00:10.977+0000 7f6a0a6ba640 10 mds.0.server apply_blocklist: killed 0
2024-03-08T23:00:10.977+0000 7f6a0a6ba640  4 mds.0.156 apply_blocklist: killed 0, blocklisted sessions (54 blocklist entries, 1)
2024-03-08T23:00:10.977+0000 7f6a0a6ba640  1 mds.0.server reconnect_clients -- 1 sessions
2024-03-08T23:00:10.977+0000 7f6a0a6ba640 10 mds.0.sessionmap dump
2024-03-08T23:00:10.977+0000 7f6a0a6ba640 10 mds.0.sessionmap client.4559 0x55dc69663e00 state open completed {103934=0x0,103935=0x1000000888f,103936=0x0,103937=0x10000008890,103938=0x0,103939=0x10000008891,103940=0x0,103941=0x10000008892,103942=0x0,103943=0x10000008893,103944=0x0,103945=0x10000008894,103946=0x0,103947=0x10000008895,103949=0x10000007dfc,103951=0x10000007dfd,103952=0x0,103953=0x10000008896,103954=0x0,103956=0x10000008897,103958=0x0,103959=0x10000008898,103961=0x10000007dfe,103962=0x0,103963=0x10000008899,103965=0x0,103966=0x1000000889a,103967=0x0,103968=0x1000000889b,103969=0x0,103971=0x1000000889c,103972=0x0,103973=0x1000000889d,103974=0x0,103975=0x1000000889e,103976=0x0,103977=0x1000000889f,103978=0x0,103979=0x100000088a0,103980=0x0,103981=0x100000088a1,103982=0x0,103983=0x100000088a2,103984=0x0,103985=0x100000088a3,103987=0x0,103989=0x100000088a4,103991=0x0,103992=0x100000088a5,103993=0x0,103994=0x100000088a6,103995=0x0,103996=0x100000088a7,103997=0x0,103998=0x100000088a8,103999=0x0,104000=0x100000088a9} free_prealloc_inos [0x10000007dff~0x79,0x10000008310~0xfb,0x100000086a6~0x68,0x100000087aa~0x13,0x100000088aa~0x24d] delegated_inos []
2024-03-08T23:00:10.977+0000 7f6a0a6ba640  7 mds.0.tableserver(snaptable) finish_recovery
2024-03-08T23:00:10.977+0000 7f6a0a6ba640  7 mds.0.tableserver(snaptable) _do_server_recovery 0
...
2024-03-08T23:00:10.984+0000 7f6a0a6ba640  3 mds.0.server queuing replayed op
2024-03-08T23:00:10.984+0000 7f6a0a6ba640  1 -- [v2:172.21.15.173:6835/3994328214,v1:172.21.15.173:6837/3994328214] <== client.4559 v1:172.21.15.39:0/38623762 56 ==== client_request(client.4559:103999 unlink #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.969091+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4 ==== 180+0+0 (unknown 1564592980 0 0) 0x55dc69ed1500 con 0x55dc6eb7c800
2024-03-08T23:00:10.984+0000 7f6a0a6ba640 20 mds.0.156 get_session have 0x55dc69663e00 client.4559 v1:172.21.15.39:0/38623762 state open
2024-03-08T23:00:10.984+0000 7f6a0a6ba640  5 mds.0.server dispatch request in up:reconnect: client_request(client.4559:103999 unlink #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.969091+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4
2024-03-08T23:00:10.984+0000 7f6a0a6ba640  3 mds.0.server queuing replayed op
2024-03-08T23:00:10.984+0000 7f6a0a6ba640  1 -- [v2:172.21.15.173:6835/3994328214,v1:172.21.15.173:6837/3994328214] <== client.4559 v1:172.21.15.39:0/38623762 57 ==== client_request(client.4559:104000 create owner_uid=1000, owner_gid=1301 #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.971091+0000 RETRY=1 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4 ==== 180+0+65 (unknown 2784880835 0 2155696498) 0x55dc69ed1880 con 0x55dc6eb7c800
2024-03-08T23:00:10.984+0000 7f6a0a6ba640 20 mds.0.156 get_session have 0x55dc69663e00 client.4559 v1:172.21.15.39:0/38623762 state open
2024-03-08T23:00:10.984+0000 7f6a0a6ba640  5 mds.0.server dispatch request in up:reconnect: client_request(client.4559:104000 create owner_uid=1000, owner_gid=1301 #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.971091+0000 RETRY=1 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4
...
2024-03-08T23:00:13.061+0000 7f6a056b0640  4 mds.0.server handle_client_request client_request(client.4559:103999 unlink #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.969091+0000 RETRY=1 REPLAY QUEUED_FOR_REPLAY caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  executing item (0x1000000878a)
2024-03-08T23:00:13.061+0000 7f6a04eaf640 10 mds.0.purge_queue _execute_item:  0~53801 objects 0~1 snapc 236=[236,235] on 0x1000000878a
2024-03-08T23:00:13.061+0000 7f6a056b0640 20 mds.0.156 get_session have 0x55dc69663e00 client.4559 v1:172.21.15.39:0/38623762 state open
2024-03-08T23:00:13.061+0000 7f6a056b0640  5 mds.0.server already completed client.4559:103999
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _can_consume: 18/32 ops, 18/64 files
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  decoding entry
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  executing item (0x1000000878b)
2024-03-08T23:00:13.061+0000 7f6a056b0640 10 mds.0.156 send_message_client client.4559 v1:172.21.15.39:0/38623762 client_reply(???:103999 = 0 (0) Success safe) v1
2024-03-08T23:00:13.061+0000 7f6a04eaf640 10 mds.0.purge_queue _execute_item:  0~17431 objects 0~1 snapc 236=[236,235] on 0x1000000878b
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _can_consume: 19/32 ops, 19/64 files
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  decoding entry
2024-03-08T23:00:13.061+0000 7f6a056b0640  1 -- [v2:172.21.15.173:6835/3994328214,v1:172.21.15.173:6837/3994328214] --> v1:172.21.15.39:0/38623762 -- client_reply(???:103999 = 0 (0) Success safe) v1 -- 0x55dc69ecc200 con 0x55dc6eb7c800
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  executing item (0x10000007420)
2024-03-08T23:00:13.061+0000 7f6a056b0640 10 mds.0.156  queued next replay op
2024-03-08T23:00:13.061+0000 7f6a04eaf640 10 mds.0.purge_queue _execute_item:  0~8493 objects 0~1 snapc 236=[236,235,234,233,232,231,230,22f,22e,22d,22c,22b,22a,229,228,227,226,225,224,223,222,221] on 0x10000007420
2024-03-08T23:00:13.061+0000 7f6a056b0640 10 mds.0.156  finish 0x55dc69c75240
2024-03-08T23:00:13.061+0000 7f6a056b0640 10 MDSContext::complete: 18C_MDS_RetryMessage
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _can_consume: 20/32 ops, 20/64 files
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  decoding entry
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  executing item (0x1000000878e)
2024-03-08T23:00:13.061+0000 7f6a056b0640  4 mds.0.server handle_client_request client_request(client.4559:104000 create owner_uid=1000, owner_gid=1301 #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.971091+0000 RETRY=1 QUEUED_FOR_REPLAY caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4
2024-03-08T23:00:13.061+0000 7f6a04eaf640 10 mds.0.purge_queue _execute_item:  0~3299 objects 0~1 snapc 236=[236,235] on 0x1000000878e
2024-03-08T23:00:13.061+0000 7f6a056b0640 20 mds.0.156 get_session have 0x55dc69663e00 client.4559 v1:172.21.15.39:0/38623762 state open
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _can_consume: 21/32 ops, 21/64 files
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  decoding entry
2024-03-08T23:00:13.061+0000 7f6a056b0640 15 mds.0.server  oldest_client_tid=103934
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  executing item (0x10000008164)
2024-03-08T23:00:13.061+0000 7f6a04eaf640 10 mds.0.purge_queue _execute_item:  0~22282 objects 0~1 snapc 236=[236,235,234,233,232,231,230] on 0x10000008164
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _can_consume: 22/32 ops, 22/64 files
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  decoding entry
2024-03-08T23:00:13.061+0000 7f6a056b0640  7 mds.0.cache request_start request(client.4559:104000 nref=2 cr=0x55dc69ed1880)
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  executing item (0x10000007075)
2024-03-08T23:00:13.061+0000 7f6a04eaf640 10 mds.0.purge_queue _execute_item:  0~6618 objects 0~1 snapc 236=[236,235,234,233,232,231,230,22f,22e,22d,22c,22b,22a,229] on 0x10000007075
2024-03-08T23:00:13.061+0000 7f6a056b0640  7 mds.0.server dispatch_client_request client_request(client.4559:104000 create owner_uid=1000, owner_gid=1301 #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.971091+0000 RETRY=1 QUEUED_FOR_REPLAY caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _can_consume: 23/32 ops, 23/64 files
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  decoding entry
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  executing item (0x10000007d84)
2024-03-08T23:00:13.061+0000 7f6a04eaf640 10 mds.0.purge_queue _execute_item:  0~16380 objects 0~1 snapc 236=[236,235,234,233,232,231,230,22f,22e,22d] on 0x10000007d84
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _can_consume: 24/32 ops, 24/64 files
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  decoding entry
2024-03-08T23:00:13.061+0000 7f6a04eaf640 20 mds.0.purge_queue _consume:  executing item (0x10000008792)
2024-03-08T23:00:13.061+0000 7f6a056b0640  7 mds.0.server open on #0x10000002e89/ceph_fuse.cc
2024-03-08T23:00:13.061+0000 7f6a04eaf640 10 mds.0.purge_queue _execute_item:  0~32996 objects 0~1 snapc 236=[236,235] on 0x10000008792
2024-03-08T23:00:13.061+0000 7f6a056b0640 10 mds.0.server rdlock_path_pin_ref request(client.4559:104000 nref=2 cr=0x55dc69ed1880) #0x10000002e89/ceph_fuse.cc
...
2024-03-08T23:00:13.062+0000 7f6a056b0640  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.4559:104000 create owner_uid=1000, owner_gid=1301 #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.971091+0000 RETRY=1 QUEUED_FOR_REPLAY caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4
...

Actually it should be just skip re-handling like the following request:

2024-03-08T23:00:13.061+0000 7f6a056b0640  4 mds.0.server handle_client_request client_request(client.4559:103999 unlink #0x10000002e89/ceph_fuse.cc 2024-03-08T22:59:48.969091+0000 RETRY=1 REPLAY QUEUED_FOR_REPLAY caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v4
2024-03-08T23:00:13.061+0000 7f6a056b0640 20 mds.0.156 get_session have 0x55dc69663e00 client.4559 v1:172.21.15.39:0/38623762 state open
2024-03-08T23:00:13.061+0000 7f6a056b0640  5 mds.0.server already completed client.4559:103999

Actions

Copy link

#12

Updated by Xiubo Li almost 2 years ago

This will be a problem when the kclient sends a create request and the mds successfully handles it and responses to the kclient. But the kclient couldn't receive the replies just before the socket connection is reset and then the create request will be retried.

Then in the mds side it will handle the create request again, but this time it will just open the existing file instead of creating a new one. And then in kclient side it will skip marking the file->f_mode |= FMODE_CREATED in ceph_atomic_open(). And then finally in VFS fs/namei.c it will fail the create by returning the -EEXIST errno:

3595 /*      
3596  * Handle the last step of open()
3597  */     
3598 static int do_open(struct nameidata *nd,
3599                    struct file *file, const struct open_flags *op)
3600 {               
3601         struct mnt_idmap *idmap;
3602         int open_flag = op->open_flag;
3603         bool do_truncate;
3604         int acc_mode;
3605         int error;
...
3612         if (!(file->f_mode & FMODE_CREATED))
3613                 audit_inode(nd->name, nd->path.dentry, 0);
3614         idmap = mnt_idmap(nd->path.mnt);
3615         if (open_flag & O_CREAT) {
3616                 if ((open_flag & O_EXCL) && !(file->f_mode & FMODE_CREATED))
3617                         return -EEXIST;
...
3651         if (do_truncate)
3652                 mnt_drop_write(nd->path.mnt);
3653         return error;
3654 }

So we need to handle this case in mds and return the correct info instead of just as a open request.

Actions

Copy link

#13

Updated by Xiubo Li almost 2 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 57754

Actions

Copy link

#14

Updated by Xiubo Li almost 2 years ago

While there has a protocol bug when retrying the create client request:

 2492 void Server::handle_client_request(const cref_t<MClientRequest> &req)
 2493 {
 2494   dout(4) << "handle_client_request " << *req << dendl;
 2495  
...
 2533   // completed request?
 2534   bool has_completed = false;
 2535   if (req->is_replay() || req->get_retry_attempt()) {
 2536     ceph_assert(session);
 2537     inodeno_t created;
 2538     if (session->have_completed_request(req->get_reqid().tid, &created)) {
 2539       has_completed = true;
 2540       if (!session->is_open())
 2541         return;
 2542       // Don't send traceless reply if the completed request has created
 2543       // new inode. Treat the request as lookup request instead.
 2544       if (req->is_replay() ||
 2545           ((created == inodeno_t() || !mds->is_clientreplay()) &&
 2546            req->get_op() != CEPH_MDS_OP_OPEN &&
 2547            req->get_op() != CEPH_MDS_OP_CREATE)) {
 2548         dout(5) << "already completed " << req->get_reqid() << dendl;
 2549         auto reply = make_message<MClientReply>(*req, 0);
 2550         if (created != inodeno_t()) {
 2551           bufferlist extra;
 2552           encode(created, extra);
 2553           reply->set_extra_bl(extra);
 2554         }
 2555         mds->send_message_client(reply, session);
 2556 
 2557         if (req->is_queued_for_replay())
 2558           mds->queue_one_replay();
 2559 
 2560         return;
 2561       }
 2562       if (req->get_op() != CEPH_MDS_OP_OPEN &&
 2563           req->get_op() != CEPH_MDS_OP_CREATE) {
 2564         dout(10) << " completed request which created new inode " << created
 2565                  << ", convert it to lookup request" << dendl;
 2566         req->head.op = req->get_dentry_wanted() ? CEPH_MDS_OP_LOOKUP : CEPH_MDS_OP_GETATTR;
 2567         req->head.args.getattr.mask = CEPH_STAT_CAP_INODE_ALL;
 2568       }
 2569     }
 2570   }                                                       
...

If the create request was totally completed previously but the client only received the unsafe reply, then when retrying the create request it will be marked as REPLAY. But in Line#2550~2554 it will only encode the extra info use the legacy format, and this could cause decoding corruption in client side, such as in kclient:

 661 /*
 662  * parse create results
 663  */
 664 static int parse_reply_info_create(void **p, void *end,
 665                                   struct ceph_mds_reply_info_parsed *info,
 666                                   u64 features, struct ceph_mds_session *s)
 667 {
 668         int ret;
 669   
 670         if (features == (u64)-1 ||
 671             (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {                                                                                                                                                                                              
 672                 if (*p == end) {
 673                         /* Malformed reply? */              
 674                         info->has_create_ino = false;       
 675                 } else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
 676                         info->has_create_ino = true;        
 677                         /* struct_v, struct_compat, and len */
 678                         ceph_decode_skip_n(p, end, 2 + sizeof(u32), bad);
 679                         ceph_decode_64_safe(p, end, info->ino, bad);
 680                         ret = ceph_parse_deleg_inos(p, end, s); 
 681                         if (ret)
 682                                 return ret;                         
 683                 } else {
 684                         /* legacy */                        
 685                         ceph_decode_64_safe(p, end, info->ino, bad);
 686                         info->has_create_ino = true;        
 687                 }
 688         } else {
 689                 if (*p != end)
 690                         goto bad;
 691         }
 692   
 693         /* Skip over any unrecognized fields */
 694         *p = end;
 695         return 0;
 696 bad:
 697         return -EIO;
 698 }

Actions

Copy link

#15

Updated by Patrick Donnelly almost 2 years ago

Category set to Correctness/Safety
Target version set to v20.0.0
Source set to Q/A
Backport set to squid,reef,quincy
Component(FS) MDS added

Actions

Copy link

#16

Updated by Patrick Donnelly over 1 year ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#17

Updated by Upkeep Bot over 1 year ago

Copied to Backport #66613: quincy: qa: fs/snaps/snaptest-git-ceph.sh failed when reseting to tag 'v0.1' added

Actions

Copy link

#18

Updated by Upkeep Bot over 1 year ago

Copied to Backport #66614: reef: qa: fs/snaps/snaptest-git-ceph.sh failed when reseting to tag 'v0.1' added

Actions

Copy link

#19

Updated by Upkeep Bot over 1 year ago

Copied to Backport #66626: squid: qa: fs/snaps/snaptest-git-ceph.sh failed when reseting to tag 'v0.1' added

Actions

Copy link

#20

Updated by Upkeep Bot over 1 year ago

Tags (freeform) set to backport_processed

Actions

Copy link

#21

Updated by Milind Changire about 1 year ago

squid:
https://pulpito.ceph.com/vshankar-2024-11-25_18:02:39-fs-wip-vshankar-testing-20241125.064001-squid-debug-testing-default-smithi/8008481

Actions

Copy link

#22

Updated by Konstantin Shalygin about 1 year ago

Status changed from Pending Backport to Resolved
% Done changed from 0 to 100

Actions

Copy link

#23

Updated by Upkeep Bot 9 months ago

Merge Commit set to 63ad865ce7ec5198a565736e2ffff3dc51b79fbe
Fixed In set to v19.3.0-2922-g63ad865ce7e
Upkeep Timestamp set to 2025-07-02T03:46:12+00:00

Actions

Copy link

#24

Updated by Upkeep Bot 8 months ago

Fixed In changed from v19.3.0-2922-g63ad865ce7e to v19.3.0-2922-g63ad865ce7
Upkeep Timestamp changed from 2025-07-02T03:46:12+00:00 to 2025-07-14T16:45:47+00:00