Skip to content

test/libcephfs: validate asynchronous write and fsync executing concurrently#63636

Merged
vshankar merged 5 commits intoceph:mainfrom
vshankar:wip-71515
Sep 9, 2025
Merged

test/libcephfs: validate asynchronous write and fsync executing concurrently#63636
vshankar merged 5 commits intoceph:mainfrom
vshankar:wip-71515

Conversation

@vshankar
Copy link
Contributor

@vshankar vshankar commented Jun 2, 2025

Still a WIP (kind of).

Reproducer test case.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@vshankar vshankar requested a review from a team June 2, 2025 05:31
@vshankar vshankar added the cephfs Ceph File System label Jun 2, 2025
@vshankar
Copy link
Contributor Author

vshankar commented Jun 2, 2025

Just a note that this change still needs work to be able to reproduce the client crash. I will be updating this change over the coming days.

Copy link
Contributor

@chrisphoffman chrisphoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I revert the fix, the test still passes. Ah, I see the above note now, will rereview then.

@vshankar
Copy link
Contributor Author

vshankar commented Jun 2, 2025

When I revert the fix, the test still passes. Ah, I see the above note now, will rereview then.

Yeh, I still need to finish up.

@vshankar vshankar force-pushed the wip-71515 branch 2 times, most recently from ad5a42c to 21763db Compare June 3, 2025 12:17
@vshankar vshankar requested a review from mchangir June 3, 2025 12:17
@vshankar
Copy link
Contributor Author

vshankar commented Jun 3, 2025

When I revert the fix, the test still passes. Ah, I see the above note now, will rereview then.

Yeh, I still need to finish up.

This is done. The test changes reproduces the issue. The bug is essentially a buggy reference count decrement (for the MetaRequest), so the way this is caught is by asserting if the reference count drops below zero. This suffices IMO since the exact backtrace is hard to reproduce as it requires the MetaRequest to be tracked in some [x]list and that in turn requires the test case to perform way more operations and hoping that the request is still actively part of one of the [x]list's.

@vshankar vshankar requested review from a team, chrisphoffman and dparmar18 June 3, 2025 12:32
@vshankar
Copy link
Contributor Author

jenkins test make check

@vshankar
Copy link
Contributor Author

jenkins retest this please

@github-actions
Copy link

github-actions bot commented Jun 13, 2025

Config Diff Tool Output

+ added: client_inject_write_delay_secs (mds-client.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.
Ignore this comment if docs are already updated. To make the "Check ceph config changes" CI check pass, please comment /config check ok and re-run the test.

@vshankar
Copy link
Contributor Author

Testing update: https://tracker.ceph.com/issues/71514#note-12

The jobs have picked up crimson flavour for OSD and is causing lots of failures. Not sure how did that happen. This is under resolution and we are talking to crimson team. Standby!

@vshankar
Copy link
Contributor Author

@vshankar
Copy link
Contributor Author

jenkins test windows

@vshankar
Copy link
Contributor Author

Config Diff Tool Output

+ added: client_inject_write_delay_secs (mds-client.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.

So, after fixing up the test case to set the correct config I have run the test locally and the issue reproduces without PR #63619

➜  build git:(wip-71515) ✗ ./bin/ceph_test_libcephfs --gtest_filter=LibCephFS.ConcurrentWriteAndFsync
Note: Google Test filter = LibCephFS.ConcurrentWriteAndFsync
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LibCephFS
[ RUN      ] LibCephFS.ConcurrentWriteAndFsync
: setxattr thread sleeping: fsync thread sleeping

: setxattr thread wokeup
: fsync thread wokeup
: waiting for fsync to finish
: waiting for write to finish
/home/vshankar/ceph/src/client/Client.cc: In function 'void Client::put_request(MetaRequest*)' thread 7f63cdffb640 time 2025-06-18T05:24:37.869462+0000
/home/vshankar/ceph/src/client/Client.cc: 2290: FAILED ceph_assert(request->ref >= 1)
 ceph version 20.3.0-891-g7c63afd9659 (7c63afd9659b17fa5bac6cb1eb3feccb855fe863) tentacle (dev - Debug)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x119) [0x7f63eb8ebb98]
 2: (ceph::__ceph_assert_fail(ceph::assert_data const&)+0x17) [0x7f63eb8ebedc]
 3: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x6a075) [0x7f63ec46a075]
 4: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xb8217) [0x7f63ec4b8217]
 5: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xef06e) [0x7f63ec4ef06e]
 6: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xf1136) [0x7f63ec4f1136]
 7: (Context::complete(int)+0x9) [0x7f63ec45949d]
 8: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x16a1c7) [0x7f63ec56a1c7]
 9: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xa7c51) [0x7f63ec4a7c51]
 10: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xf0d47) [0x7f63ec4f0d47]
 11: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x16da11) [0x7f63ec56da11]
 12: (Context::complete(int)+0x9) [0x7f63ec45949d]
 13: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x6d108) [0x7f63ec46d108]
 14: (Context::complete(int)+0x9) [0x7f63ec45949d]
 15: (Finisher::finisher_thread_entry()+0x665) [0x7f63eb86fdc1]
 16: (Finisher::FinisherThread::entry()+0xd) [0x7f63eb870ddf]
 17: (Thread::entry_wrapper()+0x2f) [0x7f63eb8b88f5]
 18: (Thread::_entry_func(void*)+0x9) [0x7f63eb8b8907]
 19: /lib64/libc.so.6(+0x89e92) [0x7f63ea689e92]
 20: /lib64/libc.so.6(+0x10ef20) [0x7f63ea70ef20]
[1]    199494 IOT instruction (core dumped)  ./bin/ceph_test_libcephfs --gtest_filter=LibCephFS.ConcurrentWriteAndFsync

The teuthology test ran without this comment incorporated, however, I think its safe to merge this fix as we would defer the backport as mentioned here: https://tracker.ceph.com/issues/71514#note-12

@vshankar
Copy link
Contributor Author

jenkins test windows

@vshankar
Copy link
Contributor Author

This PR is under test in https://tracker.ceph.com/issues/71837.

vshankar added a commit to vshankar/ceph that referenced this pull request Jun 25, 2025
* refs/pull/63636/head:
	test/libcephfs: validate asynchronous write and fsync executing concurrently
	client: catch buggy reference count drop for MetaRequest
	client: synthetically delay write operation
	client: log unsafe operation count (for debugging)
	libcephfs/client: asynchronous fsync interface

Reviewed-by: Dhairya Parmar <dparmar@redhat.com>
Reviewed-by: Christopher Hoffman <choffman@redhat.com>
@vshankar
Copy link
Contributor Author

vshankar commented Jul 9, 2025

jenkins test make check

@vshankar
Copy link
Contributor Author

vshankar commented Jul 9, 2025

jenkins test make check arm64

@vshankar
Copy link
Contributor Author

@vshankar
Copy link
Contributor Author

jenkins test make check

3 similar comments
@vshankar
Copy link
Contributor Author

jenkins test make check

@vshankar
Copy link
Contributor Author

jenkins test make check

@vshankar
Copy link
Contributor Author

jenkins test make check

@vshankar
Copy link
Contributor Author

jenkins retest this please

1 similar comment
@vshankar
Copy link
Contributor Author

jenkins retest this please

@vshankar
Copy link
Contributor Author

jenkins retest this please

@vshankar
Copy link
Contributor Author

vshankar commented Aug 4, 2025

jenkins retest this please

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Mostly for writing test for hunting [0].

[0]: https://tracker.ceph.com/issues/71510

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
To allow the client to hold Fb caps for an extended period of
time, to allow an asynchronous fsync to intervene and block, so
as to hunt [0].

[0]: https://tracker.ceph.com/issues/71510

Signed-off-by: Venky Shankar <vshankar@redhat.com>
With the prior commit that introduces a synthetic delay in write
operation so as to write a test reproducer which would interleave
asynchronous fsync and an operation that makes the MDS send a early
reply to the client (therefore, having the client track the early
replied response for an inode in Inode::unsafe_ops). Now, this is
enough to trick the client into the code path that causes a buggy
reference drop for the request (MetaRequest), but, hitting the
_exact_ crash backtrace requires the request to be a in various
[x]list's.

This last bit is tricky to synthetically massage in the test. So,
in order to catch the buggy reference drop, it would suffice to
assert on the reference count dropping to less than zero (0).

Signed-off-by: Venky Shankar <vshankar@redhat.com>
…rrently

This synthetic reproducer does three things:

- setup a client mount with a configuration to delay write operations and
  initiates a write operation via a thread.
- a thread that invokes asynchronous fsync
- a thread that invokes setxattr for the client to track early replies

Without the fix[0], the test reproduces the following crash:

```
/home/vshankar/ceph/src/client/Client.cc: In function 'void Client::put_request(MetaRequest*)' thread 7f7210ff9640 time 2025-06-03T09:34:45.634974+0000
/home/vshankar/ceph/src/client/Client.cc: 2290: FAILED ceph_assert(request->ref >= 1)
 ceph version 20.3.0-673-gdd152807f7e (dd15280) tentacle (dev - Debug)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x119) [0x7f72222ebb98]
 2: (ceph::__ceph_assert_fail(ceph::assert_data const&)+0x17) [0x7f72222ebedc]
 3: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x6a075) [0x7f7222e6a075]
 4: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xb8289) [0x7f7222eb8289]
 5: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xee951) [0x7f7222eee951]
 6: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xf167c) [0x7f7222ef167c]
 7: (Context::complete(int)+0x9) [0x7f7222e5949d]
 8: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x16a853) [0x7f7222f6a853]
 9: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xa7cc5) [0x7f7222ea7cc5]
 10: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xf128d) [0x7f7222ef128d]
 11: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x16e09d) [0x7f7222f6e09d]
 12: (Context::complete(int)+0x9) [0x7f7222e5949d]
 13: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x6d108) [0x7f7222e6d108]
 14: (Context::complete(int)+0x9) [0x7f7222e5949d]
 15: (Finisher::finisher_thread_entry()+0x665) [0x7f722226fdc1]
 16: (Finisher::FinisherThread::entry()+0xd) [0x7f7222270ddf]
 17: (Thread::entry_wrapper()+0x2f) [0x7f72222b88f5]
 18: (Thread::_entry_func(void*)+0x9) [0x7f72222b8907]
 19: /lib64/libc.so.6(+0x89e92) [0x7f7221089e92]
 20: /lib64/libc.so.6(+0x10ef20) [0x7f722110ef20]
[1]    2162689 IOT instruction (core dumped)  ./bin/ceph_test_libcephfs --gtest_filter=LibCephFS.ConcurrentWriteAndFsync
```

[0]: ceph#63619

Fixes: http://tracker.ceph.com/issues/71515
Signed-off-by: Venky Shankar <vshankar@redhat.com>
@vshankar vshankar merged commit 598c41f into ceph:main Sep 9, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants