test/libcephfs: validate asynchronous write and fsync executing concurrently#63636
test/libcephfs: validate asynchronous write and fsync executing concurrently#63636
Conversation
|
Just a note that this change still needs work to be able to reproduce the client crash. I will be updating this change over the coming days. |
Yeh, I still need to finish up. |
ad5a42c to
21763db
Compare
This is done. The test changes reproduces the issue. The bug is essentially a buggy reference count decrement (for the MetaRequest), so the way this is caught is by asserting if the reference count drops below zero. This suffices IMO since the exact backtrace is hard to reproduce as it requires the MetaRequest to be tracked in some [x]list and that in turn requires the test case to perform way more operations and hoping that the request is still actively part of one of the [x]list's. |
|
jenkins test make check |
|
jenkins retest this please |
Config Diff Tool Output+ added: client_inject_write_delay_secs (mds-client.yaml.in)
The above configuration changes are found in the PR. Please update the relevant release documentation if necessary. |
|
Testing update: https://tracker.ceph.com/issues/71514#note-12 |
|
jenkins test windows |
So, after fixing up the test case to set the correct config I have run the test locally and the issue reproduces without PR #63619 The teuthology test ran without this comment incorporated, however, I think its safe to merge this fix as we would defer the backport as mentioned here: https://tracker.ceph.com/issues/71514#note-12 |
|
jenkins test windows |
|
This PR is under test in https://tracker.ceph.com/issues/71837. |
* refs/pull/63636/head: test/libcephfs: validate asynchronous write and fsync executing concurrently client: catch buggy reference count drop for MetaRequest client: synthetically delay write operation client: log unsafe operation count (for debugging) libcephfs/client: asynchronous fsync interface Reviewed-by: Dhairya Parmar <dparmar@redhat.com> Reviewed-by: Christopher Hoffman <choffman@redhat.com>
|
jenkins test make check |
|
jenkins test make check arm64 |
|
jenkins test make check |
3 similar comments
|
jenkins test make check |
|
jenkins test make check |
|
jenkins test make check |
|
jenkins retest this please |
1 similar comment
|
jenkins retest this please |
|
jenkins retest this please |
|
jenkins retest this please |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Mostly for writing test for hunting [0]. [0]: https://tracker.ceph.com/issues/71510 Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
To allow the client to hold Fb caps for an extended period of time, to allow an asynchronous fsync to intervene and block, so as to hunt [0]. [0]: https://tracker.ceph.com/issues/71510 Signed-off-by: Venky Shankar <vshankar@redhat.com>
With the prior commit that introduces a synthetic delay in write operation so as to write a test reproducer which would interleave asynchronous fsync and an operation that makes the MDS send a early reply to the client (therefore, having the client track the early replied response for an inode in Inode::unsafe_ops). Now, this is enough to trick the client into the code path that causes a buggy reference drop for the request (MetaRequest), but, hitting the _exact_ crash backtrace requires the request to be a in various [x]list's. This last bit is tricky to synthetically massage in the test. So, in order to catch the buggy reference drop, it would suffice to assert on the reference count dropping to less than zero (0). Signed-off-by: Venky Shankar <vshankar@redhat.com>
…rrently This synthetic reproducer does three things: - setup a client mount with a configuration to delay write operations and initiates a write operation via a thread. - a thread that invokes asynchronous fsync - a thread that invokes setxattr for the client to track early replies Without the fix[0], the test reproduces the following crash: ``` /home/vshankar/ceph/src/client/Client.cc: In function 'void Client::put_request(MetaRequest*)' thread 7f7210ff9640 time 2025-06-03T09:34:45.634974+0000 /home/vshankar/ceph/src/client/Client.cc: 2290: FAILED ceph_assert(request->ref >= 1) ceph version 20.3.0-673-gdd152807f7e (dd15280) tentacle (dev - Debug) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x119) [0x7f72222ebb98] 2: (ceph::__ceph_assert_fail(ceph::assert_data const&)+0x17) [0x7f72222ebedc] 3: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x6a075) [0x7f7222e6a075] 4: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xb8289) [0x7f7222eb8289] 5: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xee951) [0x7f7222eee951] 6: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xf167c) [0x7f7222ef167c] 7: (Context::complete(int)+0x9) [0x7f7222e5949d] 8: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x16a853) [0x7f7222f6a853] 9: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xa7cc5) [0x7f7222ea7cc5] 10: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0xf128d) [0x7f7222ef128d] 11: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x16e09d) [0x7f7222f6e09d] 12: (Context::complete(int)+0x9) [0x7f7222e5949d] 13: /home/vshankar/ceph/build/lib/libcephfs.so.2(+0x6d108) [0x7f7222e6d108] 14: (Context::complete(int)+0x9) [0x7f7222e5949d] 15: (Finisher::finisher_thread_entry()+0x665) [0x7f722226fdc1] 16: (Finisher::FinisherThread::entry()+0xd) [0x7f7222270ddf] 17: (Thread::entry_wrapper()+0x2f) [0x7f72222b88f5] 18: (Thread::_entry_func(void*)+0x9) [0x7f72222b8907] 19: /lib64/libc.so.6(+0x89e92) [0x7f7221089e92] 20: /lib64/libc.so.6(+0x10ef20) [0x7f722110ef20] [1] 2162689 IOT instruction (core dumped) ./bin/ceph_test_libcephfs --gtest_filter=LibCephFS.ConcurrentWriteAndFsync ``` [0]: ceph#63619 Fixes: http://tracker.ceph.com/issues/71515 Signed-off-by: Venky Shankar <vshankar@redhat.com>
Still a WIP (kind of).Reproducer test case.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition