client: Fix libcephfs aio metadata corruption.#59987
Conversation
|
Need to add test |
42f75b2 to
be26259
Compare
|
@vshankar @gregsfortytwo
This is also the reason that our libcephfs nonblocking io tests are not catching this issue. It is using objectcacher and everything works fine with it. |
nfs-ganesha runs without oc intentionally
I can add some test cases in #54435 (which is yet to merged). |
be26259 to
d990535
Compare
For now I have added a test to run the exisiting tests with objectcacher disabled. |
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
d990535 to
3e60a67
Compare
|
jenkins test make check |
7e54d44 to
1c68510
Compare
|
Rebased and did small nfs test fix |
|
jenkins retest this please |
1c68510 to
3f42326
Compare
|
Fixed test_nfs |
|
jenkins test make check |
|
This PR is under test in https://tracker.ceph.com/issues/68382. |
|
jenkins test make check |
|
jenkins test make check |
|
@vshankar The make check failure is because of the flake8 error on the test_nfs.py. I will fix this and rebase. |
Fixes: https://tracker.ceph.com/issues/68146 Signed-off-by: Kotresh HR <khiremat@redhat.com>
Fixes: https://tracker.ceph.com/issues/68146 Signed-off-by: Kotresh HR <khiremat@redhat.com>
The same bufferlist is used without cleaning for multiple calls. The test 'LlreadvLlwritev' used to fail because of it. Fixed the same. Fixes: https://tracker.ceph.com/issues/68146 Signed-off-by: Kotresh HR <khiremat@redhat.com>
Problem: With cephfs nfs-ganesha, there were following asserts hit while doing write on a file. 1. FAILED ceph_assert((bool)_front == (bool)_size) 2. FAILED ceph_assert(cap_refs[c] > 0) Cause: In aio path, the client_lock was not being held in the internal callback after the io is done where it's expected to be taken leading to corruption. Fix: Take client_lock in the callback Fixes: https://tracker.ceph.com/issues/68146 Signed-off-by: Kotresh HR <khiremat@redhat.com>
When libcephfs aio tests (src/test/client) are run with objectcacher disabled (ceph_test_client --client_oc=false), the TestClient.LlreadvLlwritev fails and core dumps. The client hits the assert 'caps_ref[c]<0'. This patch fixes the same. There is no need to give out cap_ref and take it again between multiple read because of short reads. In some cases, the get_caps used to fail in C_Read_Sync_NonBlocking::finish causing cap_ref to go negative when put_cap_ref is done at last in C_Read_Finish::finish_io Fixes: https://tracker.ceph.com/issues/68308 Signed-off-by: Kotresh HR <khiremat@redhat.com>
The following test fails when run with objectcacher disabled. TestClient.LlreadvLlwritevZeroBytes Failure - nonblocking.cc ceph/src/osdc/Striper.cc: 186: FAILED ceph_assert(len > 0) Traceback: ceph version Development (no_version) squid (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7fc0a340aafe] 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7fc0a340ad20] 3: (Striper::file_to_extents(ceph::common::CephContext*, file_layout_t const*, ...)+0x184) [0x562727e13ab4] 4: (Striper::file_to_extents(ceph::common::CephContext*, char const*, ...)+0x97) [0x562727e145d1] 5: (Striper::file_to_extents(ceph::common::CephContext*, inodeno_t, ...)+0x75) [0x562727d29520] 6: (Filer::read_trunc(inodeno_t, file_layout_t const*, snapid_t, ...)+0x61) [0x562727d66ea5] 7: (Client::C_Read_Sync_NonBlocking::retry()+0x10c) [0x562727cd8a8e] 8: (Client::_read(Fh*, long, unsigned long, ceph::buffer::v15_2_0::list*, Context*)+0x578) [0x562727d10cb6] 9: (Client::_preadv_pwritev_locked(Fh*, iovec const*, int, long, bool, ...)+0x3a7) [0x562727d18159] 10: (Client::ll_preadv_pwritev(Fh*, iovec const*, int, long, bool, ...)+0x179) [0x562727d18b99] 11: (TestClient_LlreadvLlwritevZeroBytes_Test::TestBody()+0x592) [0x562727ca5352] 12: (void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, ...)+0x1b) [0x562727d9dea3] 13: (void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, ...)+0x80) [0x562727da2b26] 14: (testing::Test::Run()+0xb4) [0x562727d927ae] 15: (testing::TestInfo::Run()+0x104) [0x562727d92988] 16: (testing::TestSuite::Run()+0xb2) [0x562727d92b34] 17: (testing::internal::UnitTestImpl::RunAllTests()+0x36b) [0x562727d95303] 18: (bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, ...)(), char const*)+0x1b) [0x562727d9e15f] 19: (bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, ...)+0x80) [0x562727da3083] 20: (testing::UnitTest::Run()+0x63) [0x562727d92813] 21: (RUN_ALL_TESTS()+0x11) [0x562727c828d9] 22: main() The patch fixes the same. Fixes: https://tracker.ceph.com/issues/68309 Signed-off-by: Kotresh HR <khiremat@redhat.com>
3f42326 to
942474c
Compare
Done |
Never mind. I built the change locally before building packages in shaman. Just that when running test I'll point to a custom qa suite branch (saves time). |
|
jenkins test make check arm64 |
* refs/pull/59987/head: client: Fix aio zerobyte file read client: Fix caps_ref[c]<0 assert client: Fix libcephfs aio metadata corruption. test/client: Fix aio nonblocking test qa: Add libcephfs client test with objectcacher disabled qa: Add data read/write test for nfs-ganesha
vshankar
left a comment
There was a problem hiding this comment.
This is good to merge. Preparing run wiki now - will merge soon.
| // C_Read_Sync_NonBlocking::finish(). | ||
|
|
||
| // trim read based on file size? | ||
| if ((offset >= in->size) || (size == 0)) { |
There was a problem hiding this comment.
seeing gcc warnings from this:
ceph/src/client/Client.cc:10967:17: warning: comparison of integer expressions of different signedness: ‘int64_t’ {aka ‘long int’} and ‘uint64_t’ {aka ‘long unsigned int’} [-Wsign-compare]
10967 | if ((offset >= in->size) || (size == 0)) {
| ~~~~~~~^~~~~~~~~~~
tracked in https://tracker.ceph.com/issues/68552
Problem:
With cephfs nfs-ganesha, there were following
asserts hit while doing write on a file.
Cause:
In aio path, the client_lock was not being held
in the internal callback after the io is done where it's expected to be taken leading to corruption.
Fix:
Take client_lock in the callback
Fixes: https://tracker.ceph.com/issues/68146
Fixes: https://tracker.ceph.com/issues/68308
Fixes: https://tracker.ceph.com/issues/68309
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e