Skip to content

BUG-FIX: NCB code was reporting Bogus error when we had an overlapped…#42991

Merged
yuriw merged 1 commit intoceph:masterfrom
benhanokh:suppress_bogus_errors
Sep 17, 2021
Merged

BUG-FIX: NCB code was reporting Bogus error when we had an overlapped…#42991
yuriw merged 1 commit intoceph:masterfrom
benhanokh:suppress_bogus_errors

Conversation

@benhanokh
Copy link
Contributor

NCB sanity check was aborting when found an overlapped extent between shared-blobs on the same Onode (which is a legal case)

The error check was refined to skip shared-blobs
I also added an assert for copy-allocator spillover (should never happen) and removed an early exit when an empty extent was found (we report and skip it)
Signed-off-by: Gabriel Benhanokh gbenhano@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

if (!blob.is_shared() && (lcl_itr != lcl_extnt_map.end()) ) {
// repeated extents must have the same length!

// --Note--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete note.

if (lcl_itr != lcl_extnt_map.end()) {
// shared blobs might have differnt length
if (!blob.is_shared() && (lcl_itr != lcl_extnt_map.end()) ) {
// repeated extents must have the same length!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If blob is shared, then we should not damage offset map:

lcl_extnt_map[offset] = length;

if (idx > *p_num_entries) {
derr << "****spillover, num_entries=" << *p_num_entries << ", spillover=" << (idx - *p_num_entries) << dendl;
return -1;
//return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just delete.

ceph_assert(idx <= *p_num_entries);
}

#if 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete.

@@ -17551,7 +17559,8 @@ void BlueStore::read_allocation_from_single_onode(

// skip repeating extents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are 2 reasonable long term solutions:

  1. make offset check complete (preferred)
  2. drop it altogether

@tchaikov

This comment has been minimized.

@benhanokh benhanokh force-pushed the suppress_bogus_errors branch 2 times, most recently from 8466536 to fbf444d Compare September 1, 2021 17:30
…tents

Fixes: https://tracker.ceph.com/issues/52138

BUG-FIX: NCB code was reporting Bogus error when we had an overlapped extent between shared-blobs on the same Onode (which is a legal case)
The error check was refined to skip shared-blobs
I also added an assert for copy-allocator spillover (should never happen) and removed an early exit when an empty extent was found (we report and skip it)
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
@benhanokh benhanokh force-pushed the suppress_bogus_errors branch from fbf444d to d33acd8 Compare September 1, 2021 17:33
@sseshasa
Copy link
Contributor

sseshasa commented Sep 9, 2021

Teuthology Testing Result:

Original Run:
https://pulpito.ceph.com/yuriw-2021-09-07_21:38:25-rados-wip-yuri2-testing-2021-09-07-1258-distro-basic-smithi/

Re-run of dead and failed jobs from the above run:
https://pulpito.ceph.com/yuriw-2021-09-08_15:10:21-rados-wip-yuri2-testing-2021-09-07-1258-distro-basic-smithi/

Failures from the Re-run:

  1. https://pulpito.ceph.com/yuriw-2021-09-08_15:10:21-rados-wip-yuri2-testing-2021-09-07-1258-distro-basic-smithi/6379886

    osd.0 didn't respond to the following command after it was revived as part of _do_thrash().
    First error:

  • 2021-09-08T15:47:47.945 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell osd.0 injectargs --filestore_debug_random_read_err=0.0

  • 2021-09-08T15:47:48.067 INFO:teuthology.orchestra.run.smithi064.stderr:Error ENXIO: problem getting command descriptions from osd.0
    New tracker added: https://tracker.ceph.com/issues/52562
    @benhanokh Could you please help take a look at the logs for this failure and see if it is related to this PR and update the above tracker?

  1. https://pulpito.ceph.com/yuriw-2021-09-08_15:10:21-rados-wip-yuri2-testing-2021-09-07-1258-distro-basic-smithi/6379887

    test_dashboard_e2e.sh failure in orchestrator/01-hosts-force-maintenance.e2e-spec.ts
    Tracker: https://tracker.ceph.com/issues/51728 (Fix under review)

  2. https://pulpito.ceph.com/yuriw-2021-09-08_15:10:21-rados-wip-yuri2-testing-2021-09-07-1258-distro-basic-smithi/6379888

    Rook related: 'check osd count' reached maximum tries
    Tracked by: https://tracker.ceph.com/issues/52321

  3. https://pulpito.ceph.com/yuriw-2021-09-08_15:10:21-rados-wip-yuri2-testing-2021-09-07-1258-distro-basic-smithi/6379899

    Rook related: 'check osd count' reached maximum tries
    Tracked by: https://tracker.ceph.com/issues/52321

  4. https://pulpito.ceph.com/yuriw-2021-09-08_15:10:21-rados-wip-yuri2-testing-2021-09-07-1258-distro-basic-smithi/6379901

    Cephadm related: 2021-09-08T15:25:50.215 INFO:teuthology.orchestra.run.smithi043.stderr:Error: OCI runtime error: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: process_linux.go:422: setting cgroup config for procHooks process caused: Unit libpod 86831cf6c897f95a9d72d8d1b74dab77bc37c86c0b5dceba03fc16656e6f160d.scope not found. 2021-09-08T15:25:50.235 DEBUG:teuthology.orchestra.run:got remote process result: 127
    Tracked by: https://tracker.ceph.com/issues/49287

  5. https://pulpito.ceph.com/yuriw-2021-09-08_15:10:21-rados-wip-yuri2-testing-2021-09-07-1258-distro-basic-smithi/6379908

    First error reported: raise RuntimeError("Synthetic exception in serve")
    Test that failed: tasks.mgr.test_module_selftest.TestModuleSelftest.
    Tracked by: https://tracker.ceph.com/issues/38455

  6. https://pulpito.ceph.com/yuriw-2021-09-08_15:10:21-rados-wip-yuri2-testing-2021-09-07-1258-distro-basic-smithi/6379917

    Same as 2 above.

@yuriw
Copy link
Contributor

yuriw commented Sep 17, 2021

From @neha-ojha approval:

No related failures, existing ones tracked in

https://tracker.ceph.com/issues/52621
https://tracker.ceph.com/issues/49287
https://tracker.ceph.com/issues/51815
https://tracker.ceph.com/issues/52652

@yuriw yuriw merged commit 98870af into ceph:master Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants