src/test: fix to avoid fail notification when testing manifest refcount by myoungwon · Pull Request #38937 · ceph/ceph

myoungwon · 2021-01-18T03:30:09Z

Due to false-positive design on manifest snap refcounting,
a message to decrement the refcount can be missing.
This commit checks whether the manifest object's state is correct
when such mismatch happens to prevent aborting unit test.

The mismatch that happened here will be fixed by chunk scrub later.

The following is a specific scenario.
Assuming that there are osd 1, 6, 7, and object A is the manifest object in osd 1.

[osd 1, 6, 7], dec_refcount is invoked by osd 1 on object A

[osd 1, 6, 7], manifest info including chunk_info in object A is updated without waiting dec_refcount's completion

[osd 1 is out] a message to decrement the referece can not be delivered

[osd 6, 7] nothing happens, but the manifest info in the object A is already updated.

fixes: https://tracker.ceph.com/issues/48786
https://tracker.ceph.com/issues/48915
https://tracker.ceph.com/issues/47024

Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

myoungwon · 2021-01-18T03:32:13Z

Before going ahead, #38767 should be merged first.

athanatos · 2021-01-21T19:29:36Z

This appears to fix a problem introduced in #38767, why not include this in that PR?

myoungwon · 2021-01-21T23:34:05Z

I opend #38767 for appropriate refcounting after reviewing current chunk scrub code. Then, this issue is reported. So, I created another PR. Do you think this pr needs to be merged into #38767?

athanatos · 2021-01-21T23:48:29Z

If it's independent, then this is fine.

src/test/librados/tier_cxx.cc

myoungwon · 2021-01-27T07:06:57Z

@tchaikov I addressed your comment.

myoungwon · 2021-01-27T12:07:29Z

@tchaikov Sorry, the failure in your recent test suite (https://pulpito.ceph.com/kchai-2021-01-27_08:19:23-rados-wip-kefu-testing-2021-01-27-1353-distro-basic-smithi/5833001/) caused by this PR because I added the code mistakenly, which invokes exec() with null string. I've re-pushed the commit to fix this, so it would be good if you include this PR when your next test suite runs.

tchaikov · 2021-01-28T13:53:10Z

@myoungwon ack. will rerun the failed tests with the latest change in this PR.

neha-ojha · 2021-02-04T01:57:38Z

jenkins test make check

tchaikov · 2021-02-04T04:22:17Z

myoungwon · 2021-02-04T07:30:31Z

@athanatos What do you think?

athanatos · 2021-02-04T22:02:02Z

src/test/librados/tier_cxx.cc

      ASSERT_TRUE(0);
    }
-    ASSERT_EQ(1u, refs.count());
+    if (refs.count() != 1u) {


Why these if checks? Is refs supposed to have just 1?

Yeah, The expected ref is 1 here. If the ref is not 1, just check whether the reference of both source and target is correct or not by the following code.

I'm really not following. Why not just embed that check in is_intended_refcount_state?

myoungwon · 2021-02-05T03:27:44Z

@athanatos I added a commit. Is this what you pointed out?

athanatos · 2021-02-05T06:02:48Z

What, specifically, is causing the refcount decrement to fail? I don't see anything in this unit test that should cause that behaviour, so I wonder whether this patch would actually cover a bug.

myoungwon · 2021-02-05T06:18:01Z

Assuming that there are osd 1, 6, 7, and object A is the manifest object in osd 1.

[osd 1, 6, 7], dec_refcount is invoked by osd 1 on object A

[osd 1, 6, 7], manifest info including chunk_info in object A is updated without waiting dec_refcount's completion

[osd 1 is out] a message to decrement the referece can not be delivered

[osd 6, 7] nothing happens, but the manifest info in the object A is already updated.

athanatos · 2021-02-05T07:17:22Z

Does this test actually mark osd 1 out?

athanatos · 2021-02-05T07:17:48Z

Oh, I suppose it's running with osd thrashing in the background in teuthology?

myoungwon · 2021-02-05T07:20:31Z

Yes. The case I mentioned occurs when it's running in teuthology with osd thrashing.

athanatos · 2021-02-05T07:21:18Z

src/test/librados/tier_cxx.cc

+    }
+    ceph_assert(src_refcount >= 0);
+  }
+  if (src_refcount > dst_refcount) {


Isn't this backwards? The design is that the refcount recorded on the target is always >= than the real number of references, right? We always increment the refcount on the target before updating the source metadata and update the source metadata before decrementing the target refcount, right?

Nvm, that's exactly what's happening here.

src_refcount should always match expected_refcount, right? I think you should always read dst_refcount and src_refcount and assert:

src_refcount == expected_refcount

dst_refcount >= src_refcount

src means here is the manifest object on the upper tier, and dst means that the the chunked object on the low tier.
~~So, isn't dst_refcount == expected_refcount correct?~~

I don't think so, we specifically allow dst_refcount to be larger for the reasons outlined in my first comment in this chain.

Yeah, dst_refcount >= src_refcount is correct.
but expected_refcount here is to check whether dst_refcount is the refcount as our expectation or not.

Oh, sorry, I misunderstood (your comment is the same as what I said). assert:

src_refcount == expected_refcount dst_refcount >= src_refcount

are correct. Can you take a look a commit I added?

Due to false-positive design on manifest snap refcounting, a message to decrement the refcount can be missing. This commit checks whether the manifest object's state is correct when such mismatch happens to prevent aborting unit test. Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>

myoungwon · 2021-02-09T09:27:20Z

@athanatos any other comments?

athanatos · 2021-02-09T23:37:19Z

Looks good to me. Has this gone through testing yet?

myoungwon · 2021-02-15T08:47:05Z

@tchaikov @athanatos The fail in the last test run (https://pulpito.ceph.com/kchai-2021-02-14_08:35:02-rados-wip-kefu-testing-2021-02-14-1248-distro-basic-smithi/5880425/) occurs during ManifestSnapRefcount2() because I didn't catch a point that snap_remove is done asynchronously by selfmanaged_snap_remove(). So, cls_cas_references_chunk() to count source object's refcount is delivered before trim_object() is executed. To resolve this, I added a commit, which allow expected_refcounts to have multiple values, which considers two cases; the snapshot is removed or not.

@tchaikov Can you please re-include this PR when your next test run?

athanatos · 2021-02-16T01:37:56Z

I'm worried that that kind of renders the test pointless -- it'll succeed even if the snap removal never actually happens or if the snap removal happens, but the refcount update is incorrect. What if you returned an error if there are untrimmed snaps? You could then loop until the snaps have been trimmed. You can check the current OSDMap for each snap to check whether it still exists (get_osdmap()->in_removed_snaps_queue(info.pgid.pgid.pool(), oid.snap)).

After calling selfmanaged_snap_remove, we don't know when trimming snapshot is finished. So, we make the OSD to return EBUSY if the snapshot in removed_snap_queue, then the unit test waits the completion Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>

myoungwon · 2021-02-17T02:02:48Z

@athanatos Done. Make sense?

athanatos

Seems good pending a suite run.

tchaikov · 2021-02-25T06:26:46Z

github-actions bot added the tests label Jan 18, 2021

myoungwon force-pushed the fix-snap-refcount branch from 9af7397 to 9df398b Compare January 19, 2021 08:37

neha-ojha added the core label Jan 19, 2021

myoungwon requested a review from athanatos January 21, 2021 01:55

tchaikov added the wip-kefu-testing label Jan 26, 2021

myoungwon force-pushed the fix-snap-refcount branch from 9df398b to 70f82dd Compare January 27, 2021 02:50

myoungwon changed the title ~~WIP: src/test: fix to avoid fail notification when testing manifest refcount~~ src/test: fix to avoid fail notification when testing manifest refcount Jan 27, 2021

myoungwon added the bug-fix label Jan 27, 2021

tchaikov reviewed Jan 27, 2021

View reviewed changes

src/test/librados/tier_cxx.cc Outdated Show resolved Hide resolved

tchaikov reviewed Jan 27, 2021

View reviewed changes

src/test/librados/tier_cxx.cc Outdated Show resolved Hide resolved

myoungwon force-pushed the fix-snap-refcount branch from 70f82dd to e3c42fc Compare January 27, 2021 06:52

myoungwon force-pushed the fix-snap-refcount branch from e3c42fc to 4b88daf Compare January 27, 2021 12:03

neha-ojha added the needs-qa label Feb 4, 2021

athanatos reviewed Feb 4, 2021

View reviewed changes

athanatos reviewed Feb 5, 2021

View reviewed changes

myoungwon force-pushed the fix-snap-refcount branch 2 times, most recently from 9e2524a to 9ed7c48 Compare February 5, 2021 08:09

myoungwon force-pushed the fix-snap-refcount branch from 9ed7c48 to d0369dc Compare February 5, 2021 12:51

tchaikov removed the wip-kefu-testing label Feb 8, 2021

athanatos approved these changes Feb 9, 2021

View reviewed changes

tchaikov added the wip-kefu-testing label Feb 12, 2021

myoungwon force-pushed the fix-snap-refcount branch 2 times, most recently from 68a029e to d0aa7f4 Compare February 15, 2021 08:22

myoungwon force-pushed the fix-snap-refcount branch from d0aa7f4 to d6f9f23 Compare February 16, 2021 05:48

tchaikov removed the wip-kefu-testing label Feb 18, 2021

athanatos approved these changes Feb 18, 2021

View reviewed changes

tchaikov added the wip-kefu-testing label Feb 19, 2021

tchaikov merged commit e836413 into ceph:master Feb 25, 2021

myoungwon mentioned this pull request Mar 2, 2021

pacific: osd, test: fix to avoid fail notification when testing manifest refcount #39773

Merged

3 tasks

Conversation

myoungwon commented Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

myoungwon commented Jan 18, 2021

Uh oh!

athanatos commented Jan 21, 2021

Uh oh!

myoungwon commented Jan 21, 2021

Uh oh!

athanatos commented Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

myoungwon commented Jan 27, 2021

Uh oh!

myoungwon commented Jan 27, 2021

Uh oh!

tchaikov commented Jan 28, 2021

Uh oh!

neha-ojha commented Feb 4, 2021

Uh oh!

tchaikov commented Feb 4, 2021

Uh oh!

myoungwon commented Feb 4, 2021

Uh oh!

athanatos Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

myoungwon Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

athanatos Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

myoungwon commented Feb 5, 2021

Uh oh!

athanatos commented Feb 5, 2021

Uh oh!

myoungwon commented Feb 5, 2021

Uh oh!

athanatos commented Feb 5, 2021

Uh oh!

athanatos commented Feb 5, 2021

Uh oh!

myoungwon commented Feb 5, 2021

Uh oh!

athanatos Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

athanatos Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

myoungwon Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

athanatos Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

myoungwon Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

myoungwon Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

myoungwon commented Feb 9, 2021

Uh oh!

athanatos commented Feb 9, 2021

Uh oh!

myoungwon commented Feb 15, 2021

Uh oh!

athanatos commented Feb 16, 2021

Uh oh!

myoungwon commented Feb 17, 2021

myoungwon commented Jan 18, 2021 •

edited

Loading

athanatos commented Jan 21, 2021 •

edited

Loading

athanatos Feb 5, 2021 •

edited

Loading

myoungwon Feb 5, 2021 •

edited

Loading

athanatos Feb 5, 2021 •

edited

Loading

myoungwon Feb 5, 2021 •

edited

Loading