Skip to content

src/test: fix to avoid fail notification when testing manifest refcount#38937

Merged
tchaikov merged 2 commits intoceph:masterfrom
myoungwon:fix-snap-refcount
Feb 25, 2021
Merged

src/test: fix to avoid fail notification when testing manifest refcount#38937
tchaikov merged 2 commits intoceph:masterfrom
myoungwon:fix-snap-refcount

Conversation

@myoungwon
Copy link
Member

@myoungwon myoungwon commented Jan 18, 2021

Due to false-positive design on manifest snap refcounting,
a message to decrement the refcount can be missing.
This commit checks whether the manifest object's state is correct
when such mismatch happens to prevent aborting unit test.

The mismatch that happened here will be fixed by chunk scrub later.

The following is a specific scenario.
Assuming that there are osd 1, 6, 7, and object A is the manifest object in osd 1.

[osd 1, 6, 7], dec_refcount is invoked by osd 1 on object A

[osd 1, 6, 7], manifest info including chunk_info in object A is updated without waiting dec_refcount's completion

[osd 1 is out] a message to decrement the referece can not be delivered

[osd 6, 7] nothing happens, but the manifest info in the object A is already updated.

fixes: https://tracker.ceph.com/issues/48786
https://tracker.ceph.com/issues/48915
https://tracker.ceph.com/issues/47024

Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@github-actions github-actions bot added the tests label Jan 18, 2021
@myoungwon
Copy link
Member Author

Before going ahead, #38767 should be merged first.

@athanatos
Copy link
Contributor

This appears to fix a problem introduced in #38767, why not include this in that PR?

@myoungwon
Copy link
Member Author

I opend #38767 for appropriate refcounting after reviewing current chunk scrub code. Then, this issue is reported. So, I created another PR. Do you think this pr needs to be merged into #38767?

@athanatos
Copy link
Contributor

athanatos commented Jan 21, 2021

If it's independent, then this is fine.

@myoungwon myoungwon changed the title WIP: src/test: fix to avoid fail notification when testing manifest refcount src/test: fix to avoid fail notification when testing manifest refcount Jan 27, 2021
@myoungwon
Copy link
Member Author

@tchaikov I addressed your comment.

@myoungwon
Copy link
Member Author

@tchaikov Sorry, the failure in your recent test suite (https://pulpito.ceph.com/kchai-2021-01-27_08:19:23-rados-wip-kefu-testing-2021-01-27-1353-distro-basic-smithi/5833001/) caused by this PR because I added the code mistakenly, which invokes exec() with null string. I've re-pushed the commit to fix this, so it would be good if you include this PR when your next test suite runs.

@tchaikov
Copy link
Contributor

@myoungwon ack. will rerun the failed tests with the latest change in this PR.

@neha-ojha
Copy link
Member

jenkins test make check

@myoungwon
Copy link
Member Author

@athanatos What do you think?

ASSERT_TRUE(0);
}
ASSERT_EQ(1u, refs.count());
if (refs.count() != 1u) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these if checks? Is refs supposed to have just 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, The expected ref is 1 here. If the ref is not 1, just check whether the reference of both source and target is correct or not by the following code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not following. Why not just embed that check in is_intended_refcount_state?

@myoungwon
Copy link
Member Author

@athanatos I added a commit. Is this what you pointed out?

@athanatos
Copy link
Contributor

What, specifically, is causing the refcount decrement to fail? I don't see anything in this unit test that should cause that behaviour, so I wonder whether this patch would actually cover a bug.

@myoungwon
Copy link
Member Author

Assuming that there are osd 1, 6, 7, and object A is the manifest object in osd 1.

[osd 1, 6, 7], dec_refcount is invoked by osd 1 on object A

[osd 1, 6, 7], manifest info including chunk_info in object A is updated without waiting dec_refcount's completion

[osd 1 is out] a message to decrement the referece can not be delivered

[osd 6, 7] nothing happens, but the manifest info in the object A is already updated.

@athanatos
Copy link
Contributor

Does this test actually mark osd 1 out?

@athanatos
Copy link
Contributor

Oh, I suppose it's running with osd thrashing in the background in teuthology?

@myoungwon
Copy link
Member Author

Yes. The case I mentioned occurs when it's running in teuthology with osd thrashing.

}
ceph_assert(src_refcount >= 0);
}
if (src_refcount > dst_refcount) {
Copy link
Contributor

@athanatos athanatos Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this backwards? The design is that the refcount recorded on the target is always >= than the real number of references, right? We always increment the refcount on the target before updating the source metadata and update the source metadata before decrementing the target refcount, right?

Nvm, that's exactly what's happening here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src_refcount should always match expected_refcount, right? I think you should always read dst_refcount and src_refcount and assert:

  1. src_refcount == expected_refcount
  2. dst_refcount >= src_refcount

Copy link
Member Author

@myoungwon myoungwon Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src means here is the manifest object on the upper tier, and dst means that the the chunked object on the low tier.
So, isn't dst_refcount == expected_refcount correct?

Copy link
Contributor

@athanatos athanatos Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, we specifically allow dst_refcount to be larger for the reasons outlined in my first comment in this chain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, dst_refcount >= src_refcount is correct.
but expected_refcount here is to check whether dst_refcount is the refcount as our expectation or not.

Copy link
Member Author

@myoungwon myoungwon Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I misunderstood (your comment is the same as what I said). assert:

src_refcount == expected_refcount
dst_refcount >= src_refcount

are correct. Can you take a look a commit I added?

@myoungwon myoungwon force-pushed the fix-snap-refcount branch 2 times, most recently from 9e2524a to 9ed7c48 Compare February 5, 2021 08:09
Due to false-positive design on manifest snap refcounting,
a message to decrement the refcount can be missing.
This commit checks whether the manifest object's state is correct
when such mismatch happens to prevent aborting unit test.

Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
@myoungwon
Copy link
Member Author

@athanatos any other comments?

@athanatos
Copy link
Contributor

Looks good to me. Has this gone through testing yet?

@myoungwon myoungwon force-pushed the fix-snap-refcount branch 2 times, most recently from 68a029e to d0aa7f4 Compare February 15, 2021 08:22
@myoungwon
Copy link
Member Author

@tchaikov @athanatos The fail in the last test run (https://pulpito.ceph.com/kchai-2021-02-14_08:35:02-rados-wip-kefu-testing-2021-02-14-1248-distro-basic-smithi/5880425/) occurs during ManifestSnapRefcount2() because I didn't catch a point that snap_remove is done asynchronously by selfmanaged_snap_remove(). So, cls_cas_references_chunk() to count source object's refcount is delivered before trim_object() is executed. To resolve this, I added a commit, which allow expected_refcounts to have multiple values, which considers two cases; the snapshot is removed or not.

@tchaikov Can you please re-include this PR when your next test run?

@athanatos
Copy link
Contributor

I'm worried that that kind of renders the test pointless -- it'll succeed even if the snap removal never actually happens or if the snap removal happens, but the refcount update is incorrect. What if you returned an error if there are untrimmed snaps? You could then loop until the snaps have been trimmed. You can check the current OSDMap for each snap to check whether it still exists (get_osdmap()->in_removed_snaps_queue(info.pgid.pgid.pool(), oid.snap)).

After calling selfmanaged_snap_remove, we don't know
when trimming snapshot is finished.
So, we make the OSD to return EBUSY if the snapshot in removed_snap_queue,
then the unit test waits the completion

Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
@myoungwon
Copy link
Member Author

@athanatos Done. Make sense?

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good pending a suite run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants