src/test: fix to avoid fail notification when testing manifest refcount#38937
src/test: fix to avoid fail notification when testing manifest refcount#38937tchaikov merged 2 commits intoceph:masterfrom
Conversation
|
Before going ahead, #38767 should be merged first. |
9af7397 to
9df398b
Compare
|
This appears to fix a problem introduced in #38767, why not include this in that PR? |
|
If it's independent, then this is fine. |
9df398b to
70f82dd
Compare
70f82dd to
e3c42fc
Compare
|
@tchaikov I addressed your comment. |
e3c42fc to
4b88daf
Compare
|
@tchaikov Sorry, the failure in your recent test suite (https://pulpito.ceph.com/kchai-2021-01-27_08:19:23-rados-wip-kefu-testing-2021-01-27-1353-distro-basic-smithi/5833001/) caused by this PR because I added the code mistakenly, which invokes exec() with null string. I've re-pushed the commit to fix this, so it would be good if you include this PR when your next test suite runs. |
|
@myoungwon ack. will rerun the failed tests with the latest change in this PR. |
|
jenkins test make check |
|
@athanatos What do you think? |
src/test/librados/tier_cxx.cc
Outdated
| ASSERT_TRUE(0); | ||
| } | ||
| ASSERT_EQ(1u, refs.count()); | ||
| if (refs.count() != 1u) { |
There was a problem hiding this comment.
Why these if checks? Is refs supposed to have just 1?
There was a problem hiding this comment.
Yeah, The expected ref is 1 here. If the ref is not 1, just check whether the reference of both source and target is correct or not by the following code.
There was a problem hiding this comment.
I'm really not following. Why not just embed that check in is_intended_refcount_state?
|
@athanatos I added a commit. Is this what you pointed out? |
|
What, specifically, is causing the refcount decrement to fail? I don't see anything in this unit test that should cause that behaviour, so I wonder whether this patch would actually cover a bug. |
|
Assuming that there are osd 1, 6, 7, and object A is the manifest object in osd 1. [osd 1, 6, 7], dec_refcount is invoked by osd 1 on object A [osd 1, 6, 7], manifest info including chunk_info in object A is updated without waiting dec_refcount's completion [osd 1 is out] a message to decrement the referece can not be delivered [osd 6, 7] nothing happens, but the manifest info in the object A is already updated. |
|
Does this test actually mark osd 1 out? |
|
Oh, I suppose it's running with osd thrashing in the background in teuthology? |
|
Yes. The case I mentioned occurs when it's running in teuthology with osd thrashing. |
src/test/librados/tier_cxx.cc
Outdated
| } | ||
| ceph_assert(src_refcount >= 0); | ||
| } | ||
| if (src_refcount > dst_refcount) { |
There was a problem hiding this comment.
Isn't this backwards? The design is that the refcount recorded on the target is always >= than the real number of references, right? We always increment the refcount on the target before updating the source metadata and update the source metadata before decrementing the target refcount, right?
Nvm, that's exactly what's happening here.
There was a problem hiding this comment.
src_refcount should always match expected_refcount, right? I think you should always read dst_refcount and src_refcount and assert:
- src_refcount == expected_refcount
- dst_refcount >= src_refcount
There was a problem hiding this comment.
src means here is the manifest object on the upper tier, and dst means that the the chunked object on the low tier.
So, isn't dst_refcount == expected_refcount correct?
There was a problem hiding this comment.
I don't think so, we specifically allow dst_refcount to be larger for the reasons outlined in my first comment in this chain.
There was a problem hiding this comment.
Yeah, dst_refcount >= src_refcount is correct.
but expected_refcount here is to check whether dst_refcount is the refcount as our expectation or not.
There was a problem hiding this comment.
Oh, sorry, I misunderstood (your comment is the same as what I said). assert:
src_refcount == expected_refcount
dst_refcount >= src_refcount
are correct. Can you take a look a commit I added?
9e2524a to
9ed7c48
Compare
Due to false-positive design on manifest snap refcounting, a message to decrement the refcount can be missing. This commit checks whether the manifest object's state is correct when such mismatch happens to prevent aborting unit test. Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
9ed7c48 to
d0369dc
Compare
|
@athanatos any other comments? |
|
Looks good to me. Has this gone through testing yet? |
68a029e to
d0aa7f4
Compare
|
@tchaikov @athanatos The fail in the last test run (https://pulpito.ceph.com/kchai-2021-02-14_08:35:02-rados-wip-kefu-testing-2021-02-14-1248-distro-basic-smithi/5880425/) occurs during ManifestSnapRefcount2() because I didn't catch a point that snap_remove is done asynchronously by selfmanaged_snap_remove(). So, cls_cas_references_chunk() to count source object's refcount is delivered before trim_object() is executed. To resolve this, I added a commit, which allow expected_refcounts to have multiple values, which considers two cases; the snapshot is removed or not. @tchaikov Can you please re-include this PR when your next test run? |
|
I'm worried that that kind of renders the test pointless -- it'll succeed even if the snap removal never actually happens or if the snap removal happens, but the refcount update is incorrect. What if you returned an error if there are untrimmed snaps? You could then loop until the snaps have been trimmed. You can check the current OSDMap for each snap to check whether it still exists (get_osdmap()->in_removed_snaps_queue(info.pgid.pgid.pool(), oid.snap)). |
After calling selfmanaged_snap_remove, we don't know when trimming snapshot is finished. So, we make the OSD to return EBUSY if the snapshot in removed_snap_queue, then the unit test waits the completion Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
d0aa7f4 to
d6f9f23
Compare
|
@athanatos Done. Make sense? |
athanatos
left a comment
There was a problem hiding this comment.
Seems good pending a suite run.
Due to false-positive design on manifest snap refcounting,
a message to decrement the refcount can be missing.
This commit checks whether the manifest object's state is correct
when such mismatch happens to prevent aborting unit test.
The mismatch that happened here will be fixed by chunk scrub later.
The following is a specific scenario.
Assuming that there are osd 1, 6, 7, and object A is the manifest object in osd 1.
[osd 1, 6, 7], dec_refcount is invoked by osd 1 on object A
[osd 1, 6, 7], manifest info including chunk_info in object A is updated without waiting dec_refcount's completion
[osd 1 is out] a message to decrement the referece can not be delivered
[osd 6, 7] nothing happens, but the manifest info in the object A is already updated.
fixes: https://tracker.ceph.com/issues/48786
https://tracker.ceph.com/issues/48915
https://tracker.ceph.com/issues/47024
Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox