osd: avoid two copy with same src cancel each other#39593
osd: avoid two copy with same src cancel each other#39593tchaikov merged 1 commit intoceph:masterfrom
Conversation
|
@tchaikov please help to have a look |
|
@liewegas please help to have a look |
|
I wonder if it would be cleaner to pull the
|
c24105e to
c2318c6
Compare
|
@liewegas good idea, i have done, please help to review. |
|
Code looks good. Is this a case you can reproduce? It would be great to add a test for it |
|
Let me try it. |
a972035 to
2296f83
Compare
|
@liewegas I reproduced it and modified the description, not two rollback op, the first op is rollback and the second is read snap. Please help to review. |
|
@neha-ojha Please help to rewiew, thx. |
This comment has been minimized.
This comment has been minimized.
2296f83 to
88031b5
Compare
|
@tchaikov I have rebased it, is that ok? |
|
@mychoxin thanks! looks great! |
80c890c to
9e77de1
Compare
neha-ojha
left a comment
There was a problem hiding this comment.
makes sense to me, @myoungwon @athanatos WDYT?
nit: the description in #39593 (comment) looks great, can we please add the same to the commit description as well?
For cache tier, if some head object has two snaps, the two snaps share the same clone object, and the clone object was flush/evicted from cache pool, when a rollback requests and a read snap request to these two snaps at the same time will generate two promote requests to the same clone object, these two promote requests will generate two copy ops with same src, than the second copy op will cancel the first copy op by calling cancel_copy and kick_object_context_blocked, but after calling kick_object_context_blocked, a new promote request corresponding to first copy op will be restarted and generate a new copy op, the new copy op will cancel the second copy op again, so two promote requests will cancel their copy op each other and run into dead loop. Fixes: https://tracker.ceph.com/issues/49409 Signed-off-by: YuanXin <yuanxin@didiglobal.com>
9e77de1 to
617f711
Compare
done |
|
LGTM |
|
lgtm |
|
@liewegas Is there any question? |
|
@mychoxin it's just pending on a rados suite run. |
|
ok, it tells ceph_test_rados Crashed, but no details show where it crashed at. |
|
@myoungwon Can you help @mychoxin interpret the test failure? |
|
@mychoxin @myoungwon following is an excerpt from |
|
@athanatos @tchaikov I'll take a look. |
|
@mychoxin @myoungwon i am not able to reproduce the failure when rerunning of the failed test |
|
this change is not related: https://pulpito.ceph.com/kchai-2021-03-08_10:48:51-rados-wip-kefu2-testing-2021-03-08-1335-distro-basic-smithi/ |
https://tracker.ceph.com/issues/49726 is showing up in almost every rados run on master after this PR merged. @mychoxin can you please address this? |
|
#40057 is created before we have a fix. |
|
If it's that common, probably worth just reverting it. |
|
@athanatos pretty reproducible . see https://pulpito.ceph.com/kchai-2021-03-12_05:42:44-rados-wip-kefu-testing-2021-03-12-1106-distro-basic-smithi/5958481/. just added Signed-off-by and Fixes tags to #40057. and will test it in my next batch. |
|
@athanatos @tchaikov @neha-ojha Please look at the following.
At this point, the original code invokes kick_object_context_blocked(), then cop->cb->complete() in order. So, I think we can avoid this and resolve the issue this PR posted via #40067 ? |
|
@mychoxin just reverted this change in #40057 @myoungwon thanks for looking into this. at first glance, your analysis makes sense to me! but before your fix lands on master, i think we'd better address the test failure first. also, could you include this change in #40067? so we can test them at the same time? |
|
@tchaikov ok. |
osd: avoid two copy with same src cancel each other and run into dead loop
For cache tier, if some head object has two snaps, the two snaps share the same clone object,
and the clone object was flush/evicted from cache pool, when a rollback requests and a read
snap request to these two snaps at the same time will generate two promote requests to the
same clone object, these two promote requests will generate two copy ops with same src, than
the second copy op will cancel the first copy op by calling cancel_copy and kick_object_context_blocked,
but after calling kick_object_context_blocked, a new promote request corresponding to first
copy op will be restarted and generate a new copy op, the new copy op will cancel the second
copy op again, so two promote requests will cancel their copy op each other and run into dead
loop.
Fixes: https://tracker.ceph.com/issues/49409
Signed-off-by: YuanXin yuanxin@didiglobal.com
Signed-off-by: mychoxin mychoxin@gmail.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox