os/bluestore: Fix race condition in Onode:put()#48566
os/bluestore: Fix race condition in Onode:put()#48566
Conversation
The race condition happens when an Onode is unpinned in one "put" thread and being trimmed right away(after the cache lock is released) by another thread Fixes: https://tracker.ceph.com/issues/57895 Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
|
@taodd Thanks a ton for digging into this. You are right, this is fairly core code and this isn't the first time we've hit issues here. Added a couple of the relevant folks as potential reviewers. |
|
@ifedo01 mentioned he has a (bigger) PR that also may fix this issue, but I haven't looked it over yet: #47702 |
| } | ||
| auto pn = --put_nref; | ||
| if (nref == 0 && pn == 0) { | ||
| if (n == 0 && put_nref == 0) { |
There was a problem hiding this comment.
First of all - it's not mandatory that onode is cached and hence potentially two threads might own onode independently with no references from c->onode_map.
So we might have two threads owning onode with nref == 2. Finally both threads call put()....
Case 1:
Thread A makes nref ==1 and goes through if(n == 1) block but before it reaches --put_nref thread B falls through and deletes onode... At this point thread A operates on a released onode.
Case 2:
Thread A makes nref == 1 and reaches n = nref. At this point thread B makes nref == 0 and falls through to put() return - with no onode release due to put_nref != 0. Then thread A continues and bypasses delete due to n == 1 as well. Hence Onode is leaking...
Generally my idea behind put_ref increment/decrement is that it has to "wrap" other manipulations on onode within put(). So its increment has to be the first op in put() and decrement to be the last one before the delete. I haven't achieved that completely but looks like you're moving even further from the original idea...
So we don't have good enough fix for now :(
There was a problem hiding this comment.
Thank you very much for your review.
I did really assume caching onode is mandatory -- an Onode had to be added to the cache before it could be referenced.
Could you please give me some examples that the Onode might be referenced without adding to cache ?( is it deep-scrub ? ) thanks a lot :)
There was a problem hiding this comment.
@taodd - I don't have any real-life examples under my hand atm. But generally:
a) it's available with the current onode design.
b) it's a bad practice to have a dependency between Onode use case (e.g. whether we put it into the cache or not) and its life-cycle tracking(aka ref counting). That latter has to be completely use case agnostic. Even if we don't actively use this mode at the moment - one can start using it in the future...
There was a problem hiding this comment.
such as two onode cached in two threads, but the onode is trimed.
|
@taodd I have a question, why onode put do not use a onode lock to make judgement and delete a atomic operation ? |
|
This PR can be closed actually, Please see #47702 |
This race condition happens when an Onode is unpinned in one "put" thread and being trimmed right away(after the cache lock is released) by another thread
The race happens like this:
Fixes: https://tracker.ceph.com/issues/57895
Signed-off-by: dongdong tao dongdong.tao@canonical.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows