os/bluestore: avoid race between split_cache and get/put pin/unpin#32665
os/bluestore: avoid race between split_cache and get/put pin/unpin#32665liewegas merged 2 commits intoceph:masterfrom
Conversation
markhpc
left a comment
There was a problem hiding this comment.
So this feels pretty hacky to me, but the only alternatives I can think of are:
- a new lock in the onode to fully protect s
- we make s write-once, don't allow a given onode to migrate between caches, and protect s via DCLP ala https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/
2 would be faster than 1 for reads but still slow for Onode cache insertion.
In the end Sage is probably right that this is the better approach though we should probably document that the onode's get/put methods are not thread-safe in the Onode struct. We may still want the slight optimization from #32536
Signed-off-by: Sage Weil <sage@redhat.com>
The Onode get() and put() methods may call pin() and unpin() when we transition between 1 and >1 ref. To do this they make use of the OSDShard *s pointer without taking any additional locks. This runs afoul of Collection::split_cache(), which moves an onode between shards. It would be very complicated to address this race head-on: we ultimately need to be under the protction of the OSDShard lock to do the pin/unpin, but if OSDShard *s is changing, we don't know which lock to take. And if it is null, what do we do? It might be null when we test but then get set by split_cache. And what if there is a put() followed by a get(), and they managed to acquire the appropriate lock(s), but the get() thread gets it first? And so on. We can avoid this whole mess by preventing a put() or get() from making this transition (and looking at OSDShard *s) at all. The only reason nref was *ever* < 2 is because the sequence was - remove from old collection onode_map - move onode to new shard - add to new collection onode_map The fix is to simply - remove from old colleciton onode_map - add to new collection onode_map - adjust onode shard That ensures that the onode's nref is >= 2 at all times. At the same time, improve this code so that we don't _rm and _add when the src and dest shard are the same. Fixes: https://tracker.ceph.com/issues/43147 Fixes: https://tracker.ceph.com/issues/43131 Signed-off-by: Sage Weil <sage@redhat.com>
After looking at this a bit more closely, the only reasdon nref ever was allows to drop below 2 in the first place was the subtle ordering of the update. Before, we would
The fix just changes this ordering to
and we now know that nref is always >= 2 while the osdshard fiddling is going on and the race is no longer possible. |
|
The problem is that in that PR, line 3739 needs to happen before _rm and _add calls. That's what keeps nref >= 2 at all times.
|
|
@liewegas I must be missing something, I think that's exactly what's in the other PR? Does the placement of the o->c = dest line matter here? ceph/src/os/bluestore/BlueStore.cc Lines 3734 to 3740 in 598208d vs ceph/src/os/bluestore/BlueStore.cc Lines 3739 to 3750 in 828da5b |
Hmm, yeah my reading is that the other PR at ceph/src/os/bluestore/BlueStore.cc Lines 3734 to 3740 in 598208d |
|
@liewegas Kefu's test run didn't show the original issue from https://tracker.ceph.com/issues/43147 after my PR, but did hit the assert here https://tracker.ceph.com/issues/43147 during thrash (especially EC thrash) tests. Presumably your PR should hit the same assert since it's almost identical. |
|
IMO pin logic can be totally simplified. |
* refs/pull/32665/head: os/bluestore: avoid race between split_cache and get/put pin/unpin os/bluestore: remove no-op line from split_cache Reviewed-by: Mark Nelson <mnelson@redhat.com>
The Onode get() and put() methods may call pin() and unpin() when we
transition between 1 and >1 ref. To do this they make use of the
OSDShard *s pointer without taking any additional locks. This runs afoul
of Collection::split_cache(), which moves an onode between shards.
It would be very complicated to address this race head-on: we ultimately
need to be under the protction of the OSDShard lock to do the pin/unpin,
but if OSDShard *s is changing, we don't know which lock to take. And if
it is null, what do we do? It might be null when we test but then get
set by split_cache. And what if there is a put() followed by a get(),
and they managed to acquire the appropriate lock(s), but the get() thread
gets it first? And so on.
We can avoid this whole mess by preventing a put() or get() from making
this transition (and looking at OSDShard *s) at all--we simply have to
take an additional ref in split_cache() so that we are certain nref >= 2.
Fixes: https://tracker.ceph.com/issues/43147
Fixes: https://tracker.ceph.com/issues/43131
Signed-off-by: Sage Weil sage@redhat.com