os/bluestore: avoid race between split_cache and get/put pin/unpin by liewegas · Pull Request #32665 · ceph/ceph

liewegas · 2020-01-15T22:43:01Z

The Onode get() and put() methods may call pin() and unpin() when we
transition between 1 and >1 ref. To do this they make use of the
OSDShard *s pointer without taking any additional locks. This runs afoul
of Collection::split_cache(), which moves an onode between shards.

It would be very complicated to address this race head-on: we ultimately
need to be under the protction of the OSDShard lock to do the pin/unpin,
but if OSDShard *s is changing, we don't know which lock to take. And if
it is null, what do we do? It might be null when we test but then get
set by split_cache. And what if there is a put() followed by a get(),
and they managed to acquire the appropriate lock(s), but the get() thread
gets it first? And so on.

We can avoid this whole mess by preventing a put() or get() from making
this transition (and looking at OSDShard *s) at all--we simply have to
take an additional ref in split_cache() so that we are certain nref >= 2.

Fixes: https://tracker.ceph.com/issues/43147
Fixes: https://tracker.ceph.com/issues/43131
Signed-off-by: Sage Weil sage@redhat.com

markhpc

So this feels pretty hacky to me, but the only alternatives I can think of are:

a new lock in the onode to fully protect s
we make s write-once, don't allow a given onode to migrate between caches, and protect s via DCLP ala https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/

2 would be faster than 1 for reads but still slow for Onode cache insertion.

In the end Sage is probably right that this is the better approach though we should probably document that the onode's get/put methods are not thread-safe in the Onode struct. We may still want the slight optimization from #32536

Signed-off-by: Sage Weil <sage@redhat.com>

The Onode get() and put() methods may call pin() and unpin() when we transition between 1 and >1 ref. To do this they make use of the OSDShard *s pointer without taking any additional locks. This runs afoul of Collection::split_cache(), which moves an onode between shards. It would be very complicated to address this race head-on: we ultimately need to be under the protction of the OSDShard lock to do the pin/unpin, but if OSDShard *s is changing, we don't know which lock to take. And if it is null, what do we do? It might be null when we test but then get set by split_cache. And what if there is a put() followed by a get(), and they managed to acquire the appropriate lock(s), but the get() thread gets it first? And so on. We can avoid this whole mess by preventing a put() or get() from making this transition (and looking at OSDShard *s) at all. The only reason nref was *ever* < 2 is because the sequence was - remove from old collection onode_map - move onode to new shard - add to new collection onode_map The fix is to simply - remove from old colleciton onode_map - add to new collection onode_map - adjust onode shard That ensures that the onode's nref is >= 2 at all times. At the same time, improve this code so that we don't _rm and _add when the src and dest shard are the same. Fixes: https://tracker.ceph.com/issues/43147 Fixes: https://tracker.ceph.com/issues/43131 Signed-off-by: Sage Weil <sage@redhat.com>

liewegas · 2020-01-15T23:16:00Z

So this feels pretty hacky to me, but the only alternatives I can think of are:

a new lock in the onode to fully protect s

we make s write-once, don't allow a given onode to migrate between caches, and protect s via DCLP ala https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/

2 would be faster than 1 for reads but still slow for Onode cache insertion.

In the end Sage is probably right that this is the better approach though we should probably document that the onode's get/put methods are not thread-safe in the Onode struct. We may still want the slight optimization from #32536

After looking at this a bit more closely, the only reasdon nref ever was allows to drop below 2 in the first place was the subtle ordering of the update. Before, we would

remove from old collection onode_map
fiddle with osd shard
add to new collection onode_map

The fix just changes this ordering to

remove from old collection onode_map
add to new collection onode_map
fiddle with osd shard

and we now know that nref is always >= 2 while the osdshard fiddling is going on and the race is no longer possible.

markhpc · 2020-01-15T23:24:24Z

@liewegas So the new version of this PR looks almost exactly like #32536 but with o->c = dest moved to happen between _rm and _add. Is there something I'm missing where moving that line resolves the assert we hit with #32536?

liewegas · 2020-01-15T23:27:48Z

The problem is that in that PR, line 3739 needs to happen before _rm and _add calls. That's what keeps nref >= 2 at all times.

markhpc · 2020-01-15T23:33:02Z

@liewegas I must be missing something, I think that's exactly what's in the other PR? Does the placement of the o->c = dest line matter here?

ceph/src/os/bluestore/BlueStore.cc

Lines 3734 to 3740 in 598208d

    
           p = onode_map.onode_map.erase(p); 
        
           o->c = dest; 
        
           dest->onode_map.onode_map[o->oid] = o; 
        
           if (dest->cache != cache) { 
        
             onode_map.cache->_rm(p->second); 
        
             dest->onode_map.cache->_add(o, 1); 
        
           }

vs

ceph/src/os/bluestore/BlueStore.cc

Lines 3739 to 3750 in 828da5b

    
                p = onode_map.onode_map.erase(p); 
        
                dest->onode_map.onode_map[o->oid] = o; 
        
                if (onode_map.cache != dest->onode_map.cache) { 
        
           // move onode to a different cache shard 
        
           onode_map.cache->_rm(o); 
        
           o->c = dest; 
        
           dest->onode_map.cache->_add(o, 1); 
        
                } else { 
        
           // the onode is in the same cache shard, making our move simpler. 
        
           o->c = dest; 
        
                }

liewegas · 2020-01-16T00:40:37Z

@liewegas I must be missing something, I think that's exactly what's in the other PR? Does the placement of the o->c = dest line matter here?

ceph/src/os/bluestore/BlueStore.cc

Lines 3734 to 3740 in 598208d

p = onode_map.onode_map.erase(p);

o->c = dest;

dest->onode_map.onode_map[o->oid] = o;

if (dest->cache != cache) {

onode_map.cache->_rm(p->second);

dest->onode_map.cache->_add(o, 1);

}

vs

ceph/src/os/bluestore/BlueStore.cc

Lines 3739 to 3750 in 828da5b

p = onode_map.onode_map.erase(p);

dest->onode_map.onode_map[o->oid] = o;

if (onode_map.cache != dest->onode_map.cache) {

// move onode to a different cache shard

onode_map.cache->_rm(o);

o->c = dest;

dest->onode_map.cache->_add(o, 1);

} else {

// the onode is in the same cache shard, making our move simpler.

o->c = dest;

}

Hmm, yeah my reading is that the other PR at

ceph/src/os/bluestore/BlueStore.cc

Lines 3734 to 3740 in 598208d

    
           p = onode_map.onode_map.erase(p); 
        
           o->c = dest; 
        
           dest->onode_map.onode_map[o->oid] = o; 
        
           if (dest->cache != cache) { 
        
             onode_map.cache->_rm(p->second); 
        
             dest->onode_map.cache->_add(o, 1); 
        
           }

should also have fixed it. Did it not?

markhpc · 2020-01-16T15:29:21Z

@liewegas Kefu's test run didn't show the original issue from https://tracker.ceph.com/issues/43147 after my PR, but did hit the assert here https://tracker.ceph.com/issues/43147 during thrash (especially EC thrash) tests. Presumably your PR should hit the same assert since it's almost identical.

ifed01 · 2020-01-24T13:54:55Z

IMO pin logic can be totally simplified.
As far as I understand the primarily goal for pin/unpin logic is to bypass multireferenced (nref > 1) onodes while doing trimming in Onode cache. Can we achieve the same by simply removing such onodes from cache when nref becomes greater than 1 and tagging them with pinned flag. And likely the latter isn't necessary too as ref counter is the good marker for 'pinned' state.
Keeping shard ref in onode isn't necessary in this case too - put/get methods can use c->shard to do _add/_rm from the cache.
Let me try to implement this and submit a different PR

* refs/pull/32665/head: os/bluestore: avoid race between split_cache and get/put pin/unpin os/bluestore: remove no-op line from split_cache Reviewed-by: Mark Nelson <mnelson@redhat.com>

liewegas added bluestore bug-fix labels Jan 15, 2020

liewegas requested review from ifed01, jdurgin, neha-ojha and rzarzynski January 15, 2020 22:43

markhpc self-requested a review January 15, 2020 22:59

markhpc approved these changes Jan 15, 2020

View reviewed changes

os/bluestore: remove no-op line from split_cache

aa55ac8

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the fix-43147 branch from a941adc to 04a0948 Compare January 15, 2020 23:09

liewegas force-pushed the fix-43147 branch from 04a0948 to 828da5b Compare January 15, 2020 23:14

liewegas added the needs-qa label Jan 15, 2020

markhpc self-requested a review January 15, 2020 23:27

liewegas added the wip-sage2-testing label Jan 16, 2020

liewegas added wip-sage-testing wip-sage3-testing and removed wip-sage-testing wip-sage3-testing labels Jan 20, 2020

liewegas added a commit that referenced this pull request Jan 24, 2020

Merge PR #32665 into master

05d6d3a

* refs/pull/32665/head: os/bluestore: avoid race between split_cache and get/put pin/unpin os/bluestore: remove no-op line from split_cache Reviewed-by: Mark Nelson <mnelson@redhat.com>

liewegas merged commit 828da5b into ceph:master Jan 24, 2020

liewegas deleted the fix-43147 branch January 24, 2020 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore: avoid race between split_cache and get/put pin/unpin#32665

os/bluestore: avoid race between split_cache and get/put pin/unpin#32665
liewegas merged 2 commits intoceph:masterfrom
liewegas:fix-43147

liewegas commented Jan 15, 2020

Uh oh!

markhpc left a comment

Uh oh!

liewegas commented Jan 15, 2020

Uh oh!

markhpc commented Jan 15, 2020 •

edited

Loading

Uh oh!

liewegas commented Jan 15, 2020 via email

Uh oh!

markhpc commented Jan 15, 2020 •

edited

Loading

Uh oh!

liewegas commented Jan 16, 2020

Uh oh!

markhpc commented Jan 16, 2020 •

edited

Loading

Uh oh!

ifed01 commented Jan 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liewegas commented Jan 15, 2020

Uh oh!

markhpc left a comment

Choose a reason for hiding this comment

Uh oh!

liewegas commented Jan 15, 2020

Uh oh!

markhpc commented Jan 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liewegas commented Jan 15, 2020 via email

Uh oh!

markhpc commented Jan 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liewegas commented Jan 16, 2020

Uh oh!

markhpc commented Jan 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifed01 commented Jan 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

markhpc commented Jan 15, 2020 •

edited

Loading

markhpc commented Jan 15, 2020 •

edited

Loading

markhpc commented Jan 16, 2020 •

edited

Loading