Skip to content

os/bluestore: avoid race between split_cache and get/put pin/unpin#32665

Merged
liewegas merged 2 commits intoceph:masterfrom
liewegas:fix-43147
Jan 24, 2020
Merged

os/bluestore: avoid race between split_cache and get/put pin/unpin#32665
liewegas merged 2 commits intoceph:masterfrom
liewegas:fix-43147

Conversation

@liewegas
Copy link
Copy Markdown
Member

The Onode get() and put() methods may call pin() and unpin() when we
transition between 1 and >1 ref. To do this they make use of the
OSDShard *s pointer without taking any additional locks. This runs afoul
of Collection::split_cache(), which moves an onode between shards.

It would be very complicated to address this race head-on: we ultimately
need to be under the protction of the OSDShard lock to do the pin/unpin,
but if OSDShard *s is changing, we don't know which lock to take. And if
it is null, what do we do? It might be null when we test but then get
set by split_cache. And what if there is a put() followed by a get(),
and they managed to acquire the appropriate lock(s), but the get() thread
gets it first? And so on.

We can avoid this whole mess by preventing a put() or get() from making
this transition (and looking at OSDShard *s) at all--we simply have to
take an additional ref in split_cache() so that we are certain nref >= 2.

Fixes: https://tracker.ceph.com/issues/43147
Fixes: https://tracker.ceph.com/issues/43131
Signed-off-by: Sage Weil sage@redhat.com

Copy link
Copy Markdown
Member

@markhpc markhpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this feels pretty hacky to me, but the only alternatives I can think of are:

  1. a new lock in the onode to fully protect s
  2. we make s write-once, don't allow a given onode to migrate between caches, and protect s via DCLP ala https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/

2 would be faster than 1 for reads but still slow for Onode cache insertion.

In the end Sage is probably right that this is the better approach though we should probably document that the onode's get/put methods are not thread-safe in the Onode struct. We may still want the slight optimization from #32536

Signed-off-by: Sage Weil <sage@redhat.com>
The Onode get() and put() methods may call pin() and unpin() when we
transition between 1 and >1 ref.  To do this they make use of the
OSDShard *s pointer without taking any additional locks.  This runs afoul
of Collection::split_cache(), which moves an onode between shards.

It would be very complicated to address this race head-on: we ultimately
need to be under the protction of the OSDShard lock to do the pin/unpin,
but if OSDShard *s is changing, we don't know which lock to take.  And if
it is null, what do we do?  It might be null when we test but then get
set by split_cache.  And what if there is a put() followed by a get(),
and they managed to acquire the appropriate lock(s), but the get() thread
gets it first?  And so on.

We can avoid this whole mess by preventing a put() or get() from making
this transition (and looking at OSDShard *s) at all.

The only reason nref was *ever* < 2 is because the sequence was
 - remove from old collection onode_map
 - move onode to new shard
 - add to new collection onode_map

The fix is to simply
 - remove from old colleciton onode_map
 - add to new collection onode_map
 - adjust onode shard

That ensures that the onode's nref is >= 2 at all times.

At the same time, improve this code so that we don't _rm and _add when
the src and dest shard are the same.

Fixes: https://tracker.ceph.com/issues/43147
Fixes: https://tracker.ceph.com/issues/43131
Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Copy Markdown
Member Author

So this feels pretty hacky to me, but the only alternatives I can think of are:

  1. a new lock in the onode to fully protect s
  2. we make s write-once, don't allow a given onode to migrate between caches, and protect s via DCLP ala https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/

2 would be faster than 1 for reads but still slow for Onode cache insertion.

In the end Sage is probably right that this is the better approach though we should probably document that the onode's get/put methods are not thread-safe in the Onode struct. We may still want the slight optimization from #32536

After looking at this a bit more closely, the only reasdon nref ever was allows to drop below 2 in the first place was the subtle ordering of the update. Before, we would

  1. remove from old collection onode_map
  2. fiddle with osd shard
  3. add to new collection onode_map

The fix just changes this ordering to

  1. remove from old collection onode_map
  2. add to new collection onode_map
  3. fiddle with osd shard

and we now know that nref is always >= 2 while the osdshard fiddling is going on and the race is no longer possible.

@markhpc
Copy link
Copy Markdown
Member

markhpc commented Jan 15, 2020

@liewegas So the new version of this PR looks almost exactly like #32536 but with o->c = dest moved to happen between _rm and _add. Is there something I'm missing where moving that line resolves the assert we hit with #32536?

@markhpc markhpc self-requested a review January 15, 2020 23:27
@liewegas
Copy link
Copy Markdown
Member Author

liewegas commented Jan 15, 2020 via email

@markhpc
Copy link
Copy Markdown
Member

markhpc commented Jan 15, 2020

@liewegas I must be missing something, I think that's exactly what's in the other PR? Does the placement of the o->c = dest line matter here?

p = onode_map.onode_map.erase(p);
o->c = dest;
dest->onode_map.onode_map[o->oid] = o;
if (dest->cache != cache) {
onode_map.cache->_rm(p->second);
dest->onode_map.cache->_add(o, 1);
}

vs

p = onode_map.onode_map.erase(p);
dest->onode_map.onode_map[o->oid] = o;
if (onode_map.cache != dest->onode_map.cache) {
// move onode to a different cache shard
onode_map.cache->_rm(o);
o->c = dest;
dest->onode_map.cache->_add(o, 1);
} else {
// the onode is in the same cache shard, making our move simpler.
o->c = dest;
}

@liewegas
Copy link
Copy Markdown
Member Author

@liewegas I must be missing something, I think that's exactly what's in the other PR? Does the placement of the o->c = dest line matter here?

p = onode_map.onode_map.erase(p);
o->c = dest;
dest->onode_map.onode_map[o->oid] = o;
if (dest->cache != cache) {
onode_map.cache->_rm(p->second);
dest->onode_map.cache->_add(o, 1);
}

vs

p = onode_map.onode_map.erase(p);
dest->onode_map.onode_map[o->oid] = o;
if (onode_map.cache != dest->onode_map.cache) {
// move onode to a different cache shard
onode_map.cache->_rm(o);
o->c = dest;
dest->onode_map.cache->_add(o, 1);
} else {
// the onode is in the same cache shard, making our move simpler.
o->c = dest;
}

Hmm, yeah my reading is that the other PR at

p = onode_map.onode_map.erase(p);
o->c = dest;
dest->onode_map.onode_map[o->oid] = o;
if (dest->cache != cache) {
onode_map.cache->_rm(p->second);
dest->onode_map.cache->_add(o, 1);
}
should also have fixed it. Did it not?

@markhpc
Copy link
Copy Markdown
Member

markhpc commented Jan 16, 2020

@liewegas Kefu's test run didn't show the original issue from https://tracker.ceph.com/issues/43147 after my PR, but did hit the assert here https://tracker.ceph.com/issues/43147 during thrash (especially EC thrash) tests. Presumably your PR should hit the same assert since it's almost identical.

@ifed01
Copy link
Copy Markdown
Contributor

ifed01 commented Jan 24, 2020

IMO pin logic can be totally simplified.
As far as I understand the primarily goal for pin/unpin logic is to bypass multireferenced (nref > 1) onodes while doing trimming in Onode cache. Can we achieve the same by simply removing such onodes from cache when nref becomes greater than 1 and tagging them with pinned flag. And likely the latter isn't necessary too as ref counter is the good marker for 'pinned' state.
Keeping shard ref in onode isn't necessary in this case too - put/get methods can use c->shard to do _add/_rm from the cache.
Let me try to implement this and submit a different PR

liewegas added a commit that referenced this pull request Jan 24, 2020
* refs/pull/32665/head:
	os/bluestore: avoid race between split_cache and get/put pin/unpin
	os/bluestore: remove no-op line from split_cache

Reviewed-by: Mark Nelson <mnelson@redhat.com>
@liewegas liewegas merged commit 828da5b into ceph:master Jan 24, 2020
@liewegas liewegas deleted the fix-43147 branch January 24, 2020 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants