os/bluestore: fix onode ref counting.#47702
Conversation
c661edd to
39d61ae
Compare
|
what is missing here? "only" the review? we are affected by this bug (dozens of OSD crashes) on octopus. I understand octopus is EOL, but I hope at least on a backport to pacific so we can upgrade. |
|
@ifed01 Thanks for your patch, I reviewed the code today and I think I'm able to poke a hole here: In the end we have nref=2. pinned=true, which is a wrong state, would potentially cause the failure of onode trimming. Honestly speaking, by doing nref +=2, it breaks the definition of the self-explained variable Onode's nref, which means nref would then only be meaningful for the certain bug, and not the real onode reference count it was designed for. It would also make the code pretty hard to understand for other people if they don't know the history of the bugs. I also had #48566 opened for opinions, but I didn't have the courage to get rid of put_onode, good thing is that it's simple enough(less risky), would you consider this kind of patch? (I understand it's hard to get a perfect solution, like "pinned" should be controlled in ocs.lock really as it had a direct connection with the status of onode shard cache) |
Agree :(
+1, will gladly remove such tricks if we could....
Will post my concerns in your PR, IMO it's not 100% safe either... |
39d61ae to
c9519a4
Compare
src/os/bluestore/BlueStore.cc
Outdated
| if (cached && need_unpin) { | ||
| } | ||
| bool was_pinned = pinned; | ||
| pinned = pin_nref >= 2; |
There was a problem hiding this comment.
Though pinned and pin_nref are atomic variables, this statement could be evaluated as below two:
temp = pin_nref >= 2;
pinned = temp;
So following case could result pin_nref = 2 but pinned = false:
Suppose pin_nref = 2, pinned = true, a put thread comes in and decreases pin_nref to 1, then pin_nref >= 2 gets evaluated as false, but before it gets assigned to pinned, a get thread came in and set pinned_ref to 2 then exit because pinned is still true. After that, the put thread set the pinned to false and does the unpin. Finally, we'll have an unexpected state: pinned_ref=2 and pinned = false.
There was a problem hiding this comment.
The access of "pinned" variable should be restricted inside the cache lock ? I feel it's more like a member of cache shard, as its purpose is to correctly reflect the cache status of the Onode.
There was a problem hiding this comment.
Though pinned and pin_nref are atomic variables, this statement could be evaluated as below two: temp = pin_nref >= 2; pinned = temp;
So following case could result pin_nref = 2 but pinned = false: Suppose pin_nref = 2, pinned = true, a put thread comes in and decreases pin_nref to 1, then pin_nref >= 2 gets evaluated as false, but before it gets assigned to pinned, a get thread came in and set pinned_ref to 2 then exit because pinned is still true. After that, the put thread set the pinned
to false and does the unpin. Finally, we'll have an unexpected state: pinned_ref=2 and pinned = false.
@taodd - hopefully I resolved this issue via moving pinning stuff to onode cache _trim_to() method. The latter acquires cache lock anyway and this allows to get rid of conflicts with Onode::put(). Not to mention that this allows to re-do proper pinning again and again if unpinning does that improperly for some reasons...
As a result Onode::get() is trivial and it does nothing but ref counts increments now.
There was a problem hiding this comment.
The access of "pinned" variable should be restricted inside the cache lock ? I feel it's more like a member of cache shard, as its purpose is to correctly reflect the cache status of the Onode.
Using "pinned" var is replaced with calling onode lru_item's is_linked() method which explicitly indicates whether onode is in Cache LRU or not now.
And I refactored the code to isolate all the pinning/unpinning logic at cache shard rather than doing that at onode.
So please take a look..
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
@ifed01 wondering if you see my review comments above, does it make sense to you? |
Hi @taodd - yeah, this makes sense indeed. Sorry, was busy with a different stuff (and disappointed by that new failure ;)) so missed the response. Thinking/working on alternative approaches now.. Generally hovering around two ideas:
|
c9519a4 to
aaae0f4
Compare
2babefc to
85d6895
Compare
|
jenkins test make check |
|
Thanks Igor, that's a very clever way to simplify the logic, I'll spend more time to take a closer look next week. |
85d6895 to
adbd24f
Compare
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
|
I also encountered this problem in version 16.2.10 (45fa1a0) I tried to reproduce this problem but fail. Can someone tell how to reproduce this problem? |
well we observe this mostly on EC pools with latest octopus release with a decent workload (over > 100 Clients) this is our current number of osd crashes with this crash signature: root@ceph-mon1:~# i=0; for crash in $(ceph crash ls | awk '/osd/ { print $1 }'); do ceph crash info $crash | grep -qF 'ShardedThreadPool::WorkThreadSharded::entry()+0x10)' && i=$((i+1)); done; echo $i;
72these crashes are occurring almost daily on a regular base in our cluster. so we really didn't try to artificially reproduce the problem. these crashes are rarer if there is only light load on the cluster. HTH |
Clients mean cephfs client? We didn't find this problem even with stress testing using cosbench before deploying the cluster. By the way, I found some crash info lost like this. Do you meat the similar problem? |
|
We are running an EC Pool with currently around 200 VM's on it, currently there are about 1-4 OSD crashes per week. |
|
Copying the mail that I sent to the user's mailing list: For those who had been hitting this issue, I would like to share a workaround that could unblock you: During the investigation of this issue, I found this race condition always happens after the bluestore onode cache size becomes 0. 2022-10-25T00:47:26.562+0000 7f424f78e700 30 bluestore.MempoolThread(0x564a9dae2a68) _resize_shards max_shard_onodes: 0 max_shard_buffer: 8388608 This is apparently wrong as this means the bluestore metadata cache is basically disabled, Keep going with the investigation, it turns out the culprit for the 0-sized cache is the leak that happened in bluestore_cache_other mempool By the way, I'm backporting the fix to ubuntu Octopus and Pacific through this SRU [5], so it will be landed in ubuntu's package soon. [1] https://tracker.ceph.com/issues/56382 |
aclamk
left a comment
There was a problem hiding this comment.
I am surprised, but I am now convinced that this solution can work.
I have tried hard, but was unable to crank up scenario that created any of:
- use after free for Onode
- double delete for Onode
- leave Onode pinned
Excellent stuff!
| if (p != onode_map.end()) { | ||
| // add entry or return existing one | ||
| auto p = onode_map.emplace(oid, o); | ||
| if (!p.second) { |
There was a problem hiding this comment.
+1 Excellent optimization, will reuse.
| } | ||
| auto pn = --put_nref; | ||
| if (nref == 0 && pn == 0) { | ||
| if (--nref == 0) { |
There was a problem hiding this comment.
+1 I admire this refactor. Now there is no way we "use after free" for Onodes.
| bool cached; ///< Onode is logically in the cache | ||
| /// (it can be pinned and hence physically out | ||
| /// of it at the moment though) | ||
| std::atomic_bool pinned; ///< Onode is pinned |
There was a problem hiding this comment.
+1 for eradication of pinned bool flag.
|
|
||
| // cache onodes on a per-collection basis to avoid lock | ||
| // contention. | ||
| OnodeSpace onode_map; |
There was a problem hiding this comment.
+1 I will copy this rename pattern in future.
| } | ||
|
|
||
| OnodeRef add(const ghobject_t& oid, OnodeRef& o); | ||
| OnodeRef add_onode(const ghobject_t& oid, OnodeRef& o); |
There was a problem hiding this comment.
I do not think that rename "add" to "add_onode" is needed here, especially that other functions do not follow similar pattern.
There was a problem hiding this comment.
My primary intention was to make this function reference lookups easier - 'add' is met too frequently in the code hence it's hard make a search without special code browsing features in an editor,
There was a problem hiding this comment.
And this improves code readability as bit...
| TransContext *txc, | ||
| CollectionRef& c, | ||
| OnodeRef o, | ||
| OnodeRef& o, |
There was a problem hiding this comment.
I like optimization OnodeRef->OnodeRef&.
But would we not be better having it in different PR, for testing purposes?
Or is it somehow relevant to pin/unpin logic?
Of course plus side of having it here is that with one backport we will have perf improvement too.
There was a problem hiding this comment.
No IIRC it's not relevant to ref counting. And this apparently deserves a standalone commit/PR. But let's leave it as-is for now - I strongly want this fix to be merged ASAP - there is a plenty of user complains on the bug.
| // move onode within LRU | ||
| _rm(o); | ||
| _add(o, 1); | ||
| } |
There was a problem hiding this comment.
I think we could have _touch() function that combines _rm/_add, since those two in most regards cancel themselves out.
This is the hot part of code, it deserves optimization.
There was a problem hiding this comment.
Updated. As there are no users for that _touch() func I just put combined (and reduced) logic from _rm/_add directly to this place. @aclamk - please take another look.
to ease code reading. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
readability. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
This should eliminate duplicate onode releases that could happen before. Additionally onode pinning is performed during cache trimming not onode ref count increment. [Hopefully] fixes: https://tracker.ceph.com/issues/53002 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
8b7b550 to
1d4abf7
Compare
|
jenkins test make check |
|
will it be merged? |
Hopefully yes once it passes QA |
|
Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2023-01-26-1532 Failures, unrelated: Details: |
Will it be backported to release 16.2.12? |
|
@thenamehasbeentake, bot created backport tickets for pacific & quinicy |
|
@k0ste The next version of pacific's release time is ? |
🔮 |
OSD crashes due to apparently buggy onode ref counting/cache pinning logic are still happening for 17.2.0/16.2.9/15.2.16. Which is apparently caused by a duplicate onode release under some conditions.
I presume this might happen when two concurrent threads are 'putting' onode and they race at put_nref decrement. See my comment in the code snippet below.
Fixes: https://tracker.ceph.com/issues/56382
Signed-off-by: Igor Fedotov igor.fedotov@croit.io
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows