Skip to content

os/bluestore: avoid premature onode release.#43770

Merged
yuriw merged 1 commit intoceph:masterfrom
ifed01:wip-ifed-fix-53002
Dec 14, 2021
Merged

os/bluestore: avoid premature onode release.#43770
yuriw merged 1 commit intoceph:masterfrom
ifed01:wip-ifed-fix-53002

Conversation

@ifed01
Copy link
Contributor

@ifed01 ifed01 commented Nov 2, 2021

This was observed when onode's removal is followed by reading
and the latter causes object release before the removal is finalized.
The root cause is an improper 'pinned' state assessment in Onode::get

More detailed overview is:
At some point Onode::get() might face the case when nref == 2 and pinned = true
which means parallel incomplete put is running on the onode - ref count is
decremented but pinned state is still unmodified (and even lock hasn't been
acquired yet).
This might finally result in two puts racing over the same onode with nref == 2
which finally results in a premature onode release:
// nref =3, pinned = 1
// Thread 1 Thread 2
// o->put() o->get()
// --nref(n = 2, pinned=1)
// nref++ (n=3, pinned = 1)
// return
// ...
// o->put()
// --nref(n = 2)
// pinned = 0,
// --nref(n = 1)
// ocs->_unpin_and_rm(o) -> o->put()
// ...
// --nref(n = 0)
// release o
// o->c->get_onode_cache()
// FAULT!
//
The suggested fix is to introduce additional atomic counter tracking
running put() functions. And permit onode release when both regular
nref and put_nref are both equal to zero.

Fixes: https://tracker.ceph.com/issues/53002
Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@ifed01
Copy link
Contributor Author

ifed01 commented Nov 2, 2021

Looks good so far, still running QA hence marking WIP for now.

@ifed01 ifed01 requested review from aclamk and neha-ojha November 2, 2021 12:11
@ifed01 ifed01 force-pushed the wip-ifed-fix-53002 branch from ca2827e to b3cc049 Compare November 2, 2021 21:01
@ifed01
Copy link
Contributor Author

ifed01 commented Nov 3, 2021

jenkins test make check

@ifed01
Copy link
Contributor Author

ifed01 commented Nov 3, 2021

Looks good so far, still running QA hence marking WIP for now.

This has passed my nightly test run which had been failing in 30-60 mins before.
store_test passed as well.
Hence looks ready for review, removing WIP now.

@ifed01 ifed01 changed the title WIP,RFC: os/bluestore: avoid premature onode release. os/bluestore: avoid premature onode release. Nov 3, 2021
@ifed01
Copy link
Contributor Author

ifed01 commented Nov 4, 2021

jenkins test make check

Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some excellent fix-coding.
I spent 1 day trying to poke holes in the solution, and I could not.

Minimalistic but successful. Love that.

The only (optional) improvement would be to sync-up comments with current implementation.

@@ -3642,11 +3643,12 @@ void BlueStore::Onode::put() {
// should be the last action since Onode can be released
// at any point after this decrement
if (need_unpin) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets update comments.
Onode can no longer be released at any point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aclamk - could you please elaborate on the second comment - I can find anything related under the link....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aclamk - and it looks like we can now remove that trick with an additional increment for pinned entries. This patch makes that safe. I'm inclined to submit that in a different patch though.
Just asking for your thought for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01 The comment I meant was "// remove will also decrement nref and delete Onode". And It seems I posted broken link, or rebase moved it or something...
In this new situation, there is no way Onode could be deleted there.
"// remove will decrement nref, but Onode will not be deleted yet (put_nref > 1)"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01
Regarding removal of artificial reference counter for pinning.
I do not believe we can to that.
BUT:
I look forward to be wrong in this case!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01 The comment I meant was "// remove will also decrement nref and delete Onode". And It seems I posted broken link, or rebase moved it or something... In this new situation, there is no way Onode could be deleted there. "// remove will decrement nref, but Onode will not be deleted yet (put_nref > 1)"

OK, thanks. Updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01 Regarding removal of artificial reference counter for pinning. I do not believe we can to that. BUT: I look forward to be wrong in this case!

See the last commit of #44311
Looks good so far for my overnight QA run.

This was observed when onode's removal is followed by reading
and the latter causes object release before the removal is finalized.
The root cause is an improper 'pinned' state assessment in Onode::get

More detailed overview is:
At some point Onode::get() might face the case when nref == 2 and pinned = true
which means parallel incomplete put is running on the onode - ref count is
decremented but pinned state is still unmodified (and even lock hasn't been
acquired yet).
This might finally result in two puts racing over the same onode with nref == 2
which finally results in a premature onode release:
  // nref =3, pinned = 1
  // Thread 1                   Thread 2
  //   o->put()                   o->get()
  //   --nref(n = 2, pinned=1)
  //                              nref++ (n=3, pinned = 1)
  //                              return
  //                              ...
  //                              o->put()
  //                              --nref(n = 2)
  //                              pinned = 0,
  //                              --nref(n = 1)
  //                              ocs->_unpin_and_rm(o) -> o->put()
  //                                ...
  //                                --nref(n = 0)
  //                                release o
  //  o->c->get_onode_cache()
  //  FAULT!
  //
The suggested fix is to introduce additional atomic counter tracking
running put() functions. And permit onode release when both regular
nref and put_nref are both equal to zero.

Fixes: https://tracker.ceph.com/issues/53002
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
@yuriw
Copy link
Contributor

yuriw commented Dec 14, 2021

Unrelated failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants