Skip to content

os/bluestore: fix onode ref counting.#47702

Merged
ljflores merged 7 commits intoceph:mainfrom
ifed01:wip-ifed-more-pin-fix
Feb 8, 2023
Merged

os/bluestore: fix onode ref counting.#47702
ljflores merged 7 commits intoceph:mainfrom
ifed01:wip-ifed-more-pin-fix

Conversation

@ifed01
Copy link
Contributor

@ifed01 ifed01 commented Aug 19, 2022

OSD crashes due to apparently buggy onode ref counting/cache pinning logic are still happening for 17.2.0/16.2.9/15.2.16. Which is apparently caused by a duplicate onode release under some conditions.

I presume this might happen when two concurrent threads are 'putting' onode and they race at put_nref decrement. See my comment in the code snippet below.

void BlueStore::Onode::put() {
  auto cct = c->store->cct;
  ++put_nref;
  int n = --nref;
  if (n == 1) {
    OnodeCacheShard* ocs = c->get_onode_cache();
    ocs->lock.lock();
    while (ocs != c->get_onode_cache()) {
      ocs->lock.unlock();
      ocs = c->get_onode_cache();
      ocs->lock.lock();
    }
    bool need_unpin = pinned;
    pinned = pinned && nref >= 2;
    need_unpin = need_unpin && !pinned;
    if (cached && need_unpin) {
      if (exists) {
        ocs->_unpin(this);
      } else {
        ocs->_unpin_and_rm(this);
        // remove will also decrement nref
        c->onode_map._remove(oid);
      }
    }
    ocs->lock.unlock();
  }
  auto pn = --put_nref;

/*!!!!!!!!!
Let the first thread get pn = 0 (while nref >= 1) at this point.

Second thread might call Onode::put() (and hence increase put_nref back to 1) and decrease nref to zero before the first one reaches 'if' statement below,
As a result both threads get nref == 0 and pn == 0 and release this onode twice.
*/

  if (nref == 0 && pn == 0) {
    delete this;
  }
} 

Fixes: https://tracker.ceph.com/issues/56382
Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@ghost
Copy link

ghost commented Oct 24, 2022

what is missing here? "only" the review? we are affected by this bug (dozens of OSD crashes) on octopus.

I understand octopus is EOL, but I hope at least on a backport to pacific so we can upgrade.

@taodd
Copy link
Contributor

taodd commented Oct 24, 2022

@ifed01 Thanks for your patch, I reviewed the code today and I think I'm able to poke a hole here:

put() (nref = 4, pinned = true)        
    if (--nref <=3 ) (nref=3,pinned=true) {                   
                                                                     get()  (nref=3,pinned=true)
                                                                           nref += 2  (nref=5, pinned=true)
                                                                            if (nref > 3 && !pinned)
                                                                             ###exit because of pinned=true
             ocs.lock()         
            ....                                               
              pinned = nref > 3;(nref=5,pinned=true)
              ## doen't unpin 
            ....
            ocs.unlock()
    }
                                                                     put()  (nref = 5, pinned = true)
                                                                           if (--nref <= 3){ (nref = 4, pinned = true){ #skip if body
                                                                             .....
                                                                            }
                                                                            if (--nref == 0) (nref=3, pinned=true)
                                                                            {delete}
                        
    if (--nref == 0)   (nref = 2, pinned = true)
    {delete}

In the end we have nref=2. pinned=true, which is a wrong state, would potentially cause the failure of onode trimming.

Honestly speaking, by doing nref +=2, it breaks the definition of the self-explained variable Onode's nref, which means nref would then only be meaningful for the certain bug, and not the real onode reference count it was designed for. It would also make the code pretty hard to understand for other people if they don't know the history of the bugs.

I also had #48566 opened for opinions, but I didn't have the courage to get rid of put_onode, good thing is that it's simple enough(less risky), would you consider this kind of patch? (I understand it's hard to get a perfect solution, like "pinned" should be controlled in ocs.lock really as it had a direct connection with the status of onode shard cache)

@ifed01
Copy link
Contributor Author

ifed01 commented Oct 25, 2022

@ifed01 Thanks for your patch, I reviewed the code today and I think I'm able to poke a hole here:

put() (nref = 4, pinned = true)        
    if (--nref <=3 ) (nref=3,pinned=true) {                   
                                                                     get()  (nref=3,pinned=true)
                                                                           nref += 2  (nref=5, pinned=true)
                                                                            if (nref > 3 && !pinned)
                                                                             ###exit because of pinned=true
             ocs.lock()         
            ....                                               
              pinned = nref > 3;(nref=5,pinned=true)
              ## doen't unpin 
            ....
            ocs.unlock()
    }
                                                                     put()  (nref = 5, pinned = true)
                                                                           if (--nref <= 3){ (nref = 4, pinned = true){ #skip if body
                                                                             .....
                                                                            }
                                                                            if (--nref == 0) (nref=3, pinned=true)
                                                                            {delete}
                        
    if (--nref == 0)   (nref = 2, pinned = true)
    {delete}

In the end we have nref=2. pinned=true, which is a wrong state, would potentially cause the failure of onode trimming.

Agree :(

Honestly speaking, by doing nref +=2, it breaks the definition of the self-explained variable Onode's nref, which means nref would then only be meaningful for the certain bug, and not the real onode reference count it was designed for. It would also make the code pretty hard to understand for other people if they don't know the history of the bugs.

+1, will gladly remove such tricks if we could....

I also had #48566 opened for opinions, but I didn't have the courage to get rid of put_onode, good thing is that it's simple enough(less risky), would you consider this kind of patch? (I understand it's hard to get a perfect solution, like "pinned" should be controlled in ocs.lock really as it had a direct connection with the status of onode shard cache)

Will post my concerns in your PR, IMO it's not 100% safe either...

@ifed01 ifed01 force-pushed the wip-ifed-more-pin-fix branch from 39d61ae to c9519a4 Compare October 28, 2022 16:42
@ifed01
Copy link
Contributor Author

ifed01 commented Oct 28, 2022

@aclamk @taodd - here is another attempt, please take a look.

@aclamk - I tried to apply you patch from #47896 above this changeset but the resulting code doesn't look nice/elegant IMO.. Hence omitted your commit for now...

if (cached && need_unpin) {
}
bool was_pinned = pinned;
pinned = pin_nref >= 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though pinned and pin_nref are atomic variables, this statement could be evaluated as below two:
temp = pin_nref >= 2;
pinned = temp;

So following case could result pin_nref = 2 but pinned = false:
Suppose pin_nref = 2, pinned = true, a put thread comes in and decreases pin_nref to 1, then pin_nref >= 2 gets evaluated as false, but before it gets assigned to pinned, a get thread came in and set pinned_ref to 2 then exit because pinned is still true. After that, the put thread set the pinned to false and does the unpin. Finally, we'll have an unexpected state: pinned_ref=2 and pinned = false.

Copy link
Contributor

@taodd taodd Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The access of "pinned" variable should be restricted inside the cache lock ? I feel it's more like a member of cache shard, as its purpose is to correctly reflect the cache status of the Onode.

Copy link
Contributor Author

@ifed01 ifed01 Nov 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though pinned and pin_nref are atomic variables, this statement could be evaluated as below two: temp = pin_nref >= 2; pinned = temp;

So following case could result pin_nref = 2 but pinned = false: Suppose pin_nref = 2, pinned = true, a put thread comes in and decreases pin_nref to 1, then pin_nref >= 2 gets evaluated as false, but before it gets assigned to pinned, a get thread came in and set pinned_ref to 2 then exit because pinned is still true. After that, the put thread set the pinned
to false and does the unpin. Finally, we'll have an unexpected state: pinned_ref=2 and pinned = false.

@taodd - hopefully I resolved this issue via moving pinning stuff to onode cache _trim_to() method. The latter acquires cache lock anyway and this allows to get rid of conflicts with Onode::put(). Not to mention that this allows to re-do proper pinning again and again if unpinning does that improperly for some reasons...
As a result Onode::get() is trivial and it does nothing but ref counts increments now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The access of "pinned" variable should be restricted inside the cache lock ? I feel it's more like a member of cache shard, as its purpose is to correctly reflect the cache status of the Onode.

Using "pinned" var is replaced with calling onode lru_item's is_linked() method which explicitly indicates whether onode is in Cache LRU or not now.
And I refactored the code to isolate all the pinning/unpinning logic at cache shard rather than doing that at onode.

So please take a look..

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@taodd
Copy link
Contributor

taodd commented Nov 22, 2022

@ifed01 wondering if you see my review comments above, does it make sense to you?

@ifed01
Copy link
Contributor Author

ifed01 commented Nov 22, 2022

@ifed01 wondering if you see my review comments above, does it make sense to you?

Hi @taodd - yeah, this makes sense indeed. Sorry, was busy with a different stuff (and disappointed by that new failure ;)) so missed the response. Thinking/working on alternative approaches now.. Generally hovering around two ideas:

  • use straightforward mutex at Onode::put and remove pinned entries when trimming the cache which hopefully avoids the need for mutex at Onode::get())
  • avoid temporary onode removal from cache's lru in pinned state and simply move such an entry to the lru's end once detected during cache trimming.

@ifed01 ifed01 force-pushed the wip-ifed-more-pin-fix branch from c9519a4 to aaae0f4 Compare November 27, 2022 22:45
@ifed01 ifed01 force-pushed the wip-ifed-more-pin-fix branch 2 times, most recently from 2babefc to 85d6895 Compare November 28, 2022 08:31
@ifed01
Copy link
Contributor Author

ifed01 commented Nov 28, 2022

jenkins test make check

@taodd
Copy link
Contributor

taodd commented Dec 2, 2022

Thanks Igor, that's a very clever way to simplify the logic, I'll spend more time to take a closer look next week.

@ifed01 ifed01 force-pushed the wip-ifed-more-pin-fix branch from 85d6895 to adbd24f Compare December 6, 2022 23:01
@ifed01
Copy link
Contributor Author

ifed01 commented Dec 8, 2022

jenkins test make check

1 similar comment
@ifed01
Copy link
Contributor Author

ifed01 commented Dec 11, 2022

jenkins test make check

@thenamehasbeentake
Copy link

I also encountered this problem in version 16.2.10 (45fa1a0)
Here is the crash info:

[root@node01 /]# ceph crash info  2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0
{
    "archived": "2022-12-21 10:00:02.049100",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f5c112e0ce0]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x2353) [0x55e7fb241513]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x55e7fb25c6d6]",
        "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x55e7faea0878]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcad) [0x55e7fb08f61d]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xcf0) [0x55e7fae0bb80]",
        "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x55e7fae67efd]",
        "(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de5) [0x55e7fae70a85]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd1c) [0x55e7fae77dcc]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55e7facfefc9]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x55e7faf5de78]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x55e7fad1c4c8]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55e7fb3992a4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55e7fb39c184]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f5c112d61ca]",
        "clone()"
    ],
    "ceph_version": "16.2.10",
    "crash_id": "2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0",
    "entity_name": "osd.76",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "57bf76babfa5713dd2b536dcda981301df81c4c24dac803f2b39bd736c38330b",
    "timestamp": "2022-12-21T09:11:40.404816Z",
    "utsname_hostname": "node03",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1160.45.1.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Oct 13 17:20:51 UTC 2021"
}

I tried to reproduce this problem but fail. Can someone tell how to reproduce this problem?

@ghost
Copy link

ghost commented Dec 23, 2022

I also encountered this problem in version 16.2.10 (45fa1a0) Here is the crash info:

[root@node01 /]# ceph crash info  2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0
{
    "archived": "2022-12-21 10:00:02.049100",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f5c112e0ce0]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x2353) [0x55e7fb241513]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x55e7fb25c6d6]",
        "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x55e7faea0878]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcad) [0x55e7fb08f61d]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xcf0) [0x55e7fae0bb80]",
        "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x55e7fae67efd]",
        "(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de5) [0x55e7fae70a85]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd1c) [0x55e7fae77dcc]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55e7facfefc9]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x55e7faf5de78]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x55e7fad1c4c8]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55e7fb3992a4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55e7fb39c184]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f5c112d61ca]",
        "clone()"
    ],
    "ceph_version": "16.2.10",
    "crash_id": "2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0",
    "entity_name": "osd.76",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "57bf76babfa5713dd2b536dcda981301df81c4c24dac803f2b39bd736c38330b",
    "timestamp": "2022-12-21T09:11:40.404816Z",
    "utsname_hostname": "node03",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1160.45.1.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Oct 13 17:20:51 UTC 2021"
}

I tried to reproduce this problem but fail. Can someone tell how to reproduce this problem?

well we observe this mostly on EC pools with latest octopus release with a decent workload (over > 100 Clients)

this is our current number of osd crashes with this crash signature:

root@ceph-mon1:~# i=0; for crash in $(ceph crash ls | awk '/osd/ { print $1 }'); do ceph crash info $crash | grep -qF 'ShardedThreadPool::WorkThreadSharded::entry()+0x10)' && i=$((i+1)); done; echo $i;
72

these crashes are occurring almost daily on a regular base in our cluster. so we really didn't try to artificially reproduce the problem.

these crashes are rarer if there is only light load on the cluster.

HTH

@thenamehasbeentake
Copy link

I also encountered this problem in version 16.2.10 (45fa1a0) Here is the crash info:

[root@node01 /]# ceph crash info  2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0
{
    "archived": "2022-12-21 10:00:02.049100",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f5c112e0ce0]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x2353) [0x55e7fb241513]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x55e7fb25c6d6]",
        "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x55e7faea0878]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcad) [0x55e7fb08f61d]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xcf0) [0x55e7fae0bb80]",
        "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x55e7fae67efd]",
        "(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de5) [0x55e7fae70a85]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd1c) [0x55e7fae77dcc]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55e7facfefc9]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x55e7faf5de78]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x55e7fad1c4c8]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55e7fb3992a4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55e7fb39c184]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f5c112d61ca]",
        "clone()"
    ],
    "ceph_version": "16.2.10",
    "crash_id": "2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0",
    "entity_name": "osd.76",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "57bf76babfa5713dd2b536dcda981301df81c4c24dac803f2b39bd736c38330b",
    "timestamp": "2022-12-21T09:11:40.404816Z",
    "utsname_hostname": "node03",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1160.45.1.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Oct 13 17:20:51 UTC 2021"
}

I tried to reproduce this problem but fail. Can someone tell how to reproduce this problem?

well we observe this mostly on EC pools with latest octopus release with a decent workload (over > 100 Clients)

this is our current number of osd crashes with this crash signature:

root@ceph-mon1:~# i=0; for crash in $(ceph crash ls | awk '/osd/ { print $1 }'); do ceph crash info $crash | grep -qF 'ShardedThreadPool::WorkThreadSharded::entry()+0x10)' && i=$((i+1)); done; echo $i;
72

these crashes are occurring almost daily on a regular base in our cluster. so we really didn't try to artificially reproduce the problem.

these crashes are rarer if there is only light load on the cluster.

HTH

Clients mean cephfs client? We didn't find this problem even with stress testing using cosbench before deploying the cluster.

By the way, I found some crash info lost like this. Do you meat the similar problem?

2022-12-10T10:49:52.844+0000 7fecde18e700  4 rocksdb: [compaction/compaction_job.cc:1349] [P] [JOB 4065] Generated table #8875: 37949 keys, 8337300 bytes
2022-12-10T10:49:52.844+0000 7fecde18e700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1670669392844311, "cf_name": "P", "job": 4065, "event": "table_file_creation", "file_number": 8875, "file_size": 8337300, "table_properties": {"data_size": 8118651, "index_size": 122878, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 94917, "raw_key_size": 1851316, "raw_average_key_size": 48, "raw_value_size": 7499049, "raw_average_value_size": 197, "num_data_blocks": 2020, "num_entries": 37949, "num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "P", "column_family_id": 11, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1668234731, "oldest_key_time": 0, "file_creation_time": 1670669392}}
2022-12-10T10:49:52.847+0000 7fecde18e700  4 rocksdb: [compaction/compaction_job.cc:1415] [P] [JOB 4065] Compacted 2@0 + 2@1 files to L1 => 77245046 bytes
2022-12-10T10:49:52.847+0000 7fecde18e700  4 rocksdb: (Original Log Time 2022/12/10-10:49:52.847724) [compaction/compaction_job.cc:760] [P] compacted to: files[0 2 0 0 0 0 0] max score 0.29, MB/sec: 191.1 rd, 73.1 wr, level 1, files in(2, 2) out(2) MB in(119.0, 73.5) out(73.7), read-write-amplify(2.2) write-amplify(0.6) OK, records in: 2105194, records dropped: 1756421 output_compression: NoCompression

2022-12-10T10:49:52.847+0000 7fecde18e700  4 rocksdb: (Original Log Time 2022/12/10-10:49:52.847754) EVENT_LOG_v1 {"time_micros": 1670669392847740, "job": 4065, "event": "compaction_finished", "compaction_time_micros": 1056543, "compaction_time_cpu_micros": 806728, "output_level": 1, "num_output_files": 2, "total_output_size": 77245046, "num_input_records": 2105194, "num_output_records": 348773, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 2, 0, 0, 0, 0, 0]}
2022-12-10T10:49:52.847+0000 7fecde18e700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1670669392848059, "job": 4065, "event": "table_file_deletion", "file_number": 8873}
2022-12-10T10:49:52.848+0000 7fecde18e700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1670669392848280, "job": 4065, "event": "table_file_deletion", "file_number": 8863}
2022-12-10T10:49:52.848+0000 7fecde18e700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1670669392848454, "job": 4065, "event": "table_file_deletion", "file_number": 8856}
2022-12-10T10:49:52.848+0000 7fecde18e700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1670669392848654, "job": 4065, "event": "table_file_deletion", "file_number": 8855}
2022-12-10T10:52:55.294+0000 7ff6e3467200  0 set uid:gid to 167:167 (ceph:ceph)
2022-12-10T10:52:55.294+0000 7ff6e3467200  0 ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable), process ceph-osd, pid 42833
2022-12-10T10:52:55.294+0000 7ff6e3467200  0 pidfile_write: ignore empty --pid-file
2022-12-10T10:52:55.295+0000 7ff6e3467200  1 bdev(0x560a1e84a800 /var/lib/ceph/osd/ceph-87/block) open path /var/lib/ceph/osd/ceph-87/block
2022-12-10T10:52:55.295+0000 7ff6e3467200  0 bdev(0x560a1e84a800 /var/lib/ceph/osd/ceph-87/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-87/block failed: (22) Invalid argument
2022-12-10T10:52:55.295+0000 7ff6e3467200  1 bdev(0x560a1e84a800 /var/lib/ceph/osd/ceph-87/block) open size 1920383410176 (0x1bf1fc56000, 1.7 TiB) block_size 4096 (4 KiB) non-rotational discard supported

@rene-bayer
Copy link

We are running an EC Pool with currently around 200 VM's on it, currently there are about 1-4 OSD crashes per week.
Hoping for a fix soon :-(

@taodd
Copy link
Contributor

taodd commented Jan 6, 2023

Copying the mail that I sent to the user's mailing list:

For those who had been hitting this issue, I would like to share a workaround that could unblock you:

During the investigation of this issue, I found this race condition always happens after the bluestore onode cache size becomes 0.
Setting debug_bluestore = 1/30 will allow you to see the cache size after the crash:

2022-10-25T00:47:26.562+0000 7f424f78e700 30 bluestore.MempoolThread(0x564a9dae2a68) _resize_shards max_shard_onodes: 0 max_shard_buffer: 8388608

This is apparently wrong as this means the bluestore metadata cache is basically disabled,
but it makes much sense to explain why we are hitting the race condition so easily -- An onode will be trimmed right away after it's unpinned.

Keep going with the investigation, it turns out the culprit for the 0-sized cache is the leak that happened in bluestore_cache_other mempool
Please refer to the bug tracker [3] which has the detail of the leak issue, it was already fixed by [4], and the next Pacific point release will have it.
But it was never backported to Octopus.
So if you are hitting the same:
For those who are on Octopus, you can manually backport this patch to fix the leak and prevent the race condition from happening.
For those who are on Pacific, you can wait for the next Pacific point release.

By the way, I'm backporting the fix to ubuntu Octopus and Pacific through this SRU [5], so it will be landed in ubuntu's package soon.

[1] https://tracker.ceph.com/issues/56382
[2] #47702
[3] https://tracker.ceph.com/issues/56424
[4] #46911
[5] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1996010

Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised, but I am now convinced that this solution can work.
I have tried hard, but was unable to crank up scenario that created any of:

  1. use after free for Onode
  2. double delete for Onode
  3. leave Onode pinned
    Excellent stuff!

if (p != onode_map.end()) {
// add entry or return existing one
auto p = onode_map.emplace(oid, o);
if (!p.second) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Excellent optimization, will reuse.

}
auto pn = --put_nref;
if (nref == 0 && pn == 0) {
if (--nref == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I admire this refactor. Now there is no way we "use after free" for Onodes.

bool cached; ///< Onode is logically in the cache
/// (it can be pinned and hence physically out
/// of it at the moment though)
std::atomic_bool pinned; ///< Onode is pinned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for eradication of pinned bool flag.


// cache onodes on a per-collection basis to avoid lock
// contention.
OnodeSpace onode_map;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I will copy this rename pattern in future.

}

OnodeRef add(const ghobject_t& oid, OnodeRef& o);
OnodeRef add_onode(const ghobject_t& oid, OnodeRef& o);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that rename "add" to "add_onode" is needed here, especially that other functions do not follow similar pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My primary intention was to make this function reference lookups easier - 'add' is met too frequently in the code hence it's hard make a search without special code browsing features in an editor,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this improves code readability as bit...

TransContext *txc,
CollectionRef& c,
OnodeRef o,
OnodeRef& o,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like optimization OnodeRef->OnodeRef&.
But would we not be better having it in different PR, for testing purposes?
Or is it somehow relevant to pin/unpin logic?

Of course plus side of having it here is that with one backport we will have perf improvement too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No IIRC it's not relevant to ref counting. And this apparently deserves a standalone commit/PR. But let's leave it as-is for now - I strongly want this fix to be merged ASAP - there is a plenty of user complains on the bug.

// move onode within LRU
_rm(o);
_add(o, 1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could have _touch() function that combines _rm/_add, since those two in most regards cancel themselves out.
This is the hot part of code, it deserves optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. As there are no users for that _touch() func I just put combined (and reduced) logic from _rm/_add directly to this place. @aclamk - please take another look.

ifed01 and others added 7 commits January 13, 2023 15:05
to ease code reading.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
readability.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
This should eliminate duplicate onode releases that could happen before.
Additionally onode pinning is performed during cache trimming not onode
ref count increment.

[Hopefully] fixes: https://tracker.ceph.com/issues/53002

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
@ifed01 ifed01 force-pushed the wip-ifed-more-pin-fix branch from 8b7b550 to 1d4abf7 Compare January 13, 2023 12:21
@ifed01
Copy link
Contributor Author

ifed01 commented Jan 13, 2023

jenkins test make check

@zhucan
Copy link

zhucan commented Feb 8, 2023

will it be merged?

@ifed01
Copy link
Contributor Author

ifed01 commented Feb 8, 2023

will it be merged?

Hopefully yes once it passes QA

@ljflores
Copy link
Member

ljflores commented Feb 8, 2023

Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2023-01-26-1532

Failures, unrelated:
1. https://tracker.ceph.com/issues/58496 -- fix in progress
2. https://tracker.ceph.com/issues/58585
3. https://tracker.ceph.com/issues/58475
4. https://tracker.ceph.com/issues/57754
5. https://tracker.ceph.com/issues/49287
6. https://tracker.ceph.com/issues/57731
7. https://tracker.ceph.com/issues/54829
8. https://tracker.ceph.com/issues/52221

Details:
1. osd/PeeringState: FAILED ceph_assert(!acting_recovery_backfill.empty()) - Ceph - RADOS
2. rook: failed to pull kubelet image - Ceph - Orchestrator
3. test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard
4. test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
5. podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Ceph - Orchestrator
6. Problem: package container-selinux conflicts with udica < 0.2.6-1 provided by udica-0.2.4-1 - Infrastructure
7. crash: void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const: assert(num_down_in_osds <= num_in_osds) - Ceph - RADOS
8. crash: void OSD::handle_osd_map(MOSDMap*): assert(p != added_maps_bl.end()) - Ceph - RADOS

@ljflores ljflores merged commit 670a47a into ceph:main Feb 8, 2023
@thenamehasbeentake
Copy link

Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2023-01-26-1532

Failures, unrelated: 1. https://tracker.ceph.com/issues/58496 -- fix in progress 2. https://tracker.ceph.com/issues/58585 3. https://tracker.ceph.com/issues/58475 4. https://tracker.ceph.com/issues/57754 5. https://tracker.ceph.com/issues/49287 6. https://tracker.ceph.com/issues/57731 7. https://tracker.ceph.com/issues/54829 8. https://tracker.ceph.com/issues/52221

Details: 1. osd/PeeringState: FAILED ceph_assert(!acting_recovery_backfill.empty()) - Ceph - RADOS 2. rook: failed to pull kubelet image - Ceph - Orchestrator 3. test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard 4. test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure 5. podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Ceph - Orchestrator 6. Problem: package container-selinux conflicts with udica < 0.2.6-1 provided by udica-0.2.4-1 - Infrastructure 7. crash: void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const: assert(num_down_in_osds <= num_in_osds) - Ceph - RADOS 8. crash: void OSD::handle_osd_map(MOSDMap*): assert(p != added_maps_bl.end()) - Ceph - RADOS

Will it be backported to release 16.2.12?

@k0ste
Copy link
Contributor

k0ste commented Feb 9, 2023

@thenamehasbeentake, bot created backport tickets for pacific & quinicy

@zhucan
Copy link

zhucan commented Feb 9, 2023

@k0ste The next version of pacific's release time is ?

@k0ste
Copy link
Contributor

k0ste commented Feb 9, 2023

@k0ste The next version of pacific's release time is ?

🔮

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants