os/bluestore: fix onode ref counting. by ifed01 · Pull Request #47702 · ceph/ceph

ifed01 · 2022-08-19T12:32:23Z

OSD crashes due to apparently buggy onode ref counting/cache pinning logic are still happening for 17.2.0/16.2.9/15.2.16. Which is apparently caused by a duplicate onode release under some conditions.

I presume this might happen when two concurrent threads are 'putting' onode and they race at put_nref decrement. See my comment in the code snippet below.

void BlueStore::Onode::put() {
  auto cct = c->store->cct;
  ++put_nref;
  int n = --nref;
  if (n == 1) {
    OnodeCacheShard* ocs = c->get_onode_cache();
    ocs->lock.lock();
    while (ocs != c->get_onode_cache()) {
      ocs->lock.unlock();
      ocs = c->get_onode_cache();
      ocs->lock.lock();
    }
    bool need_unpin = pinned;
    pinned = pinned && nref >= 2;
    need_unpin = need_unpin && !pinned;
    if (cached && need_unpin) {
      if (exists) {
        ocs->_unpin(this);
      } else {
        ocs->_unpin_and_rm(this);
        // remove will also decrement nref
        c->onode_map._remove(oid);
      }
    }
    ocs->lock.unlock();
  }
  auto pn = --put_nref;

/*!!!!!!!!!
Let the first thread get pn = 0 (while nref >= 1) at this point.

Second thread might call Onode::put() (and hence increase put_nref back to 1) and decrease nref to zero before the first one reaches 'if' statement below,
As a result both threads get nref == 0 and pn == 0 and release this onode twice.
*/

  if (nref == 0 && pn == 0) {
    delete this;
  }
}

Fixes: https://tracker.ceph.com/issues/56382
Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

ghost · 2022-10-24T08:50:23Z

what is missing here? "only" the review? we are affected by this bug (dozens of OSD crashes) on octopus.

I understand octopus is EOL, but I hope at least on a backport to pacific so we can upgrade.

taodd · 2022-10-24T11:17:14Z

@ifed01 Thanks for your patch, I reviewed the code today and I think I'm able to poke a hole here:

put() (nref = 4, pinned = true)        
    if (--nref <=3 ) (nref=3,pinned=true) {                   
                                                                     get()  (nref=3,pinned=true)
                                                                           nref += 2  (nref=5, pinned=true)
                                                                            if (nref > 3 && !pinned)
                                                                             ###exit because of pinned=true
             ocs.lock()         
            ....                                               
              pinned = nref > 3;(nref=5,pinned=true)
              ## doen't unpin 
            ....
            ocs.unlock()
    }
                                                                     put()  (nref = 5, pinned = true)
                                                                           if (--nref <= 3){ (nref = 4, pinned = true){ #skip if body
                                                                             .....
                                                                            }
                                                                            if (--nref == 0) (nref=3, pinned=true)
                                                                            {delete}
                        
    if (--nref == 0)   (nref = 2, pinned = true)
    {delete}

In the end we have nref=2. pinned=true, which is a wrong state, would potentially cause the failure of onode trimming.

Honestly speaking, by doing nref +=2, it breaks the definition of the self-explained variable Onode's nref, which means nref would then only be meaningful for the certain bug, and not the real onode reference count it was designed for. It would also make the code pretty hard to understand for other people if they don't know the history of the bugs.

I also had #48566 opened for opinions, but I didn't have the courage to get rid of put_onode, good thing is that it's simple enough(less risky), would you consider this kind of patch? (I understand it's hard to get a perfect solution, like "pinned" should be controlled in ocs.lock really as it had a direct connection with the status of onode shard cache)

ifed01 · 2022-10-25T23:25:28Z

@ifed01 Thanks for your patch, I reviewed the code today and I think I'm able to poke a hole here:

put() (nref = 4, pinned = true)        
    if (--nref <=3 ) (nref=3,pinned=true) {                   
                                                                     get()  (nref=3,pinned=true)
                                                                           nref += 2  (nref=5, pinned=true)
                                                                            if (nref > 3 && !pinned)
                                                                             ###exit because of pinned=true
             ocs.lock()         
            ....                                               
              pinned = nref > 3;(nref=5,pinned=true)
              ## doen't unpin 
            ....
            ocs.unlock()
    }
                                                                     put()  (nref = 5, pinned = true)
                                                                           if (--nref <= 3){ (nref = 4, pinned = true){ #skip if body
                                                                             .....
                                                                            }
                                                                            if (--nref == 0) (nref=3, pinned=true)
                                                                            {delete}
                        
    if (--nref == 0)   (nref = 2, pinned = true)
    {delete}

In the end we have nref=2. pinned=true, which is a wrong state, would potentially cause the failure of onode trimming.

Agree :(

Honestly speaking, by doing nref +=2, it breaks the definition of the self-explained variable Onode's nref, which means nref would then only be meaningful for the certain bug, and not the real onode reference count it was designed for. It would also make the code pretty hard to understand for other people if they don't know the history of the bugs.

+1, will gladly remove such tricks if we could....

I also had #48566 opened for opinions, but I didn't have the courage to get rid of put_onode, good thing is that it's simple enough(less risky), would you consider this kind of patch? (I understand it's hard to get a perfect solution, like "pinned" should be controlled in ocs.lock really as it had a direct connection with the status of onode shard cache)

Will post my concerns in your PR, IMO it's not 100% safe either...

ifed01 · 2022-10-28T16:46:57Z

@aclamk @taodd - here is another attempt, please take a look.

@aclamk - I tried to apply you patch from #47896 above this changeset but the resulting code doesn't look nice/elegant IMO.. Hence omitted your commit for now...

taodd · 2022-11-02T10:16:44Z

src/os/bluestore/BlueStore.cc

-    if (cached && need_unpin) {
+  }
+  bool was_pinned = pinned;
+  pinned = pin_nref >= 2;


Though pinned and pin_nref are atomic variables, this statement could be evaluated as below two:
temp = pin_nref >= 2;
pinned = temp;

So following case could result pin_nref = 2 but pinned = false:
Suppose pin_nref = 2, pinned = true, a put thread comes in and decreases pin_nref to 1, then pin_nref >= 2 gets evaluated as false, but before it gets assigned to pinned, a get thread came in and set pinned_ref to 2 then exit because pinned is still true. After that, the put thread set the pinned to false and does the unpin. Finally, we'll have an unexpected state: pinned_ref=2 and pinned = false.

The access of "pinned" variable should be restricted inside the cache lock ? I feel it's more like a member of cache shard, as its purpose is to correctly reflect the cache status of the Onode.

Though pinned and pin_nref are atomic variables, this statement could be evaluated as below two: temp = pin_nref >= 2; pinned = temp;

So following case could result pin_nref = 2 but pinned = false: Suppose pin_nref = 2, pinned = true, a put thread comes in and decreases pin_nref to 1, then pin_nref >= 2 gets evaluated as false, but before it gets assigned to pinned, a get thread came in and set pinned_ref to 2 then exit because pinned is still true. After that, the put thread set the pinned
to false and does the unpin. Finally, we'll have an unexpected state: pinned_ref=2 and pinned = false.

@taodd - hopefully I resolved this issue via moving pinning stuff to onode cache _trim_to() method. The latter acquires cache lock anyway and this allows to get rid of conflicts with Onode::put(). Not to mention that this allows to re-do proper pinning again and again if unpinning does that improperly for some reasons...
As a result Onode::get() is trivial and it does nothing but ref counts increments now.

The access of "pinned" variable should be restricted inside the cache lock ? I feel it's more like a member of cache shard, as its purpose is to correctly reflect the cache status of the Onode.

Using "pinned" var is replaced with calling onode lru_item's is_linked() method which explicitly indicates whether onode is in Cache LRU or not now.
And I refactored the code to isolate all the pinning/unpinning logic at cache shard rather than doing that at onode.

So please take a look..

github-actions · 2022-11-10T20:50:39Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

taodd · 2022-11-22T01:15:24Z

@ifed01 wondering if you see my review comments above, does it make sense to you?

ifed01 · 2022-11-22T16:04:30Z

@ifed01 wondering if you see my review comments above, does it make sense to you?

Hi @taodd - yeah, this makes sense indeed. Sorry, was busy with a different stuff (and disappointed by that new failure ;)) so missed the response. Thinking/working on alternative approaches now.. Generally hovering around two ideas:

use straightforward mutex at Onode::put and remove pinned entries when trimming the cache which hopefully avoids the need for mutex at Onode::get())
avoid temporary onode removal from cache's lru in pinned state and simply move such an entry to the lru's end once detected during cache trimming.

ifed01 · 2022-11-28T09:53:07Z

jenkins test make check

src/os/bluestore/BlueStore.cc

taodd · 2022-12-02T04:07:18Z

Thanks Igor, that's a very clever way to simplify the logic, I'll spend more time to take a closer look next week.

src/os/bluestore/BlueStore.h

src/os/bluestore/BlueStore.cc

ifed01 · 2022-12-08T22:50:05Z

jenkins test make check

ifed01 · 2022-12-11T02:09:33Z

jenkins test make check

thenamehasbeentake · 2022-12-23T06:31:54Z

I also encountered this problem in version 16.2.10 (45fa1a0)
Here is the crash info:

[root@node01 /]# ceph crash info  2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0
{
    "archived": "2022-12-21 10:00:02.049100",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f5c112e0ce0]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x2353) [0x55e7fb241513]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x55e7fb25c6d6]",
        "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x55e7faea0878]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcad) [0x55e7fb08f61d]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xcf0) [0x55e7fae0bb80]",
        "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x55e7fae67efd]",
        "(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de5) [0x55e7fae70a85]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd1c) [0x55e7fae77dcc]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55e7facfefc9]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x55e7faf5de78]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x55e7fad1c4c8]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55e7fb3992a4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55e7fb39c184]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f5c112d61ca]",
        "clone()"
    ],
    "ceph_version": "16.2.10",
    "crash_id": "2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0",
    "entity_name": "osd.76",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "57bf76babfa5713dd2b536dcda981301df81c4c24dac803f2b39bd736c38330b",
    "timestamp": "2022-12-21T09:11:40.404816Z",
    "utsname_hostname": "node03",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1160.45.1.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Oct 13 17:20:51 UTC 2021"
}

I tried to reproduce this problem but fail. Can someone tell how to reproduce this problem?

ghost · 2022-12-23T10:27:33Z

I also encountered this problem in version 16.2.10 (45fa1a0) Here is the crash info:

[root@node01 /]# ceph crash info  2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0
{
    "archived": "2022-12-21 10:00:02.049100",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f5c112e0ce0]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x2353) [0x55e7fb241513]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x55e7fb25c6d6]",
        "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x55e7faea0878]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcad) [0x55e7fb08f61d]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xcf0) [0x55e7fae0bb80]",
        "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x55e7fae67efd]",
        "(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de5) [0x55e7fae70a85]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd1c) [0x55e7fae77dcc]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55e7facfefc9]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x55e7faf5de78]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x55e7fad1c4c8]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55e7fb3992a4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55e7fb39c184]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f5c112d61ca]",
        "clone()"
    ],
    "ceph_version": "16.2.10",
    "crash_id": "2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0",
    "entity_name": "osd.76",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "57bf76babfa5713dd2b536dcda981301df81c4c24dac803f2b39bd736c38330b",
    "timestamp": "2022-12-21T09:11:40.404816Z",
    "utsname_hostname": "node03",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1160.45.1.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Oct 13 17:20:51 UTC 2021"
}

I tried to reproduce this problem but fail. Can someone tell how to reproduce this problem?

well we observe this mostly on EC pools with latest octopus release with a decent workload (over > 100 Clients)

this is our current number of osd crashes with this crash signature:

root@ceph-mon1:~# i=0; for crash in $(ceph crash ls | awk '/osd/ { print $1 }'); do ceph crash info $crash | grep -qF 'ShardedThreadPool::WorkThreadSharded::entry()+0x10)' && i=$((i+1)); done; echo $i;
72

these crashes are occurring almost daily on a regular base in our cluster. so we really didn't try to artificially reproduce the problem.

these crashes are rarer if there is only light load on the cluster.

HTH

thenamehasbeentake · 2022-12-23T11:59:13Z

I also encountered this problem in version 16.2.10 (45fa1a0) Here is the crash info:

[root@node01 /]# ceph crash info  2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0
{
    "archived": "2022-12-21 10:00:02.049100",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f5c112e0ce0]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x2353) [0x55e7fb241513]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x55e7fb25c6d6]",
        "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x55e7faea0878]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcad) [0x55e7fb08f61d]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xcf0) [0x55e7fae0bb80]",
        "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x55e7fae67efd]",
        "(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de5) [0x55e7fae70a85]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd1c) [0x55e7fae77dcc]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55e7facfefc9]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x55e7faf5de78]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x55e7fad1c4c8]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55e7fb3992a4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55e7fb39c184]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f5c112d61ca]",
        "clone()"
    ],
    "ceph_version": "16.2.10",
    "crash_id": "2022-12-21T09:11:40.404816Z_c4ab0cc3-5213-44ce-b159-32f15a161df0",
    "entity_name": "osd.76",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "57bf76babfa5713dd2b536dcda981301df81c4c24dac803f2b39bd736c38330b",
    "timestamp": "2022-12-21T09:11:40.404816Z",
    "utsname_hostname": "node03",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1160.45.1.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Oct 13 17:20:51 UTC 2021"
}

I tried to reproduce this problem but fail. Can someone tell how to reproduce this problem?

well we observe this mostly on EC pools with latest octopus release with a decent workload (over > 100 Clients)

this is our current number of osd crashes with this crash signature:

root@ceph-mon1:~# i=0; for crash in $(ceph crash ls | awk '/osd/ { print $1 }'); do ceph crash info $crash | grep -qF 'ShardedThreadPool::WorkThreadSharded::entry()+0x10)' && i=$((i+1)); done; echo $i;
72

these crashes are occurring almost daily on a regular base in our cluster. so we really didn't try to artificially reproduce the problem.

these crashes are rarer if there is only light load on the cluster.

HTH

Clients mean cephfs client? We didn't find this problem even with stress testing using cosbench before deploying the cluster.

By the way, I found some crash info lost like this. Do you meat the similar problem?

2022-12-10T10:49:52.844+0000 7fecde18e700  4 rocksdb: [compaction/compaction_job.cc:1349] [P] [JOB 4065] Generated table #8875: 37949 keys, 8337300 bytes
2022-12-10T10:49:52.844+0000 7fecde18e700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1670669392844311, "cf_name": "P", "job": 4065, "event": "table_file_creation", "file_number": 8875, "file_size": 8337300, "table_properties": {"data_size": 8118651, "index_size": 122878, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 94917, "raw_key_size": 1851316, "raw_average_key_size": 48, "raw_value_size": 7499049, "raw_average_value_size": 197, "num_data_blocks": 2020, "num_entries": 37949, "num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "P", "column_family_id": 11, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1668234731, "oldest_key_time": 0, "file_creation_time": 1670669392}}
2022-12-10T10:49:52.847+0000 7fecde18e700  4 rocksdb: [compaction/compaction_job.cc:1415] [P] [JOB 4065] Compacted 2@0 + 2@1 files to L1 => 77245046 bytes
2022-12-10T10:49:52.847+0000 7fecde18e700  4 rocksdb: (Original Log Time 2022/12/10-10:49:52.847724) [compaction/compaction_job.cc:760] [P] compacted to: files[0 2 0 0 0 0 0] max score 0.29, MB/sec: 191.1 rd, 73.1 wr, level 1, files in(2, 2) out(2) MB in(119.0, 73.5) out(73.7), read-write-amplify(2.2) write-amplify(0.6) OK, records in: 2105194, records dropped: 1756421 output_compression: NoCompression

2022-12-10T10:49:52.847+0000 7fecde18e700  4 rocksdb: (Original Log Time 2022/12/10-10:49:52.847754) EVENT_LOG_v1 {"time_micros": 1670669392847740, "job": 4065, "event": "compaction_finished", "compaction_time_micros": 1056543, "compaction_time_cpu_micros": 806728, "output_level": 1, "num_output_files": 2, "total_output_size": 77245046, "num_input_records": 2105194, "num_output_records": 348773, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 2, 0, 0, 0, 0, 0]}
2022-12-10T10:49:52.847+0000 7fecde18e700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1670669392848059, "job": 4065, "event": "table_file_deletion", "file_number": 8873}
2022-12-10T10:49:52.848+0000 7fecde18e700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1670669392848280, "job": 4065, "event": "table_file_deletion", "file_number": 8863}
2022-12-10T10:49:52.848+0000 7fecde18e700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1670669392848454, "job": 4065, "event": "table_file_deletion", "file_number": 8856}
2022-12-10T10:49:52.848+0000 7fecde18e700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1670669392848654, "job": 4065, "event": "table_file_deletion", "file_number": 8855}
2022-12-10T10:52:55.294+0000 7ff6e3467200  0 set uid:gid to 167:167 (ceph:ceph)
2022-12-10T10:52:55.294+0000 7ff6e3467200  0 ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable), process ceph-osd, pid 42833
2022-12-10T10:52:55.294+0000 7ff6e3467200  0 pidfile_write: ignore empty --pid-file
2022-12-10T10:52:55.295+0000 7ff6e3467200  1 bdev(0x560a1e84a800 /var/lib/ceph/osd/ceph-87/block) open path /var/lib/ceph/osd/ceph-87/block
2022-12-10T10:52:55.295+0000 7ff6e3467200  0 bdev(0x560a1e84a800 /var/lib/ceph/osd/ceph-87/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-87/block failed: (22) Invalid argument
2022-12-10T10:52:55.295+0000 7ff6e3467200  1 bdev(0x560a1e84a800 /var/lib/ceph/osd/ceph-87/block) open size 1920383410176 (0x1bf1fc56000, 1.7 TiB) block_size 4096 (4 KiB) non-rotational discard supported

rene-bayer · 2022-12-23T12:16:53Z

We are running an EC Pool with currently around 200 VM's on it, currently there are about 1-4 OSD crashes per week.
Hoping for a fix soon :-(

taodd · 2023-01-06T03:14:06Z

Copying the mail that I sent to the user's mailing list:

For those who had been hitting this issue, I would like to share a workaround that could unblock you:

During the investigation of this issue, I found this race condition always happens after the bluestore onode cache size becomes 0.
Setting debug_bluestore = 1/30 will allow you to see the cache size after the crash:

2022-10-25T00:47:26.562+0000 7f424f78e700 30 bluestore.MempoolThread(0x564a9dae2a68) _resize_shards max_shard_onodes: 0 max_shard_buffer: 8388608

This is apparently wrong as this means the bluestore metadata cache is basically disabled,
but it makes much sense to explain why we are hitting the race condition so easily -- An onode will be trimmed right away after it's unpinned.

Keep going with the investigation, it turns out the culprit for the 0-sized cache is the leak that happened in bluestore_cache_other mempool
Please refer to the bug tracker [3] which has the detail of the leak issue, it was already fixed by [4], and the next Pacific point release will have it.
But it was never backported to Octopus.
So if you are hitting the same:
For those who are on Octopus, you can manually backport this patch to fix the leak and prevent the race condition from happening.
For those who are on Pacific, you can wait for the next Pacific point release.

By the way, I'm backporting the fix to ubuntu Octopus and Pacific through this SRU [5], so it will be landed in ubuntu's package soon.

[1] https://tracker.ceph.com/issues/56382
[2] #47702
[3] https://tracker.ceph.com/issues/56424
[4] #46911
[5] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1996010

aclamk

I am surprised, but I am now convinced that this solution can work.
I have tried hard, but was unable to crank up scenario that created any of:

use after free for Onode
double delete for Onode
leave Onode pinned
Excellent stuff!

aclamk · 2023-01-13T08:42:54Z

src/os/bluestore/BlueStore.cc

-  if (p != onode_map.end()) {
+  // add entry or return existing one
+  auto p = onode_map.emplace(oid, o);
+  if (!p.second) {


+1 Excellent optimization, will reuse.

aclamk · 2023-01-13T08:44:17Z

src/os/bluestore/BlueStore.cc

  }
-  auto pn = --put_nref;
-  if (nref == 0 && pn == 0) {
+  if (--nref == 0) {


+1 I admire this refactor. Now there is no way we "use after free" for Onodes.

aclamk · 2023-01-13T08:46:23Z

src/os/bluestore/BlueStore.h

    bool cached;              ///< Onode is logically in the cache
                              /// (it can be pinned and hence physically out
                              /// of it at the moment though)
-    std::atomic_bool pinned;  ///< Onode is pinned


+1 for eradication of pinned bool flag.

aclamk · 2023-01-13T08:48:05Z

src/os/bluestore/BlueStore.h


    // cache onodes on a per-collection basis to avoid lock
    // contention.
-    OnodeSpace onode_map;


+1 I will copy this rename pattern in future.

aclamk · 2023-01-13T08:51:01Z

src/os/bluestore/BlueStore.h

    }

-    OnodeRef add(const ghobject_t& oid, OnodeRef& o);
+    OnodeRef add_onode(const ghobject_t& oid, OnodeRef& o);


I do not think that rename "add" to "add_onode" is needed here, especially that other functions do not follow similar pattern.

My primary intention was to make this function reference lookups easier - 'add' is met too frequently in the code hence it's hard make a search without special code browsing features in an editor,

And this improves code readability as bit...

aclamk · 2023-01-13T10:46:04Z

src/os/bluestore/BlueStore.cc

  TransContext *txc,
  CollectionRef& c,
-  OnodeRef o,
+  OnodeRef& o,


I like optimization OnodeRef->OnodeRef&.
But would we not be better having it in different PR, for testing purposes?
Or is it somehow relevant to pin/unpin logic?

Of course plus side of having it here is that with one backport we will have perf improvement too.

No IIRC it's not relevant to ref counting. And this apparently deserves a standalone commit/PR. But let's leave it as-is for now - I strongly want this fix to be merged ASAP - there is a plenty of user complains on the bug.

aclamk · 2023-01-13T10:50:00Z

src/os/bluestore/BlueStore.cc

+        // move onode within LRU
+        _rm(o);
+        _add(o, 1);
+      }


I think we could have _touch() function that combines _rm/_add, since those two in most regards cancel themselves out.
This is the hot part of code, it deserves optimization.

Updated. As there are no users for that _touch() func I just put combined (and reduced) logic from _rm/_add directly to this place. @aclamk - please take another look.

src/os/bluestore/BlueStore.cc

to ease code reading. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

readability. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

This should eliminate duplicate onode releases that could happen before. Additionally onode pinning is performed during cache trimming not onode ref count increment. [Hopefully] fixes: https://tracker.ceph.com/issues/53002 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

ifed01 · 2023-01-13T14:39:07Z

jenkins test make check

zhucan · 2023-02-08T07:27:19Z

will it be merged?

ifed01 · 2023-02-08T12:03:16Z

will it be merged?

Hopefully yes once it passes QA

ljflores · 2023-02-08T23:36:20Z

Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2023-01-26-1532

Failures, unrelated:
1. https://tracker.ceph.com/issues/58496 -- fix in progress
2. https://tracker.ceph.com/issues/58585
3. https://tracker.ceph.com/issues/58475
4. https://tracker.ceph.com/issues/57754
5. https://tracker.ceph.com/issues/49287
6. https://tracker.ceph.com/issues/57731
7. https://tracker.ceph.com/issues/54829
8. https://tracker.ceph.com/issues/52221

Details:
1. osd/PeeringState: FAILED ceph_assert(!acting_recovery_backfill.empty()) - Ceph - RADOS
2. rook: failed to pull kubelet image - Ceph - Orchestrator
3. test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard
4. test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
5. podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Ceph - Orchestrator
6. Problem: package container-selinux conflicts with udica < 0.2.6-1 provided by udica-0.2.4-1 - Infrastructure
7. crash: void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const: assert(num_down_in_osds <= num_in_osds) - Ceph - RADOS
8. crash: void OSD::handle_osd_map(MOSDMap*): assert(p != added_maps_bl.end()) - Ceph - RADOS

thenamehasbeentake · 2023-02-09T01:54:11Z

Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2023-01-26-1532

Failures, unrelated: 1. https://tracker.ceph.com/issues/58496 -- fix in progress 2. https://tracker.ceph.com/issues/58585 3. https://tracker.ceph.com/issues/58475 4. https://tracker.ceph.com/issues/57754 5. https://tracker.ceph.com/issues/49287 6. https://tracker.ceph.com/issues/57731 7. https://tracker.ceph.com/issues/54829 8. https://tracker.ceph.com/issues/52221

Details: 1. osd/PeeringState: FAILED ceph_assert(!acting_recovery_backfill.empty()) - Ceph - RADOS 2. rook: failed to pull kubelet image - Ceph - Orchestrator 3. test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard 4. test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure 5. podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Ceph - Orchestrator 6. Problem: package container-selinux conflicts with udica < 0.2.6-1 provided by udica-0.2.4-1 - Infrastructure 7. crash: void OSDMap::check_health(ceph::common::CephContext*, health_check_map_t*) const: assert(num_down_in_osds <= num_in_osds) - Ceph - RADOS 8. crash: void OSD::handle_osd_map(MOSDMap*): assert(p != added_maps_bl.end()) - Ceph - RADOS

Will it be backported to release 16.2.12?

k0ste · 2023-02-09T03:18:30Z

@thenamehasbeentake, bot created backport tickets for pacific & quinicy

zhucan · 2023-02-09T06:15:38Z

@k0ste The next version of pacific's release time is ?

k0ste · 2023-02-09T07:08:04Z

@k0ste The next version of pacific's release time is ?

🔮

ifed01 added bug-fix bluestore labels Aug 19, 2022

ifed01 requested a review from aclamk August 19, 2022 12:32

ifed01 requested a review from a team as a code owner August 19, 2022 12:32

github-actions bot added common core tests labels Aug 19, 2022

ifed01 force-pushed the wip-ifed-more-pin-fix branch from c661edd to 39d61ae Compare August 24, 2022 10:34

markhpc mentioned this pull request Oct 20, 2022

os/bluestore: Fix race condition in Onode:put() #48566

Closed

14 tasks

ifed01 force-pushed the wip-ifed-more-pin-fix branch from 39d61ae to c9519a4 Compare October 28, 2022 16:42

taodd suggested changes Nov 8, 2022

View reviewed changes

github-actions bot added the needs-rebase label Nov 10, 2022

ifed01 force-pushed the wip-ifed-more-pin-fix branch from c9519a4 to aaae0f4 Compare November 27, 2022 22:45

github-actions bot removed the needs-rebase label Nov 27, 2022

ifed01 force-pushed the wip-ifed-more-pin-fix branch 2 times, most recently from 2babefc to 85d6895 Compare November 28, 2022 08:31

taodd reviewed Dec 2, 2022

View reviewed changes

src/os/bluestore/BlueStore.cc Show resolved Hide resolved

taodd reviewed Dec 6, 2022

View reviewed changes

src/os/bluestore/BlueStore.h Outdated Show resolved Hide resolved

ifed01 force-pushed the wip-ifed-more-pin-fix branch from 85d6895 to adbd24f Compare December 6, 2022 23:01

taodd reviewed Dec 7, 2022

View reviewed changes

src/os/bluestore/BlueStore.cc Show resolved Hide resolved

taodd reviewed Dec 7, 2022

View reviewed changes

src/os/bluestore/BlueStore.cc Show resolved Hide resolved

aclamk approved these changes Jan 13, 2023

View reviewed changes

ifed01 and others added 7 commits January 13, 2023 15:05

os/bluestore: rename Collection::onode_map to onode_space

0c02f3e

to ease code reading. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: minor refactoring around OnodeSpace struct to improve code

3df4a8d

readability. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: avoid duplicate lookup when adding entry to OnodeSpace

7eed358

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: unify OnodeRef usage

3b40c5c

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: simplify onode cache trimming

5aaa2e0

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

os/bluestore: touch onode in cache LRU explicitly.

1d4abf7

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

ifed01 force-pushed the wip-ifed-more-pin-fix branch from 8b7b550 to 1d4abf7 Compare January 13, 2023 12:21

neha-ojha added the needs-qa label Jan 13, 2023

ljflores added the wip-yuri2-testing label Jan 26, 2023

ljflores merged commit 670a47a into ceph:main Feb 8, 2023

ifed01 deleted the wip-ifed-more-pin-fix branch February 9, 2023 08:29

This was referenced Feb 9, 2023

quincy: os/bluesore: cumulative backport for Onode stuff and more #50048

Merged

pacific: os/bluestore: fix onode ref counting #50072

Merged

Conversation

ifed01 commented Aug 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

ghost commented Oct 24, 2022

Uh oh!

taodd commented Oct 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifed01 commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifed01 commented Oct 28, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taodd Nov 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ifed01 Nov 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 10, 2022

Uh oh!

taodd commented Nov 22, 2022

Uh oh!

ifed01 commented Nov 22, 2022

Uh oh!

ifed01 commented Nov 28, 2022

Uh oh!

Uh oh!

taodd commented Dec 2, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ifed01 commented Dec 8, 2022

Uh oh!

ifed01 commented Dec 11, 2022

Uh oh!

thenamehasbeentake commented Dec 23, 2022

Uh oh!

ghost commented Dec 23, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thenamehasbeentake commented Dec 23, 2022

Uh oh!

rene-bayer commented Dec 23, 2022

Uh oh!

taodd commented Jan 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aclamk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ifed01 commented Aug 19, 2022 •

edited

Loading

taodd commented Oct 24, 2022 •

edited

Loading

ifed01 commented Oct 25, 2022 •

edited

Loading

taodd Nov 9, 2022 •

edited

Loading

ifed01 Nov 27, 2022 •

edited

Loading

ghost commented Dec 23, 2022 •

edited by ghost

Loading

taodd commented Jan 6, 2023 •

edited

Loading