Skip to content

os/bluestore: create the tail when first set FLAG_OMAP#27627

Merged
tchaikov merged 1 commit intoceph:masterfrom
cxytt:fix-add-omap-tail
Jun 25, 2019
Merged

os/bluestore: create the tail when first set FLAG_OMAP#27627
tchaikov merged 1 commit intoceph:masterfrom
cxytt:fix-add-omap-tail

Conversation

@cxytt
Copy link

@cxytt cxytt commented Apr 17, 2019

the omap iterator when listing omap use the tail of '~', when the iterator moved to the last key of the omapswe wanted, we will try to call extra next(), usually this will be another object's omap header(with '-'). IF there are some deleted key or tombstones, rocksdb will fall in the loop of FindNextUserEntryInternal until find a valid key, so it will travels all dead key in mid and read the sst file heavily.

Signed-off-by: ningtao ningtao@sangfor.com.cn

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

@cxytt
Copy link
Author

cxytt commented Apr 17, 2019

issues 36482 maybe the same problem

@cxytt
Copy link
Author

cxytt commented Apr 17, 2019

retest this please

@Aran85
Copy link
Contributor

Aran85 commented Apr 20, 2019

@markhpc @liewegas @gregsfortytwo This is the performance improve approach of list omap, ever discussed in devel mail list. please help to review if you feel it's convenient, thanks.

It would be better if there is a performance label.

it's that object foo has a bunch of omap items, but the object(s) that came after foo are all deleted, so that when we enumerate foo's omap, we have to skip over all the delete objects' sst files to reach the head for the next non-deleted object

@liewegas liewegas changed the title bluestore: create the tail when first set FLAG_OMAP os/bluestore: create the tail when first set FLAG_OMAP Apr 22, 2019
liewegas
liewegas previously approved these changes Apr 22, 2019
@gregsfortytwo
Copy link
Member

I've not done much with the KVStores so I'm not real clear on the impact of doing those deletes on the end iterator as well. Is it going to work with the monitor (...and FileStore) as well or does correct behavior depend on those BlueStore changes?

@liewegas liewegas removed the needs-qa label Apr 22, 2019
@liewegas
Copy link
Member

Ah, good point! I didn't notice those were in kv/ and not bluestore code.

I think the safer path here is to leave rm_range_keys() implementation unchanged, and just modify BlueStore::_do_omap_clear() to explicitly delete the end key.

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted, needs to not break monitor.

@cxytt
Copy link
Author

cxytt commented Apr 24, 2019

@liewegas
I've put detete tail on _do_omap_clear,And test found that the tail should not be used as a valid key, so modify the BlueStore: : OmapIteratorImpl: : valid (), the tail as invalid,please help to review

@cxytt
Copy link
Author

cxytt commented Apr 24, 2019

retest this please

@gregsfortytwo gregsfortytwo dismissed stale reviews from liewegas and themself April 24, 2019 18:18

Reimplemented

RWLock::RLocker l(c->lock);
bool r = o->onode.has_omap() && it && it->valid() &&
it->raw_key().second <= tail;
it->raw_key().second < tail && it->raw_key().second != tail;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why adding "it->raw_key().second != tail"? Isn't "it->raw_key.second < tail" enough?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are scenarios where an error occurs, such as read_log_and_missing,So here's what we did

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cxytt - I simply mean that last part of this expression (it->raw_key()second != tail) is redundant. True for it->raw_key().second < tail automatically results in true for ... != tail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(─.─||)fixed

get_omap_header(id, &prefix);
get_omap_tail(id, &tail);
txc->t->rm_range_keys(omap_prefix, prefix, tail);
if (db->get(omap_prefix , tail, &tail_value) >= 0) {
Copy link
Contributor

@ifed01 ifed01 May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this get() call? May be unconditionally call txc->t->rmkey(omap_prefix, tail) only?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

break;
} else {
dout(30) << __func__ << " got header/data "
if (it->key() <= tail) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be it->key() < tail to avoid duplicate tail set - we've just set tail at line 13030

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

const string& prefix =
newo->onode.is_pgmeta_omap() ? PREFIX_PGMETA_OMAP : PREFIX_OMAP;
dout(20) << __func__ << " copying omap data" << dendl;
if (!newo->onode.has_omap()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check isn't needed as we unconditionally clear omap for newo above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ifed01
Copy link
Contributor

ifed01 commented May 3, 2019

This is most probably relevant to http://tracker.ceph.com/issues/36482
See my last comments there.

@cxytt
Copy link
Author

cxytt commented May 7, 2019

retest this please

@ifed01
Copy link
Contributor

ifed01 commented May 16, 2019

Found two more "slow" onodes at troublesome cluster - both are stuck at tail lookup.

@liewegas
Copy link
Member

fsck needs to be updated:

2019-05-24 17:39:49.539 7f6dd26747c0 -1 bluestore(/var/lib/ceph/osd/ceph-6) fsck error: found stray omap data on omap_head 1056
2019-05-24 17:39:49.539 7f6dd26747c0 -1 bluestore(/var/lib/ceph/osd/ceph-6) fsck error: found stray omap data on omap_head 1056
2019-05-24 17:39:49.539 7f6dd26747c0 -1 bluestore(/var/lib/ceph/osd/ceph-6) fsck error: found stray omap data on omap_head 1057

@liewegas
Copy link
Member

liewegas commented May 24, 2019

I also see a crash like this:

    -5> 2019-05-24 18:04:23.208 7fa4c87a7f00  0 osd.0 12 load_pgs
    -4> 2019-05-24 18:04:23.208 7fa4c87a7f00 20 osd.0 12 load_pgs pg_num_history pg_num_history(e12 pg_nums {1={6=4}} deleted_pools )
    -3> 2019-05-24 18:04:23.208 7fa4c87a7f00 10 osd.0 12 load_pgs ignoring unrecognized meta
    -2> 2019-05-24 18:04:23.208 7fa4c87a7f00 10 osd.0 12 pgid 1.3 coll 1.3_head
    -1> 2019-05-24 18:04:23.212 7fa4c87a7f00 -1 /build/ceph-15.0.0-1346-ge2071b9/src/osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*)' thread 7fa4c87a7f00 time 2019-05-24 18:04:23.215356
/build/ceph-15.0.0-1346-ge2071b9/src/osd/PG.cc: 1019: FAILED ceph_assert(values.size() == 2)

 ceph version 15.0.0-1346-ge2071b9 (e2071b925acd1955f96cf7c2151a8381559679f5) octopus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x556ee575527e]
 2: (()+0x4fb456) [0x556ee5755456]
 3: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*)+0x5a7) [0x556ee5971e17]
 4: (OSD::load_pgs()+0x830) [0x556ee59099a0]
 5: (OSD::init()+0x2b8d) [0x556ee590d20d]
 6: (main()+0x37ca) [0x556ee57b7d3a]

could the enumeration code be including the tail key in the result?
/a/sage-2019-05-24_16:59:44-rados-wip-sage-testing-2019-05-24-0906-distro-basic-smithi/3975790

this test you can reproduce locally with ../qa/run-standalone.sh misc/test-ceph-helpers.sh

@cxytt
Copy link
Author

cxytt commented May 27, 2019

@liewegas

  1. Reproduce as you suggest and set debug_bluestore=30. As seen from the logs, only key '_epoch' was retrieved, This does not look like a result of tail being set

    -15> 2019-05-27 16:32:41.974 7f8b1528bec0 10 osd.0 12 pgid 1.3 coll 1.3_head
    -14> 2019-05-27 16:32:41.974 7f8b1528bec0 15 bluestore(td/ceph-helpers/0) omap_get_values 1.3_head oid FreeBSD support. #1:c0000000::::head#
    -13> 2019-05-27 16:32:41.974 7f8b1528bec0 30 bluestore.OnodeSpace(0x55970a721ca8 in 0x55970aa79320) lookup
    -12> 2019-05-27 16:32:41.974 7f8b1528bec0 30 bluestore.OnodeSpace(0x55970a721ca8 in 0x55970aa79320) lookup FreeBSD support. #1:c0000000::::head# hit 0x55970b706f00
    -11> 2019-05-27 16:32:41.974 7f8b1528bec0 20 bluestore.onode(0x55970b706f00).flush flush done
    -10> 2019-05-27 16:32:41.974 7f8b1528bec0 30 bluestore(td/ceph-helpers/0) omap_get_values got 0x0000000000000410'._epoch' -> _epoch
    -9> 2019-05-27 16:32:41.974 7f8b1528bec0 10 bluestore(td/ceph-helpers/0) omap_get_values 1.3_head oid FreeBSD support. #1:c0000000::::head# = 0
    -8> 2019-05-27 16:32:41.986 7f8b1528bec0 -1 /home/nt/ceph.self/src/osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*)' thread 7f8b1528bec0 time 2019-05-27 16:32:41.978774
    /home/nt/ceph.self/src/osd/PG.cc: 1019: FAILED ceph_assert(values.size() == 2)

  2. Tail is not used as a valid key, so 'fsck' should also not get tail

@liewegas
Copy link
Member

Ah, the problem is that the prefix is being set based on the flag that isn't set until a few lines later. this fixes it:

diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc
index 1311040bfc4..f6ec5256f9e 100644
--- a/src/os/bluestore/BlueStore.cc
+++ b/src/os/bluestore/BlueStore.cc
@@ -12883,8 +12883,6 @@ int BlueStore::_omap_setkeys(TransContext *txc,
   int r;
   auto p = bl.cbegin();
   __u32 num;
-  const string& prefix =
-    o->onode.is_pgmeta_omap() ? PREFIX_PGMETA_OMAP : PREFIX_OMAP;
 
   if (!o->onode.has_omap()) {
     o->onode.set_omap_flag();
@@ -12893,6 +12891,8 @@ int BlueStore::_omap_setkeys(TransContext *txc,
     }
     txc->write_onode(o);
 
+    const string& prefix =
+      o->onode.is_pgmeta_omap() ? PREFIX_PGMETA_OMAP : PREFIX_OMAP;
     string key_tail;
     bufferlist tail;
     get_omap_tail(o->onode.nid, &key_tail);
@@ -12900,6 +12900,8 @@ int BlueStore::_omap_setkeys(TransContext *txc,
   } else {
     txc->note_modified_object(o);
   }
+  const string& prefix =
+    o->onode.is_pgmeta_omap() ? PREFIX_PGMETA_OMAP : PREFIX_OMAP;
   string final_key;
   _key_encode_u64(o->onode.nid, &final_key);
   final_key.push_back('.');
@@ -12928,8 +12930,6 @@ int BlueStore::_omap_setheader(TransContext *txc,
   dout(15) << __func__ << " " << c->cid << " " << o->oid << dendl;
   int r;
   string key;
-  const string& prefix =
-    o->onode.is_pgmeta_omap() ? PREFIX_PGMETA_OMAP : PREFIX_OMAP;
   if (!o->onode.has_omap()) {
     o->onode.set_omap_flag();
     if (o->oid.is_pgmeta()) {
@@ -12937,6 +12937,8 @@ int BlueStore::_omap_setheader(TransContext *txc,
     }
     txc->write_onode(o);
 
+    const string& prefix =
+      o->onode.is_pgmeta_omap() ? PREFIX_PGMETA_OMAP : PREFIX_OMAP;
     string key_tail;
     bufferlist tail;
     get_omap_tail(o->onode.nid, &key_tail);
@@ -12944,6 +12946,8 @@ int BlueStore::_omap_setheader(TransContext *txc,
   } else {
     txc->note_modified_object(o);
   }
+  const string& prefix =
+    o->onode.is_pgmeta_omap() ? PREFIX_PGMETA_OMAP : PREFIX_OMAP;
   get_omap_header(o->onode.nid, &key);
   txc->t->set(prefix, key, bl);
   r = 0;

the omap iterator when listing omap use the tail of '~', when the iterator moved to the last key of the omapswe wanted, we will try to call extra next(), usually this will be another object's omap header(with '-'). IF there are some deleted key or tombstones, rocksdb will fall in the loop of FindNextUserEntryInternal until find a valid key, so it will travels all dead key in mid and read the sst file heavily.

Signed-off-by: Tao Ning <ningtao@sangfor.com.cn>
@cxytt
Copy link
Author

cxytt commented May 29, 2019

retest this please

@ifed01
Copy link
Contributor

ifed01 commented Jun 7, 2019

jenkins retest this please

@ifed01
Copy link
Contributor

ifed01 commented Jun 21, 2019

@liewegas - do you think we can go ahead with this PR?

@liewegas
Copy link
Member

Yep, let's retest!

@tchaikov tchaikov merged commit b9be028 into ceph:master Jun 25, 2019
@cxytt cxytt deleted the fix-add-omap-tail branch June 25, 2019 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants