Skip to content

os/bluestore: make osd shard-thread do oncommits#22739

Merged
tchaikov merged 3 commits intoceph:masterfrom
majianpeng:osd-shardthread-do-bluestore-oncommits
Sep 20, 2018
Merged

os/bluestore: make osd shard-thread do oncommits#22739
tchaikov merged 3 commits intoceph:masterfrom
majianpeng:osd-shardthread-do-bluestore-oncommits

Conversation

@majianpeng
Copy link
Copy Markdown
Member

The benefits:
a)remove finisher threads
b)avoid racing of pg lock.
For some workload, this can decreased performance.

Signed-off-by: Jianpeng Ma jianpeng.ma@intel.com

@majianpeng
Copy link
Copy Markdown
Member Author

majianpeng commented Jun 27, 2018

My test host:
one osd(P3700) as osd/rocksdb. 10 rbd image(every 10G); fio+librbd; 4k randwrite.

4k-8osdthread-1finisher: IOPS=51K
4k-8osdthread-8finisher: IOPS=51.7K
4k-8osdthread-with-pr: IOPS=49.7K
4k-16osdthread-1finisher: IOPS=50.6K
4k-16osdthread-16finisher: IOPS=54.9K
4k-16osdthread-with-patch: IOPS=58.6k.

THe idea from seastar which make callback on the same core. From the results, when workload is relative lower, has additional finisher thread has better performance. But for heavy workload, this PR show better performance(58.6/50.6=115%) and remove the additional finisher threads.

@majianpeng
Copy link
Copy Markdown
Member Author

@markhpc @liewegas @ifed01 @tchaikov . please review.

@markhpc
Copy link
Copy Markdown
Member

markhpc commented Jun 27, 2018

@majianpeng Any chance I could get you to try using the gdb wallclock profiler on a before/after test case?

You can see it here:
https://github.com/markhpc/gdbpmp

Thanks!
Mark

++slot->num_running;
list<Context *> oncommits;
if (!sdata->context_queue.empty()) {
sdata->context_queue.swap(oncommits);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not finish() the contexts here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point, it already lock(sdata->shard_lock)
a)context callback may take long time
b)context will get pg lock and will cause dead lock.

@liewegas
Copy link
Copy Markdown
Member

This is really interesting! A few initial thoughts:

  • We can't entirely remove the finishers because sometimes the store runs outside of the OSD.
  • We can come up with a cleaner way to (explicitly!) attach the context queues to the collections.
  • We need to be really careful about deadlocks. Usually they come from the throttling in queue_transaction, but I think this change won't affect anything because we release those throttles in kv_sync_thread, not the finishers.
  • I think we can make the change to the op wq simpler...

Any sense of why this is slower with only 8 worker threads?

@majianpeng majianpeng force-pushed the osd-shardthread-do-bluestore-oncommits branch 3 times, most recently from f5096e7 to 4951286 Compare July 4, 2018 01:58
@majianpeng
Copy link
Copy Markdown
Member Author

@liewegas . update by your suggestion. please review.

@liewegas liewegas requested a review from jdurgin July 6, 2018 22:03
@majianpeng
Copy link
Copy Markdown
Member Author

@jdurgin . ping. Thanks!

@majianpeng
Copy link
Copy Markdown
Member Author

@jdurgin . Have you time to review this PR? Thanks very much.

@jdurgin
Copy link
Copy Markdown
Member

jdurgin commented Jul 14, 2018

I like this idea - agree we need to be careful about deadlocks.

The diff --minimal is a little more readable - the op wq change is simple enough, the change in indentation level just makes the diff look bigger.

@majianpeng can you rebase? Then let's run this through the rados suite and see if any fallout emerges.

@majianpeng majianpeng force-pushed the osd-shardthread-do-bluestore-oncommits branch from 7967e6d to 0e04e66 Compare July 16, 2018 05:27
@majianpeng
Copy link
Copy Markdown
Member Author

@liewegas @jdurgin . update by your suggestions: a)rebase on master ; b)change OSD::ShardedOpWQ::_process which make looks readable. please review again. Thanks!

@majianpeng
Copy link
Copy Markdown
Member Author

@liewegas . Ping. Thanks

@majianpeng
Copy link
Copy Markdown
Member Author

@liewegas . Ping.

void swap(list<Context *>& ls) {
ls.clear();
std::lock_guard<std::mutex> l(q_mutex);
if(!q.empty()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whitespace

}
for (auto c : on_applied) {
finishers[osr->shard]->queue(c);
if(on_applied.size()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whitespace

src/osd/OSD.h Outdated
}

void handle_oncommits(list<Context*>& oncommits) {
if (!oncommits.empty()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this 'if' actually help anything?

Copy link
Copy Markdown
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits, but otherwise this looks good! We need to test pretty thoroughly in QA tho

@liewegas liewegas changed the title [RFC]os/bluestore: make osd shard-thread do oncommits. os/bluestore: make osd shard-thread do oncommits Jul 20, 2018
@majianpeng majianpeng force-pushed the osd-shardthread-do-bluestore-oncommits branch from 0e04e66 to 4fe8118 Compare July 23, 2018 01:58
@majianpeng
Copy link
Copy Markdown
Member Author

retest this please.

@majianpeng
Copy link
Copy Markdown
Member Author

@liewegas . update by your suggestions. But it can't pass "make check". What's mean of "Pending — Build triggered for merge commit. "? In my dev-machine, it can pass all tests.

@liewegas
Copy link
Copy Markdown
Member

@majianpeng The lab is down, which probably include jenkins. SHould be fixed shortly!

@majianpeng
Copy link
Copy Markdown
Member Author

retest this please

@tchaikov
Copy link
Copy Markdown
Contributor

@majianpeng
Copy link
Copy Markdown
Member Author

@tchaikov . Still faied for the same reason:
2018-09-18T09:50:12.275 INFO:tasks.workunit.client.0.smithi154.stderr:mkfs failed: (22) Invalid argument
2018-09-18T09:50:12.277 INFO:tasks.workunit.client.0.smithi154.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test_kvstore_tool.sh:1: test_ceph_kvstorrm -fr /tmp/cephtool.q2a
2018-09-18T09:50:12.278 INFO:tasks.workunit:Stopping ['cephtool'] on client.0...
2018-09-18T09:50:12.278 INFO:teuthology.orchestra.run.smithi154:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'
2018-09-18T09:50:12.432 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):

But for bluestore mkfs failed: (22) Invalid argument. THe code don't affect those code. @tchaikov , Could you give some suggestion? Thanks!

@tchaikov
Copy link
Copy Markdown
Contributor

tchaikov commented Sep 18, 2018

@liewegas
Copy link
Copy Markdown
Member

Have we repeated the original performance test recently? There have been a lot of changes to the code here and I would like to see that we are still seeing a sufficient benefit!

@majianpeng majianpeng force-pushed the osd-shardthread-do-bluestore-oncommits branch from ce496c9 to d5627af Compare September 19, 2018 05:43
@majianpeng
Copy link
Copy Markdown
Member Author

@tchaikov . In rhel it failed But for Ubuntu i can passed. So i create a new #24167 which add --debug for ceph-objectstore-tool to get detail message.

@tchaikov
Copy link
Copy Markdown
Contributor

@majianpeng before we devote more resources to this improvement, could you repeat the benchmark to address Sage's concern?

@majianpeng
Copy link
Copy Markdown
Member Author

@tchaikov . I'm doing now by sage's request.

@majianpeng
Copy link
Copy Markdown
Member Author

  4k-8osdthread-1finisher 4k-8osdthread-8finisher 4k-8osdthread-with-patch 4k-16osdthread-1finisher 4k-16osdthread-16finisher 4k-16osdthread-with-patch
IOPS(K)(P3700 + First version+PG=128) 51 51.7 49.7 50.6 54.9 58.6
IOPS(K)(P3700 + Current version+PG=128) 62.1 60.2 59 59.3 59.2 63.3
IOPS(K)(Optane + current Version+PG=128) 61.8 62.5 59.8 60.4 65.7 68.4
IOPS(K)(P3700 + Current version+PG=16) 31.7 57.9 59 26.6 57.2 64.4
IOPS(K)((Optane + Current version+PG=16) 37.3 62.7 60 37.2 66.8 68.4
  • Decrease in growth for 4k-16osdthread-with-patch/4k-16osdthread-1finisher: from firs version(58.6/50.6=1.15) to (current version) (63.3/59.3=1.07). But for Optane(faster device than P3700), there is still 13% performance improved. (68.4/60.4)

  • for PG=16, we see about double performance dropped for one finisher. But at this condition, this PR can improve performance(about 100%) than one-finisher and multi-finisher.

@liewegas From the results, i think this PR still deserves to be merged.

@liewegas
Copy link
Copy Markdown
Member

Sounds good! Let's proceed

@tchaikov
Copy link
Copy Markdown
Contributor

@tchaikov
Copy link
Copy Markdown
Contributor

@majianpeng turns out we were mkfsing on a tmpfs

2018-09-20T05:15:43.102 INFO:tasks.workunit.client.0.smithi160.stderr:2018-09-20 05:15:43.115 7f262bd9f600  1 bdev create path /tmp/cephtool.sS6/block type kernel
2018-09-20T05:15:43.102 INFO:tasks.workunit.client.0.smithi160.stderr:2018-09-20 05:15:43.115 7f262bd9f600  1 bdev(0x560cc462ea80 /tmp/cephtool.sS6/block) open path /tmp/cephtool.sS6/block
2018-09-20T05:15:43.102 INFO:tasks.workunit.client.0.smithi160.stderr:2018-09-20 05:15:43.115 7f262bd9f600 -1 bdev(0x560cc462ea80 /tmp/cephtool.sS6/block) open open got: (22) Invalid argument
2018-09-20T05:15:43.102 INFO:tasks.workunit.client.0.smithi160.stderr:mkfs failed: 2018-09-20 05:15:43.115 7f262bd9f600 -1 bluestore(/tmp/cephtool.sS6) mkfs failed, (22) Invalid argument(22) Invalid argument

and this test failure was addressed by #24143

@majianpeng
Copy link
Copy Markdown
Member Author

@tchaikov . Thanks very much. How to handle #24167. Close or?

Copy link
Copy Markdown
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside from the nits, lgtm.


void swap(list<Context *>& ls) {
ls.clear();
std::lock_guard<std::mutex> l(q_mutex);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use scoped_lock if you please.

std::lock_guard<std::mutex> l(txc->osr->qlock);
txc->state = TransContext::STATE_KV_DONE;
finishers[txc->osr->shard]->queue(txc->oncommits);
if (txc->ch->commit_queue)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (auto c : on_applied) {
finishers[osr->shard]->queue(c);
if (!on_applied.empty()) {
if (c->commit_queue)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollectionHandle open_collection(const coll_t &c) override;
CollectionHandle create_new_collection(const coll_t& cid) override;
void set_collection_commit_queue(const coll_t& cid,
class ContextQueue *commit_queue) override;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


//pool options
pool_opts_t pool_opts;
class ContextQueue *commit_queue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollectionHandle open_collection(const coll_t& c) override;
CollectionHandle create_new_collection(const coll_t& c) override;
void set_collection_commit_queue(const coll_t& cid,
class ContextQueue *commit_queue) override {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollectionHandle open_collection(const coll_t& c) override;
CollectionHandle create_new_collection(const coll_t& c) override;
void set_collection_commit_queue(const coll_t& cid,
class ContextQueue *commit_queue) override {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollectionHandle create_new_collection(const coll_t& c) override;

void set_collection_commit_queue(const coll_t& cid,
class ContextQueue *commit_queue) override {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/osd/OSD.h Outdated
ceph_assert(sdata);
Mutex::Locker l(sdata->shard_lock);
return sdata->pqueue->empty();
if (thread_index < osd->num_shards)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov
Copy link
Copy Markdown
Contributor

How to handle #24167. Close or?

yeah, please close it.

@majianpeng majianpeng force-pushed the osd-shardthread-do-bluestore-oncommits branch 2 times, most recently from 41dc03c to 6c583fe Compare September 20, 2018 06:34
@majianpeng
Copy link
Copy Markdown
Member Author

@tchaikov . update by your comments. Thanks very much!

@tchaikov tchaikov merged commit f03dd73 into ceph:master Sep 20, 2018
Jianpeng Ma added 2 commits September 20, 2018 21:52
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
@liewegas
Copy link
Copy Markdown
Member

Wahoo!

Now bluestore oncommit callback exec by osd op threads.
If there are multi threads of shard, it will cause out-of order.
For example, threads_per_shard=2
              Thread1                                 Thread2
    swap_oncommits(op1_oncommit)
                                            swap_oncommits(op2_oncommit)
    OpQueueItem.run(Op3)
                                            op2_oncommit.complete();
    op1_oncommit.complete()

This make oncommits out of order.
To avoiding this, we choose a fixed thread which has the smallest
thread_index of shard to do oncommit callback function.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
@majianpeng
Copy link
Copy Markdown
Member Author

@liewegas @tchaikov . Thanks very much for help make this be merged at finale.

@tchaikov
Copy link
Copy Markdown
Contributor

2018-09-20T13:28:39.490 INFO:teuthology.orchestra.run.smithi061.stdout:[ RUN      ] ObjectStore/StoreTest.BluestoreRepairTest/2
2018-09-20T13:28:40.620 INFO:teuthology.orchestra.run.smithi061.stderr:create collection + write
2018-09-20T13:28:41.234 INFO:teuthology.orchestra.run.smithi061.stderr:fix leaked pextents
2018-09-20T13:28:41.766 INFO:teuthology.orchestra.run.smithi061.stdout:/build/ceph-14.0.0-3418-gd4c8323/src/test/objectstore/store
_test.cc:7060: Failure
2018-09-20T13:28:41.766 INFO:teuthology.orchestra.run.smithi061.stdout:      Expected: bstore->fsck(false)
2018-09-20T13:28:41.766 INFO:teuthology.orchestra.run.smithi061.stdout:      Which is: 14
2018-09-20T13:28:41.766 INFO:teuthology.orchestra.run.smithi061.stdout:To be equal to: 0
2018-09-20T13:28:41.766 INFO:teuthology.orchestra.run.smithi061.stderr:/build/ceph-14.0.0-3418-gd4c8323/src/os/bluestore/BlueStore
.cc: In function 'virtual int BlueStore::umount()' thread 7ff9398f9b00 time 2018-09-20 13:28:41.766562
2018-09-20T13:28:41.767 INFO:teuthology.orchestra.run.smithi061.stderr:/build/ceph-14.0.0-3418-gd4c8323/src/os/bluestore/BlueStore.cc: 5875: FAILED ceph_assert(_kv_only || mounted)
2018-09-20T13:28:41.771 INFO:teuthology.orchestra.run.smithi061.stderr: ceph version 14.0.0-3418-gd4c8323 (d4c8323a70a8a3c53f3884e8c4b5eea64dd80e1c) nautilus (dev)
2018-09-20T13:28:41.771 INFO:teuthology.orchestra.run.smithi061.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7ff92f5bf995]
2018-09-20T13:28:41.771 INFO:teuthology.orchestra.run.smithi061.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7ff92f5bfb72]
2018-09-20T13:28:41.771 INFO:teuthology.orchestra.run.smithi061.stderr: 3: (BlueStore::umount()+0x56a) [0x87ee4a]
2018-09-20T13:28:41.771 INFO:teuthology.orchestra.run.smithi061.stderr: 4: (StoreTestFixture::TearDown()+0x62) [0x7174a2]
2018-09-20T13:28:41.771 INFO:teuthology.orchestra.run.smithi061.stderr: 5: (void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x77) [0x9b0657]
2018-09-20T13:28:41.772 INFO:teuthology.orchestra.run.smithi061.stderr: 6: (testing::TestInfo::Run()+0xaf) [0x9a503f]
2018-09-20T13:28:41.772 INFO:teuthology.orchestra.run.smithi061.stderr: 7: (testing::TestCase::Run()+0xb5) [0x9a5175]

@majianpeng i have the failure above in http://pulpito.ceph.com/kchai-2018-09-20_12:40:38-rados-wip-kefu-testing-2018-09-19-1821-distro-basic-smithi/3046733/

in int BlueStore::_fsck(),

return errors - (int)repaired;

so we had 14 errors not fixed by fsck.

could you help take a look at this ?

@tchaikov
Copy link
Copy Markdown
Contributor

tracked by http://tracker.ceph.com/issues/36099

@majianpeng
Copy link
Copy Markdown
Member Author

@tchaikov . i'll check this now.

@liewegas
Copy link
Copy Markdown
Member

@rzarzynski started looking at this yesterday too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants