Skip to content

os/bluestore: Actually wait until completion in write_sync#26909

Merged
tchaikov merged 1 commit intoceph:masterfrom
vitalif:wal-sync-fix
Jun 11, 2019
Merged

os/bluestore: Actually wait until completion in write_sync#26909
tchaikov merged 1 commit intoceph:masterfrom
vitalif:wal-sync-fix

Conversation

@vitalif
Copy link
Copy Markdown
Contributor

@vitalif vitalif commented Mar 12, 2019

This function is only used by RocksDB WAL writing so it must sync data.

This fixes #18338 and thus allows to actually set bluefs_preextend_wal_files to true, gaining +100% single-thread write iops in disk-bound (HDD or bad SSD) setups. To my knowledge it doesn't hurt performance in other cases. Test it yourself on any HDD with fio -ioengine=rbd -direct=1 -bs=4k -iodepth=1.

Issue #18338 is easily reproduced without this patch by issuing a kill -9 to the OSD while doing fio -ioengine=rbd -direct=1 -bs=4M -iodepth=16.

Fixes: https://tracker.ceph.com/issues/18338 https://tracker.ceph.com/issues/38559
Signed-off-by: Vitaliy Filippov vitalif@yourcmc.ru

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

This function is only used by RocksDB WAL writing so it must sync data.

This fixes ceph#18338 and thus allows to actually set `bluefs_preextend_wal_files`
to true, gaining +100% single-thread write iops in disk-bound (HDD or bad SSD) setups.
To my knowledge it doesn't hurt performance in other cases.
Test it yourself on any HDD with `fio -ioengine=rbd -direct=1 -bs=4k -iodepth=1`.

Issue ceph#18338 is easily reproduced without this patch by issuing a `kill -9` to the OSD
while doing `fio -ioengine=rbd -direct=1 -bs=4M -iodepth=16`.

Fixes: https://tracker.ceph.com/issues/18338 https://tracker.ceph.com/issues/38559
Signed-off-by: Vitaliy Filippov <vitalif@yourcmc.ru>
@liewegas liewegas changed the title osd/bluestore: Actually wait until completion in write_sync os/bluestore: Actually wait until completion in write_sync Mar 12, 2019
@liewegas
Copy link
Copy Markdown
Member

Haven't looked at this closely yet, but a quick clarification: bluefs_preextend_wal_files is only useful in making writes faster right after the bluestore/OSD is created. Once you've written a little bit of data, we start recycling rocksdb log files, and this option has no effect. So there isn't much value in making it faster.

@vitalif
Copy link
Copy Markdown
Contributor Author

vitalif commented Mar 12, 2019

Another case is just after the compaction... anyway, isn't it better when it gives consistent performance regardless of the log reuse than when it's not?

@vitalif
Copy link
Copy Markdown
Contributor Author

vitalif commented Mar 12, 2019

I've just checked - yes, it seems to reuse log files. However, I observe a VERY strange behaviour: extra io_submit's and fdatasync's go away when I'm running fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -pool=bench -rbdname=testimg over the "bench" pool which has 20 PGs (the "cluster" is super-small, only 5 HDD OSDs) - however when I'm running it over another "bench1" pool which only has 1 PG and it's on the OSD I'm strace'ing extra io_submit's and fdatasync's are back (!!!) and fio result falls down to ~20-30 iops. Ceph version is 13.2.4 on that cluster.
How can it be possible? O_o

@vitalif
Copy link
Copy Markdown
Contributor Author

vitalif commented Mar 12, 2019

Could you test it yourself with a single OSD and strace -f -p 40603 2>&1 | grep -P 'io_submit|pwrite|sync' ?

@rzarzynski
Copy link
Copy Markdown
Contributor

jenkins retest this please (no log)

@stale
Copy link
Copy Markdown

stale bot commented May 27, 2019

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label May 27, 2019
@jdurgin
Copy link
Copy Markdown
Member

jdurgin commented May 30, 2019

@liewegas

@stale stale bot removed the stale label May 30, 2019
@liewegas
Copy link
Copy Markdown
Member

retest this please

@tchaikov
Copy link
Copy Markdown
Contributor

@liewegas i just marked the corresponding ticket with "backport=mimic,nautilus", please revert the change if i am wrong.

@tchaikov tchaikov merged commit 194b915 into ceph:master Jun 11, 2019
@vitalif
Copy link
Copy Markdown
Contributor Author

vitalif commented Jul 18, 2019

I see it wasn't backported into Nautilus 14.2.2, is it ok?

@liewegas
Copy link
Copy Markdown
Member

yeah, just didn't get backported in time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants