os/bluestore: introduce new io_uring IO engine#27392
Conversation
|
needs rebase |
323016c to
d761727
Compare
e058747 to
4c89f13
Compare
|
Super excited about this! Any chance you can test with the default min_alloc and blob sizes on a typical device? |
@markhpc Yes, sure, I will return to this topic. Would be great if you provide desired options, because for me is not always clear how to setup blueshark in order to squeeze everything from the performance. |
|
From @JeffMoyer: "I noticed that the pull request supports IO_URING_REGISTER_FILES, but not _BUFFERS. If there's a way to add the latter, that should provide another good speedup." Have you done any tests on NVMe devices (instead of ramdisks)? |
low-level IO engine does not control incoming buffers, so I could not find a way to register buffers in advance.
Not yet, but plan to do that. I decided to publish rfc asap in order to get some early feedback or help with testing. |
|
@rouming I think to start out just default options on NVMe would be fine. Mostly we want to see how the behavior changes. This is a big enough change that it may impact what things we need to tune in the future. |
|
retest this please |
|
@rouming after we do some testing on our test environment, we found 2 problem:
|
You mean
Hm, messenger loop is separated from IO loop. Do you experience IO hang? Is it possible to get bactkraces from all threads from OSD which does not respond? |
|
Did some testing of this on Kernel 5.1. First hit what looked like it may have been a deadlock. Second attempt got farther, but as soon as reads started hit the following assert. I'm debugging now. |
Seems that io->iocb.aio_lio_opcode should be equal to IO_CMD_PREAD, not vector variant. At least I do not see preadv() variant in ceph_aio.h, only pread(). In my tests I did only random writes, so can be that I missed that. The other thing which worries me is that you are not allowed to call ioring_getevents() or ioring_queue() from different threads. I could not prove myself that requests are submitted from different threads (actually they are, but locks are taken by the caller, at least my shallow debug showed that). So this question is still open: does the caller of aio (io_uring as well) is responsible for all locks? |
|
@rouming yep, it's the same as @blacktear23 saw, IO_CMD_PREAD is not currently handled in init_io so we assert. I added a switch statement for the other stuff. I'll see if I can poke at KernelDevice.cc some more and make progress. |
|
@rouming for first issue, below is the backtrace with debug compile. So that can give you more information on what happend. For second issue, yes we are experiencing IO hang in testing. By the way, I also do some code examination and also wondering how IO_CMD_PREAD come from. |
|
FWIW I'm working on another PR to convert KernelDevice/ceph_aio to use preadv. It's a bit broken now with random reads but works for sequential. Will work on it more tomorrow. |
As a quick fix for io_uring we can pretend we have iov for read path as well: @markhpc, what do you think? |
|
I updated current pr with minor changes in pread() in order to pass iov for both read and write paths. |
|
I'm starting to wonder a bit if some of the KernelDevice and ceph_aio code could use a bigger refactor. FWIW, I tested the io_uring read path with my preadv cahnge last night and the OSD was hanging again. Will need to dig in more to figure that out. |
That's similar to what I did, though I made it more like pwritev by moving some of that logic out into KernelDevice. |
Indeed, this part looks very hairy. Regarding the hang: is low-level IO engine should be thread-safe? Since we access the ring buffer from userspace and do that as fast as possible I did not do any locking there and rely on the level above. So the question is: do we need to protect submit_batch() and get_next_completed() calls? |
I made a separate PR for my changes. We'll want to test this independently of the io_uring work to make sure we don't introduce any regressions in the aio path. |
src/os/bluestore/io_uring.cc
Outdated
| r = ioring_cqring_reap(d, events, max, paio); | ||
| if (r) { | ||
| events += r; | ||
| continue; |
There was a problem hiding this comment.
Need to incorporate this fix from the fio code:
if (actual_min != 0)
actual_min -= r;
There was a problem hiding this comment.
Ha, what will happen if 'r > actual_min'? Not nice.
Seems should be actual_min -= min(r, actual_min);
Also for our case actual_min is always zero.
But yes, the idea is clear. Thanks for pointing that out.
There was a problem hiding this comment.
Yep, I agree. Might want to point that out to the fio guys too.
| { | ||
| struct io_uring_sqe *sqe = &d->sqes[index]; | ||
|
|
||
| if (io->iocb.aio_lio_opcode == IO_CMD_PWRITEV) |
There was a problem hiding this comment.
To debug the pread issue I changed this if/else block into a switch statement to handle the other cases:
switch (io->iocb.aio_lio_opcode) {
case IO_CMD_PREAD:
ceph_abort_msg("Don't know how to handle IO_CMD_PREAD with io_uring");
break;
case IO_CMD_PWRITE:
ceph_abort_msg("Don't know how to handle IO_CMD_PWRITE with io_uring");
break;
case IO_CMD_FSYNC:
sqe->opcode = IORING_OP_FSYNC;
break;
case IO_CMD_FDSYNC:
ceph_abort_msg("Don't know how to handle IO_CMD_FDSYNC yet");
break;
case IO_CMD_POLL: /* Never implemented in mainline, see io_prep_poll */
sqe->opcode = IORING_OP_NOP;
break;
case IO_CMD_NOOP:
sqe->opcode = IORING_OP_NOP;
break;
case IO_CMD_PREADV:
sqe->opcode = IORING_OP_READV;
break;
case IO_CMD_PWRITEV:
sqe->opcode = IORING_OP_WRITEV;
break;
default:
ceph_assert(0);
}
There was a problem hiding this comment.
Why don't you want to print and crash in all the cases which are not readv/writev? You found places where something else can be issued?
There was a problem hiding this comment.
Mostly I was just trying to clean up the non-readv/writev handling using ceph_abort_msg instead of assert, but it looks like we should be able to handle noop cleanly at least. POLL apparently was never implemented but you are right that we should probably abort there rather than nop. FDSYNC I'm aborting on and fsync passing through. I think if POLL is changed to abort, the only question then is fsync?
There was a problem hiding this comment.
Since nobody sends other than write/read I would always print & abort. If someone introduces other requests - io_uring engine has to be updated as well. Seems this is more explicit behavior which forces developer to test if something new is introduced.
|
Mark Nelson <notifications@github.com> writes:
> @markhpc i tested with 5.4.0-050400rc3 kernel with a 3-OSD vstart
> cluster. the 4k sequential write of io uring backend is almost the
> same as aio, while the 4k random read is much better.
> backend read bw (MiB/s) read IOPS read lat (us) write bw (MiB/s)
> write IOPS write lat (s)
> io uring 143.204 36660 433.094 0.22226 56 0.277773
> aio 13.6573 3496 4572.4 0.226194 57 0.275743
>
> so i guess it's good to merge once it passes the rados qa
> suite. what do you think? probably we can remove the DNM label now?
I suspect there is something odd going on with those aio read numbers.
I can't imagine that aio was hurting that badly (4-5K us latency? How
much concurrency?)
Buffered I/O, perhaps?
…-Jeff
|
Hm, should not be the case, since descriptors from "fd_directs" array (opened with O_DIRECT) are registered in io_uring. But results are indeed strange, at least rbd.fio does not show significant difference between aio and io_uring engines for 4k reads. |
|
@markhpc anything is pending to remove the DNM label? @wjwithagen i think you requested change. but i still don't get what exactly the change was. |
| .set_description("Enforces specific hw profile settings") | ||
| .set_long_description("'hdd' enforces settings intended for BlueStore above a rotational drive. 'ssd' enforces settings intended for BlueStore above a solid drive. 'default' - using settings for the actual hardware."), | ||
|
|
||
| Option("bluestore_ioring", Option::TYPE_BOOL, Option::LEVEL_ADVANCED) |
There was a problem hiding this comment.
@markhpc we don't have an experimental flag anymore. but this setting is marked with "Option::LEVEL_ADVANCED", and is disabled by default. i think it's good enough. or do you have a specific idea how to mark it experimental?
I'll going to be practical in this.... I'll take the FreeBSD in my posix-aio fallout as it comes, and work from there. So feel free to commit. |
|
@tchaikov - I wonder what the CPU usage is - if there's any difference per IOP or MBps. |
Nothing on my end. I think we may want to refactor some day, but that day isn't today! 👍 |
|
@markhpc thanks. will rebase the change. and rerun the perf test using fio. |
|
@tchaikov Do you plan to return to this pr? |
|
@rouming definitely yes. i am still debugging my own vstart cluster. stuck by another issue. |
48838d2 to
c5bcb59
Compare
|
@rouming hello, I test using ramdisk(ramdisk 60G, engine is libaio or io_uring) , but random 4k wirte only 1.6k iops, my config is: ceph-bluestore.conf: Can you give me some advice?many thanks |
|
@junxfl Probably you forgot to specify: bluestore_block_path=/dev/ram0 Also in order not to spoil fio output with ceph logs you can disable them: debug bluestore = 0/0 |
|
@rouming Hi rouming,I test again,add "bluestore_block_path=/dev/ram0", random 4k wirte still is 1.6k iops, my config is: ceph-bluestore.conf: image Can you send me your configuration file ?Thanks. |
|
@junxfl it is not enough to have only block device on ram, rocksdb should be also on ram, so bluestore_block_path=/dev/ram0 |
In the next patch new io_uring API will be used instead of libaio. So this prepares the abstract interface. Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
This implements low-level IO engine, which utilizes brand-new io_uring IO interface: https://lwn.net/Articles/776428/ By default libaio is used. If bluestore_ioring=true is set but kernel does not support io_uring or architecture is not x86-64, libaio will be used instead. In current patch liburing library is used in order not to open code everything. In order to compile with liburing WITH_LIBURING=ON should be specified. Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
c5bcb59 to
2268be9
Compare
Cc: @ifed01