Skip to content

os/bluestore: introduce new io_uring IO engine#27392

Merged
tchaikov merged 2 commits intoceph:masterfrom
rouming:bluestore-iouring
Dec 10, 2019
Merged

os/bluestore: introduce new io_uring IO engine#27392
tchaikov merged 2 commits intoceph:masterfrom
rouming:bluestore-iouring

Conversation

@rouming
Copy link
Copy Markdown

@rouming rouming commented Apr 5, 2019

This implements low-level IO engine, which utilizes brand-new
io_uring IO interface: https://lwn.net/Articles/776428/

The following basic config was used for fio_ceph_objectstore:

  rw=randwrite
  iodepth=16
  nr_files=1
  numjobs=1
  size=256m

  bluestore_min_alloc_size = 4096
  bluestore_max_blob_size  = 65536

  bluestore_block_path     = /dev/ram0
  bluestore_block_db_path  = /dev/ram1
  bluestore_block_wal_path = /dev/ram2

bluestore_iouring=false

   4k  IOPS=25.5k, BW=99.8MiB/s, Lat=0.374ms
   8k  IOPS=21.5k, BW=168MiB/s,  Lat=0.441ms
  16k  IOPS=17.2k, BW=268MiB/s,  Lat=0.544ms
  32k  IOPS=12.3k, BW=383MiB/s,  Lat=0.753ms
  64k  IOPS=8358,  BW=522MiB/s,  Lat=1.083ms
 128k  IOPS=4724,  BW=591MiB/s,  Lat=1.906ms

bluestore_iouring=true

   4k  IOPS=29.2k, BW=114MiB/s,  Lat=0.331ms
   8k  IOPS=30.7k, BW=240MiB/s,  Lat=0.319ms
  16k  IOPS=27.4k, BW=428MiB/s,  Lat=0.368ms
  32k  IOPS=22.7k, BW=709MiB/s,  Lat=0.475ms
  64k  IOPS=15.6k, BW=978MiB/s,  Lat=0.754ms
 128k  IOPS=9572,  BW=1197MiB/s, Lat=1.223ms

Overall IOPS increase is the following:

   4k  +14%
   8k  +42%
  16k  +59%
  32k  +89%
  64k  +85%
 128k  +102%

By default libaio is used.  If bluestore_ioring=true is set but kernel
does not support io_uring or architecture is not x86-64, libaio will be
used instead.

Cc: @ifed01

@batrick
Copy link
Copy Markdown
Member

batrick commented Apr 5, 2019

needs rebase

@rouming rouming force-pushed the bluestore-iouring branch from 323016c to d761727 Compare April 8, 2019 16:29
@rouming rouming force-pushed the bluestore-iouring branch 3 times, most recently from e058747 to 4c89f13 Compare April 9, 2019 19:05
@markhpc
Copy link
Copy Markdown
Member

markhpc commented Apr 11, 2019

Super excited about this! Any chance you can test with the default min_alloc and blob sizes on a typical device?

@rouming
Copy link
Copy Markdown
Author

rouming commented Apr 11, 2019

Super excited about this! Any chance you can test with the default min_alloc and blob sizes on a typical device?

@markhpc Yes, sure, I will return to this topic. Would be great if you provide desired options, because for me is not always clear how to setup blueshark in order to squeeze everything from the performance.

@liewegas
Copy link
Copy Markdown
Member

From @JeffMoyer: "I noticed that the pull request supports IO_URING_REGISTER_FILES, but not _BUFFERS. If there's a way to add the latter, that should provide another good speedup."

Have you done any tests on NVMe devices (instead of ramdisks)?

@rouming
Copy link
Copy Markdown
Author

rouming commented Apr 12, 2019

@liewegas

From @JeffMoyer: "I noticed that the pull request supports IO_URING_REGISTER_FILES, but not _BUFFERS. If there's a way to add the latter, that should provide another good speedup."

low-level IO engine does not control incoming buffers, so I could not find a way to register buffers in advance.

Have you done any tests on NVMe devices (instead of ramdisks)?

Not yet, but plan to do that. I decided to publish rfc asap in order to get some early feedback or help with testing.

@markhpc
Copy link
Copy Markdown
Member

markhpc commented Apr 18, 2019

@rouming I think to start out just default options on NVMe would be fine. Mostly we want to see how the behavior changes. This is a big enough change that it may impact what things we need to tune in the future.

@markhpc
Copy link
Copy Markdown
Member

markhpc commented May 6, 2019

retest this please

@blacktear23
Copy link
Copy Markdown

@rouming after we do some testing on our test environment, we found 2 problem:

  1. If cluster is in recovery mode, the OSD may crash by iocb.aio_lio_opcode set to IO_CMD_PREAD (0).
  2. If we start rados benchmark, it may cause some OSD cannot response heartbeat to monitor, so the OSD is set to down.

@rouming
Copy link
Copy Markdown
Author

rouming commented May 7, 2019

@blacktear23

  1. If cluster is in recovery mode, the OSD may crash by iocb.aio_lio_opcode set to IO_CMD_PREAD (0).

You mean sqe->opcode = IORING_OP_READV write access? Do you have an explicit backtrace of a crash?

  1. If we start rados benchmark, it may cause some OSD cannot response heartbeat to monitor, so the OSD is set to down.

Hm, messenger loop is separated from IO loop. Do you experience IO hang? Is it possible to get bactkraces from all threads from OSD which does not respond?

@markhpc
Copy link
Copy Markdown
Member

markhpc commented May 7, 2019

Did some testing of this on Kernel 5.1. First hit what looked like it may have been a deadlock. Second attempt got farther, but as soon as reads started hit the following assert. I'm debugging now.

2019-05-07 14:28:54.719 7efdd7db5700 -1 /home/ubuntu/src/markhpc/ceph/src/os/bluestore/io_uring.cc: In function 'void init_io(ioring_data*, unsigned int, aio_t*)' thread 7efddddc1700 time 2019-05-07 14:28:54.716314
/home/ubuntu/src/markhpc/ceph/src/os/bluestore/io_uring.cc: 201: FAILED ceph_assert(0)

 ceph version 10.0.4-49321-g1e9907b (1e9907bcfff6a9ba011a299354e061d5dbdcb7a5) octopus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55951d2416dc]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55951d2418aa]
 3: (()+0xb681cc) [0x55951d9531cc]
 4: (KernelDevice::aio_submit(IOContext*)+0x16b) [0x55951d94a67b]
 5: (BlueStore::_do_read(BlueStore::Collection*, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x14bd) [0x55951d7f99bd]
 6: (BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0x342) [0x55951d7fc7e2]
 7: (ReplicatedBackend::objects_read_sync(hobject_t const&, unsigned long, unsigned long, unsigned int, ceph::buffer::v14_2_0::list*)+0x9d) [0x55951d68624d]
 8: (PrimaryLogPG::do_sparse_read(PrimaryLogPG::OpContext*, OSDOp&)+0x444) [0x55951d4c5624]
 9: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x68c1) [0x55951d4d6801]
 10: (PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0x97) [0x55951d4e4257]
 11: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x2fd) [0x55951d4e4a1d]
 12: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x391a) [0x55951d4e998a]
 13: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xc15) [0x55951d4eb475]
 14: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1ac) [0x55951d381a0c]
 15: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55951d5c5262]
 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9fb) [0x55951d39e93b]
 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55951d9af643]
 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55951d9b26f0]
 19: (()+0x7e25) [0x7efdfd771e25]
 20: (clone()+0x6d) [0x7efdfc85334d]

@rouming
Copy link
Copy Markdown
Author

rouming commented May 7, 2019

Did some testing of this on Kernel 5.1. First hit what looked like it may have been a deadlock. Second attempt got farther, but as soon as reads started hit the following assert. I'm debugging now.

Seems that io->iocb.aio_lio_opcode should be equal to IO_CMD_PREAD, not vector variant. At least I do not see preadv() variant in ceph_aio.h, only pread(). In my tests I did only random writes, so can be that I missed that.

The other thing which worries me is that you are not allowed to call ioring_getevents() or ioring_queue() from different threads. I could not prove myself that requests are submitted from different threads (actually they are, but locks are taken by the caller, at least my shallow debug showed that). So this question is still open: does the caller of aio (io_uring as well) is responsible for all locks?

@markhpc
Copy link
Copy Markdown
Member

markhpc commented May 7, 2019

@rouming yep, it's the same as @blacktear23 saw, IO_CMD_PREAD is not currently handled in init_io so we assert. I added a switch statement for the other stuff. I'll see if I can poke at KernelDevice.cc some more and make progress.

@blacktear23
Copy link
Copy Markdown

blacktear23 commented May 8, 2019

@rouming for first issue, below is the backtrace with debug compile. So that can give you more information on what happend.

#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00005589efec2d47 in reraise_fatal (signum=6) at /data/ceph/src/global/signal_handler.cc:81
#2  0x00005589efec3e14 in handle_fatal_signal (signum=6) at /data/ceph/src/global/signal_handler.cc:326
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#5  0x00007f982011d801 in __GI_abort () at abort.c:79
#6  0x00005589eff2db43 in ceph::__ceph_assert_fail (assertion=0x5589f0b921ba "0",
    file=0x5589f0b921c0 "/data/ceph/src/os/bluestore/io_uring.cc", line=201,
    func=0x5589f0b92260 <init_io(ioring_data*, unsigned int, aio_t*)::__PRETTY_FUNCTION__> "void init_io(ioring_data*, unsigned int, aio_t*)") at /data/ceph/src/common/assert.cc:73
#7  0x00005589eff2dbf6 in ceph::__ceph_assert_fail (ctx=...) at /data/ceph/src/common/assert.cc:78
#8  0x00005589efea85e7 in init_io (d=0x5589fc15a908, index=12, io=0x5589fc1841d0)
    at /data/ceph/src/os/bluestore/io_uring.cc:201
#9  0x00005589efea87ba in ioring_queue (d=0x5589fc15a908, priv=0x7f9805340a80, beg=
        {iocb = {data = 0x0, key = 0, __pad2 = 0, aio_lio_opcode = 0, aio_reqprio = 0, aio_fildes = 30, u = {c = {buf = 0x5589fe978000, nbytes = 524288, offset = 8611233792, __pad3 = 0, flags = 0, resfd = 0}, v = {vec = 0x5589fe978000, nr = 524288, offset = 8611233792}, poll = {events = -23625728, __pad1 = 21897}, saddr = {addr = 0x5589fe978000, len = 524288}}}, priv = 0x7f9805340a80, fd = 30, iov = {<boost::container::small_vector_base<iovec, boost::container::new_allocator<iovec> >> = {<boost::container::vector<iovec, boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >, void>> = {m_holder = {<boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >> = {<boost::container::new_allocator<iovec>> = {<No data fields>}, <No data fields>}, m_start = 0x5589fc184238, m_size = 0, m_capacity = 4}}, m_storage_start = {aligner = {data = "\001\000\000\000\377\001\001\027\000\000\000\342\000\000", <incomplete sequence \343>}, data = "\001\000\000\000\377\001\001\027\000\000\000\342\000\000", <incomplete sequence \343>}}, <boost::container::small_vector_storage<boost::move_detail::aligned_struct_wrapper<16, 8>, 3>> = {m_rest_of_storage = {{aligner = {data = "\000\000\000\001\000\000\000\001\001\005\000\000\000\005\000"}, data = "\000\000\000\001\000\000\000\001\001\005\000\000\000\005\000"}, {aligner = {data = "\000\377\000\000\000\000\001\021P\000\000\000\000\000\000"}, data = "\000\377\000\000\000\000\001\021P\000\000\000\000\000\000"}, {aligner = {data = "6\025._epoch\004\346\000\000\000\001\020"}, data = "6\025._epoch\004\346\000\000\000\001\020"}}}, static needed_extra_storages = <optimized out>, static needed_bytes = <optimized out>, static header_bytes = <optimized out>, static s_start = <optimized out>, static static_capacity = <optimized out>}, offset = 8611233792, length = 524288, rval = -1000, bl = {_buffers = {_root = {next = 0x5589fe88e2a0}, _tail = 0x5589fe88e2a0, _size = 1}, _carriage = 0x5589f9a835b0 <ceph::buffer::v14_2_0::list::always_empty_bptr>, _len = 524288, _memcopy_count = 0, last_p = {<ceph::buffer::v14_2_0::list::iterator_impl<false>> = {bl = 0x5589fc184290, ls = 0x5589fc184290, p = {cur = 0x5589fc184290}, off = 0, p_off = 0}, <No data fields>}, static always_empty_bptr = {_raw = 0x0, _off = 0, _len = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}, queue_item = {<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>> = {<boost::intrusive::list_node<void*>> = {next_ = 0x0, prev_ = 0x0}, <boost::intrusive::hook_tags_definer<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>, 0>> = {<No data fields>}, <No data fields>}, <No data fields>}}, end=
        {iocb = {data = 0x8, key = 0, __pad2 = 8, aio_lio_opcode = 2817, aio_reqprio = 1332, aio_fildes = 32664, u = {c = {buf = 0xb2a816f7ee63a400, nbytes = 94050922021624, offset = 94050922023750, __pad3 = 94051144048640, flags = 32, resfd = 0}, v = {vec = 0xb2a816f7ee63a400, nr = -271826184, offset = 94050922023750}, poll = {events = -295459840, __pad1 = -1297606921}, saddr = {addr = 0xb2a816f7ee63a400, len = -271826184}}}, priv = 0x7f9805340d00, fd = -271822627, iov = {<boost::container::small_vector_base<iovec, boost::container::new_allocator<iovec> >> = {<boost::container::vector<iovec, boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >, void>> = {m_holder = {<boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >> = {<boost::container::new_allocator<iovec>> = {<No data fields>}, <No data fields>}, m_start = 0x20, m_size = 0, m_capacity = 140290899054240}}, m_storage_start = {aligner = {data = "\000\000@", '\000' <repeats 12 times>}, data = "\000\000@", '\000' <repeats 12 times>}}, <boost::container::small_vector_storage<boost::move_detail::aligned_struct_wrapper<16, 8>, 3>> = {m_rest_of_storage = {{aligner = {data = "@\017\064\005\230\177\000\000xt\273\375\211U\000"}, data = "@\017\064\005\230\177\000\000xt\273\375\211U\000"}, {aligner = {data = "\000 \b\375\211U\000\000\000`!\376\211U\000"}, data = "\000 \b\375\211U\000\000\000`!\376\211U\000"}, {aligner = {data = "@\017\064\005\230\177\000\000\343l\204\370\251K\000"}, data = "@\017\064\005\230\177\000\000\343l\204\370\251K\000"}}}, static needed_extra_storages = <optimized out>, static needed_bytes = <optimized out>, static header_bytes = <optimized out>, static s_start = <optimized out>, static static_capacity = <optimized out>}, offset = 83193390985925, length = 94051162606592, rval = 3204, bl = {_buffers = {_root = {next = 0x5589fcf42880}, _tail = 0x5589fcf42898, _size = 140290899053696}, _carriage = 0x5589ef474c07 <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*)+49>, _len = 4263600128, _memcopy_count =---Type <return> to continue, or q <return> to quit---
 21897, last_p = {<ceph::buffer::v14_2_0::list::iterator_impl<false>> = {bl = 0x5589ef4748d2 <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider::_Alloc_hider(char*, std::allocator<char>&&)+50>, ls = 0x7f9805340c80, p = {cur = 0x5589fe3cb2f0}, off = 4265390832, p_off = 21897}, <No data fields>}, static always_empty_bptr = {_raw = 0x0, _off = 0, _len = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}, queue_item = {<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>> = {<boost::intrusive::list_node<void*>> = {next_ = 0x7f9805340fa0, prev_ = 0x7f9805340cc0}, <boost::intrusive::hook_tags_definer<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>, 0>> = {<No data fields>}, <No data fields>}, <No data fields>}}) at /data/ceph/src/os/bluestore/io_uring.cc:249
#10 0x00005589efea8bd2 in ioring_queue_t::submit_batch (this=0x5589fc15a900, beg=
        {iocb = {data = 0x0, key = 0, __pad2 = 0, aio_lio_opcode = 0, aio_reqprio = 0, aio_fildes = 30, u = {c = {buf = 0x5589fe978000, nbytes = 524288, offset = 8611233792, __pad3 = 0, flags = 0, resfd = 0}, v = {vec = 0x5589fe978000, nr = 524288, offset = 8611233792}, poll = {events = -23625728, __pad1 = 21897}, saddr = {addr = 0x5589fe978000, len = 524288}}}, priv = 0x7f9805340a80, fd = 30, iov = {<boost::container::small_vector_base<iovec, boost::container::new_allocator<iovec> >> = {<boost::container::vector<iovec, boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >, void>> = {m_holder = {<boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >> = {<boost::container::new_allocator<iovec>> = {<No data fields>}, <No data fields>}, m_start = 0x5589fc184238, m_size = 0, m_capacity = 4}}, m_storage_start = {aligner = {data = "\001\000\000\000\377\001\001\027\000\000\000\342\000\000", <incomplete sequence \343>}, data = "\001\000\000\000\377\001\001\027\000\000\000\342\000\000", <incomplete sequence \343>}}, <boost::container::small_vector_storage<boost::move_detail::aligned_struct_wrapper<16, 8>, 3>> = {m_rest_of_storage = {{aligner = {data = "\000\000\000\001\000\000\000\001\001\005\000\000\000\005\000"}, data = "\000\000\000\001\000\000\000\001\001\005\000\000\000\005\000"}, {aligner = {data = "\000\377\000\000\000\000\001\021P\000\000\000\000\000\000"}, data = "\000\377\000\000\000\000\001\021P\000\000\000\000\000\000"}, {aligner = {data = "6\025._epoch\004\346\000\000\000\001\020"}, data = "6\025._epoch\004\346\000\000\000\001\020"}}}, static needed_extra_storages = <optimized out>, static needed_bytes = <optimized out>, static header_bytes = <optimized out>, static s_start = <optimized out>, static static_capacity = <optimized out>}, offset = 8611233792, length = 524288, rval = -1000, bl = {_buffers = {_root = {next = 0x5589fe88e2a0}, _tail = 0x5589fe88e2a0, _size = 1}, _carriage = 0x5589f9a835b0 <ceph::buffer::v14_2_0::list::always_empty_bptr>, _len = 524288, _memcopy_count = 0, last_p = {<ceph::buffer::v14_2_0::list::iterator_impl<false>> = {bl = 0x5589fc184290, ls = 0x5589fc184290, p = {cur = 0x5589fc184290}, off = 0, p_off = 0}, <No data fields>}, static always_empty_bptr = {_raw = 0x0, _off = 0, _len = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}, queue_item = {<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>> = {<boost::intrusive::list_node<void*>> = {next_ = 0x0, prev_ = 0x0}, <boost::intrusive::hook_tags_definer<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>, 0>> = {<No data fields>}, <No data fields>}, <No data fields>}}, end=
        {iocb = {data = 0x8, key = 0, __pad2 = 8, aio_lio_opcode = 2817, aio_reqprio = 1332, aio_fildes = 32664, u = {c = {buf = 0xb2a816f7ee63a400, nbytes = 94050922021624, offset = 94050922023750, __pad3 = 94051144048640, flags = 32, resfd = 0}, v = {vec = 0xb2a816f7ee63a400, nr = -271826184, offset = 94050922023750}, poll = {events = -295459840, __pad1 = -1297606921}, saddr = {addr = 0xb2a816f7ee63a400, len = -271826184}}}, priv = 0x7f9805340d00, fd = -271822627, iov = {<boost::container::small_vector_base<iovec, boost::container::new_allocator<iovec> >> = {<boost::container::vector<iovec, boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >, void>> = {m_holder = {<boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >> = {<boost::container::new_allocator<iovec>> = {<No data fields>}, <No data fields>}, m_start = 0x20, m_size = 0, m_capacity = 140290899054240}}, m_storage_start = {aligner = {data = "\000\000@", '\000' <repeats 12 times>}, data = "\000\000@", '\000' <repeats 12 times>}}, <boost::container::small_vector_storage<boost::move_detail::aligned_struct_wrapper<16, 8>, 3>> = {m_rest_of_storage = {{aligner = {data = "@\017\064\005\230\177\000\000xt\273\375\211U\000"}, data = "@\017\064\005\230\177\000\000xt\273\375\211U\000"}, {aligner = {data = "\000 \b\375\211U\000\000\000`!\376\211U\000"}, data = "\000 \b\375\211U\000\000\000`!\376\211U\000"}, {aligner = {data = "@\017\064\005\230\177\000\000\343l\204\370\251K\000"}, data = "@\017\064\005\230\177\000\000\343l\204\370\251K\000"}}}, static needed_extra_storages = <optimized out>, static needed_bytes = <optimized out>, static header_bytes = <optimized out>, static s_start = <optimized out>, static static_capacity = <optimized out>}, offset = 83193390985925, length = 94051162606592, rval = 3204, bl = {_buffers = {_root = {next = 0x5589fcf42880}, _tail = 0x5589fcf42898, _size = 140290899053696}, _carriage = 0x5589ef474c07 <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*)+49>, _len = 4263600128, _memcopy_count = 21897, last_p = {<ceph::buffer::v14_2_0::list::iterator_impl<false>> = {bl = 0x5589ef4748d2 <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider::_Alloc_hider(char*, std::allocator<char>&&)+50>, ls = 0x7f9805340c80, p = {cur = 0x5589fe3cb2f0}, off = 4265390832, p_off = 21897}, <No data fields>}, static always_---Type <return> to continue, or q <return> to quit---
empty_bptr = {_raw = 0x0, _off = 0, _len = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}, queue_item = {<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>> = {<boost::intrusive::list_node<void*>> = {next_ = 0x7f9805340fa0, prev_ = 0x7f9805340cc0}, <boost::intrusive::hook_tags_definer<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>, 0>> = {<No data fields>}, <No data fields>}, <No data fields>}}, aios_size=8, priv=0x7f9805340a80, retries=0x7f9805340588)
    at /data/ceph/src/os/bluestore/io_uring.cc:340
#11 0x00005589efe9c597 in KernelDevice::aio_submit (this=0x5589fd04c400, ioc=0x7f9805340a80)
    at /data/ceph/src/os/bluestore/KernelDevice.cc:760
#12 0x00005589efcc806e in BlueStore::_do_read (this=0x5589fd082000, c=0x5589fcf42880, o=..., offset=0,
    length=4194304, bl=..., op_flags=32, retry_count=0) at /data/ceph/src/os/bluestore/BlueStore.cc:8696
#13 0x00005589efcc50dd in BlueStore::read (this=0x5589fd082000, c_=..., oid=..., offset=0, length=4194304, bl=...,
    op_flags=32) at /data/ceph/src/os/bluestore/BlueStore.cc:8398
#14 0x00005589efadfe5b in ReplicatedBackend::build_push_op (this=0x5589fdbc2000, recovery_info=..., progress=...,
    out_progress=0x7f9805340f10, out_op=0x5589fec30000, stat=0x0, cache_dont_need=true)
    at /data/ceph/src/osd/ReplicatedBackend.cc:1979
#15 0x00005589efae1bf7 in ReplicatedBackend::handle_pull (this=0x5589fdbc2000, peer=..., op=...,
    reply=0x5589fec30000) at /data/ceph/src/osd/ReplicatedBackend.cc:2143
#16 0x00005589efad4acc in ReplicatedBackend::do_pull (this=0x5589fdbc2000, op=...)
    at /data/ceph/src/osd/ReplicatedBackend.cc:866
#17 0x00005589efacf4ed in ReplicatedBackend::_handle_message (this=0x5589fdbc2000, op=...)
    at /data/ceph/src/osd/ReplicatedBackend.cc:196
#18 0x00005589ef907732 in PGBackend::handle_message (this=0x5589fdbc2000, op=...)
    at /data/ceph/src/osd/PGBackend.cc:114
#19 0x00005589ef79d7c5 in PrimaryLogPG::do_request (this=0x5589fdbb7400, op=..., handle=...)
    at /data/ceph/src/osd/PrimaryLogPG.cc:1848
#20 0x00005589ef4f0e0b in OSD::dequeue_op (this=0x5589fcf7c000, pg=..., op=..., handle=...)
    at /data/ceph/src/osd/OSD.cc:9763
#21 0x00005589efa49724 in PGOpItem::run (this=0x5589fe1be6f0, osd=0x5589fcf7c000, sdata=0x5589fc17c700, pg=...,
    handle=...) at /data/ceph/src/osd/OpQueueItem.cc:24
#22 0x00005589ef51f6ed in OpQueueItem::run (this=0x7f9805342380, osd=0x5589fcf7c000, sdata=0x5589fc17c700, pg=...,
    handle=...) at /data/ceph/src/osd/OpQueueItem.h:134
#23 0x00005589ef4fe45d in OSD::ShardedOpWQ::_process (this=0x5589fcf7d2e8, thread_index=0, hb=0x5589fe1b50c0)
    at /data/ceph/src/osd/OSD.cc:10947
#24 0x00005589eff1d230 in ShardedThreadPool::shardedthreadpool_worker (this=0x5589fcf7caf8, thread_index=0)
    at /data/ceph/src/common/WorkQueue.cc:311
#25 0x00005589eff1ec49 in ShardedThreadPool::WorkThreadSharded::entry (this=0x5589fdf61530)
    at /data/ceph/src/common/WorkQueue.h:704
#26 0x00005589eff09634 in Thread::entry_wrapper (this=0x5589fdf61530) at /data/ceph/src/common/Thread.cc:84
#27 0x00005589eff095b2 in Thread::_entry_func (arg=0x5589fdf61530) at /data/ceph/src/common/Thread.cc:71
#28 0x00007f982145e6db in start_thread (arg=0x7f9805345700) at pthread_create.c:463
#29 0x00007f98201fe88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

For second issue, yes we are experiencing IO hang in testing.

By the way, I also do some code examination and also wondering how IO_CMD_PREAD come from.

@markhpc
Copy link
Copy Markdown
Member

markhpc commented May 8, 2019

FWIW I'm working on another PR to convert KernelDevice/ceph_aio to use preadv. It's a bit broken now with random reads but works for sequential. Will work on it more tomorrow.

@rouming
Copy link
Copy Markdown
Author

rouming commented May 8, 2019

FWIW I'm working on another PR to convert KernelDevice/ceph_aio to use preadv. It's a bit broken now with random reads but works for sequential. Will work on it more tomorrow.

As a quick fix for io_uring we can pretend we have iov for read path as well:

diff --git a/src/os/bluestore/ceph_aio.h b/src/os/bluestore/ceph_aio.h
index 24fd72e359be..d8e67668cf6f 100644
--- a/src/os/bluestore/ceph_aio.h
+++ b/src/os/bluestore/ceph_aio.h
@@ -63,8 +63,10 @@ struct aio_t {
     offset = _offset;
     length = len;
     bufferptr p = buffer::create_small_page_aligned(length);
+    bl.append(std::move(p));
 #if defined(HAVE_LIBAIO)
-    io_prep_pread(&iocb, fd, p.c_str(), length, offset);
+    bl.prepare_iov(&iov);
+    io_prep_preadv(&iocb, fd, &iov[0], 1, offset);
 #elif defined(HAVE_POSIXAIO)
     n_aiocb = 1;
     aio.aiocb.aio_fildes = fd;
@@ -72,7 +74,6 @@ struct aio_t {
     aio.aiocb.aio_nbytes = length;
     aio.aiocb.aio_offset = offset;
 #endif
-    bl.append(std::move(p));
   }
 

@markhpc, what do you think?

@rouming rouming force-pushed the bluestore-iouring branch from 4c89f13 to 43d028d Compare May 8, 2019 10:01
@rouming
Copy link
Copy Markdown
Author

rouming commented May 8, 2019

I updated current pr with minor changes in pread() in order to pass iov for both read and write paths.
Strange enough I can't test read path in fio_ceph_objectstore fio engine: even I do only reads io_uring does not receive them at all. So will take time to run this in real environment.

@markhpc
Copy link
Copy Markdown
Member

markhpc commented May 8, 2019

I'm starting to wonder a bit if some of the KernelDevice and ceph_aio code could use a bigger refactor. FWIW, I tested the io_uring read path with my preadv cahnge last night and the OSD was hanging again. Will need to dig in more to figure that out.

@markhpc
Copy link
Copy Markdown
Member

markhpc commented May 8, 2019

FWIW I'm working on another PR to convert KernelDevice/ceph_aio to use preadv. It's a bit broken now with random reads but works for sequential. Will work on it more tomorrow.

As a quick fix for io_uring we can pretend we have iov for read path as well:

diff --git a/src/os/bluestore/ceph_aio.h b/src/os/bluestore/ceph_aio.h
index 24fd72e359be..d8e67668cf6f 100644
--- a/src/os/bluestore/ceph_aio.h
+++ b/src/os/bluestore/ceph_aio.h
@@ -63,8 +63,10 @@ struct aio_t {
     offset = _offset;
     length = len;
     bufferptr p = buffer::create_small_page_aligned(length);
+    bl.append(std::move(p));
 #if defined(HAVE_LIBAIO)
-    io_prep_pread(&iocb, fd, p.c_str(), length, offset);
+    bl.prepare_iov(&iov);
+    io_prep_preadv(&iocb, fd, &iov[0], 1, offset);
 #elif defined(HAVE_POSIXAIO)
     n_aiocb = 1;
     aio.aiocb.aio_fildes = fd;
@@ -72,7 +74,6 @@ struct aio_t {
     aio.aiocb.aio_nbytes = length;
     aio.aiocb.aio_offset = offset;
 #endif
-    bl.append(std::move(p));
   }
 

@markhpc, what do you think?

That's similar to what I did, though I made it more like pwritev by moving some of that logic out into KernelDevice.

@rouming
Copy link
Copy Markdown
Author

rouming commented May 8, 2019

I'm starting to wonder a bit if some of the KernelDevice and ceph_aio code could use a bigger refactor. FWIW, I tested the io_uring read path with my preadv cahnge last night and the OSD was hanging again. Will need to dig in more to figure that out.

Indeed, this part looks very hairy.

Regarding the hang: is low-level IO engine should be thread-safe? Since we access the ring buffer from userspace and do that as fast as possible I did not do any locking there and rely on the level above. So the question is: do we need to protect submit_batch() and get_next_completed() calls?

@markhpc
Copy link
Copy Markdown
Member

markhpc commented May 8, 2019

I updated current pr with minor changes in pread() in order to pass iov for both read and write paths.
Strange enough I can't test read path in fio_ceph_objectstore fio engine: even I do only reads io_uring does not receive them at all. So will take time to run this in real environment.

I made a separate PR for my changes. We'll want to test this independently of the io_uring work to make sure we don't introduce any regressions in the aio path.

r = ioring_cqring_reap(d, events, max, paio);
if (r) {
events += r;
continue;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to incorporate this fix from the fio code:

		if (actual_min != 0)
			actual_min -= r;

axboe/fio@f7cbbbf#diff-f23fe137e0491aa4a3958612529b0334

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, what will happen if 'r > actual_min'? Not nice.
Seems should be actual_min -= min(r, actual_min);
Also for our case actual_min is always zero.
But yes, the idea is clear. Thanks for pointing that out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree. Might want to point that out to the fio guys too.

{
struct io_uring_sqe *sqe = &d->sqes[index];

if (io->iocb.aio_lio_opcode == IO_CMD_PWRITEV)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To debug the pread issue I changed this if/else block into a switch statement to handle the other cases:

  switch (io->iocb.aio_lio_opcode) {
  case IO_CMD_PREAD:
    ceph_abort_msg("Don't know how to handle IO_CMD_PREAD with io_uring");
    break;
  case IO_CMD_PWRITE:
    ceph_abort_msg("Don't know how to handle IO_CMD_PWRITE with io_uring");
    break;
  case IO_CMD_FSYNC:
    sqe->opcode = IORING_OP_FSYNC;
    break;
  case IO_CMD_FDSYNC:
    ceph_abort_msg("Don't know how to handle IO_CMD_FDSYNC yet");
    break;
  case IO_CMD_POLL: /* Never implemented in mainline, see io_prep_poll */
    sqe->opcode = IORING_OP_NOP;
    break;
  case IO_CMD_NOOP:
    sqe->opcode = IORING_OP_NOP;
    break;
  case IO_CMD_PREADV:
    sqe->opcode = IORING_OP_READV;
    break;
  case IO_CMD_PWRITEV:
    sqe->opcode = IORING_OP_WRITEV;
    break;
  default:
    ceph_assert(0);
  }

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you want to print and crash in all the cases which are not readv/writev? You found places where something else can be issued?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly I was just trying to clean up the non-readv/writev handling using ceph_abort_msg instead of assert, but it looks like we should be able to handle noop cleanly at least. POLL apparently was never implemented but you are right that we should probably abort there rather than nop. FDSYNC I'm aborting on and fsync passing through. I think if POLL is changed to abort, the only question then is fsync?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since nobody sends other than write/read I would always print & abort. If someone introduces other requests - io_uring engine has to be updated as well. Seems this is more explicit behavior which forces developer to test if something new is introduced.

@JeffMoyer
Copy link
Copy Markdown

JeffMoyer commented Oct 17, 2019 via email

@rouming
Copy link
Copy Markdown
Author

rouming commented Oct 17, 2019

Buffered I/O, perhaps?

Hm, should not be the case, since descriptors from "fd_directs" array (opened with O_DIRECT) are registered in io_uring. But results are indeed strange, at least rbd.fio does not show significant difference between aio and io_uring engines for 4k reads.

@tchaikov
Copy link
Copy Markdown
Contributor

@markhpc anything is pending to remove the DNM label?

@wjwithagen i think you requested change. but i still don't get what exactly the change was.

.set_description("Enforces specific hw profile settings")
.set_long_description("'hdd' enforces settings intended for BlueStore above a rotational drive. 'ssd' enforces settings intended for BlueStore above a solid drive. 'default' - using settings for the actual hardware."),

Option("bluestore_ioring", Option::TYPE_BOOL, Option::LEVEL_ADVANCED)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markhpc we don't have an experimental flag anymore. but this setting is marked with "Option::LEVEL_ADVANCED", and is disabled by default. i think it's good enough. or do you have a specific idea how to mark it experimental?

@wjwithagen
Copy link
Copy Markdown
Contributor

@wjwithagen i think you requested change. but i still don't get what exactly the change was.

I'll going to be practical in this.... I'll take the FreeBSD in my posix-aio fallout as it comes, and work from there. So feel free to commit.

@wjwithagen wjwithagen self-requested a review October 23, 2019 09:08
@mykaul
Copy link
Copy Markdown

mykaul commented Oct 28, 2019

@tchaikov - I wonder what the CPU usage is - if there's any difference per IOP or MBps.

@markhpc
Copy link
Copy Markdown
Member

markhpc commented Oct 31, 2019

@markhpc anything is pending to remove the DNM label?

@wjwithagen i think you requested change. but i still don't get what exactly the change was.

Nothing on my end. I think we may want to refactor some day, but that day isn't today! 👍

@tchaikov tchaikov removed the DNM label Oct 31, 2019
@tchaikov
Copy link
Copy Markdown
Contributor

@markhpc thanks. will rebase the change. and rerun the perf test using fio.

@rouming
Copy link
Copy Markdown
Author

rouming commented Nov 13, 2019

@tchaikov Do you plan to return to this pr?

@tchaikov
Copy link
Copy Markdown
Contributor

@rouming definitely yes. i am still debugging my own vstart cluster. stuck by another issue.

@junxfl
Copy link
Copy Markdown

junxfl commented Nov 29, 2019

@rouming hello, I test using ramdisk(ramdisk 60G, engine is libaio or io_uring) , but random 4k wirte only 1.6k iops, my config is:
ceph-bluestore.fio:
rw=randwrite
iodepth=16
bs=4k
time_based=1
runtime=60s
[bluestore]
nr_files=1
numjobs=1
size=256m

ceph-bluestore.conf:
[global]
osd pool default pg num = 16
osd op num shards = 5
[osd]
osd objectstore = bluestore
bluestore ioring = true
bluestore min alloc size = 4096
bluestore max blob size = 65536
osd data = /home/test/src/ceph-build/master/ceph/build/dev/osd1
log file = /home/test/src/ceph-build/master/ceph/build/out/log

Can you give me some advice?many thanks

@rouming
Copy link
Copy Markdown
Author

rouming commented Nov 29, 2019

@junxfl Probably you forgot to specify:

bluestore_block_path=/dev/ram0
bluestore_block_db_path=/dev/ram1
bluestore_block_wal_path=/dev/ram2

Also in order not to spoil fio output with ceph logs you can disable them:

debug bluestore = 0/0
debug bluefs = 0/0
debug bdev = 0/0
debug rocksdb = 0/0

@junxfl
Copy link
Copy Markdown

junxfl commented Nov 29, 2019

@rouming Hi rouming,I test again,add "bluestore_block_path=/dev/ram0", random 4k wirte still is 1.6k iops, my config is:
ceph-bluestore.fio:

[global]
ioengine=libfio_ceph_objectstore.so # must be found in your LD_LIBRARY_PATH
conf=ceph-bluestore.conf1 # must point to a valid ceph configuration file
directory=/home/yangj/src/ceph-build/master/ceph/build/dev/osd1
rw=randwrite
iodepth=128
bs=4k
time_based=1
runtime=60s
[bluestore]
nr_files=1
numjobs=1
size=256m

ceph-bluestore.conf:

[global]
osd pool default pg num = 16
osd op num shards = 5
debug bluestore = 0/0
debug bluefs = 0/0
debug bdev = 0/0
debug rocksdb = 0/0
[osd]
osd objectstore = bluestore
bluestore ioring = true
bluestore min alloc size = 4096
bluestore max blob size = 65536
bluestore_block_path = /dev/ram0
osd data = /home/yangj/src/ceph-build/master/ceph/build/dev/osd1
log file = /home/yangj/src/ceph-build/master/ceph/build/out/log

image
image

Can you send me your configuration file ?Thanks.

@rouming
Copy link
Copy Markdown
Author

rouming commented Nov 29, 2019

@junxfl it is not enough to have only block device on ram, rocksdb should be also on ram, so

bluestore_block_path=/dev/ram0
bluestore_block_db_path=/dev/ram1
bluestore_block_wal_path=/dev/ram2

@tchaikov
Copy link
Copy Markdown
Contributor

tchaikov commented Dec 5, 2019

Roman Penyaev added 2 commits December 10, 2019 17:21
In the next patch new io_uring API will be used instead of libaio.
So this prepares the abstract interface.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
This implements low-level IO engine, which utilizes brand-new
io_uring IO interface: https://lwn.net/Articles/776428/

By default libaio is used.  If bluestore_ioring=true is set but kernel
does not support io_uring or architecture is not x86-64, libaio will be
used instead.

In current patch liburing library is used in order not to open code
everything.

In order to compile with liburing WITH_LIBURING=ON should be specified.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.