blk/aio: allocate iocb using small_vector#35952
Conversation
there is chance that the batch size is so large that we have stack overflow when allocating piocb array using VLA. in this change, piocb is allocated using small_vector<>, as in most circumstances, the size of an object is 4M when it comes to CephFS, RGW and RBD by default. so if the client sends 4K / 8K aligned read / write, in general, that would be translated into 1024 IO requests at most. so, use 1024 as the size of small_vector would suffice. also, 1024 is the default bdev_aio_max_queue_depth by default, hence this setting is able to feed a io_submit() batch. Fixes: https://tracker.ceph.com/issues/46366 Signed-off-by: Kefu Chai <kchai@redhat.com>
|
changelog
|
|
@tchaikov Thanks for the patch. It looks like the tests are failing because of unavailable space in the jenkins test system. Other tests are also failing because of this:
|
|
@wvh-github hi Wout, thanks for looking into these test failures. yeah, i am contacting our lab administrator to fix this. we've been suffering from this since almost two weeks ago. |
|
Hey, I don't think this patch will solve the segfaults. I replaced the pciob value with a malloc/free combination and the process would still segfault, and occasionally assert on the assert after the loop comparing left and aios_size. I suspect the input to submit_batch gets corrupted upstream somewhere, the aios_size values were extremely big (50k+) and it seems like the list structure being passed was invalid. |
|
@RobinGeuze could you enable debug-osd=20, and debug-bluestore=20, and paste the log to https://tracker.ceph.com/issues/46366 and answer the query of https://tracker.ceph.com/issues/46366#note-6 ? i am also interested in the workload. |
|
@tchaikov I want to get our system back to full stability before I start fiddling with debug settings again. We run RBD using 4+2 erasure code and we use a separate data pool. Maybe to give a bit of context:
At this point we tried a bunch of different patches. I attached GDB but I could get very little useful info. This let me to use malloc for allocation piocb, which allowed me to get the proper stacktrace shared in the bug report. Eventually based on that stack trace a colleague of mine noted 74be30c8b7c. We then tested a build with that patch revert, which solved the crashes and our cluster is now running properly on that patched binary. I only reverted that one single patch. |
|
@RobinGeuze thanks for your detailed explanation! so the number of iocb should have been much less than "118712", which was found in the backtrace you uploaded in the ticket, as a 4M RBD object being recovered should not have so many fragments. |
|
@tchaikov yes I think the list being handed into submit_batch is corrupt, which just caused the loop to go on endlessly. In all dumps I made aios_size was smaller than left, which means that KernelDevice::aio_submit somehow took more elements from IOContext::pending_aios than the value of IOContext::num_pending, which, as far as I can tell, should not be possible. Hence my theory the list is somehow corrupted. |
|
@tchaikov the logs of a crashing OSD can be found at:
I'm also wondering if you can confirm the concern of @RobinGeuze regarding this patch in relation to the problems we saw this weekend. |
|
@wvh-github thank you, Wout! i am still downloading the log files. will start with the smaller one.. not sure if i can finish downloading the larger one in my lifetime though... |
Thanks for the update, I'm not sure the larger one will provide more insights compared to the smaller one. |
Hi @tchaikov , I was wondering if you have had a chance to go over the logs. If so, have you been able to confirm the bug @RobinGeuze has described? |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
Hi @tchaikov Have you been able to look into Robins comment?
|
|
This PR says it fixes: https://tracker.ceph.com/issues/46366 However, seeing the comment from Robin it doesn't seem to be the case. Something else seems to be the root cause of the issue above. @xiexingguo can you comment on this and mainly on the issue linked in the start of this PR? |
xiexingguo
left a comment
There was a problem hiding this comment.
@wido I withdraw my approval as the root cause is still not clear to me
I withdraw my approval as the root cause is still not clear to me
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
there is chance that the batch size is so large that we have stack
overflow when allocating piocb array using VLA. in this change,
piocb is allocated using small_vector<>, as in most circumstances, the
size of an object is 4M when it comes to CephFS, RGW and RBD by default.
so if the client sends 4K / 8K aligned read / write, in general,
that would be translated into 1024 IO requests at most. so,
use 1024 as the size of small_vector would suffice.
also, 1024 is the default bdev_aio_max_queue_depth by default, hence
this setting is able to feed a io_submit() batch.
Fixes: https://tracker.ceph.com/issues/46366
Signed-off-by: Kefu Chai kchai@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard backendjenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox