blk/aio: allocate iocb using small_vector by tchaikov · Pull Request #35952 · ceph/ceph

tchaikov · 2020-07-07T05:54:14Z

there is chance that the batch size is so large that we have stack
overflow when allocating piocb array using VLA. in this change,
piocb is allocated using small_vector<>, as in most circumstances, the
size of an object is 4M when it comes to CephFS, RGW and RBD by default.
so if the client sends 4K / 8K aligned read / write, in general,
that would be translated into 1024 IO requests at most. so,
use 1024 as the size of small_vector would suffice.

also, 1024 is the default bdev_aio_max_queue_depth by default, hence
this setting is able to feed a io_submit() batch.

Fixes: https://tracker.ceph.com/issues/46366
Signed-off-by: Kefu Chai kchai@redhat.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

xiexingguo

lgtm

there is chance that the batch size is so large that we have stack overflow when allocating piocb array using VLA. in this change, piocb is allocated using small_vector<>, as in most circumstances, the size of an object is 4M when it comes to CephFS, RGW and RBD by default. so if the client sends 4K / 8K aligned read / write, in general, that would be translated into 1024 IO requests at most. so, use 1024 as the size of small_vector would suffice. also, 1024 is the default bdev_aio_max_queue_depth by default, hence this setting is able to feed a io_submit() batch. Fixes: https://tracker.ceph.com/issues/46366 Signed-off-by: Kefu Chai <kchai@redhat.com>

tchaikov · 2020-07-07T06:13:38Z

changelog

fixed the syntax error in the comment.

wvh-github · 2020-07-07T07:27:54Z

@tchaikov Thanks for the patch.

It looks like the tests are failing because of unavailable space in the jenkins test system. Other tests are also failing because of this:

/home/jenkins-build/build/workspace/ceph-dashboard-pr-backend/build/bin/ceph-mon -i a -c /home/jenkins-build/build/workspace/ceph-dashboard-pr-backend/build/ceph.conf 2020-07-07T06:31:40.457+0000 7fbc9729f700 -1 error: monitor data filesystem reached concerning levels of available storage space (available: 3% 14 GiB) you may adjust 'mon data avail crit' to a lower value to make this go away (default: 5%)

tchaikov · 2020-07-07T07:31:13Z

@wvh-github hi Wout, thanks for looking into these test failures. yeah, i am contacting our lab administrator to fix this. we've been suffering from this since almost two weeks ago.

RobinGeuze · 2020-07-07T08:49:48Z

Hey, I don't think this patch will solve the segfaults. I replaced the pciob value with a malloc/free combination and the process would still segfault, and occasionally assert on the assert after the loop comparing left and aios_size.

I suspect the input to submit_batch gets corrupted upstream somewhere, the aios_size values were extremely big (50k+) and it seems like the list structure being passed was invalid.

tchaikov · 2020-07-07T09:25:21Z

@RobinGeuze could you enable debug-osd=20, and debug-bluestore=20, and paste the log to https://tracker.ceph.com/issues/46366 and answer the query of https://tracker.ceph.com/issues/46366#note-6 ? i am also interested in the workload.

RobinGeuze · 2020-07-07T09:42:24Z

@tchaikov I want to get our system back to full stability before I start fiddling with debug settings again. We run RBD using 4+2 erasure code and we use a separate data pool. Maybe to give a bit of context:

We had a quite catastrophic cluster failure, caused by improperly following the upgrade guide for Nautilus (we found this out eventually)
In the process of trying to fix this we upgraded to octopus
Once we finally had all OSD's back online after fixing the nautilus upgrade issue we reenabled recovery. At this point the cluster was paused.
About 10-15 OSD's started crashing with the segfault this bug report is about.
We took out this OSD's and manually moved inactive PG's of them to other OSD's. This seemed to fix it for a while.
After about 4-6 hours or so other OSD's started crashing though.

At this point we tried a bunch of different patches. I attached GDB but I could get very little useful info. This let me to use malloc for allocation piocb, which allowed me to get the proper stacktrace shared in the bug report. Eventually based on that stack trace a colleague of mine noted 74be30c8b7c. We then tested a build with that patch revert, which solved the crashes and our cluster is now running properly on that patched binary. I only reverted that one single patch.

tchaikov · 2020-07-07T10:22:31Z

@RobinGeuze thanks for your detailed explanation!

so the number of iocb should have been much less than "118712", which was found in the backtrace you uploaded in the ticket, as a 4M RBD object being recovered should not have so many fragments.

RobinGeuze · 2020-07-07T10:31:13Z

@tchaikov yes I think the list being handed into submit_batch is corrupt, which just caused the loop to go on endlessly. In all dumps I made aios_size was smaller than left, which means that KernelDevice::aio_submit somehow took more elements from IOContext::pending_aios than the value of IOContext::num_pending, which, as far as I can tell, should not be possible. Hence my theory the list is somehow corrupted.

wvh-github · 2020-07-08T09:46:44Z

@tchaikov the logs of a crashing OSD can be found at:

I'm also wondering if you can confirm the concern of @RobinGeuze regarding this patch in relation to the problems we saw this weekend.

ideepika · 2020-07-08T11:16:30Z

http://pulpito.ceph.com/ideepika-2020-07-07_12:58:57-rados-wip-deepika-testing-2020-06-25-2058-distro-basic-smithi/
http://pulpito.front.sepia.ceph.com/ideepika-2020-07-08_07:00:35-rados-wip-deepika-testing-2020-06-25-2058-distro-basic-smithi/

failures tracked by:
https://tracker.ceph.com/issues/44399
https://tracker.ceph.com/issues/46412

tchaikov · 2020-07-09T14:42:57Z

@wvh-github thank you, Wout! i am still downloading the log files. will start with the smaller one.. not sure if i can finish downloading the larger one in my lifetime though...

wvh-github · 2020-07-09T19:20:15Z

@wvh-github thank you, Wout! i am still downloading the log files. will start with the smaller one.. not sure if i can finish downloading the larger one in my lifetime though...

Thanks for the update, I'm not sure the larger one will provide more insights compared to the smaller one.
Is there some other place I can upload it so you can grab it faster?

wvh-github · 2020-07-28T12:38:41Z

@wvh-github thank you, Wout! i am still downloading the log files. will start with the smaller one.. not sure if i can finish downloading the larger one in my lifetime though...

Hi @tchaikov ,

I was wondering if you have had a chance to go over the logs. If so, have you been able to confirm the bug @RobinGeuze has described?

stale · 2020-09-27T08:13:31Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

wvh-github · 2020-12-16T16:59:49Z

Hi @tchaikov

Have you been able to look into Robins comment?

@tchaikov yes I think the list being handed into submit_batch is corrupt, which just caused the loop to go on endlessly. In all dumps I made aios_size was smaller than left, which means that KernelDevice::aio_submit somehow took more elements from IOContext::pending_aios than the value of IOContext::num_pending, which, as far as I can tell, should not be possible. Hence my theory the list is somehow corrupted.

wido · 2020-12-21T15:39:57Z

This PR says it fixes: https://tracker.ceph.com/issues/46366

However, seeing the comment from Robin it doesn't seem to be the case. Something else seems to be the root cause of the issue above.

@xiexingguo can you comment on this and mainly on the issue linked in the start of this PR?

xiexingguo

@wido I withdraw my approval as the root cause is still not clear to me

I withdraw my approval as the root cause is still not clear to me

stale · 2021-07-21T06:00:44Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2022-09-13T15:02:12Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2022-10-28T21:02:16Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

tchaikov requested review from jdurgin, neha-ojha and xiexingguo July 7, 2020 05:54

tchaikov added bluestore bug-fix needs-qa labels Jul 7, 2020

xiexingguo approved these changes Jul 7, 2020

View reviewed changes

tchaikov force-pushed the wip-46366 branch from 361c11f to a53aee6 Compare July 7, 2020 06:13

ideepika added the wip-deepika-testing label Jul 7, 2020

tchaikov added the DNM label Jul 7, 2020

ideepika removed the wip-deepika-testing label Jul 8, 2020

tchaikov removed the needs-qa label Jul 9, 2020

stale bot added the stale label Sep 27, 2020

stale bot removed the stale label Dec 16, 2020

xiexingguo previously requested changes Dec 21, 2020

View reviewed changes

stale bot added the stale label Jul 21, 2021

djgalloway changed the base branch from master to main July 9, 2022 00:00

djgalloway requested a review from a team as a code owner July 9, 2022 00:00

github-actions bot removed the stale label Jul 15, 2022

github-actions bot added the stale label Sep 13, 2022

github-actions bot closed this Oct 28, 2022

tchaikov deleted the wip-46366 branch March 21, 2024 00:34

Conversation

tchaikov commented Jul 7, 2020

Checklist

Uh oh!

xiexingguo left a comment

Choose a reason for hiding this comment

Uh oh!

tchaikov commented Jul 7, 2020

Uh oh!

wvh-github commented Jul 7, 2020

Uh oh!

tchaikov commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RobinGeuze commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tchaikov commented Jul 7, 2020

Uh oh!

RobinGeuze commented Jul 7, 2020

Uh oh!

tchaikov commented Jul 7, 2020

Uh oh!

RobinGeuze commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wvh-github commented Jul 8, 2020

Uh oh!

ideepika commented Jul 8, 2020

Uh oh!

tchaikov commented Jul 9, 2020

Uh oh!

wvh-github commented Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wvh-github commented Jul 28, 2020

Uh oh!

stale bot commented Sep 27, 2020

Uh oh!

wvh-github commented Dec 16, 2020

Uh oh!

wido commented Dec 21, 2020

Uh oh!

xiexingguo left a comment

Choose a reason for hiding this comment

Uh oh!

stale bot commented Jul 21, 2021

Uh oh!

github-actions bot commented Sep 13, 2022

Uh oh!

github-actions bot commented Oct 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tchaikov commented Jul 7, 2020 •

edited

Loading

RobinGeuze commented Jul 7, 2020 •

edited

Loading

RobinGeuze commented Jul 7, 2020 •

edited

Loading

wvh-github commented Jul 9, 2020 •

edited

Loading