Skip to content

blk/aio: allocate iocb using small_vector#35952

Closed
tchaikov wants to merge 1 commit intoceph:mainfrom
tchaikov:wip-46366
Closed

blk/aio: allocate iocb using small_vector#35952
tchaikov wants to merge 1 commit intoceph:mainfrom
tchaikov:wip-46366

Conversation

@tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Jul 7, 2020

there is chance that the batch size is so large that we have stack
overflow when allocating piocb array using VLA. in this change,
piocb is allocated using small_vector<>, as in most circumstances, the
size of an object is 4M when it comes to CephFS, RGW and RBD by default.
so if the client sends 4K / 8K aligned read / write, in general,
that would be translated into 1024 IO requests at most. so,
use 1024 as the size of small_vector would suffice.

also, 1024 is the default bdev_aio_max_queue_depth by default, hence
this setting is able to feed a io_submit() batch.

Fixes: https://tracker.ceph.com/issues/46366
Signed-off-by: Kefu Chai kchai@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

Copy link
Member

@xiexingguo xiexingguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

there is chance that the batch size is so large that we have stack
overflow when allocating piocb array using VLA. in this change,
piocb is allocated using small_vector<>, as in most circumstances, the
size of an object is 4M when it comes to CephFS, RGW and RBD by default.
so if the client sends 4K / 8K aligned read / write, in general,
that would be translated into 1024 IO requests at most. so,
use 1024 as the size of small_vector would suffice.

also, 1024 is the default bdev_aio_max_queue_depth by default, hence
this setting is able to feed a io_submit() batch.

Fixes: https://tracker.ceph.com/issues/46366
Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov
Copy link
Contributor Author

tchaikov commented Jul 7, 2020

changelog

  • fixed the syntax error in the comment.

@wvh-github
Copy link

@tchaikov Thanks for the patch.

It looks like the tests are failing because of unavailable space in the jenkins test system. Other tests are also failing because of this:

/home/jenkins-build/build/workspace/ceph-dashboard-pr-backend/build/bin/ceph-mon -i a -c /home/jenkins-build/build/workspace/ceph-dashboard-pr-backend/build/ceph.conf 2020-07-07T06:31:40.457+0000 7fbc9729f700 -1 error: monitor data filesystem reached concerning levels of available storage space (available: 3% 14 GiB) you may adjust 'mon data avail crit' to a lower value to make this go away (default: 5%)

@tchaikov
Copy link
Contributor Author

tchaikov commented Jul 7, 2020

@wvh-github hi Wout, thanks for looking into these test failures. yeah, i am contacting our lab administrator to fix this. we've been suffering from this since almost two weeks ago.

@RobinGeuze
Copy link

RobinGeuze commented Jul 7, 2020

Hey, I don't think this patch will solve the segfaults. I replaced the pciob value with a malloc/free combination and the process would still segfault, and occasionally assert on the assert after the loop comparing left and aios_size.

I suspect the input to submit_batch gets corrupted upstream somewhere, the aios_size values were extremely big (50k+) and it seems like the list structure being passed was invalid.

@tchaikov tchaikov added the DNM label Jul 7, 2020
@tchaikov
Copy link
Contributor Author

tchaikov commented Jul 7, 2020

@RobinGeuze could you enable debug-osd=20, and debug-bluestore=20, and paste the log to https://tracker.ceph.com/issues/46366 and answer the query of https://tracker.ceph.com/issues/46366#note-6 ? i am also interested in the workload.

@RobinGeuze
Copy link

@tchaikov I want to get our system back to full stability before I start fiddling with debug settings again. We run RBD using 4+2 erasure code and we use a separate data pool. Maybe to give a bit of context:

  • We had a quite catastrophic cluster failure, caused by improperly following the upgrade guide for Nautilus (we found this out eventually)
  • In the process of trying to fix this we upgraded to octopus
  • Once we finally had all OSD's back online after fixing the nautilus upgrade issue we reenabled recovery. At this point the cluster was paused.
  • About 10-15 OSD's started crashing with the segfault this bug report is about.
  • We took out this OSD's and manually moved inactive PG's of them to other OSD's. This seemed to fix it for a while.
  • After about 4-6 hours or so other OSD's started crashing though.

At this point we tried a bunch of different patches. I attached GDB but I could get very little useful info. This let me to use malloc for allocation piocb, which allowed me to get the proper stacktrace shared in the bug report. Eventually based on that stack trace a colleague of mine noted 74be30c8b7c. We then tested a build with that patch revert, which solved the crashes and our cluster is now running properly on that patched binary. I only reverted that one single patch.

@tchaikov
Copy link
Contributor Author

tchaikov commented Jul 7, 2020

@RobinGeuze thanks for your detailed explanation!

so the number of iocb should have been much less than "118712", which was found in the backtrace you uploaded in the ticket, as a 4M RBD object being recovered should not have so many fragments.

@RobinGeuze
Copy link

RobinGeuze commented Jul 7, 2020

@tchaikov yes I think the list being handed into submit_batch is corrupt, which just caused the loop to go on endlessly. In all dumps I made aios_size was smaller than left, which means that KernelDevice::aio_submit somehow took more elements from IOContext::pending_aios than the value of IOContext::num_pending, which, as far as I can tell, should not be possible. Hence my theory the list is somehow corrupted.

@wvh-github
Copy link

@tchaikov the logs of a crashing OSD can be found at:

I'm also wondering if you can confirm the concern of @RobinGeuze regarding this patch in relation to the problems we saw this weekend.

@tchaikov
Copy link
Contributor Author

tchaikov commented Jul 9, 2020

@wvh-github thank you, Wout! i am still downloading the log files. will start with the smaller one.. not sure if i can finish downloading the larger one in my lifetime though...

@wvh-github
Copy link

wvh-github commented Jul 9, 2020

@wvh-github thank you, Wout! i am still downloading the log files. will start with the smaller one.. not sure if i can finish downloading the larger one in my lifetime though...

Thanks for the update, I'm not sure the larger one will provide more insights compared to the smaller one.
Is there some other place I can upload it so you can grab it faster?

@wvh-github
Copy link

@wvh-github thank you, Wout! i am still downloading the log files. will start with the smaller one.. not sure if i can finish downloading the larger one in my lifetime though...

Hi @tchaikov ,

I was wondering if you have had a chance to go over the logs. If so, have you been able to confirm the bug @RobinGeuze has described?

@stale
Copy link

stale bot commented Sep 27, 2020

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Sep 27, 2020
@wvh-github
Copy link

Hi @tchaikov

Have you been able to look into Robins comment?

@tchaikov yes I think the list being handed into submit_batch is corrupt, which just caused the loop to go on endlessly. In all dumps I made aios_size was smaller than left, which means that KernelDevice::aio_submit somehow took more elements from IOContext::pending_aios than the value of IOContext::num_pending, which, as far as I can tell, should not be possible. Hence my theory the list is somehow corrupted.

@stale stale bot removed the stale label Dec 16, 2020
@wido
Copy link
Member

wido commented Dec 21, 2020

This PR says it fixes: https://tracker.ceph.com/issues/46366

However, seeing the comment from Robin it doesn't seem to be the case. Something else seems to be the root cause of the issue above.

@xiexingguo can you comment on this and mainly on the issue linked in the start of this PR?

Copy link
Member

@xiexingguo xiexingguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wido I withdraw my approval as the root cause is still not clear to me

@xiexingguo xiexingguo dismissed their stale review December 22, 2020 02:32

I withdraw my approval as the root cause is still not clear to me

@stale
Copy link

stale bot commented Jul 21, 2021

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jul 21, 2021
@djgalloway djgalloway changed the base branch from master to main July 9, 2022 00:00
@djgalloway djgalloway requested a review from a team as a code owner July 9, 2022 00:00
@github-actions github-actions bot removed the stale label Jul 15, 2022
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 13, 2022
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Oct 28, 2022
@tchaikov tchaikov deleted the wip-46366 branch March 21, 2024 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants