osd: Remove aios_size argument from submit_batch#44065
osd: Remove aios_size argument from submit_batch#44065RobinGeuze wants to merge 1 commit intoceph:mainfrom
Conversation
Due to aios_size being a uint16 and the source value for the actual call being an int there was a possible overflow. This was "fixed" with an assert, however that still causes a crash. This commit removes the need for aios_size completely by iterating over the list and submitting it in max_iodepth batches. Fixes: https://tracker.ceph.com/issues/46366 Signed-off-by: Robin Geuze <robin.geuze@nl.team.blue>
|
@tchaikov could you take a look at this? I think its a proper fix for https://tracker.ceph.com/issues/46366 |
|
We managed to reproduce the crash described in the issue (and confirmed that it still occurs with master) and verified that this patch fixes it succesfully. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
ljflores
left a comment
There was a problem hiding this comment.
Hey @RobinGeuze, can I ask what you did to verify the fix?
Also, per @sebastian-philipp, it looks like this fix may need a test (perhaps a unit test in test/objectstore/test_bdev.cc would work).
|
Hey @ljflores, We tested this by first reproducing the issue. We set up a minimal ceph cluster (eg 3 OSD's), write some data to it to make sure there is something there. Once that is done we kill one of the OSD's. Then we create an object with a very large number of extends, for example using the following python code: If you then bring the OSD back up it will start recovery and crash once it gets to that object. If this patch is applied it does not crash. I was trying to figure out how to test this properly but I was unable to get teuthology to work locally. Writing a unit test could work though, I might be able to take a look at that later this week. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
Due to aios_size being a uint16 and the source value for the actual
call being an int there was a possible overflow. This was "fixed"
with an assert, however that still causes a crash.
This pull removes the need for aios_size completely by iterating
over the list and submitting it in max_iodepth batches.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox