Skip to content

osd: Remove aios_size argument from submit_batch#44065

Closed
RobinGeuze wants to merge 1 commit intoceph:mainfrom
RobinGeuze:fixSubmitBatch
Closed

osd: Remove aios_size argument from submit_batch#44065
RobinGeuze wants to merge 1 commit intoceph:mainfrom
RobinGeuze:fixSubmitBatch

Conversation

@RobinGeuze
Copy link

Due to aios_size being a uint16 and the source value for the actual
call being an int there was a possible overflow. This was "fixed"
with an assert, however that still causes a crash.

This pull removes the need for aios_size completely by iterating
over the list and submitting it in max_iodepth batches.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
  • Teuthology
    • Completed teuthology run
    • No teuthology test necessary (e.g., documentation)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

Due to aios_size being a uint16 and the source value for the actual
call being an int there was a possible overflow. This was "fixed"
with an assert, however that still causes a crash.

This commit removes the need for aios_size completely by iterating
over the list and submitting it in max_iodepth batches.

Fixes: https://tracker.ceph.com/issues/46366
Signed-off-by: Robin Geuze <robin.geuze@nl.team.blue>
@github-actions github-actions bot added the core label Nov 23, 2021
@RobinGeuze
Copy link
Author

@tchaikov could you take a look at this? I think its a proper fix for https://tracker.ceph.com/issues/46366

@RobinGeuze
Copy link
Author

We managed to reproduce the crash described in the issue (and confirmed that it still occurs with master) and verified that this patch fixes it succesfully.

@djgalloway djgalloway changed the base branch from master to main June 2, 2022 21:25
@djgalloway djgalloway requested a review from a team as a code owner June 2, 2022 21:25
@github-actions
Copy link

github-actions bot commented Aug 1, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Aug 1, 2022
Copy link
Member

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @RobinGeuze, can I ask what you did to verify the fix?

Also, per @sebastian-philipp, it looks like this fix may need a test (perhaps a unit test in test/objectstore/test_bdev.cc would work).

@ljflores ljflores requested review from ifed01 and rzarzynski August 1, 2022 22:25
@github-actions github-actions bot removed the stale label Aug 1, 2022
@RobinGeuze
Copy link
Author

Hey @ljflores,

We tested this by first reproducing the issue. We set up a minimal ceph cluster (eg 3 OSD's), write some data to it to make sure there is something there.

Once that is done we kill one of the OSD's. Then we create an object with a very large number of extends, for example using the following python code:

import rados

cluster = rados.Rados(conffile='/etc/ceph/ceph.conf')
cluster.connect()

ioctx = cluster.open_ioctx('test')
for i in range(0,1024*1024,2):
    print(i)
    ioctx.write('beep', b'a', i)

print(ioctx.read('beep'))

ioctx.close()

If you then bring the OSD back up it will start recovery and crash once it gets to that object. If this patch is applied it does not crash.

I was trying to figure out how to test this properly but I was unable to get teuthology to work locally. Writing a unit test could work though, I might be able to take a look at that later this week.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Dec 28, 2022
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants