Skip to content

common: use boost::shared_mutex on Windows#46989

Merged
tchaikov merged 1 commit intoceph:mainfrom
petrutlucian94:boost_shared_mutex
Aug 6, 2022
Merged

common: use boost::shared_mutex on Windows#46989
tchaikov merged 1 commit intoceph:mainfrom
petrutlucian94:boost_shared_mutex

Conversation

@petrutlucian94
Copy link
Contributor

@petrutlucian94 petrutlucian94 commented Jul 6, 2022

common: use boost::shared_mutex on Windows

The winpthreads shared mutex implementation causes deadlocks on
Windows [1][2]. Specifically, async RBD IO calls are hanging. This
also prevents the images from being unmounted.

For this reason, we're switching to boost::shared_mutex when using
MinGW.

[1] cloudbase/wnbd#63 (comment)
[2] msys2/MINGW-packages#3319
Trace: https://pastebin.com/raw/i3jpTyS3

Fixes: https://tracker.ceph.com/issues/56480
Signed-off-by: Lucian Petrut lpetrut@cloudbasesolutions.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@petrutlucian94 petrutlucian94 marked this pull request as draft July 6, 2022 10:40
@petrutlucian94
Copy link
Contributor Author

petrutlucian94 commented Jul 6, 2022

We used this simple test to reproduce this issue and validate the fix: https://github.com/ceph/ceph/blob/main/qa/workunits/windows/test_rbd_wnbd.py

Preliminary test results look fine: we're no longer experiencing deadlocks after switching to boost::shared_mutex. We'll run some more perf tests to ensure that the performance is not affected by this patch before removing the "draft" label.

@petrutlucian94 petrutlucian94 added the win32 Specifix changes for the windows platform label Jul 8, 2022
@ionutbalutoiu
Copy link

I tested this patch on top of Quincy (with more than 15k RBD volumes tested), and the deadlocks never reproduced. So, I can confirm that the fix is working.

Moreover, I tried to benchmark the performance impact of this change, and here are the results:

imgpsh_fullsize_anim

There are 5 consecutive runs of 1k RBD volumes. The results seems to be the same (excepting a small difference, which is not likely due to this patch). So, I would say that there is no performance impact with this fix.

@petrutlucian94 petrutlucian94 marked this pull request as ready for review July 14, 2022 10:15
@djgalloway djgalloway requested a review from idryomov July 14, 2022 13:37
@idryomov idryomov requested a review from tchaikov July 14, 2022 19:26
The winpthreads shared mutex implementation causes deadlocks on
Windows [1][2]. Specifically, async RBD IO calls are hanging. This
also prevents the images from being unmounted.

For this reason, we're switching to boost::shared_mutex when using
MinGW.

[1] cloudbase/wnbd#63 (comment)
[2] msys2/MINGW-packages#3319
Trace: https://pastebin.com/raw/i3jpTyS3

Fixes: https://tracker.ceph.com/issues/56480
Signed-off-by: Lucian Petrut <lpetrut@cloudbasesolutions.com>
@petrutlucian94
Copy link
Contributor Author

jenkins test windows

@idryomov
Copy link
Contributor

jenkins test make check

@petrutlucian94
Copy link
Contributor Author

petrutlucian94 commented Jul 27, 2022

jenkins test windows

The windows job failed with a different error now, a test timeout. It's probably just a flaky test:

[2022-07-27T12:46:51.000Z] [googletest] unittest_fault_injector.exe failed. Error: Command timed out (300s): "cmd /c 'C:\ceph\unittest_fault_injector.exe --

@idryomov
Copy link
Contributor

jenkins test windows

@petrutlucian94
Copy link
Contributor Author

jenkins test make check arm64

1 similar comment
@petrutlucian94
Copy link
Contributor Author

jenkins test make check arm64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build/ops common needs-qa win32 Specifix changes for the windows platform

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants