Skip to content

common: use close_range on Linux#50622

Merged
yuriw merged 1 commit intoceph:mainfrom
cofractal:close-range
Mar 21, 2024
Merged

common: use close_range on Linux#50622
yuriw merged 1 commit intoceph:mainfrom
cofractal:close-range

Conversation

@edef1c
Copy link
Contributor

@edef1c edef1c commented Mar 22, 2023

Fix rook/rook#10110, which occurs when _SC_OPEN_MAX/RLIMIT_NOFILE is set to very large values (2^30), leaving fork_function pegging a core busylooping.

The glibc wrappers closefrom(3)/close_range(3) are not available before glibc 2.34, so we invoke the syscall directly. When glibc 2.34 is old enough to be a reasonable hard minimum dependency, we should switch to using closefrom.

If we're not running on (recent enough) Linux, we fall back to the existing approach.

Checklist

  • Tracker (select at least one)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@travisn travisn requested a review from a team April 21, 2023 07:48
@salfers
Copy link

salfers commented Jul 14, 2023

We just tried to boostrap a new Ceph cluster in our company via cephadm using Ceph Quincy, EL9 and Docker and this broke our cluster every time we tried the process from scratch.
If not prioritizing this bug fix the Ceph team should at least edit the documentation to state clearly that EL9 + Docker do not work with cephadm, to avoid other people wasting their time with it.

ping @travisn

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 17, 2023
@edef1c
Copy link
Contributor Author

edef1c commented Oct 3, 2023

I believe this PR is still relevant, but I'm not sure what I can do to move it along.

@idryomov idryomov removed the stale label Oct 3, 2023
@idryomov
Copy link
Contributor

idryomov commented Oct 3, 2023

@rzarzynski @ljflores Looks like this is waiting for a review from Core.

@rzarzynski rzarzynski self-requested a review October 3, 2023 10:28
@github-actions
Copy link

github-actions bot commented Dec 2, 2023

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Dec 2, 2023
@heliochronix
Copy link

Is there anything blocking this getting merged?

@github-actions github-actions bot removed the stale label Dec 2, 2023
@rzarzynski
Copy link
Contributor

All we need is a QA run.

@ljflores, @yuriw: could you please take this into testing?

@ljflores
Copy link
Member

ljflores commented Jan 3, 2024

@rzarzynski it's in a batch!

@yuriw
Copy link
Contributor

yuriw commented Jan 4, 2024

@edef1c this a suspect to have broken the build:

https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=focal,DIST=focal,MACHINE_SIZE=gigantic/75440//consoleFull

19.0.0-466-g15215d01/src/common/SubProcess.cc
[ 60%] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_nested.cc.o
/build/ceph-19.0.0-466-g15215d01/src/common/SubProcess.cc:9:10: fatal error: linux/close_range.h: No such file or directory
9 | #include <linux/close_range.h>
| ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[3]: *** [src/common/CMakeFiles/common-common-objs.dir/build.make:1263: src/common/CMakeFiles/common-common-objs.dir/SubProcess.cc.o] Error 1
make[3]: *** Waiting for unfinished jobs....

@ljflores fyi

@ljflores
Copy link
Member

ljflores commented Jan 4, 2024

@yuriw yeah, it is breaking focal.

@edef1c this compile error seems to happen in ubuntu focal if you need to reproduce it.

@edef1c
Copy link
Contributor Author

edef1c commented Jan 17, 2024

@edef1c this compile error seems to happen in ubuntu focal if you need to reproduce it.

Hmm, okay. Seems like focal's linux-libc-dev package doesn't include close_range.h. We don't actually need the header (it just supplies the CLOSE_RANGE_{UNSHARE,CLOEXEC} constants), so we can just leave it out.

@edef1c
Copy link
Contributor Author

edef1c commented Jan 17, 2024

I've dropped the header include. I think things should be fine now unless SYS_close_range isn't available somewhere, in which case we can just #ifdef a little more.

Fix rook/rook#10110, which occurs when _SC_OPEN_MAX/RLIMIT_NOFILE is
set to very large values (2^30), leaving fork_function pegging a core
busylooping.

The glibc wrappers closefrom(3)/close_range(3) are not available before
glibc 2.34, so we invoke the syscall directly. When glibc 2.34 is old
enough to be a reasonable hard minimum dependency, we should switch to
using closefrom.

If we're not running on (recent enough) Linux, we fall back to the
existing approach.

Fixes: https://tracker.ceph.com/issues/59125
Signed-off-by: edef <edef@edef.eu>
@edef1c
Copy link
Contributor Author

edef1c commented Feb 5, 2024

I ended up taking a closer look, and found that neither SYS_close_range nor __NR_close_range are available on focal. I've amended the change to just check if SYS_close_range is defined, so nothing should change on systems where it's not available from the headers.

@ljflores @yuriw Could we get another run? It should be fully baked wrt backward compat at this point, as far as I can tell.

@edef1c
Copy link
Contributor Author

edef1c commented Feb 15, 2024

The failed API tests seem to go to a Jenkins 404 page, so I'm not sure what's up there.

@tchaikov
Copy link
Contributor

jenkins test api

@heliochronix
Copy link

It seems like this is done and approved. Is this waiting for some milestone to merge?

@rzarzynski
Copy link
Contributor

We need rerun the QA per #50622 (comment).

Ping @yuriw, @ljflores.

@amathuria
Copy link
Contributor

@yuriw yuriw merged commit d61ec16 into ceph:main Mar 21, 2024
@edef1c edef1c deleted the close-range branch April 16, 2024 17:32
@prazumovsky
Copy link

See the same on Reef v18.2.4, definitely required backport to reef

@idryomov
Copy link
Contributor

idryomov commented Feb 3, 2025

See the same on Reef v18.2.4, definitely required backport to reef

Hrm, this hasn't been backported to squid either. @prazumovsky Can you create a squid backport as well?

@idryomov
Copy link
Contributor

idryomov commented Feb 3, 2025

@prazumovsky Can you create a squid backport as well?

It would go under https://tracker.ceph.com/issues/69778.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ceph monitors in crash loop

10 participants