Preserve O_CLOEXEC when converting Unix.file_descr to a CRT fd on Windows#13921
Merged
dra27 merged 3 commits intoocaml:trunkfrom Apr 24, 2025
Merged
Preserve O_CLOEXEC when converting Unix.file_descr to a CRT fd on Windows#13921dra27 merged 3 commits intoocaml:trunkfrom
O_CLOEXEC when converting Unix.file_descr to a CRT fd on Windows#13921dra27 merged 3 commits intoocaml:trunkfrom
Conversation
On slower machines, it's possible for the main part of the test (in fdstatus_main.ml) to complete before the cloexec.ml has actually terminated as part of `Unix.execv` (this is an artefact of how the underlying CRT `execv` call is implemented on Windows). Strictly on native Windows, cloexec.ml creates an additional dummy file and takes a write lock on it (with `Unix.lockf`). Before running, if this file exists, fdstatus_main.ml will also try to take a write lock on it, providing a way of fdstatus_main.ml being sure that its ancestor has fully terminated without actually having to know its PID.
Particularly on slow VMs, on Windows it's possible to end up seeing a "permission denied" error trying to delete tmp.txt at the end of the test. The issue is that Windows Defender (or other virus scanners) are backlogged during a testsuite run scanning executables which are being compiled. When tmp.txt is closed, it gets added to Windows Defender's scan queue, but it can take long enough that it hasn't actually closed the file by the time fdstatus_main.ml tries to delete it. The solution here is instead to open tmp.txt with O_SHARE_DELETE and then use an additional C stub to open the file with FILE_FLAG_DELETE_ON_CLOSE. This way, when fdstatus_main.ml terminates, the file is automatically by Windows "when it's ready". ocamltest merrily waits for the entire process group to terminate, as it usually does.
Member
Author
|
See also precheck#1038. If any one of these Windows tests fails this time, I may actually sob at my desk...... |
Contributor
|
Incidentally, 1. looks very similar to ocaml/dune#11437. |
Member
Author
|
There may be something else interesting to dig into "at some point": mingw64 on AppVeyor: msvc64 on precheck: mingw64 on precheck: msvc32 on precheck: mingw32 on precheck: |
Contributor
|
@dra27 Feel free to merge if you are happy with the current state of the PR. |
On Windows, the Unix library performs most operations on files using the
Windows API directly, but the Unix.file_descr structure has the option
for using an underlying CRT fd. In particular, these are needed for the
{in,out}_channel_of_descr functions.
For some reason, the underlying _open_osfhandle call doesn't check for
inheritability of the OS handle itself (it's slightly odd, because it
does check other properties of it). By doing this ourselves, the
resulting CRT fd now correctly propagates O_CLOEXEC.
Member
Author
|
Thanks, @nojb - sorry for taking quite so long to push the revised |
dra27
added a commit
that referenced
this pull request
Apr 24, 2025
Preserve `O_CLOEXEC` when converting `Unix.file_descr` to a CRT fd on Windows (cherry picked from commit 23852cc)
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-on from #13879, which has been sporadically failing in Jenkins.
There are two kinds of failure which have been showing up:
Sys_errorexception reporting"tmp.txt: Permission denied"I've investigated this further (in large part, because one of the bug-fixes in Relocatable OCaml made both these issues much more reproducible than they were before) and have determined:
tmp.txtare closed by L6 in:ocaml/testsuite/tests/lib-unix/common/fdstatus_main.ml
Lines 3 to 11 in 76ec239
but by the time we reach the
Sys.removeon L11, Windows Defender has opened the file ready for "scanning", but not actually done it. As a result, theSys.removefails.Unix.execvin Restore lib-unix/common/cloexec.ml test on Windows #13879, we still end up seeing the same issue that caused the test to be disabled previously because of some missing code incaml_win32_CRT_fd_of_filedescr. The various mechanisms tested incloexec.mlall correctly useSetHandleInformationorSECURITY_ATTRIBUTES, but whencaml_win32_CRT_fd_of_filedescrconverts them to CRT fds, we don't pass theO_NOINHERITflag for close-on-exec handles. As a result all of the handles will be passed to the exec'd process - but the ones which are meant to be close-on-exec will just be invalid. However, what we are then sporadically seeing is that the Win32 handle value has been reused, so the CRT fd appears to be open (but it's in fact a different handle). This is what was being seen before when we were passing the Win32 handle values directly 🤦Finally, while debugging this, I noted and have worked around an additional quirk of the way the
execfamily are implemented on Windows. Again, on an overloaded system, it's just possible that the exec'd program runs for quite some time, or even to completion, before the process which actually exec'd it (let's call it the "ancestor" rather than "parent", as it's not supposed to exist) has reached the_exit(0)call in the CRT. In this race condition, the handles may still be open when theSys.removeis reached. That's obviously not a situation which can happen on a POSIX system, where the ancestor still running while the exec'd image is running is a major misimplementation!This PR is therefore three commits:
cloexec.mlcreates a filelock.txt(withO_CLOEXEC) and locks the entire file for reading and writing. Again, on Windows only,fdstatus_main.mlopens thislock.txtand also attempts to lock the file for reading and writing. This will only succeed once the "ancestor" has completely terminated - in other words,fdstatus_main.mlis then able to block untilcloexec.mlhas definitely terminated.tmp.txtwithFILE_SHARE_DELETEand then adding an additional C stub which is used at the start offdstatus_main.mlto opentmp.txtwithFILE_FLAG_DELETE_ON_CLOSE. The Windows handle from this is then intentionally leaked - the point is that the handle will be automatically closed by Windows when the process exits and when Windows Defender has finally scanned it, the kernel will dutifully delete the file for us. Enough processes get held open in the process group while this happens that ocamltest waits around as normal for the test to conclude. This is a well-known problem on Windows, but this a less-common solution to it. The more common approach is to spin for up to a few seconds, retrying the deletion every 100ms or so. The solution I propose here is technically neater - the file is temporary, and we're using an (old and) actual Windows technology for dealing with the removal of temporary files, but also the value of the timeout we'd need for the spin is hard to determine - in one run on precheck, the delay is 10 seconds...!GetHandleInformationto determine if the Win32 handle has been marked as inheritable. If it is not, then we passO_NOINHERITto_open_osf_handle, which causes the CRT fd book-keeping to be corrected.Note that while the fix to
caml_win32_CRT_fd_of_filedescris a bug-fix, it is not as critical a bug-fix as it might look on first glance. Handles marked as close-on-exec were not being passed to any exec'd processes - the issue is that the Win32 handle values were being passed in fds, but the descriptors fail on first use. This combined slightly unfortunately in the cloexec.ml test with the fact that Windows uses HANDLE values for multiple things and meant that the fdstatus_main.ml might re-use the value of a HANDLE with the result that an fd appeared open - but it would not be a handle to the same thing as in cloexec.ml (i.e. the bug is book-keeping/correctness and not security). That said, Jenkins is relatively frequently suffering from this, so it'd be good to fix the test (again)...