[Core] Use fd instead of handle for windows log redirection#53852
[Core] Use fd instead of handle for windows log redirection#53852jjyao merged 5 commits intoray-project:masterfrom
Conversation
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
| if (handle == INVALID_HANDLE_VALUE) { | ||
| return Status::IOError("") << "Fails to get file handle for flushing"; | ||
| } | ||
| if (!FlushFileBuffers(handle)) { |
There was a problem hiding this comment.
For flush, there is no corresponding C API so we need to covert fd back to HANDLE and use the WinAPI
There was a problem hiding this comment.
will mixing the APIs in this way cause issues like the one this PR intends to fix?
There was a problem hiding this comment.
From the testing, it didn't cause issues.
I actually didn't fully understand the root cause. I initially started the PR just trying to simplify things by uniformly using fd and then realized that it fixed the issue.
One thing I noticed is that converting from HANDLE to fd transfer the ownership. But getting a HANDLE back from fd (_get_osfhandle) should be safe as long as we don't close the handle directly. All these are based on my very limited Windows knowledge.
The _open_osfhandle call transfers ownership of the Win32 file handle to the file descriptor. To close a file opened by using _open_osfhandle, call [_close](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/close?view=msvc-170). The underlying OS file handle is also closed by a call to _close. Don't call the Win32 function CloseHandle on the original handle. If the file descriptor is owned by a FILE * stream, then a call to [fclose](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fclose-fcloseall?view=msvc-170) closes both the file descriptor and the underlying handle. In this case, don't call _close on the file descriptor or CloseHandle on the original handle.
There was a problem hiding this comment.
I don't pretend to understand this and the PR is merged, but could you please leave a comment in compat.h that describes your current understanding and can aid in future debugging? This might break again in the future.
|
Have you verified that this fixes the reported issue? If so, can you detail how you tested it? |
|
I tried it and it works, (ray-dev) PS C:\Users\czgdp1807\Anyscale\ray\python> ray start --head
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: 127.0.0.1
--------------------
Ray runtime started.
--------------------
Next steps
To connect to this Ray cluster:
import ray
ray.init()
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
(ray-dev) PS C:\Users\czgdp1807\Anyscale\ray\python> ray status
======== Autoscaler status: 2025-06-17 17:50:13.618875 ========
Node status
---------------------------------------------------------------
Active:
1 node_1f5d161361f94b4077b0d9ee34c485990b85c77bd55bd6db82d8766a
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/4.0 CPU
0B/7.21GiB memory
0B/3.09GiB object_store_memory
Total Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands) |
czgdp1807
left a comment
There was a problem hiding this comment.
Seems like fds are bing directly used in this PR and that's why its working on Windows 11.
|
Could we get a cherry-pick for that PR to be integrated in recently published 2.47 release? |
|
@PhilippWillms we will release it as 2.48 which should be released in 2-3 weeks. Would that work? |
israbbani
left a comment
There was a problem hiding this comment.
I don't understand the change fully, but it might be worth documenting this in the code to aid future debugging.
Thanks for debugging and fixing this!
| if (handle == INVALID_HANDLE_VALUE) { | ||
| return Status::IOError("") << "Fails to get file handle for flushing"; | ||
| } | ||
| if (!FlushFileBuffers(handle)) { |
There was a problem hiding this comment.
I don't pretend to understand this and the PR is merged, but could you please leave a comment in compat.h that describes your current understanding and can aid in future debugging? This might break again in the future.
| Status CompleteWrite(int fd, const char *data, size_t len) { | ||
| const int ret = _write(fd, data, len); | ||
| if (ret == -1) { | ||
| return Status::IOError("") << "Fails to write to file because " << strerror(errno); |
| Status CompleteWrite(int fd, const char *data, size_t len) { | ||
| const int ret = _write(fd, data, len); | ||
| if (ret == -1) { | ||
| return Status::IOError("") << "Fails to write to file because " << strerror(errno); |
| } | ||
| return Status::OK(); | ||
| } | ||
| Status Close(MEMFD_TYPE_NON_UNIQUE fd) { |
There was a problem hiding this comment.
Why was this MEMFD_TYPE_NON_UNIQUE before and isn't that just an alias for int anyway?
| } | ||
| if ((DWORD)len != bytes_written) { | ||
| if (ret != static_cast<int>(len)) { | ||
| return Status::IOError("") << "Fails to write all requested bytes, requests to write " |
There was a problem hiding this comment.
s/Fails/Failed/
s/write/wrote/
| auto ostream = | ||
| std::make_shared<boost::iostreams::stream<boost::iostreams::file_descriptor_sink>>( | ||
| std::move(fd_sink)); | ||
| #if defined(__APPLE__) || defined(__linux__) |
There was a problem hiding this comment.
Probably also deserves a comment that explains the edge case that caused this issue.
Ok, fine for me. |
…ect#53852) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Why are these changes needed?
Windows has two sets of APIs for working with files. One is the WinAPI which works with
HANDLE. The other is C API which works withfd(just like Unix). Currently in master we are mix using both and seems have caused issues like #52739. In this PR, we use the C API instead which is more consistent with the Unix code path.ray startpreviously failed on my windows machine now passes with this PR.Related issue number
Closes #52739
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.