Skip to content

[dashboard] Fix retrieving IP address from the GPUProfilingManager on the dashboard agent#53807

Merged
justinvyu merged 3 commits intoray-project:masterfrom
justinvyu:fix_ip_mac_issue
Jun 13, 2025
Merged

[dashboard] Fix retrieving IP address from the GPUProfilingManager on the dashboard agent#53807
justinvyu merged 3 commits intoray-project:masterfrom
justinvyu:fix_ip_mac_issue

Conversation

@justinvyu
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu commented Jun 13, 2025

Summary

The current code to get the current IP address errors for some versions of MacOS. Instead, we can pass in the IP address from the dashboard agent directly.

hostname = socket.gethostname()
self._ip_address = socket.gethostbyname(hostname)

This code flakily fails on some Mac versions with this error:

Traceback (most recent call last):
  File "/Users/eoakes/code/ray/python/ray/dashboard/agent.py", line 451, in <module>
    loop.run_until_complete(agent.run())
  File "/Users/eoakes/miniconda3/envs/ray/lib/python3.12/asyncio/base_events.py", line 686, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/eoakes/code/ray/python/ray/dashboard/agent.py", line 166, in run
    modules = self._load_modules()
              ^^^^^^^^^^^^^^^^^^^^
  File "/Users/eoakes/code/ray/python/ray/dashboard/agent.py", line 146, in _load_modules
    c = cls(self)
        ^^^^^^^^^
  File "/Users/eoakes/code/ray/python/ray/dashboard/modules/reporter/reporter_agent.py", line 441, in __init__
    self._gpu_profiling_manager = GpuProfilingManager(self._log_dir)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eoakes/code/ray/python/ray/dashboard/modules/reporter/gpu_profile_manager.py", line 73, in __init__
    self._ip_address = socket.gethostbyname(hostname)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copilot AI review requested due to automatic review settings June 13, 2025 17:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR resolves an issue with retrieving the IP address on MacOS by changing the implementation to pass the IP address directly from the dashboard agent instead of determining it locally. Key changes include:

  • Passing a new ip_address parameter to GpuProfilingManager in the tests.
  • Updating reporter_agent.py to construct GpuProfilingManager with the ip_address.
  • Removing the socket-based IP address resolution logic from gpu_profile_manager.py in favor of the passed ip_address.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
python/ray/dashboard/modules/reporter/tests/test_gpu_profiler_manager.py Tests updated to pass a mock IP address instead of relying on local resolution.
python/ray/dashboard/modules/reporter/reporter_agent.py GpuProfilingManager construction updated to include the ip_address parameter.
python/ray/dashboard/modules/reporter/gpu_profile_manager.py init modified to accept ip_address and removed socket usage for IP retrieval.
Comments suppressed due to low confidence (1)

python/ray/dashboard/modules/reporter/gpu_profile_manager.py:63

  • [nitpick] Update the constructor's docstring to document the new ip_address parameter for improved clarity for future maintainers.
def __init__(self, profile_dir_path: str, ip_address: str):

Comment on lines +40 to +41
MOCK_IP_ADDRESS = "127.0.0.1"

Copy link

Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider refactoring the use of MOCK_IP_ADDRESS into a dedicated fixture to reduce repetition and make the tests more maintainable.

Suggested change
MOCK_IP_ADDRESS = "127.0.0.1"
@pytest.fixture
def mock_ip_address():
yield "127.0.0.1"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinvyu separately, the naming in these files is sloppy/inconsistent, suggest to clean it up:

  • Should be GPUProfilingManager, not GpuProfilingManager because GPU is an initialism
  • The file is called gpu_profile_manager.py, the class is GpuProfilingManager, and the test is gpu_profiler_manager.py. Stick to "profile" or "profiling" and make it consistent.

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Jun 13, 2025

@justinvyu please also include details about the error this fixes (including the traceback) and any manual testing steps you took to verify that it fixes the issue.

@justinvyu
Copy link
Copy Markdown
Contributor Author

@edoakes I'm not able to reproduce this on my machine -- I tried the test_multiprocessing.py repro you mention in your PR with master, but it has been consistently passing.

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Jun 13, 2025

@edoakes I'm not able to reproduce this on my machine -- I tried the test_multiprocessing.py repro you mention in your PR with master, but it has been consistently passing.

Ah I'm realizing now I was running it against the version of that test in this branch: #53802

Not sure the difference, but it only failed on that branch for me as well.

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Jun 13, 2025

Prior to this change, it failed ~1/10 times with this traceback:

Traceback (most recent call last):
  File "/Users/eoakes/code/ray/python/ray/dashboard/agent.py", line 451, in <module>
    loop.run_until_complete(agent.run())
  File "/Users/eoakes/miniconda3/envs/ray/lib/python3.12/asyncio/base_events.py", line 686, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/eoakes/code/ray/python/ray/dashboard/agent.py", line 166, in run
    modules = self._load_modules()
              ^^^^^^^^^^^^^^^^^^^^
  File "/Users/eoakes/code/ray/python/ray/dashboard/agent.py", line 146, in _load_modules
    c = cls(self)
        ^^^^^^^^^
  File "/Users/eoakes/code/ray/python/ray/dashboard/modules/reporter/reporter_agent.py", line 441, in __init__
    self._gpu_profiling_manager = GpuProfilingManager(self._log_dir)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eoakes/code/ray/python/ray/dashboard/modules/reporter/gpu_profile_manager.py", line 73, in __init__
    self._ip_address = socket.gethostbyname(hostname)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

With the change in this PR, I haven't observed any failures after >100 runs.

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀



class GpuProfilingManager:
class GPUProfilingManager:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this renaming seems to not related to the bug fixing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's unrelated to make some naming more conventional

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider splitting the rename to another PR? or at least when cherrypicking, only cherrypick the fix without the renaming?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

cc @edoakes Going to revert the naming changes for this PR and do them in a followup to have a minimal fix for cherry-pick.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, better PR hygiene to split behavior changes and refactoring

@justinvyu justinvyu enabled auto-merge (squash) June 13, 2025 20:47
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 13, 2025
@justinvyu justinvyu disabled auto-merge June 13, 2025 20:56
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu enabled auto-merge (squash) June 13, 2025 21:48
@justinvyu justinvyu merged commit d5c6beb into ray-project:master Jun 13, 2025
6 checks passed
@justinvyu justinvyu deleted the fix_ip_mac_issue branch June 13, 2025 22:31
justinvyu added a commit to justinvyu/ray that referenced this pull request Jun 13, 2025
…on the dashboard agent (ray-project#53807)

The current code to get the current IP address errors for some versions
of MacOS. Instead, we can pass in the IP address from the dashboard
agent directly.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
aslonnie pushed a commit that referenced this pull request Jun 14, 2025
…ingManager on the dashboard agent (#53817)

This is a cherry-pick of #53807

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
…on the dashboard agent (#53807)

The current code to get the current IP address errors for some versions
of MacOS. Instead, we can pass in the IP address from the dashboard
agent directly.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
…on the dashboard agent (#53807)

The current code to get the current IP address errors for some versions
of MacOS. Instead, we can pass in the IP address from the dashboard
agent directly.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
weiquanlee pushed a commit to antgroup/ant-ray that referenced this pull request Aug 5, 2025
…ingManager on the dashboard agent (ray-project#53817)

This is a cherry-pick of ray-project#53807

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants