-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'ttriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)
Description
What is the problem?
The long running distributed release test (pytorch_pbt_failure) fails after around 10 minutes with the following error:
2021-02-04 17:58:07,590 INFO commands.py:283 -- Checking AWS environment settings
2021-02-04 17:58:08,874 INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:58:09,027 INFO commands.py:441 -- Shutdown i-03aa1f3b86602ada0
2021-02-04 17:58:09,028 INFO command_runner.py:356 -- Fetched IP: 52.36.104.14
2021-02-04 17:58:09,028 INFO log_timer.py:27 -- NodeUpdater: i-03aa1f3b86602ada0: Got IP [LogTimer=0ms]
Warning: Permanently added '52.36.104.14' (ECDSA) to the list of known hosts.
Error: No such container: ray_container
Shared connection to 52.36.104.14 closed.
2021-02-04 17:59:20,400 WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 72.837 s, which may be a performance bottleneck.
Traceback (most recent call last):
File "/home/ray/pytorch_pbt_failure.py", line 136, in <module>
stop={"training_iteration": 1} if args.smoke_test else None)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
runner.step()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 360, in step
iteration=self._iteration, trials=self._trials)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/callback.py", line 172, in on_step_begin
callback.on_step_begin(**info)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/mock.py", line 122, in on_step_begin
override_cluster_name=None)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 460, in kill_node
_exec(updater, "ray stop", False, False)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 912, in _exec
shutdown_after_run=shutdown_after_run)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 627, in run
ssh_options_override_ssh_key=ssh_options_override_ssh_key)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 519, in run
final_cmd, with_output, exit_on_fail, silent=silent)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 445, in _run_helper
"Command failed:\n\n {}\n".format(joined_cmd)) from None
click.exceptions.ClickException: Command failed:
ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/3d9ed41da7/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@52.36.104.14 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'"'"'"'"'"'"'"'"''"'"' )'
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'ttriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)