Skip to content

Long running distributed test fails #13923

@wuisawesome

Description

@wuisawesome

What is the problem?

The long running distributed release test (pytorch_pbt_failure) fails after around 10 minutes with the following error:

2021-02-04 17:58:07,590 INFO commands.py:283 -- Checking AWS environment settings
2021-02-04 17:58:08,874 INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:58:09,027 INFO commands.py:441 -- Shutdown i-03aa1f3b86602ada0
2021-02-04 17:58:09,028 INFO command_runner.py:356 -- Fetched IP: 52.36.104.14
2021-02-04 17:58:09,028 INFO log_timer.py:27 -- NodeUpdater: i-03aa1f3b86602ada0: Got IP  [LogTimer=0ms]
Warning: Permanently added '52.36.104.14' (ECDSA) to the list of known hosts.
Error: No such container: ray_container
Shared connection to 52.36.104.14 closed.
2021-02-04 17:59:20,400 WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 72.837 s, which may be a performance bottleneck.
Traceback (most recent call last):
  File "/home/ray/pytorch_pbt_failure.py", line 136, in <module>
    stop={"training_iteration": 1} if args.smoke_test else None)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 360, in step
    iteration=self._iteration, trials=self._trials)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/callback.py", line 172, in on_step_begin
    callback.on_step_begin(**info)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/mock.py", line 122, in on_step_begin
    override_cluster_name=None)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 460, in kill_node
    _exec(updater, "ray stop", False, False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 912, in _exec
    shutdown_after_run=shutdown_after_run)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 627, in run
    ssh_options_override_ssh_key=ssh_options_override_ssh_key)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 519, in run
    final_cmd, with_output, exit_on_fail, silent=silent)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 445, in _run_helper
    "Command failed:\n\n  {}\n".format(joined_cmd)) from None
click.exceptions.ClickException: Command failed:

  ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/3d9ed41da7/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@52.36.104.14 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'"'"'"'"'"'"'"'"''"'"' )'

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'ttriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions