Skip to content

[autoscaler/tune] Inconsistent state when downscaling #1899

@richardliaw

Description

@richardliaw

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Ray installed from (source or binary):
  • Ray version: 0.4
  • Python version:
  • Exact command to reproduce:

Describe the problem

It seems that the autoscaler will start downscaling and enter an inconsistent state, and eventually fail.

Source code / logs

(ray) C02TX1VXHTDD:ec2 rliaw$ cat /Users/rliaw/Downloads/raylogs/monitor-2018-04-11_18-37-25-07373.out | grep Traceback -C 50
 - NumNodesConnected: 10
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/720.0 CPU, 63.0/90.0 GPU
StandardAutoscaler: Terminating idle node: i-00eb871722ab3ef53
StandardAutoscaler: Terminating idle node: i-03a1a9e59d316c57c
StandardAutoscaler [2018-04-12 00:27:38.671127]: 7/9 target nodes
 - NodeIdleSeconds: Min=1 Mean=89 Max=305
 - NumNodesConnected: 10
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/720.0 CPU, 63.0/90.0 GPU
StandardAutoscaler: Launching 2 new nodes
StandardAutoscaler [2018-04-12 00:27:40.652004]: 9/9 target nodes
 - NodeIdleSeconds: Min=3 Mean=91 Max=307
 - NumNodesConnected: 10
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/720.0 CPU, 63.0/90.0 GPU
StandardAutoscaler [2018-04-12 00:27:43.596031]: 9/9 target nodes
 - NodeIdleSeconds: Min=2 Mean=92 Max=309
 - NumNodesConnected: 10
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/576.0 CPU, 63.0/72.0 GPU
Removed 2 stale ip mappings: {'172.31.24.15', '172.31.23.76'} not in {'172.31.27.74', '172.31.18.42', '172.31.28.247', '172.31.16.73', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.29.215', '172.31.30.85'}
Removed 2 stale ip mappings: {'172.31.24.15', '172.31.23.76'} not in {'172.31.27.74', '172.31.18.42', '172.31.28.247', '172.31.16.73', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.29.215', '172.31.30.85'}
Removed 2 stale ip mappings: {'172.31.24.15', '172.31.23.76'} not in {'172.31.27.74', '172.31.18.42', '172.31.28.247', '172.31.16.73', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.29.215', '172.31.30.85'}
StandardAutoscaler [2018-04-12 00:27:48.907662]: 9/9 target nodes
 - NodeIdleSeconds: Min=0 Mean=37 Max=292
 - NumNodesConnected: 8
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/576.0 CPU, 63.0/72.0 GPU
StandardAutoscaler: i-01b4588cb28aa093d has runtime state None, want 7fcab4984a7206ef0a66161eed19d8b36597788c
StandardAutoscaler: i-00d0a6d166f459c7e has runtime state None, want 7fcab4984a7206ef0a66161eed19d8b36597788c
NodeUpdater: Updating i-01b4588cb28aa093d to 7fcab4984a7206ef0a66161eed19d8b36597788c, logging to /tmp/node-updater-6d36g0kv
NodeUpdater: Updating i-00d0a6d166f459c7e to 7fcab4984a7206ef0a66161eed19d8b36597788c, logging to /tmp/node-updater-olw0oad2
StandardAutoscaler [2018-04-12 00:27:53.830742]: 9/9 target nodes (2 updating)
 - NodeIdleSeconds: Min=0 Mean=37 Max=297
 - NumNodesConnected: 8
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/576.0 CPU, 63.0/72.0 GPU
StandardAutoscaler [2018-04-12 00:27:58.864586]: 9/9 target nodes (2 updating)
 - NodeIdleSeconds: Min=0 Mean=38 Max=302
 - NumNodesConnected: 8
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/576.0 CPU, 63.0/72.0 GPU
StandardAutoscaler: Terminating idle node: i-0797e33d0134d26bd
StandardAutoscaler [2018-04-12 00:27:59.727026]: 8/9 target nodes (2 updating)
 - NodeIdleSeconds: Min=1 Mean=38 Max=303
 - NumNodesConnected: 8
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/576.0 CPU, 63.0/72.0 GPU
StandardAutoscaler: Launching 1 new nodes
StandardAutoscaler: Error during autoscaling: {} Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 246, in update
    self._update()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 308, in _update
    min(self.max_concurrent_launches, target_num - len(nodes)))
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 458, in launch_new_node
    "Num nodes failed to increase after creating a new node"
AssertionError: Num nodes failed to increase after creating a new node

StandardAutoscaler [2018-04-12 00:28:04.632214]: 9/9 target nodes (2 updating)
 - NodeIdleSeconds: Min=2 Mean=41 Max=308
 - NumNodesConnected: 8
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
Removed 1 stale ip mappings: {'172.31.29.215'} not in {'172.31.27.74', '172.31.18.42', '172.31.16.73', '172.31.28.247', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.17.181', '172.31.30.85'}
Removed 1 stale ip mappings: {'172.31.29.215'} not in {'172.31.27.74', '172.31.18.42', '172.31.16.73', '172.31.28.247', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.17.181', '172.31.30.85'}
Removed 1 stale ip mappings: {'172.31.29.215'} not in {'172.31.27.74', '172.31.18.42', '172.31.16.73', '172.31.28.247', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.17.181', '172.31.30.85'}
StandardAutoscaler [2018-04-12 00:28:09.637022]: 9/9 target nodes (2 updating)
 - NodeIdleSeconds: Min=0 Mean=0 Max=0
 - NumNodesConnected: 7
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler: i-0f50dfb903178000f has runtime state None, want 7fcab4984a7206ef0a66161eed19d8b36597788c
NodeUpdater: Updating i-0f50dfb903178000f to 7fcab4984a7206ef0a66161eed19d8b36597788c, logging to /tmp/node-updater-1khuufqg
StandardAutoscaler [2018-04-12 00:28:14.782816]: 9/9 target nodes (3 updating)
 - NodeIdleSeconds: Min=0 Mean=0 Max=0
 - NumNodesConnected: 7
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler [2018-04-12 00:28:19.808195]: 9/9 target nodes (3 updating)
 - NodeIdleSeconds: Min=0 Mean=0 Max=0
 - NumNodesConnected: 7
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler [2018-04-12 00:28:24.845424]: 9/9 target nodes (3 updating)
 - NodeIdleSeconds: Min=0 Mean=0 Max=0
 - NumNodesConnected: 7
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler [2018-04-12 00:28:29.874830]: 9/9 target nodes (3 updating)
 - NodeIdleSeconds: Min=0 Mean=0 Max=0
 - NumNodesConnected: 7
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler [2018-04-12 00:28:35.665921]: 9/9 target nodes (3 updating)
 - NodeIdleSeconds: Min=0 Mean=0 Max=0
 - NumNodesConnected: 7
 - NumNodesUsed: 7.0
 - ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler [2018-04-12 00:28:40.068353]: 9/9 target nodes (3 updating)
 - NodeIdleSeconds: Min=0 Mean=0 Max=0
--

The other issue that shows up here is this:

--
 - NumNodesUsed: 5.67
 - ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:30:48.135179]: 8/8 target nodes
 - NodeIdleSeconds: Min=0 Mean=59 Max=266
 - NumNodesConnected: 9
 - NumNodesUsed: 5.67
 - ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:30:53.172798]: 8/8 target nodes
 - NodeIdleSeconds: Min=0 Mean=60 Max=271
 - NumNodesConnected: 9
 - NumNodesUsed: 5.67
 - ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:30:58.382863]: 8/8 target nodes
 - NodeIdleSeconds: Min=0 Mean=61 Max=277
 - NumNodesConnected: 9
 - NumNodesUsed: 5.67
 - ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:31:03.363217]: 8/8 target nodes
 - NodeIdleSeconds: Min=0 Mean=62 Max=282
 - NumNodesConnected: 9
 - NumNodesUsed: 5.67
 - ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:31:08.471393]: 8/8 target nodes
 - NodeIdleSeconds: Min=0 Mean=63 Max=287
 - NumNodesConnected: 9
 - NumNodesUsed: 5.67
 - ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:31:13.574426]: 8/8 target nodes
 - NodeIdleSeconds: Min=0 Mean=64 Max=292
 - NumNodesConnected: 9
 - NumNodesUsed: 5.67
 - ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:31:18.571424]: 8/8 target nodes
 - NodeIdleSeconds: Min=0 Mean=65 Max=297
 - NumNodesConnected: 9
 - NumNodesUsed: 5.67
 - ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:31:23.713460]: 8/8 target nodes
 - NodeIdleSeconds: Min=0 Mean=67 Max=302
 - NumNodesConnected: 9
 - NumNodesUsed: 5.67
 - ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler: Terminating idle node: i-0057848d7c2279b1f
StandardAutoscaler: Terminating idle node: i-016a30fe73374594f
StandardAutoscaler [2018-04-12 08:31:24.528971]: 6/8 target nodes
 - NodeIdleSeconds: Min=1 Mean=67 Max=303
 - NumNodesConnected: 9
 - NumNodesUsed: 5.67
 - ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler: Launching 2 new nodes
StandardAutoscaler: Error during autoscaling: {} Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 246, in update
    self._update()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 308, in _update
    min(self.max_concurrent_launches, target_num - len(nodes)))
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 458, in launch_new_node
    "Num nodes failed to increase after creating a new node"
AssertionError: Num nodes failed to increase after creating a new node

StandardAutoscaler [2018-04-12 08:42:00.481457]: 3/8 target nodes
 - NodeIdleSeconds: Min=5 Mean=212 Max=939
 - NumNodesConnected: 9
 - NumNodesUsed: 5.67
 - ResourceUsage: 0.0/504.0 CPU, 51.0/63.0 GPU
Removed 7 stale ip mappings: {'172.31.27.74', '172.31.22.168', '172.31.18.42', '172.31.16.73', '172.31.28.188', '172.31.18.120', '172.31.30.85'} not in {'172.31.24.65', '172.31.21.3', '172.31.26.25', '172.31.30.218'}
Removed 7 stale ip mappings: {'172.31.27.74', '172.31.22.168', '172.31.18.42', '172.31.16.73', '172.31.28.188', '172.31.18.120', '172.31.30.85'} not in {'172.31.24.65', '172.31.21.3', '172.31.26.25', '172.31.30.218'}
Removed 7 stale ip mappings: {'172.31.27.74', '172.31.22.168', '172.31.18.42', '172.31.16.73', '172.31.28.188', '172.31.18.120', '172.31.30.85'} not in {'172.31.24.65', '172.31.21.3', '172.31.26.25', '172.31.30.218'}
StandardAutoscaler: i-056b298d20ffb7d16 has runtime state None, want 7fcab4984a7206ef0a66161eed19d8b36597788c
NodeUpdater: Updating i-056b298d20ffb7d16 to 7fcab4984a7206ef0a66161eed19d8b36597788c, logging to /tmp/node-updater-gj5ktv7o
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions