-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- Ray installed from (source or binary):
- Ray version: 0.4
- Python version:
- Exact command to reproduce:
Describe the problem
It seems that the autoscaler will start downscaling and enter an inconsistent state, and eventually fail.
Source code / logs
(ray) C02TX1VXHTDD:ec2 rliaw$ cat /Users/rliaw/Downloads/raylogs/monitor-2018-04-11_18-37-25-07373.out | grep Traceback -C 50
- NumNodesConnected: 10
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/720.0 CPU, 63.0/90.0 GPU
StandardAutoscaler: Terminating idle node: i-00eb871722ab3ef53
StandardAutoscaler: Terminating idle node: i-03a1a9e59d316c57c
StandardAutoscaler [2018-04-12 00:27:38.671127]: 7/9 target nodes
- NodeIdleSeconds: Min=1 Mean=89 Max=305
- NumNodesConnected: 10
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/720.0 CPU, 63.0/90.0 GPU
StandardAutoscaler: Launching 2 new nodes
StandardAutoscaler [2018-04-12 00:27:40.652004]: 9/9 target nodes
- NodeIdleSeconds: Min=3 Mean=91 Max=307
- NumNodesConnected: 10
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/720.0 CPU, 63.0/90.0 GPU
StandardAutoscaler [2018-04-12 00:27:43.596031]: 9/9 target nodes
- NodeIdleSeconds: Min=2 Mean=92 Max=309
- NumNodesConnected: 10
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/576.0 CPU, 63.0/72.0 GPU
Removed 2 stale ip mappings: {'172.31.24.15', '172.31.23.76'} not in {'172.31.27.74', '172.31.18.42', '172.31.28.247', '172.31.16.73', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.29.215', '172.31.30.85'}
Removed 2 stale ip mappings: {'172.31.24.15', '172.31.23.76'} not in {'172.31.27.74', '172.31.18.42', '172.31.28.247', '172.31.16.73', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.29.215', '172.31.30.85'}
Removed 2 stale ip mappings: {'172.31.24.15', '172.31.23.76'} not in {'172.31.27.74', '172.31.18.42', '172.31.28.247', '172.31.16.73', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.29.215', '172.31.30.85'}
StandardAutoscaler [2018-04-12 00:27:48.907662]: 9/9 target nodes
- NodeIdleSeconds: Min=0 Mean=37 Max=292
- NumNodesConnected: 8
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/576.0 CPU, 63.0/72.0 GPU
StandardAutoscaler: i-01b4588cb28aa093d has runtime state None, want 7fcab4984a7206ef0a66161eed19d8b36597788c
StandardAutoscaler: i-00d0a6d166f459c7e has runtime state None, want 7fcab4984a7206ef0a66161eed19d8b36597788c
NodeUpdater: Updating i-01b4588cb28aa093d to 7fcab4984a7206ef0a66161eed19d8b36597788c, logging to /tmp/node-updater-6d36g0kv
NodeUpdater: Updating i-00d0a6d166f459c7e to 7fcab4984a7206ef0a66161eed19d8b36597788c, logging to /tmp/node-updater-olw0oad2
StandardAutoscaler [2018-04-12 00:27:53.830742]: 9/9 target nodes (2 updating)
- NodeIdleSeconds: Min=0 Mean=37 Max=297
- NumNodesConnected: 8
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/576.0 CPU, 63.0/72.0 GPU
StandardAutoscaler [2018-04-12 00:27:58.864586]: 9/9 target nodes (2 updating)
- NodeIdleSeconds: Min=0 Mean=38 Max=302
- NumNodesConnected: 8
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/576.0 CPU, 63.0/72.0 GPU
StandardAutoscaler: Terminating idle node: i-0797e33d0134d26bd
StandardAutoscaler [2018-04-12 00:27:59.727026]: 8/9 target nodes (2 updating)
- NodeIdleSeconds: Min=1 Mean=38 Max=303
- NumNodesConnected: 8
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/576.0 CPU, 63.0/72.0 GPU
StandardAutoscaler: Launching 1 new nodes
StandardAutoscaler: Error during autoscaling: {} Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 246, in update
self._update()
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 308, in _update
min(self.max_concurrent_launches, target_num - len(nodes)))
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 458, in launch_new_node
"Num nodes failed to increase after creating a new node"
AssertionError: Num nodes failed to increase after creating a new node
StandardAutoscaler [2018-04-12 00:28:04.632214]: 9/9 target nodes (2 updating)
- NodeIdleSeconds: Min=2 Mean=41 Max=308
- NumNodesConnected: 8
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
Removed 1 stale ip mappings: {'172.31.29.215'} not in {'172.31.27.74', '172.31.18.42', '172.31.16.73', '172.31.28.247', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.17.181', '172.31.30.85'}
Removed 1 stale ip mappings: {'172.31.29.215'} not in {'172.31.27.74', '172.31.18.42', '172.31.16.73', '172.31.28.247', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.17.181', '172.31.30.85'}
Removed 1 stale ip mappings: {'172.31.29.215'} not in {'172.31.27.74', '172.31.18.42', '172.31.16.73', '172.31.28.247', '172.31.28.188', '172.31.30.218', '172.31.16.192', '172.31.26.25', '172.31.17.181', '172.31.30.85'}
StandardAutoscaler [2018-04-12 00:28:09.637022]: 9/9 target nodes (2 updating)
- NodeIdleSeconds: Min=0 Mean=0 Max=0
- NumNodesConnected: 7
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler: i-0f50dfb903178000f has runtime state None, want 7fcab4984a7206ef0a66161eed19d8b36597788c
NodeUpdater: Updating i-0f50dfb903178000f to 7fcab4984a7206ef0a66161eed19d8b36597788c, logging to /tmp/node-updater-1khuufqg
StandardAutoscaler [2018-04-12 00:28:14.782816]: 9/9 target nodes (3 updating)
- NodeIdleSeconds: Min=0 Mean=0 Max=0
- NumNodesConnected: 7
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler [2018-04-12 00:28:19.808195]: 9/9 target nodes (3 updating)
- NodeIdleSeconds: Min=0 Mean=0 Max=0
- NumNodesConnected: 7
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler [2018-04-12 00:28:24.845424]: 9/9 target nodes (3 updating)
- NodeIdleSeconds: Min=0 Mean=0 Max=0
- NumNodesConnected: 7
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler [2018-04-12 00:28:29.874830]: 9/9 target nodes (3 updating)
- NodeIdleSeconds: Min=0 Mean=0 Max=0
- NumNodesConnected: 7
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler [2018-04-12 00:28:35.665921]: 9/9 target nodes (3 updating)
- NodeIdleSeconds: Min=0 Mean=0 Max=0
- NumNodesConnected: 7
- NumNodesUsed: 7.0
- ResourceUsage: 0.0/504.0 CPU, 63.0/63.0 GPU
StandardAutoscaler [2018-04-12 00:28:40.068353]: 9/9 target nodes (3 updating)
- NodeIdleSeconds: Min=0 Mean=0 Max=0
--
The other issue that shows up here is this:
--
- NumNodesUsed: 5.67
- ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:30:48.135179]: 8/8 target nodes
- NodeIdleSeconds: Min=0 Mean=59 Max=266
- NumNodesConnected: 9
- NumNodesUsed: 5.67
- ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:30:53.172798]: 8/8 target nodes
- NodeIdleSeconds: Min=0 Mean=60 Max=271
- NumNodesConnected: 9
- NumNodesUsed: 5.67
- ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:30:58.382863]: 8/8 target nodes
- NodeIdleSeconds: Min=0 Mean=61 Max=277
- NumNodesConnected: 9
- NumNodesUsed: 5.67
- ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:31:03.363217]: 8/8 target nodes
- NodeIdleSeconds: Min=0 Mean=62 Max=282
- NumNodesConnected: 9
- NumNodesUsed: 5.67
- ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:31:08.471393]: 8/8 target nodes
- NodeIdleSeconds: Min=0 Mean=63 Max=287
- NumNodesConnected: 9
- NumNodesUsed: 5.67
- ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:31:13.574426]: 8/8 target nodes
- NodeIdleSeconds: Min=0 Mean=64 Max=292
- NumNodesConnected: 9
- NumNodesUsed: 5.67
- ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:31:18.571424]: 8/8 target nodes
- NodeIdleSeconds: Min=0 Mean=65 Max=297
- NumNodesConnected: 9
- NumNodesUsed: 5.67
- ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler [2018-04-12 08:31:23.713460]: 8/8 target nodes
- NodeIdleSeconds: Min=0 Mean=67 Max=302
- NumNodesConnected: 9
- NumNodesUsed: 5.67
- ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler: Terminating idle node: i-0057848d7c2279b1f
StandardAutoscaler: Terminating idle node: i-016a30fe73374594f
StandardAutoscaler [2018-04-12 08:31:24.528971]: 6/8 target nodes
- NodeIdleSeconds: Min=1 Mean=67 Max=303
- NumNodesConnected: 9
- NumNodesUsed: 5.67
- ResourceUsage: 0.0/648.0 CPU, 51.0/81.0 GPU
StandardAutoscaler: Launching 2 new nodes
StandardAutoscaler: Error during autoscaling: {} Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 246, in update
self._update()
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 308, in _update
min(self.max_concurrent_launches, target_num - len(nodes)))
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 458, in launch_new_node
"Num nodes failed to increase after creating a new node"
AssertionError: Num nodes failed to increase after creating a new node
StandardAutoscaler [2018-04-12 08:42:00.481457]: 3/8 target nodes
- NodeIdleSeconds: Min=5 Mean=212 Max=939
- NumNodesConnected: 9
- NumNodesUsed: 5.67
- ResourceUsage: 0.0/504.0 CPU, 51.0/63.0 GPU
Removed 7 stale ip mappings: {'172.31.27.74', '172.31.22.168', '172.31.18.42', '172.31.16.73', '172.31.28.188', '172.31.18.120', '172.31.30.85'} not in {'172.31.24.65', '172.31.21.3', '172.31.26.25', '172.31.30.218'}
Removed 7 stale ip mappings: {'172.31.27.74', '172.31.22.168', '172.31.18.42', '172.31.16.73', '172.31.28.188', '172.31.18.120', '172.31.30.85'} not in {'172.31.24.65', '172.31.21.3', '172.31.26.25', '172.31.30.218'}
Removed 7 stale ip mappings: {'172.31.27.74', '172.31.22.168', '172.31.18.42', '172.31.16.73', '172.31.28.188', '172.31.18.120', '172.31.30.85'} not in {'172.31.24.65', '172.31.21.3', '172.31.26.25', '172.31.30.218'}
StandardAutoscaler: i-056b298d20ffb7d16 has runtime state None, want 7fcab4984a7206ef0a66161eed19d8b36597788c
NodeUpdater: Updating i-056b298d20ffb7d16 to 7fcab4984a7206ef0a66161eed19d8b36597788c, logging to /tmp/node-updater-gj5ktv7o
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Warning: could not find ip for client 4e88ecf12e7cb9751f7d95566d0eb25268d89456.
Warning: could not find ip for client dc53f86651004cd1371f943b1bc964317b1f2ac8.
Warning: could not find ip for client 32485125b848396d1f3ebdc91b31b2ae7023a9cb.
Warning: could not find ip for client 26fc174ef12ac7c6b430db8821a984be61c9df42.
Warning: could not find ip for client 144919cdb0d5cdeb6cdb1828f96d1372d472f9d8.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels