[autoscaler] fix the autoscaling bug for continuously launching failed nodes by AmeerHajAli · Pull Request #11714 · ray-project/ray

AmeerHajAli · 2020-10-30T01:30:49Z

Previously we considered pending and running nodes based on node tags rather than connected nodes from load metrics. This might be problematic as we cannot rely on node tags. This PR defines running nodes as connected nodes from LoadMetrics and pending nodes as the remaining nodes (non_terminated_nodes-connected nodes + pending queue) Furthermore, this PR treats the nodes that failed to launch as pending, and hence the number of failed nodes that are launched will be bounded by the max allowed pending nodes (in the code the handles the max concurrency).

Unblocks #11615 (merge after merging this please).

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

This reverts commit 818a63a.

… consider only load_metric's connected nodes as running

Ameer Haj Ali and others added 30 commits September 24, 2020 00:26

prepare for head node

7248cf9

move command runner interface outside _private

bc43e46

Merge github.com:ray-project/ray

14be0a1

remove space

ab660a8

Eric

1ea0c1f

flake

dad31ae

Merge github.com:ray-project/ray

49bcf56

min_workers in multi node type

16d736d

Merge github.com:ray-project/ray

06911df

fixing edge cases

0d8dddb

eric not idle

fe69ce3

fix target_workers to consider min_workers of node types

35832ed

idle timeout

ca0be53

minor

c9518bd

minor fix

5452d39

test

9e904cd

lint

f9edcbe

eric v2

cb02267

eric 3

5e5d403

min_workers constraint before bin packing

4d44cd8

Merge github.com:ray-project/ray

614abbf

Update resource_demand_scheduler.py

818a63a

Revert "Update resource_demand_scheduler.py"

539b29c

This reverts commit 818a63a.

reducing diff

9a63866

Merge branch 'master' of github.com:AmeerHajAli/ray

7501623

make get_nodes_to_launch return a dict

b4edd21

Merge github.com:ray-project/ray

fc48725

merge

0aef789

weird merge fix

39245a8

auto fill instance types for AWS

c7eb4ad

Ameer Haj Ali and others added 23 commits October 3, 2020 20:15

lets see

556eec8

edward

6e4b290

Limit max launch concurrency

4cab21c

Merge github.com:ray-project/ray

7c0d81c

commenting frac TODO

a49ca55

move to resource demand scheduler

3f41c74

Merge github.com:ray-project/ray

1b7c06d

use STATUS UP TO DATE

be89c0c

Eric

a4032c1

Merge github.com:ray-project/ray

ab93d3b

Merge github.com:ray-project/ray

0f382e7

make logger of gc freed refs debug instead of info

b2b4cc0

Merge github.com:ray-project/ray

95ea031

add cluster name to docker mount prefix directory

1f000e5

grrR

d4253a2

fix tests

a195d6e

moving docker directory to sdk

7f05e7e

move the import to prevent circular dependency

2e16221

smallf fix

64595ec

ian

abaf4c7

Merge github.com:ray-project/ray

668898c

fix max launch concurrency bug to assume failing nodes as pending and…

4a3ea6a

… consider only load_metric's connected nodes as running

small fix

1a2eb0b

AmeerHajAli requested review from ericl and wuisawesome October 30, 2020 01:30

AmeerHajAli assigned ericl and wuisawesome Oct 30, 2020

ericl approved these changes Oct 30, 2020

View reviewed changes

ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 30, 2020

ericl merged commit 7aade46 into ray-project:master Oct 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] fix the autoscaling bug for continuously launching failed nodes#11714

[autoscaler] fix the autoscaling bug for continuously launching failed nodes#11714
ericl merged 68 commits intoray-project:masterfrom
AmeerHajAli:master

AmeerHajAli commented Oct 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AmeerHajAli commented Oct 30, 2020

Related issue number

Checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants