Skip to content

Take into account queue length in autoscaling#5684

Merged
ericl merged 5 commits intoray-project:masterfrom
ericl:fix-autoscaler-load
Sep 11, 2019
Merged

Take into account queue length in autoscaling#5684
ericl merged 5 commits intoray-project:masterfrom
ericl:fix-autoscaler-load

Conversation

@ericl
Copy link
Copy Markdown
Contributor

@ericl ericl commented Sep 11, 2019

Why are these changes needed?

Related issue number

Checks

@ericl ericl changed the title [WIP] Take into account queue length in autoscaling Take into account queue length in autoscaling Sep 11, 2019
try:
self._run()
except Exception:
logger.exception("Error in monitor loop")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't do this, the error message is lost forever trying to terminate nodes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean nodes probably weren't cleaned up? If so, would be good to print that in the error message.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's done below actually

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16964/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16966/
Test PASSed.

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

def testUpdate(self):
lm = LoadMetrics()
lm.update("1.1.1.1", {"CPU": 2}, {"CPU": 1})
lm.update("1.1.1.1", {"CPU": 2}, {"CPU": 1}, {})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some test cases for the new metric

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there are a couple entries below in testLoadMessages

try:
self._run()
except Exception:
logger.exception("Error in monitor loop")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean nodes probably weren't cleaned up? If so, would be good to print that in the error message.

try:
self._run()
except Exception:
logger.exception("Error in monitor loop")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.exception("Error in monitor loop")
logger.exception("Error in monitor loop.")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe in periods in log messages

@ericl ericl merged commit 2fdefe1 into ray-project:master Sep 11, 2019
@ericl
Copy link
Copy Markdown
Contributor Author

ericl commented Sep 11, 2019

Merging so I can test this for real on compiled wheels -- had some issues trying to set it up before.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16978/
Test FAILed.

@ericl
Copy link
Copy Markdown
Contributor Author

ericl commented Sep 12, 2019

I just tested this on a real cluster and it works as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants