Skip to content

[Flaky tests] FIx test fork#5671

Merged
simon-mo merged 2 commits intoray-project:masterfrom
simon-mo:flaky-tests/test-fork
Sep 10, 2019
Merged

[Flaky tests] FIx test fork#5671
simon-mo merged 2 commits intoray-project:masterfrom
simon-mo:flaky-tests/test-fork

Conversation

@simon-mo
Copy link
Copy Markdown
Contributor

@simon-mo simon-mo commented Sep 9, 2019

No description provided.

Maybe queue actor takes too long to initialize, that's why we are
seeing "Many python processes started" since most of the python
tasks are blocked on ray.get
@simon-mo
Copy link
Copy Markdown
Contributor Author

This reason for the flaky is is as follows:

  • Travis has limited resource (2 cores)
  • When the Queue actor initialization took too long, all the enqueue tasks will be blocked by the initialization task. Workers running ray.get on enqueue tasks are blocked and raylet will mark the resource as available, therefore starting more workers (up to 100).
  • When many python process starts, the operating system will start multiplexing the processes. The actor initialization might never able to went.
  • Eventually, we run out of memory by starting many python processes.

Fix:

  • Make sure the queue actor is initialized before sending enqueue task.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16911/
Test PASSed.

@simon-mo
Copy link
Copy Markdown
Contributor Author

Reran 3 times on Travis, test_fork all passed.

@simon-mo simon-mo merged commit 147e7d4 into ray-project:master Sep 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants