Skip to content

Fix test_actor_lifetime_load_balancing#5463

Merged
raulchen merged 1 commit intoray-project:masterfrom
antgroup:fix_test
Aug 17, 2019
Merged

Fix test_actor_lifetime_load_balancing#5463
raulchen merged 1 commit intoray-project:masterfrom
antgroup:fix_test

Conversation

@raulchen
Copy link
Copy Markdown
Contributor

@raulchen raulchen commented Aug 16, 2019

Why are these changes needed?

Increase timeout because it's slow in CI.

What do these changes do?

Related issue number

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16327/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16328/
Test PASSed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the timing measurement in this test? We can probably just remove it, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm true, I forgot that the point of this test was that it used to hang.

Maybe we should just increase the timeout instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephanie-wang do you want to just increase the total timeout? I saw the the PR title was add regression test for actor creation. And thought your purpose was to make sure creating actors wouldn't be too slow.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think it's acceptable if a bug in a PR causes a test to hang (then we'll notice and fix the bug before it gets merged), so I'd be ok with no timeout, but if either of you feel strongly we can keep the timeout and make it longer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Fixed per your comment.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16345/
Test PASSed.

@raulchen raulchen merged commit 657ce4b into ray-project:master Aug 17, 2019
@raulchen raulchen deleted the fix_test branch August 17, 2019 09:25
@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16362/
Test PASSed.

@robertnishihara
Copy link
Copy Markdown
Collaborator

robertnishihara commented Aug 18, 2019

@raulchen I just saw this test fail again in https://travis-ci.com/ray-project/ray/jobs/226080214.

Looks like it timed out after 20 seconds. I posted the failure below.

It's very surprising that it would take that long. Nevertheless, perhaps we should just get rid of the timeout.

______________________ test_actor_lifetime_load_balancing ______________________

ray_start_cluster = <ray.tests.cluster_utils.Cluster object at 0x7f44cc3dbe48>

    @pytest.mark.skipif(

        pytest_timeout is None,

        reason="Timeout package not installed; skipping test that may hang.")

    @pytest.mark.timeout(20)

    def test_actor_lifetime_load_balancing(ray_start_cluster):

        cluster = ray_start_cluster

        cluster.add_node(num_cpus=0)

        num_nodes = 3

        for i in range(num_nodes):

            cluster.add_node(num_cpus=1)

        ray.init(redis_address=cluster.redis_address)

    

        @ray.remote(num_cpus=1)

        class Actor(object):

            def __init__(self):

                pass

    

            def ping(self):

                return

    

        actors = [Actor.remote() for _ in range(num_nodes)]

>       ray.get([actor.ping.remote() for actor in actors])

python/ray/tests/test_actor.py:929: 

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

python/ray/worker.py:2277: in get

    values = worker.get_object(object_ids)

python/ray/worker.py:586: in get_object

    int(0.01 * len(unready_ids)),

python/ray/worker.py:456: in retrieve_and_deserialize

    with_meta=True,

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = ([ObjectID(d133b594d3f91a6cc2eb0100000000c001000000), ObjectID(8af3aad452ac88b2e2550100000000c001000000), ObjectID(0ef84f11f157ad5f8c7f0100000000c001000000)], 1000)

kwargs = {'with_meta': True}

    @functools.wraps(orig_attr)

    def _wrapper(*args, **kwargs):

        with self.lock:

>           return orig_attr(*args, **kwargs)

E           Failed: Timeout >20.0s

python/ray/utils.py:519: Failed

----------------------------- Captured stderr call -----------------------------

+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++

~~~~~~~~~~ Stack of ray_push_profiling_information (139933494851328) ~~~~~~~~~~~

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 884, in _bootstrap

    self._bootstrap_inner()

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 916, in _bootstrap_inner

    self.run()

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 864, in run

    self._target(*self._args, **self._kwargs)

  File "/home/travis/build/ray-project/ray/python/ray/profiling.py", line 104, in _periodically_flush_profile_events

    self.threads_stopped.wait(timeout=1)

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 551, in wait

    signaled = self._cond.wait(timeout)

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 299, in wait

    gotit = waiter.acquire(True, timeout)

~~~~~~~~~~~~~~~~~~ Stack of ray_print_logs (139934395983616) ~~~~~~~~~~~~~~~~~~~

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 884, in _bootstrap

    self._bootstrap_inner()

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 916, in _bootstrap_inner

    self.run()

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 864, in run

    self._target(*self._args, **self._kwargs)

  File "/home/travis/build/ray-project/ray/python/ray/worker.py", line 1619, in print_logs

    threads_stopped.wait(timeout=0.01)

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 551, in wait

    signaled = self._cond.wait(timeout)

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 299, in wait

    gotit = waiter.acquire(True, timeout)

~~~~~~~~~~~~~ Stack of ray_print_error_messages (139933469673216) ~~~~~~~~~~~~~~

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 884, in _bootstrap

    self._bootstrap_inner()

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 916, in _bootstrap_inner

    self.run()

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 864, in run

    self._target(*self._args, **self._kwargs)

  File "/home/travis/build/ray-project/ray/python/ray/worker.py", line 1675, in print_error_messages_raylet

    threads_stopped.wait(timeout=0.01)

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 551, in wait

    signaled = self._cond.wait(timeout)

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 299, in wait

    gotit = waiter.acquire(True, timeout)

~~~~~~~~~~~~~ Stack of ray_listen_error_messages (139933486458624) ~~~~~~~~~~~~~

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 884, in _bootstrap

    self._bootstrap_inner()

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 916, in _bootstrap_inner

    self.run()

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 864, in run

    self._target(*self._args, **self._kwargs)

  File "/home/travis/build/ray-project/ray/python/ray/worker.py", line 1729, in listen_error_messages_raylet

    threads_stopped.wait(timeout=0.01)

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 551, in wait

    signaled = self._cond.wait(timeout)

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 299, in wait

    gotit = waiter.acquire(True, timeout)

~~~~~~~~~~~~~~~~~ Stack of ray_import_thread (139933478065920) ~~~~~~~~~~~~~~~~~

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 884, in _bootstrap

    self._bootstrap_inner()

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 916, in _bootstrap_inner

    self.run()

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 864, in run

    self._target(*self._args, **self._kwargs)

  File "/home/travis/build/ray-project/ray/python/ray/import_thread.py", line 76, in _run

    self.threads_stopped.wait(timeout=0.01)

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 551, in wait

    signaled = self._cond.wait(timeout)

  File "/home/travis/miniconda/lib/python3.6/threading.py", line 299, in wait

    gotit = waiter.acquire(True, timeout)

+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++

@raulchen
Copy link
Copy Markdown
Contributor Author

well, I verified that this test passed in all the 4 CI jobs before merging this PR.
Maybe we should remove the timeout and see whether this test is hanging, or it's simply too slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants