[Metrics] Cache metrics ports in a file at each node by architkulkarni · Pull Request #13501 · ray-project/ray

architkulkarni · 2021-01-16T01:47:57Z

Why are these changes needed?

Different drivers started on the same node need to use the same metrics port. Before this PR, each subsequent driver script after the first would choose a new unused port for the metrics agent, so metrics defined in these driver scripts would not display.

Related issue number

Closes #13372

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567 · 2021-01-16T02:16:56Z

python/ray/node.py

+                                 "ports_by_node.json")
+
+        # Maps a Node.unique_id to a dict that maps port names to port numbers.
+        ports_by_node: Dict[str, Dict[str, int]] = defaultdict(dict)


This syntax is not available in python 3.6 I think?

My Dev environment is 3.6 so I think it's okay!

Hmm not sure what was the issue before, but I definitely remember this caused some issues... Let's see if CI says something

rkooo567 · 2021-01-16T02:20:46Z

python/ray/node.py

+        self._metrics_export_port = self._get_cached_port(
+            "metrics_export_port", default_port=ray_params.metrics_export_port)
+
+        self._metrics_export_port = ray_params.metrics_export_port


We don't need this block right?

Oh, didn't we say that whatever approach we do for metrics_agent_port we should do for metrics_export_port too? I'm not fully familiar with the second one yet so it's your call. Also, I'll do the rename of metrics_agent to dashboard_agent in a separate PR so that this diff is easier to read.

You already assigned a port of this in line 190 right? Am I missing something?

Line 193-195 will make line 190 meaningless IIUC?

Ah I see, you're totally right, I forgot to remove lines 193-195. I'll do that

architkulkarni · 2021-01-16T18:48:29Z

pytest test_metrics.py fails on test_metrics_export_end_to_end, but pytest test_metrics.py::test_metrics_export_end_to_end succeeds... investigating

architkulkarni · 2021-01-17T20:47:29Z

Yeah, something is not working as expected in either ray.shutdown() or cluster.shutdown(), not sure exactly what yet. On this branch the following code succeeds for num_iterations = 1 but fails for num_iterations = 2 on the second iteration:

import ray
from ray.cluster_utils import Cluster
from ray.util.metrics import Count
from ray.test_utils import wait_for_condition, fetch_prometheus

num_iterations = 2
counters = [None] * num_iterations

for i in range(num_iterations):
    # Start a cluster, record and test a metric, and shut it down.

    print("STARTING UP")
    cluster = Cluster()
    cluster.add_node()
    cluster.wait_for_nodes()
    ray.init(address=cluster.address)

    counter_name = f"test_driver_counter_{i}"
    counters[i] = Count(counter_name, description="desc")
    counters[i].record(i)

    def test():
        node_info_list = ray.nodes()
        prom_addresses = []
        for node_info in node_info_list:
            metrics_export_port = node_info["MetricsExportPort"]
            addr = node_info["NodeManagerAddress"]
            prom_addresses.append(f"{addr}:{metrics_export_port}")

        components_dict, metric_names, metric_samples = fetch_prometheus(
                prom_addresses)
        return any(counter_name in full_name for full_name in metric_names)

    wait_for_condition(test, timeout=30)

    print("SHUTTING DOWN")
    ray.shutdown()
    cluster.shutdown()

architkulkarni · 2021-01-18T18:04:34Z

It seems the issue is just related to ray.shutdown(). Removing all the cluster_utils stuff and just using ray.init() doesn't change anything; it still works with one iteration but not two.

So to summarize, calling ray.init() and ray.shutdown() before calling ray.init() again causes custom metrics defined after the second ray.init() to not be exported. This happens on master and on this branch.

edoakes

I'm a little bit confused -- is this for the address="auto" case or for the "new driver" case? If it's for the address="auto" case the most logical solution would be to store the port mapping in the GCS and fetch it on startup. I'm not sure we want this behavior for the "new driver" case at all.

I think you mentioned there's some reason not to do that, could you remind me what that is?

architkulkarni · 2021-01-19T18:18:53Z

No, I think I might be confused--what's the difference between the two cases? This PR is for when more than one driver script runs on the same Ray cluster and on the same node, so there's a new driver that calls address="auto". Maybe I used the term "job" incorrectly in the title.

Or maybe the source of confusion is that this PR currently fails tests because of a separate issue #13532 which involves the stats not shutting down correctly (in particular, in between different test functions when running pytest)

edoakes · 2021-01-19T20:02:55Z

The difference is that in the address=auto case, the metrics agent is started only once (when ray start is called) but in the ray.init() case it's started twice -- in the second scenario we probably don't want this caching behavior.

Assuming this is just solving the first case, the solution seems reasonable but ideally we would store this in the GCS instead of a local file for consistency.

architkulkarni · 2021-01-20T00:11:27Z

Ah, I think I see what you're saying. The cache is keyed by session, so in the case of calling ray.init(), ray.shutdown(), and then ray.init(), the ports from the first ray.init() are not read by the second ray.init() (it's a new session, so it has a separate cache). Does that address the concern?

rkooo567 · 2021-01-20T08:12:13Z

Btw, @allenyin55 said this should be fixed ASAP. Can Is there anything else we should handle from this PR?

architkulkarni · 2021-01-20T17:03:35Z

Before merging this PR, the PR #13565 needs to be merged, so that tests on the current PR pass (since shutdown() is called between each test).

After that, this PR is ready to be merged.

rkooo567 · 2021-01-20T20:08:36Z

Is #13565 regression? Otherwise, we can just merge this PR first right?

edoakes · 2021-01-20T20:50:51Z

Relevant tests failing @architkulkarni

architkulkarni · 2021-01-20T20:52:20Z

Yes, once #13565 is merged this PR will pass tests. The relevant test currently passes in isolation but not in CI because ray.shutdown() is called in between tests. #13565 fixes the ray.shutdown metrics issue.

rkooo567 · 2021-01-22T04:14:47Z

cc @edoakes

@architkulkarni Can you just refactor the test to not do ray.init & ray.shutdown again and again and just merge the PR first? (so that we can unblock Allen). Lingxuan also thinks the current fix can cause some issues, so it'd take some time to resolve the blocker.

architkulkarni · 2021-01-22T08:21:11Z

Good idea. I moved the relevant test to the beginning of the file so that it runs first, before any ray.shutdown() calls.

architkulkarni · 2021-01-22T17:37:25Z

Only gcs_server_test and test_object_spilling are failing now, and I think those are unrelated and known to be flaky. So this should be ready to merge.

* cache metric ports in a file at each node * remove old assignment of export port * lint * lint * move e2e test to top of file to avoid shutdown bug

…ject#13501)" This reverts commit 4d32c72.

cache metric ports in a file at each node

fd9c157

architkulkarni requested a review from rkooo567 January 16, 2021 01:47

architkulkarni assigned edoakes Jan 16, 2021

rkooo567 reviewed Jan 16, 2021

View reviewed changes

architkulkarni added 2 commits January 16, 2021 10:29

remove old assignment of export port

685e024

lint

15b0779

edoakes reviewed Jan 19, 2021

View reviewed changes

rkooo567 approved these changes Jan 20, 2021

View reviewed changes

lint

1cdbe0b

edoakes approved these changes Jan 20, 2021

View reviewed changes

move e2e test to top of file to avoid shutdown bug

2da0c52

architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 22, 2021

rkooo567 merged commit da59283 into ray-project:master Jan 22, 2021

architkulkarni deleted the fix-export-metrics branch January 22, 2021 18:00

fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021

Revert "[Metrics] Cache metrics ports in a file at each node (ray-pro…

398a77e

…ject#13501)" This reverts commit 4d32c72.

Conversation

architkulkarni commented Jan 16, 2021

Why are these changes needed?

Related issue number

Checks

Uh oh!

rkooo567 Jan 16, 2021

Choose a reason for hiding this comment

Uh oh!

architkulkarni Jan 16, 2021

Choose a reason for hiding this comment

Uh oh!

rkooo567 Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkooo567 Jan 16, 2021

Choose a reason for hiding this comment

Uh oh!

architkulkarni Jan 16, 2021

Choose a reason for hiding this comment

Uh oh!

rkooo567 Jan 16, 2021

Choose a reason for hiding this comment

Uh oh!

rkooo567 Jan 16, 2021

Choose a reason for hiding this comment

Uh oh!

architkulkarni Jan 16, 2021

Choose a reason for hiding this comment

Uh oh!

architkulkarni commented Jan 16, 2021

Uh oh!

architkulkarni commented Jan 17, 2021

Uh oh!

architkulkarni commented Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

architkulkarni commented Jan 19, 2021

Uh oh!

edoakes commented Jan 19, 2021

Uh oh!

architkulkarni commented Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkooo567 commented Jan 20, 2021

Uh oh!

architkulkarni commented Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkooo567 commented Jan 20, 2021

Uh oh!

edoakes commented Jan 20, 2021

Uh oh!

architkulkarni commented Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkooo567 commented Jan 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

architkulkarni commented Jan 22, 2021

Uh oh!

architkulkarni commented Jan 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rkooo567 Jan 16, 2021 •

edited

Loading

architkulkarni commented Jan 18, 2021 •

edited

Loading

architkulkarni commented Jan 20, 2021 •

edited

Loading

architkulkarni commented Jan 20, 2021 •

edited

Loading

architkulkarni commented Jan 20, 2021 •

edited

Loading

rkooo567 commented Jan 22, 2021 •

edited

Loading