Skip to content

[Metrics] Cache metrics ports in a file at each node#13501

Merged
rkooo567 merged 5 commits intoray-project:masterfrom
architkulkarni:fix-export-metrics
Jan 22, 2021
Merged

[Metrics] Cache metrics ports in a file at each node#13501
rkooo567 merged 5 commits intoray-project:masterfrom
architkulkarni:fix-export-metrics

Conversation

@architkulkarni
Copy link
Copy Markdown
Contributor

Why are these changes needed?

Different drivers started on the same node need to use the same metrics port. Before this PR, each subsequent driver script after the first would choose a new unused port for the metrics agent, so metrics defined in these driver scripts would not display.

Related issue number

Closes #13372

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

"ports_by_node.json")

# Maps a Node.unique_id to a dict that maps port names to port numbers.
ports_by_node: Dict[str, Dict[str, int]] = defaultdict(dict)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax is not available in python 3.6 I think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My Dev environment is 3.6 so I think it's okay!

Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure what was the issue before, but I definitely remember this caused some issues... Let's see if CI says something

self._metrics_export_port = self._get_cached_port(
"metrics_export_port", default_port=ray_params.metrics_export_port)

self._metrics_export_port = ray_params.metrics_export_port
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this block right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, didn't we say that whatever approach we do for metrics_agent_port we should do for metrics_export_port too? I'm not fully familiar with the second one yet so it's your call. Also, I'll do the rename of metrics_agent to dashboard_agent in a separate PR so that this diff is easier to read.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already assigned a port of this in line 190 right? Am I missing something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 193-195 will make line 190 meaningless IIUC?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, you're totally right, I forgot to remove lines 193-195. I'll do that

@architkulkarni
Copy link
Copy Markdown
Contributor Author

pytest test_metrics.py fails on test_metrics_export_end_to_end, but pytest test_metrics.py::test_metrics_export_end_to_end succeeds... investigating

@architkulkarni
Copy link
Copy Markdown
Contributor Author

Yeah, something is not working as expected in either ray.shutdown() or cluster.shutdown(), not sure exactly what yet. On this branch the following code succeeds for num_iterations = 1 but fails for num_iterations = 2 on the second iteration:

import ray
from ray.cluster_utils import Cluster
from ray.util.metrics import Count
from ray.test_utils import wait_for_condition, fetch_prometheus

num_iterations = 2
counters = [None] * num_iterations

for i in range(num_iterations):
    # Start a cluster, record and test a metric, and shut it down.

    print("STARTING UP")
    cluster = Cluster()
    cluster.add_node()
    cluster.wait_for_nodes()
    ray.init(address=cluster.address)

    counter_name = f"test_driver_counter_{i}"
    counters[i] = Count(counter_name, description="desc")
    counters[i].record(i)

    def test():
        node_info_list = ray.nodes()
        prom_addresses = []
        for node_info in node_info_list:
            metrics_export_port = node_info["MetricsExportPort"]
            addr = node_info["NodeManagerAddress"]
            prom_addresses.append(f"{addr}:{metrics_export_port}")

        components_dict, metric_names, metric_samples = fetch_prometheus(
                prom_addresses)
        return any(counter_name in full_name for full_name in metric_names)

    wait_for_condition(test, timeout=30)

    print("SHUTTING DOWN")
    ray.shutdown()
    cluster.shutdown()

@architkulkarni
Copy link
Copy Markdown
Contributor Author

architkulkarni commented Jan 18, 2021

It seems the issue is just related to ray.shutdown(). Removing all the cluster_utils stuff and just using ray.init() doesn't change anything; it still works with one iteration but not two.

So to summarize, calling ray.init() and ray.shutdown() before calling ray.init() again causes custom metrics defined after the second ray.init() to not be exported. This happens on master and on this branch.

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit confused -- is this for the address="auto" case or for the "new driver" case? If it's for the address="auto" case the most logical solution would be to store the port mapping in the GCS and fetch it on startup. I'm not sure we want this behavior for the "new driver" case at all.

I think you mentioned there's some reason not to do that, could you remind me what that is?

@architkulkarni
Copy link
Copy Markdown
Contributor Author

No, I think I might be confused--what's the difference between the two cases? This PR is for when more than one driver script runs on the same Ray cluster and on the same node, so there's a new driver that calls address="auto". Maybe I used the term "job" incorrectly in the title.

Or maybe the source of confusion is that this PR currently fails tests because of a separate issue #13532 which involves the stats not shutting down correctly (in particular, in between different test functions when running pytest)

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Jan 19, 2021

The difference is that in the address=auto case, the metrics agent is started only once (when ray start is called) but in the ray.init() case it's started twice -- in the second scenario we probably don't want this caching behavior.

Assuming this is just solving the first case, the solution seems reasonable but ideally we would store this in the GCS instead of a local file for consistency.

@architkulkarni
Copy link
Copy Markdown
Contributor Author

architkulkarni commented Jan 20, 2021

Ah, I think I see what you're saying. The cache is keyed by session, so in the case of calling ray.init(), ray.shutdown(), and then ray.init(), the ports from the first ray.init() are not read by the second ray.init() (it's a new session, so it has a separate cache). Does that address the concern?

@rkooo567
Copy link
Copy Markdown
Contributor

Btw, @allenyin55 said this should be fixed ASAP. Can Is there anything else we should handle from this PR?

@architkulkarni
Copy link
Copy Markdown
Contributor Author

architkulkarni commented Jan 20, 2021

Before merging this PR, the PR #13565 needs to be merged, so that tests on the current PR pass (since shutdown() is called between each test).

After that, this PR is ready to be merged.

@rkooo567
Copy link
Copy Markdown
Contributor

Is #13565 regression? Otherwise, we can just merge this PR first right?

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Jan 20, 2021

Relevant tests failing @architkulkarni

@architkulkarni
Copy link
Copy Markdown
Contributor Author

architkulkarni commented Jan 20, 2021

Yes, once #13565 is merged this PR will pass tests. The relevant test currently passes in isolation but not in CI because ray.shutdown() is called in between tests. #13565 fixes the ray.shutdown metrics issue.

@rkooo567
Copy link
Copy Markdown
Contributor

rkooo567 commented Jan 22, 2021

cc @edoakes

@architkulkarni Can you just refactor the test to not do ray.init & ray.shutdown again and again and just merge the PR first? (so that we can unblock Allen). Lingxuan also thinks the current fix can cause some issues, so it'd take some time to resolve the blocker.

@architkulkarni
Copy link
Copy Markdown
Contributor Author

Good idea. I moved the relevant test to the beginning of the file so that it runs first, before any ray.shutdown() calls.

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 22, 2021
@architkulkarni
Copy link
Copy Markdown
Contributor Author

Only gcs_server_test and test_object_spilling are failing now, and I think those are unrelated and known to be flaky. So this should be ready to merge.

@rkooo567 rkooo567 merged commit da59283 into ray-project:master Jan 22, 2021
@architkulkarni architkulkarni deleted the fix-export-metrics branch January 22, 2021 18:00
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
* cache metric ports in a file at each node

* remove old assignment of export port

* lint

* lint

* move e2e test to top of file to avoid shutdown bug
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Metrics] Metrics are not exported when using different jobs

3 participants