[Dashboard] Turn on New Dashboard by Default#11321
[Dashboard] Turn on New Dashboard by Default#11321edoakes merged 25 commits intoray-project:masterfrom
Conversation
…t. This will likely fail and I will need to remove some testing associated with the old dashboard.
…cs. Test code that has been removed has been added to the dashboard sub tests
… dashboard does not run the reporter process on its head node
…t it is still failing for some reason
…tirely sure why the new_dashboard dir fails.
…test suites, cross platform. I removed it from test-wheels because it seemed to inexplicably hit import errors where it does not in other tests. I think it should have no impact removing from here.
|
Looks like it passes all tests it should at this point. |
dashboard/agent.py
Outdated
| type=int, | ||
| help="The port to expose metrics through Prometheus.") | ||
| parser.add_argument( | ||
| "--metrics-agent-port", |
There was a problem hiding this comment.
I think it's better to call it --dashboard-agent-port, because many agent grpc services will share this port. For example, reporter agent, job agent, ...
There was a problem hiding this comment.
Good point, thank you
dashboard/agent.py
Outdated
| self.server = aiogrpc.server(options=(("grpc.so_reuseport", 0), )) | ||
| self.grpc_port = self.server.add_insecure_port("[::]:0") | ||
| self.grpc_port = self.server.add_insecure_port( | ||
| f"[::]:{self.metrics_agent_port}") |
There was a problem hiding this comment.
Also, it's better to rename it to dashboard_agent_port.
| import psutil | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
| routes = dashboard_utils.ClassMethodRouteTable |
There was a problem hiding this comment.
routes is for RESTful APIs, but i can't find any routes usages in reporter_agent.
There was a problem hiding this comment.
Good catch, it shouldn't be there, thanks.
There was a problem hiding this comment.
So bizarre, I must have pushed these changes to a different branch or something, I remember addressing these this morning.
| async def ReportMetrics(self, request, context): | ||
| # NOTE: Exceptions are not propagated properly | ||
| # when we don't catch them here. | ||
| async def ReportOCMetrics(self, request, context): |
There was a problem hiding this comment.
Can you add a comment for this method? It's hard to get what OC means.
There was a problem hiding this comment.
ping on adding a comment. what's OC?
python/ray/node.py
Outdated
| stdout_file, stderr_file = self.get_log_file_handles( | ||
| "dashboard", unique=True) | ||
| stdout_file, stderr_file = self.get_log_file_handles( | ||
| "dashboard", unique=True) |
There was a problem hiding this comment.
Dashboard use a file logger by default (the dashboard.log), it's not necessary to redirect the stdout and stderr.
There was a problem hiding this comment.
stdout redirection can neither rotate log files nor fragment log files. It is not good for long running cluster. Three log files may confuse users. (dashboard.out, dashboard.err, dashboard.log)
There was a problem hiding this comment.
Hmm having a separate .err is useful though. Is there any way we can have .out (rotated) and .err?
There was a problem hiding this comment.
Another way to have a seperate .err with rotate feature (not tested):
- stdout_file, stderr_file = None, None
- Two
RotatingFileHandlerinstances,- One level is
WARNING, filename isdashboard.err. - The other level is
INFO, filename isdashboard.out,--logging-levelonly affects this one.
- One level is
The drawback is, the log with level >= WARNING will be printed twice, it will be slower than one dashboard.log.
There was a problem hiding this comment.
I see. I am okay with mixed logs because I agree we should make all logs rotational. @mfitton Can you just revert back to the original (@fyrestone)'s implementation?
…ng the dashboard.
…of building the old dashboard.
|
Found an issue, which was that the wheels being built were not including the compiled new_dashboard. I switched the dir in the build-wheels shell scripts to rectify. |
| dashboard_dir = "dashboard" | ||
| logdir = None | ||
|
|
||
| dashboard_dir = "new_dashboard" |
There was a problem hiding this comment.
I assume we'll remove the "new" prefix in a future PR
There was a problem hiding this comment.
Yeah, that'll be done when we remove the old dashboard, which we decided not to do just yet in case reverting is needed.
| async def ReportMetrics(self, request, context): | ||
| # NOTE: Exceptions are not propagated properly | ||
| # when we don't catch them here. | ||
| async def ReportOCMetrics(self, request, context): |
There was a problem hiding this comment.
ping on adding a comment. what's OC?
| import psutil | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
| routes = dashboard_utils.ClassMethodRouteTable |
|
|
||
| agent = DashboardAgent( | ||
| args.redis_address, | ||
| args.dashboard_agent_port, |
| assert wait_until_server_available(webui_url) | ||
| webui_url = format_web_url(webui_url) | ||
|
|
||
| def actor_killed(PID): |
There was a problem hiding this comment.
Why is the argument uppercase?
There was a problem hiding this comment.
I copied it over. Happy to change.
| actor = actor_groups["Actor"]["entries"][0] | ||
| return actor | ||
|
|
||
| def kill_actor(actor): |
There was a problem hiding this comment.
| def kill_actor(actor): | |
| def kill_actor_using_dashboard(actor): |
| except (KeyError, AssertionError) as e: | ||
| last_exc = e | ||
| time.sleep(.1) | ||
| assert last_exc is None |
There was a problem hiding this comment.
| assert last_exc is None | |
| assert last_exc is None, (f"last exception: {last_exc}") |
There was a problem hiding this comment.
If I assert last_exc is None and last_exc is not None, it already prints an assertion error that says what last_exc was.
There was a problem hiding this comment.
oh but if last_exc is not None, this assertion will fail, and we'd like to print a better message right?
| async def ReportMetrics(self, request, context): | ||
| # NOTE: Exceptions are not propagated properly | ||
| # when we don't catch them here. | ||
| async def ReportOCMetrics(self, request, context): |
|
|
||
| # Build the dashboard so its static assets can be included in the wheel. | ||
| pushd python/ray/dashboard/client | ||
| pushd python/ray/new_dashboard/client |
There was a problem hiding this comment.
Add TODO here to revert it back to dashboard when we delete the old dashboard?
| dashboard_dependencies_present = False | ||
| warning_message = ( | ||
| "Failed to start the dashboard. The dashboard requires Python 3 " | ||
| "as well as 'pip install aiohttp grpcio'.") |
There was a problem hiding this comment.
Is this warning message still valid?
| "--raylet-name={}".format(raylet_name), | ||
| "--temp-dir={}".format(temp_dir), | ||
| f"--redis-address={redis_address}", | ||
| f"--metrics-export-port={metrics_export_port}", |
There was a problem hiding this comment.
Do you mind changing this name to dashboard_agent_port?
There was a problem hiding this comment.
They're different in this case, right? The agent port is where other processes send their metrics whereas the metrics export port is where prometheus actually scrapes the states from.
There was a problem hiding this comment.
Yeah I meant to change metrics_agenet_port. My bad that I was writing a comment in the wrong line.
There was a problem hiding this comment.
Since metrics agent is just the dashboard agent now
python/ray/node.py
Outdated
| stdout_file, stderr_file = self.get_log_file_handles( | ||
| "dashboard", unique=True) | ||
| stdout_file, stderr_file = self.get_log_file_handles( | ||
| "dashboard", unique=True) |
There was a problem hiding this comment.
Hmm having a separate .err is useful though. Is there any way we can have .out (rotated) and .err?
| import pytest | ||
| import requests | ||
|
|
||
| import ray |
There was a problem hiding this comment.
We probably need to delete this from BUILD.bazel. Same for test_metrics. Also, this test is run as a sanity check for the wheel builds. I think we can do the same thing for the new dashboard.
There was a problem hiding this comment.
I removed both of them from BUILD.bazel
As far as running it as a sanity check for the wheels, that makes sense. I was having trouble running a test inside the dashboard directory in this case. There was an import error importing things from conftest. Given how long the cycle time is in trying to fix something like that over CI, I decided it probably wasn't worth the effort. What do you think?
|
@fyrestone Somehow as an effect, tests that rely on
Ray starts up correctly (I can access the dashboard and see the processes running on my laptop and even connect and spawn actors manually); however, somehow the output never gets successfully read. Do you know how the two could be related? I must be missing something. I'd really appreciate your help digging into this, as I've spent a couple hours on it, and this is the reason I made the change in the first place. I also noticed we use a private function |
|
|
@mfitton About hanging on The process tree is,
A solution is to redirect the stdout & stderr to devnull: antgroup@59e3250 Ref: https://docs.python.org/3/library/subprocess.html?highlight=popen#subprocess.Popen
|
|
@fyrestone many thanks for your work digging into that. I'll cherry pick the commit onto this branch. |
4dcd921 to
af75260
Compare
Why are these changes needed?
This PR turns on the new dashboard backend by default.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.