Skip to content

[Dashboard] Turn on New Dashboard by Default#11321

Merged
edoakes merged 25 commits intoray-project:masterfrom
mfitton:turn-on-new-dashboard-smh
Oct 19, 2020
Merged

[Dashboard] Turn on New Dashboard by Default#11321
edoakes merged 25 commits intoray-project:masterfrom
mfitton:turn-on-new-dashboard-smh

Conversation

@mfitton
Copy link
Copy Markdown
Contributor

@mfitton mfitton commented Oct 9, 2020

Why are these changes needed?

This PR turns on the new dashboard backend by default.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Max Fitton added 18 commits October 5, 2020 15:46
…t. This will likely fail and I will need to remove some testing associated with the old dashboard.
…cs. Test code that has been removed has been added to the dashboard sub tests
… dashboard does not run the reporter process on its head node
…tirely sure why the new_dashboard dir fails.
…test suites, cross platform. I removed it from test-wheels because it seemed to inexplicably hit import errors where it does not in other tests. I think it should have no impact removing from here.
@mfitton mfitton changed the title [WIP] New Dashboard [Dashboard] Turn on New Dashboard by Default Oct 13, 2020
@mfitton mfitton requested review from fyrestone and rkooo567 October 13, 2020 00:56
@mfitton
Copy link
Copy Markdown
Contributor Author

mfitton commented Oct 13, 2020

Looks like it passes all tests it should at this point.

type=int,
help="The port to expose metrics through Prometheus.")
parser.add_argument(
"--metrics-agent-port",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to call it --dashboard-agent-port, because many agent grpc services will share this port. For example, reporter agent, job agent, ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thank you

self.server = aiogrpc.server(options=(("grpc.so_reuseport", 0), ))
self.grpc_port = self.server.add_insecure_port("[::]:0")
self.grpc_port = self.server.add_insecure_port(
f"[::]:{self.metrics_agent_port}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's better to rename it to dashboard_agent_port.

import psutil

logger = logging.getLogger(__name__)
routes = dashboard_utils.ClassMethodRouteTable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

routes is for RESTful APIs, but i can't find any routes usages in reporter_agent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it shouldn't be there, thanks.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So bizarre, I must have pushed these changes to a different branch or something, I remember addressing these this morning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

async def ReportMetrics(self, request, context):
# NOTE: Exceptions are not propagated properly
# when we don't catch them here.
async def ReportOCMetrics(self, request, context):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment for this method? It's hard to get what OC means.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping on adding a comment. what's OC?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is OpenCensus

stdout_file, stderr_file = self.get_log_file_handles(
"dashboard", unique=True)
stdout_file, stderr_file = self.get_log_file_handles(
"dashboard", unique=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dashboard use a file logger by default (the dashboard.log), it's not necessary to redirect the stdout and stderr.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched it back to the old behavior because a bunch of tests broke in other parts of the codebase. Plus, in this case error and non-error logs go to different places, which I think is desirable.

@rkooo567 @edoakes do either of you have opinions here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stdout redirection can neither rotate log files nor fragment log files. It is not good for long running cluster. Three log files may confuse users. (dashboard.out, dashboard.err, dashboard.log)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm having a separate .err is useful though. Is there any way we can have .out (rotated) and .err?

Copy link
Copy Markdown
Contributor

@fyrestone fyrestone Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to have a seperate .err with rotate feature (not tested):

  • stdout_file, stderr_file = None, None
  • Two RotatingFileHandler instances,
    • One level is WARNING, filename is dashboard.err.
    • The other level is INFO, filename is dashboard.out, --logging-level only affects this one.

The drawback is, the log with level >= WARNING will be printed twice, it will be slower than one dashboard.log.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I am okay with mixed logs because I agree we should make all logs rotational. @mfitton Can you just revert back to the original (@fyrestone)'s implementation?

@mfitton
Copy link
Copy Markdown
Contributor Author

mfitton commented Oct 13, 2020

Found an issue, which was that the wheels being built were not including the compiled new_dashboard. I switched the dir in the build-wheels shell scripts to rectify.

@zhe-thoughts zhe-thoughts linked an issue Oct 13, 2020 that may be closed by this pull request
dashboard_dir = "dashboard"
logdir = None

dashboard_dir = "new_dashboard"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we'll remove the "new" prefix in a future PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that'll be done when we remove the old dashboard, which we decided not to do just yet in case reverting is needed.

async def ReportMetrics(self, request, context):
# NOTE: Exceptions are not propagated properly
# when we don't catch them here.
async def ReportOCMetrics(self, request, context):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping on adding a comment. what's OC?

import psutil

logger = logging.getLogger(__name__)
routes = dashboard_utils.ClassMethodRouteTable
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping


agent = DashboardAgent(
args.redis_address,
args.dashboard_agent_port,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

assert wait_until_server_available(webui_url)
webui_url = format_web_url(webui_url)

def actor_killed(PID):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the argument uppercase?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied it over. Happy to change.

actor = actor_groups["Actor"]["entries"][0]
return actor

def kill_actor(actor):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def kill_actor(actor):
def kill_actor_using_dashboard(actor):

except (KeyError, AssertionError) as e:
last_exc = e
time.sleep(.1)
assert last_exc is None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert last_exc is None
assert last_exc is None, (f"last exception: {last_exc}")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I assert last_exc is None and last_exc is not None, it already prints an assertion error that says what last_exc was.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh but if last_exc is not None, this assertion will fail, and we'd like to print a better message right?

async def ReportMetrics(self, request, context):
# NOTE: Exceptions are not propagated properly
# when we don't catch them here.
async def ReportOCMetrics(self, request, context):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is OpenCensus


# Build the dashboard so its static assets can be included in the wheel.
pushd python/ray/dashboard/client
pushd python/ray/new_dashboard/client
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add TODO here to revert it back to dashboard when we delete the old dashboard?

dashboard_dependencies_present = False
warning_message = (
"Failed to start the dashboard. The dashboard requires Python 3 "
"as well as 'pip install aiohttp grpcio'.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this warning message still valid?

"--raylet-name={}".format(raylet_name),
"--temp-dir={}".format(temp_dir),
f"--redis-address={redis_address}",
f"--metrics-export-port={metrics_export_port}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind changing this name to dashboard_agent_port?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're different in this case, right? The agent port is where other processes send their metrics whereas the metrics export port is where prometheus actually scrapes the states from.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I meant to change metrics_agenet_port. My bad that I was writing a comment in the wrong line.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since metrics agent is just the dashboard agent now

stdout_file, stderr_file = self.get_log_file_handles(
"dashboard", unique=True)
stdout_file, stderr_file = self.get_log_file_handles(
"dashboard", unique=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm having a separate .err is useful though. Is there any way we can have .out (rotated) and .err?

import pytest
import requests

import ray
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to delete this from BUILD.bazel. Same for test_metrics. Also, this test is run as a sanity check for the wheel builds. I think we can do the same thing for the new dashboard.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed both of them from BUILD.bazel
As far as running it as a sanity check for the wheels, that makes sense. I was having trouble running a test inside the dashboard directory in this case. There was an import error importing things from conftest. Given how long the cycle time is in trying to fix something like that over CI, I decided it probably wasn't worth the effort. What do you think?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 14, 2020
@mfitton
Copy link
Copy Markdown
Contributor Author

mfitton commented Oct 16, 2020

@fyrestone
I'm hitting errors in several tests when making the change you recommended of not redirecting stdout and stderr for the dashboard process.

Somehow as an effect, tests that rely on call_ray_start in conftest.py hang indefinitely on this line:

out = ray.utils.decode(subprocess.check_output(command_args, stderr=subprocess.STDOUT))

Ray starts up correctly (I can access the dashboard and see the processes running on my laptop and even connect and spawn actors manually); however, somehow the output never gets successfully read. Do you know how the two could be related? I must be missing something. I'd really appreciate your help digging into this, as I've spent a couple hours on it, and this is the reason I made the change in the first place.

I also noticed we use a private function _run_app instead of the exposed run_app in aiohttp. Why do we do this?

web_server = aiohttp.web._run_app(
            app, host=self.http_host, port=self.http_port)

@fyrestone
Copy link
Copy Markdown
Contributor

http_port

  1. Let me find out why the check_output hangs.

  2. About the _run_app.

    • The run_app not only creates the http server coroutine but also waits the coroutine finished, it means that the run_app will block on the http server exit.
    • The _run_app just create a http server coroutine, so we can run the http server in our event loop.

    There is a way to avoid using private _run_app, just like the agent http server creation,

    • Create a aiohttp.web.Application()
    • Add routes to app.
    • Create a aiohttp.web.AppRunner on app
    • Setup runner
    • Create a aiohttp.web.TCPSite on runner
    • Start site
    • ...

    The _run_app is a shortcut of above steps. Dashboard agent use above steps is to enable CORS. So, I choose the shortcut _run_app for dashboard head.

@fyrestone
Copy link
Copy Markdown
Contributor

fyrestone commented Oct 19, 2020

@mfitton About hanging on check_output. The child (dashboard) inherit fds from parent (ray.scripts.scripts). The dashboard process is long running, it prints stdout / stderr to the parent's fd. So, the check_output can't get the EOF until dashboard exit.

The process tree is,

  • check_output, waiting for the EOF of stdout from [1]
    • [1] ray start ... (ray.scripts.scripts), become a zombie because it's stdout / stderr is not closed.
      • [2] dashboard.py, inherits stdout from [1]

A solution is to redirect the stdout & stderr to devnull: antgroup@59e3250

Ref: https://docs.python.org/3/library/subprocess.html?highlight=popen#subprocess.Popen

With the default settings of None, no redirection will occur; the child’s file handles will be inherited from the parent.

@mfitton
Copy link
Copy Markdown
Contributor Author

mfitton commented Oct 19, 2020

@fyrestone many thanks for your work digging into that. I'll cherry pick the commit onto this branch.

@mfitton mfitton force-pushed the turn-on-new-dashboard-smh branch from 4dcd921 to af75260 Compare October 19, 2020 15:05
@mfitton mfitton removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 19, 2020
@edoakes edoakes merged commit f500292 into ray-project:master Oct 19, 2020
@mfitton mfitton deleted the turn-on-new-dashboard-smh branch October 19, 2020 20:59
mfitton added a commit that referenced this pull request Oct 20, 2020
edoakes pushed a commit that referenced this pull request Oct 20, 2020
mfitton added a commit that referenced this pull request Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Dashboard] Turn it on by default.

4 participants