Skip to content

Add new CLI that is extensible#9283

Merged
jrbourbeau merged 41 commits intodask:mainfrom
douglasdavis:new-cli
Oct 14, 2022
Merged

Add new CLI that is extensible#9283
jrbourbeau merged 41 commits intodask:mainfrom
douglasdavis:new-cli

Conversation

@douglasdavis
Copy link
Member

@douglasdavis douglasdavis commented Jul 16, 2022

New CLI discussed at SciPy sprint

This creates a new dask command line tool that can be augmented by external packages (e.g. distributed) with a [dask_cli] entry point. The new tool can be used with either

$ dask

or

$ python -m dask

Adding an entry point is pretty straightforward; we'll use distributed as an example:

In an external-to-dask module at the location: distributed.cli.dask_scheduler with a function called main we can add to pyproject.toml (PEP-621):

[project.entry-points."dask_cli"]
scheduler = "distributed.cli.dask_scheduler.main"

or an "old school" setup.py or setup.cfg

[options.entry_points]
dask_cli =
    scheduler = distributed.cli.dask_scheduler:main

The function needs to be a click command:

# in a distributed/distributed/cli/dask_scheduler.py file

@click.command(name="scheduler")
@click.option(...)
@click.option(...)
def main():
    ...

Since the scheduler = ... entry point under the dask_cli section is defined in the distributed package; upon import of dask.cli the run_cli function will discover the external scheduler command. This allows use of

$ dask scheduler

or

$ python -m dask scheduler

Related PRs in distributed:

douglasdavis added a commit to douglasdavis/distributed that referenced this pull request Jul 16, 2022
@github-actions github-actions bot added the documentation Improve or add to documentation label Jul 16, 2022
@douglasdavis douglasdavis added feature Something is missing and removed documentation Improve or add to documentation labels Jul 17, 2022
======================

Dask provides a command line tool (``dask``) to accomplish a number of
tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to change this to reflect the current state of this work. Maybe say that it's experimental and include the version info for when it was added?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage I don't think that we should have an experimental label in docs for something top-level and that is this important. At least not if it's going to be in the index. I think that we should invest in this page by adding dask scheduler and whatnot and then not think of it as experimental.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree that is an even better idea!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll third this-- I think the docs will be a large (if not the most significant) part of this PR

@douglasdavis
Copy link
Member Author

douglasdavis commented Jul 18, 2022

pkg_resources is part of setuptools (news to me!) and its use is "discouraged" in favor of importlib from the stdlib. (so developers can use whatever packaging library they prefer, not locked into setuptools).

@github-actions github-actions bot added the documentation Improve or add to documentation label Jul 18, 2022
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @douglasdavis! Could you provide a bit more context around the motivation for adding this new feature? For example, was this based on user feedback?

Do we want to make click a hard dependency?

I don't have a strong opinion. I view click as a lightweight dependency that already being installed in a lot of cases (e.g. pip install dask[complete], conda install dask), so if it makes other things more straightforward, I wouldn't object to adding it as a dependency

@douglasdavis
Copy link
Member Author

douglasdavis commented Jul 19, 2022

Hey @jrbourbeau! The motivation is a bit multidimensional. In general a group of Dask contributors that were at the SciPy sprint last week discussed improving CLI based on some feedback and a desire to bring CLI tooling under one extensible umbrella. @jacobtomlinson adds some detail on a PR in distributed here: dask/distributed#6738 (comment)

@mrocklin
Copy link
Member

From my perspective this needs docs, and then we should drop anything that isn't implemented (info). I'm good to merge after that.

@douglasdavis
Copy link
Member Author

douglasdavis commented Jul 20, 2022 via email

from dask.cli import cli


def test_version():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to also add test for other commands being added (e.g. dask version) and register_third_party

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the registering code a bit to make registering commands somewhat test-able

@douglasdavis
Copy link
Member Author

we should drop anything that isn't implemented (info).

Just for reference, info is actually being used as a group that can have command children. So far we are using it for

$ dask info versions

(which is calling dask.utils.show_versions). We can add more commands under the info umbrella, and so can other libraries/apps with

from dask.cli import info
import click

@click.command(name="fun")
def fun():
    print(42)
    
info.add_command(fun)

to get

$ dask info fun
42

douglasdavis added a commit to douglasdavis/distributed that referenced this pull request Jul 22, 2022
@jacobtomlinson
Copy link
Member

jacobtomlinson commented Jul 25, 2022

Do we want to make click a hard dependency?

I don't think we should. I'd be interested to explore how much work it would be to also support existing tools that just use argparse, is there some kind of type check we can do and somehow wrap things if they aren't click? It'd be nice for subprojects to choose whatever CLI tool they like, they might want to try typer for example.

pkg_resources is part of setuptools (news to me!) and its use is "discouraged" in favor of importlib from the stdlib. (so developers can use whatever packaging library they prefer, not locked into setuptools).

Ooh this is news to me too! We should definitely switch out pkg_resources in dask-ctl too.

@douglasdavis
Copy link
Member Author

I'd be interested to explore how much work it would be to also support existing tools that just use argparse, is there some kind of type check we can do and somehow wrap things if they aren't click? It'd be nice for subprojects to choose whatever CLI tool they like, they might want to try typer for example.

Typer apps can be used in Click apps, but I haven't been able to find anything that connects argparse and Click. Since the existing tools already use Click, I'm inclined to keep that as the foundation. If we don't make Click a hard dependency, perhaps we can make $ dask usage without click just print a you need to pip or conda install click message.

douglasdavis added a commit to douglasdavis/distributed that referenced this pull request Aug 22, 2022
@douglasdavis douglasdavis marked this pull request as ready for review August 23, 2022 02:49
@douglasdavis
Copy link
Member Author

Alright I think this is ready for review. The partner PR (dask/distributed#6735) should go in with this one, but it relies on installing my dask branch associated with this PR (I've changed all of the CI to use dask at version douglasdavis/dask@new-cli). I'm guessing this situation requires:

  1. This PR is approved and ready to go (don't merge but also don't make any more changes)
  2. Then review Use of new dask CLI distributed#6735
  3. Remove Use of new dask CLI distributed#6735 dependence on this PR's branch
  4. Merge both PRs together

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally looking great, thanks @douglasdavis.

I had a go at migrating dask-ctl over to this and it was pretty easy. dask-contrib/dask-ctl#50

I left a couple of small comments but happy to coordinate the merging of this and dask/distributed#6735 when ready. Ping me on Slack if you want to sync up to do it.

Comment on lines +65 to +69
if not isinstance(command, (click.Command, click.Group)):
raise TypeError(
"entry points in 'dask_cli' must be instances of "
"click.Command or click.Group"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be a warning instead? If a package registers a bad entrypoint it breaks the whole CLI.

$ dask --help   
Traceback (most recent call last):
  File "/home/jtomlinson/miniconda3/envs/dask/bin/dask", line 33, in <module>
    sys.exit(load_entry_point('dask', 'console_scripts', 'dask')())
  File "/home/jtomlinson/Projects/dask/dask/dask/__main__.py", line 23, in main
    run_cli()
  File "/home/jtomlinson/Projects/dask/dask/dask/cli.py", line 79, in run_cli
    _register_command_ep(cli, ep)
  File "/home/jtomlinson/Projects/dask/dask/dask/cli.py", line 66, in _register_command_ep
    raise TypeError(
TypeError: entry points in 'dask_cli' must be instances of click.Command or click.Group

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I think a warning would be better, just made that change

"entry points in 'dask_cli' must be instances of "
"click.Command or click.Group"
)
interface.add_command(command)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the case of a name conflict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out and found that a repeated name will just cause an overwrite. At first I added a commit to raise if this happens, but based on your other comment I decided to do another warning.

@douglasdavis
Copy link
Member Author

thanks @jacobtomlinson! Added a few commits to address those comments

sub-group in `interface`.

"""
command = entry_point.load()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note that this dynamic registration approach allows third party packages to register arbitrary code to be executed when the dask command is run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this, it's a fair point.

I don't think we should worry about this though, I agree that it allows nasty people to register an entrypoint that would cause the dask CLI command to run arbitrary code. But they could equally set a console_scripts entrypoint that just overrides the dask command altogether. If multiple packages register the same CLI command they just get overridden, so any package who's name starts with a letter greater than d can override the dask command.

Entrypoints as a concept allows arbitrary packages to register code that will be executed by some other package. Typically we have to trust that everything installed in the Python environment is trusted. There isn't really any other option.

@jrbourbeau jrbourbeau mentioned this pull request Oct 12, 2022
7 tasks
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, all lingering review comments have been addressed. I think this PR, and the accompanying PR in distributed (xref dask/distributed#6735), are both good to go. Thanks for your work here @douglasdavis

@jrbourbeau jrbourbeau merged commit 0532bb3 into dask:main Oct 14, 2022
rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this pull request Jan 19, 2023
In dask/dask#9283 we are adding a new top level `dask` CLI command which can be extended by other modules using entry points. A primary motivation here is to improve discoverability by uniting everything under one tool and allowing folks to run `dask --help` and `dask <subcommand> --help` to learn more about the various tools.

This PR adds a new `click` group called `cuda` and moves the `dask-cuda-worker` command under that group with the name `worker`. This means the `dask-cuda-worker` becomes `dask cuda worker` in the new CLI tool.

I haven't made any changes to the existing `dask-cuda-worker` console script so that will still continue to work, but maybe we should add a deprecation warning to it?

I went with this name rather than `dask cuda-worker` because I think it is more readable and also leaves us open to adding more subcommands in the future without cluttering up the top-level `dask` namespace.

```console
$ dask --help             
Usage: dask [OPTIONS] COMMAND [ARGS]...

  Dask command line interface.

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  cluster    Manage dask clusters.
  cuda       GPU subcommands.
  docs       Open Dask documentation (https://docs.dask.org/) in a web browser.
  info       Information about your dask installation.
  scheduler  Launch a distributed scheduler.
  ssh        Launch a distributed cluster over SSH.
  worker     Launch a distributed worker attached to an existing SCHEDULER.
```

```console
$ dask cuda --help        
Usage: dask cuda [OPTIONS] COMMAND [ARGS]...

  GPU subcommands.

Options:
  -h, --help  Show this message and exit.

Commands:
  worker  Launch a distributed worker with GPUs attached to an existing SCHEDULER.
```

```console
$ dask cuda worker --help  
Usage: dask cuda worker [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...

  Launch a distributed worker with GPUs attached to an existing SCHEDULER.

  See https://docs.rapids.ai/api/dask-cuda/stable/quickstart.html#dask-cuda-worker for
  info.

Options:
  --host TEXT                     IP address of serving host; should be visible to the
                                  scheduler and other workers. Can be a string (like
                                  ``"127.0.0.1"``) or ``None`` to fall back on the
                                  address of the interface specified by
                                  ``--interface`` or the default interface.
  --nthreads INTEGER              Number of threads to be used for each Dask worker
                                  process.  [default: 1]
...
```

The CLI PR needs to be merged and released before this can be merged.

Fixes #1038

Authors:
  - Jacob Tomlinson (https://github.com/jacobtomlinson)
  - Ray Douglass (https://github.com/raydouglass)
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - https://github.com/jakirkham
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #981
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improve or add to documentation feature Something is missing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants