Skip to content

Deprecate --nprocs option for dask-worker CLI#5641

Merged
jrbourbeau merged 18 commits intodask:mainfrom
bryanwweber:deprecate-nprocs
Feb 2, 2022
Merged

Deprecate --nprocs option for dask-worker CLI#5641
jrbourbeau merged 18 commits intodask:mainfrom
bryanwweber:deprecate-nprocs

Conversation

@bryanwweber
Copy link
Contributor

@bryanwweber bryanwweber commented Jan 6, 2022

Per #2471, nprocs is potentially confusing and should be replaced by num-workers.

Per #2471, nprocs is confusing and should be replaced by num-workers.
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@jrbourbeau
Copy link
Member

add to allowlist

@bryanwweber bryanwweber marked this pull request as draft January 6, 2022 17:21
@jrbourbeau
Copy link
Member

cc @guillaumeeb for visibility (as this change will impact dask-jobqueue)

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bryanwweber! I know this is still a draft PR, but I still wanted to leave some early feedback.

Also, unfortunately distributed has several known flaky tests at the moment. The test failures popping up here are all unrelated to the changes in this PR.

@bryanwweber
Copy link
Contributor Author

@jrbourbeau Sure, thanks for the review! I'll take a look tomorrow. I also noticed that edge case, but wanted to follow the pre-existing convention for the bokeh/dashboard options. I'll be happy to update.

I just pushed 3 more commits to catch the few other places where nprocs was used. I'm not sure if I should also update the old_ssh interface, so I put those changes into a separate commit to make it easy to drop.

@bryanwweber bryanwweber marked this pull request as ready for review January 7, 2022 17:39
@guillaumeeb
Copy link
Member

cc @guillaumeeb for visibility (as this change will impact dask-jobqueue)

Thanks @jrbourbeau, I'll try to follow the evolution here!

@jrbourbeau
Copy link
Member

@jacobtomlinson if you get a moment, would you mind taking a look at the SSHCluster-related changes here?

@bryanwweber
Copy link
Contributor Author

@jrbourbeau I can't reproduce this test failure locally: https://github.com/dask/distributed/runs/4791743011?check_suite_focus=true#step:12:182 possibly because I'm on a mac and this is a Windows failure?

jacobtomlinson
jacobtomlinson previously approved these changes Jan 12, 2022
Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SSH changes look reasonable to me.

This will be breaking for old_ssh but perhaps we should think about removing that now anyway.

@jacobtomlinson jacobtomlinson dismissed their stale review January 12, 2022 16:37

Further information

@jacobtomlinson
Copy link
Member

Sorry for the confusion with the review dismissal. I'm happy with this change.

We should remove old_ssh at some point, but that requires migrating the CLI over to use it.

@bryanwweber
Copy link
Contributor Author

bryanwweber commented Jan 18, 2022

@jrbourbeau I tried to use Client as we discussed to test dask_ssh but the test always times out. I wonder if this is the cause:

# Monitor the output of remote processes. This blocks until the user issues a KeyboardInterrupt.
c.monitor_remote_processes()

I suspect this might be why I needed to issue the SIGINT to get any output, if that function is blocking until SIGINT is sent?

@graingert
Copy link
Member

@jrbourbeau I tried to use Client as we discussed to test dask_ssh but the test always times out. I wonder if this is the cause:

# Monitor the output of remote processes. This blocks until the user issues a KeyboardInterrupt.
c.monitor_remote_processes()

I suspect this might be why I needed to issue the SIGINT to get any output, if that function is blocking until SIGINT is sent?

I suspect this is because the stdout stream is block-buffered by default (when non-interactive) you could set PYTHONUNBUFFERED if you want the output as soon as possible.

https://docs.python.org/3/library/sys.html#sys.stderr

@bryanwweber
Copy link
Contributor Author

@graingert Thanks for the thought! I don't think that's the problem here, for a few reasons. The modified test code with Client looks like:

def test_ssh_cli_nprocs_renamed_to_num_workers(loop):
    num_workers = 2
    with popen(
        ["dask-ssh", f"--nprocs={num_workers}", "--nohost", "localhost"]
    ) as cluster:
        with Client("tcp://127.0.0.1:8786", loop=loop) as c:
            c.wait_for_workers(num_workers, timeout="15 seconds")
  1. The timeout failure happens due to pytest-timeout, not because the Client has timed out. I'm running pytest as pytest -s distributed/cli/tests/test_dask_ssh.py::test_ssh_cli_nprocs_renamed_to_num_workers --timeout=30, and the timeout only happens after 30 seconds, not 15.
  2. The -s flag to pytest should result in unbuffered output, if I understand correctly
  3. I also tried modifying the bufsize kwarg to popen to set it to 0, which also did not result in any output
  4. The modified test is not actually testing or waiting on any output to stdout or stderr, but it still times out

@jsignell
Copy link
Member

Sorry to be super late to the conversation, but is there a good reason to use num_ rather than n or n_ as the prefix? And if there is, can we consolidate nthreads and n_workers to this prefix?

@bryanwweber
Copy link
Contributor Author

bryanwweber commented Jan 19, 2022

Sorry to be super late to the conversation, but is there a good reason to use num_ rather than n or n_ as the prefix? And if there is, can we consolidate nthreads and n_workers to this prefix?

No problem @jsignell! num_workers was suggested in the linked issue, #2471 (comment). n_workers would be fine with me 😄

I believe nthreads and num_workers are orthogonal. nthreads sets the number of threads per process, and num_workers sets the number of processes. But I'm not totally sure about that last part, specifically whether num_workers sets the number of processes 😃

@jsignell
Copy link
Member

Yeah, I just meant consolidate on a prefix :)

@bryanwweber
Copy link
Contributor Author

Thanks @jsignell! Do you prefer nworkers and nthreads or with the underscore? The advantage of nworkers is that it won't require a deprecation cycle with nthreads, but my personal preference is for the underscores.

@jsignell
Copy link
Member

Thanks @jsignell! Do you prefer nworkers and nthreads or with the underscore? The advantage of nworkers is that it won't require a deprecation cycle with nthreads, but my personal preference is for the underscores.

No preference. There is an n_workers already elsewhere (at least in LocalCluster, but possibly other places). Not sure if we should match or be different for greppability (probably match).

@bryanwweber
Copy link
Contributor Author

Whoops, sorry @jacobtomlinson I hit the wrong button there. Feel free to dismiss the review request (or do a review if you want to 😄 )

@bryanwweber
Copy link
Contributor Author

@jsignell Based on a conversation with @jrbourbeau we agreed that using --nworkers for CLI flags and n_workers in Python code would be the most self-consistent approach.

The client only has 5 seconds to connect to the scheduler by default, so
it seems to sometimes time out on CI.
@bryanwweber
Copy link
Contributor Author

@jrbourbeau I think I tracked down the SSH failures to the Client not having enough time on the CI machines to connect to the scheduler. I bumped that timeout, hopefully it will fix it 🤞

@bryanwweber
Copy link
Contributor Author

@jrbourbeau The test failing on Ubuntu/3.9/no ci1 doesn't look like it's due to my change, but I can't quite tell. Do you have any insight? https://github.com/dask/distributed/runs/4968972824?check_suite_focus=true#step:12:1434 Otherwise, waiting on macOS like some other PRs

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your efforts on this @bryanwweber. Agreed those test failures are unrelated to the changes here. I left a few final review comments, but this looks good overall

nanny_port=None,
remote_dask_worker="distributed.cli.dask_worker",
local_directory=None,
**kwargs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding **kwargs here will cause invalid keyword arguments to go unnoticed. For example, today a typo of memory_limt instead of memory_limit will raise an error, while with this change as is such a typo will be silently ignored.

Instead I'd rather us either keep an explicit nprocs= keyword around, but still raise a ValueError / FutureWarning accordingly, or do something like (below is pseudocode)

nprocs = kwargs.pop("nprocs")
if kwargs:
    raise TypeError("...")
<existing ValueError / FutureWarning logic>

elif n_workers is None:
n_workers = 1

self.n_workers = n_workers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a small self.nprocs property which redirects users to self.n_workers instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to add this. However, setting nprocs or n_workers as an attribute (after __init__() runs) doesn't actually change anything about the cluster, unless the user subsequently runs add_worker(). For that matter, it's not all that clear to me what the behavior should be... do all the worker nodes need to be updated with the number of worker processes? Is there a method to do that (there doesn't seem to be from my quick read of the code)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an offline discussion, we decided that although these attributes are not necessarily meant to be set, a belt-and-suspenders approach to backward compatibility means we should add a property to warn instead of an AttributeError if nprocs is not defined. This is particularly true for the getting case

)
self.n_workers = self.kwargs.pop("nprocs")
else:
self.n_workers = self.kwargs.pop("n_workers", 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here about adding a self.nprocs property redirecting to self.n_workers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar response as above 😄

"renamed to n_workers.",
FutureWarning,
)
self.n_workers = self.kwargs.pop("nprocs")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is needed or not

Suggested change
self.n_workers = self.kwargs.pop("nprocs")
self.n_workers = self.kwargs.pop("nprocs", 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be needed, but wouldn't hurt to add it.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2022

Unit Test Results

       18 files  +       1         18 suites  +1   9h 48m 14s ⏱️ + 20m 49s
  2 572 tests +     11    2 494 ✔️ +     15       78 💤 ±    0  0  - 4 
23 014 runs  +1 432  21 428 ✔️ +1 316  1 586 💤 +120  0  - 4 

Results for commit 353d5ce. ± Comparison against base commit 713204b.

@jrbourbeau jrbourbeau changed the title Deprecate nprocs keyword from dask-worker Deprecate --nprocs option for dask-worker CLI Feb 2, 2022
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bryanwweber! This is in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove --nprocs keyword from dask-worker command?

7 participants