Scaling distributed dask to run 27,648 dask workers on the Summit supercomputer

This github distributed dask issue is to track progress and resolve problems associated with scaling dask to run on all the nodes on Oak Ridge National Laboratory (ORNL)’s [Summit](https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/) supercomputer.  Benjamin Zaitlen of Nvidia suggested that I create this GitHub issue on the RAPIDS-GoAi Slack when I first shared what we were trying to do there; he communicated with the dask development team, and they posed this approach of creating this issue in the project’s GitHub issue tracker.

At ORNL we are using distributed dask to evaluate unique driving scenarios using the [CARLA](http://carla.org/) driving simulator to improve a deep-learner model for driving a virtual autonomous vehicle.  As part of this effort, we are stress testing dask on Summit by gradually adding more Summit nodes to submitted jobs, fixing problems at the point where a given number of nodes has caused issues, and then increasing the number of nodes to address the next set of problems.  

As of April 9th, the maximum number of Summit nodes from which we’ve gotten a successful run is 402 nodes with six dask workers each, for a total of 2,412 dask workers.  We have a job pending with 921 nodes that will push the total number of dask workers to 5,526 dask workers.  Eventually we’d like dask running on all 4,608 Summit nodes, which would have 27,648 dask workers.

In parallel, we’re developing an open-source evolutionary algorithm framework, [LEAP](https://github.com/AureumChaos/LEAP/tree/develop), with colleagues at George Mason University and MITRE that uses dask to concurrently evaluate individuals.  We have used LEAP as the vehicle for stress testing dask on Summit.  The first test problem was a MAX ONES instance where the fitness was the total number of ones in a given bit string.  Since this problem was too trivial for stress testing, we created a new problem that had a single integer as the individual's phenome (that is, genes that have been decoded into something meaningful to the problem, in this case an integer) that would correspond to how many seconds to sleep during evaluation.  This means an individual that had a genome of 10 bits of all ones would cause the dask worker to sleep for about 17 minutes during evaluation.  Moreover, this was a maximization problem, so the longer the sleep periods, the fitter the individual.  (We could probably call this test the UNDERGRAD problem. 😉 )

In the comments below, I’ll detail what I’ve done so far, and relate some of the problems I’ve encountered on this journey.  I welcome comments, tips, suggestions, criticisms, and help with the next inevitable set of problems associated with this non-trivial task.  🙂 

Cheers,

Mark Coletti,
ORNL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scaling distributed dask to run 27,648 dask workers on the Summit supercomputer #3691

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Scaling distributed dask to run 27,648 dask workers on the Summit supercomputer #3691

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions