Skip to content

Failing test_climatic_mean  #253

@gjoseph92

Description

@gjoseph92

Copied from #243 (comment). test_climatic_mean is currently skipped since it fails in every CI run, but this should be investigated and the underlying cause fixed.

When I run this test on its own against coiled-runtime 0.0.4, it does fine. The dashboard looks as it should, tasks go quickly. There's a little bit of spillage but not much.

However, it's always failing in the full CI job. With a lot of pain and watching GitHub CI logs and clicking at random at clusters on the coiled dashboard, I managed to find the cluster that was running the test and watch it. The dashboard looked a bit worse, more data spilled to disk. Workers kept running out of memory and restarting. So progress was extremely slow, and kept rolling back every time a worker died.

Screen Shot 2022-08-16 at 7 53 06 PM

Theories for why it's failing:

  • On distributed==2022.6.0, MALLOC_TRIM_THRESHOLD_ hasn't been set yet by default. That might make the difference. Note though that the test passes even without it being set, if it's run on a fresh cluster. So that's clearly not the only problem. Plus, we're client.restart()-ing the workers before every test, so the workers should be in the same brand-new state regardless of whether the test is run on its own, or after others. However, client.restart() doesn't do that much to the scheduler, so maybe that's where the problem is.

  • We've know that every subsequent time you submit a workload to the scheduler, it runs slower and slower, and scheduler memory grows and grows: Are reference cycles a performance problem? dask/distributed#4987 (comment). (There's no reason to think things have changed since that research last year.)

    As the scheduler gets sluggish, it will be slower to both tell workers about data-consuimg downstream tasks to run (instead of the data-producing root tasks they've already been told to run), and it will be slower to allow them to delete keys that are completed and aren't needed anymore. Note that just because a worker runs a downstream task (like writing a chunk to zarr) doesn't mean the worker gets to immediately release the upstream data—it must be explicitly told by the scheduler to do so. If the scheduler is slow, the worker will go load even more data into memory while keeping around the chunks that have already been written to zarr and should have been released.

    Thus we see the double-whammy of root task overproduction: as soon as the delicate balance of scheduler latency is thrown off, workers will simultaneously produce memory faster than they should, and release memory slower than they should:

    Basically, I think this will only be fixed by Withhold root tasks [no co assignment] dask/distributed#6614, or by understanding and fixing whatever's causing the scheduler to slow down (which is further out) Are reference cycles a performance problem? dask/distributed#4987.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions