Failing `test_climatic_mean `

Copied from https://github.com/coiled/coiled-runtime/pull/243#issuecomment-1217395868. `test_climatic_mean` is currently skipped since it fails in every CI run, but this should be investigated and the underlying cause fixed.

When I run this test on its own against coiled-runtime 0.0.4, it does fine. The dashboard looks as it should, tasks go quickly. There's a little bit of spillage but not much.

However, it's always failing in the full CI job. With a lot of pain and watching GitHub CI logs and clicking at random at clusters on the coiled dashboard, I managed to find the cluster that was running the test and watch it. The dashboard looked a bit worse, more data spilled to disk. Workers kept running out of memory and restarting. So progress was extremely slow, and kept rolling back every time a worker died.

![Screen Shot 2022-08-16 at 7 53 06 PM](https://user-images.githubusercontent.com/3309802/185019594-1cc87d77-83d8-48b8-b21a-258fc1297b3f.png)

Theories for why it's failing:
* On [distributed==2022.6.0](https://github.com/coiled/coiled-runtime/runs/7869994497?check_suite_focus=true#step:5:41), `MALLOC_TRIM_THRESHOLD_` [hasn't been set yet](https://distributed.dask.org/en/stable/changelog.html#id9) by default. That might make the difference. Note though that the test passes even without it being set, if it's run on a fresh cluster. So that's clearly not the only problem. Plus, we're `client.restart()`-ing the workers before every test, so the _workers_ should be in the same brand-new state regardless of whether the test is run on its own, or after others. However, `client.restart()` doesn't do that much to the scheduler, so maybe that's where the problem is.
* We've know that every subsequent time you submit a workload to the scheduler, it runs slower and slower, and scheduler memory grows and grows: https://github.com/dask/distributed/issues/4987#issuecomment-877473670. (There's no reason to think things have changed since that research last year.)

    As the scheduler gets sluggish, it will be slower to both tell workers about data-consuimg downstream tasks to run (instead of the data-producing root tasks they've already been told to run), _and_ it will be slower to allow them to delete keys that are completed and aren't needed anymore. Note that just because a worker runs a downstream task (like writing a chunk to zarr) doesn't mean the worker gets to immediately release the upstream data—it must be explicitly told by the scheduler to do so. If the scheduler is slow, the worker will go load even more data into memory while keeping around the chunks that have already been written to zarr and should have been released.

    Thus we see the double-whammy of root task overproduction: as soon as the delicate balance of scheduler latency is thrown off, workers will simultaneously produce memory faster than they should, and release memory slower than they should:
    - https://github.com/dask/distributed/issues/5223
    - https://github.com/dask/distributed/issues/5114

    Basically, I think this will only be fixed by https://github.com/dask/distributed/pull/6614, or by understanding and fixing whatever's causing the scheduler to slow down (which is further out) https://github.com/dask/distributed/issues/4987.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing `test_climatic_mean` #253

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failing test_climatic_mean #253

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failing `test_climatic_mean` #253