Make AMM memory measure configurable#7062
Conversation
| with warnings.catch_warnings(): | ||
| warnings.simplefilter("ignore") | ||
| b = (a @ a.T).sum().round(3) | ||
| assert await c.compute(b) == 245.394 |
There was a problem hiding this comment.
Interestingly, (20, 20) would take forever or hang on my PC with Worker=Worker, but took as little as 5s with Worker=Nanny. On CI, Worker=Nanny hangs because of #5371.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 5h 59m 36s ⏱️ - 7m 26s For more details on these failures, see this check. Results for commit d8c9586. ± Comparison against base commit b40c03d. ♻️ This comment has been updated with latest results. |
6f87b5b to
1955f92
Compare
6662655 to
d8c9586
Compare
| x = c.submit(lambda: 123, key="x", workers=[w1.address]) | ||
| await wait(x) | ||
| # Fill w2 with dummy data so that it's got the highest memory usage | ||
| clutter = await c.scatter(456, workers=[w2.address]) |
There was a problem hiding this comment.
| clutter = await c.scatter(456, workers=[w2.address]) | |
| clutter = await c.scatter("c" * 10, workers=[w2.address]) |
Small note, I would expect 123 and 456 to have the same memory usage. So the comment above is slightly misleading. Something like this would make it definitively larger.
There was a problem hiding this comment.
w2 has got the highest memory usage among the workers that aren't being retired, meaning w2 and w3. I updated the comment to clarify.
Allow the Active Memory Manager to use a measure other than optimistic memory (managed + unmanaged that appeared more than 30s ago) in its heuristics.
This is particularly useful on MacOSX, where memory deallocation is not as responsive as on Windows or Linux, and on Linux when allocators other than malloc are being used.
This also allows to write the AMM unit tests using Worker instead of Nanny and make them much more robust and fast.
This PR finally enables all AMM stress tests in CI.
Stress test evidence: https://github.com/crusaderky/distributed/actions/runs/3114670981/jobs/5050785452#step:18:1674
There was one failure which I don't believe to be attributed to AMM. Follow-up: #7063