Fix regression in test_weakref_cache#6033
Conversation
distributed/tests/test_spill.py
Outdated
| # Surprisingly, even on CPython this is needed to ensure that the object is garbage | ||
| # collected, even if there are no obvious circular references going on | ||
| gc.collect() |
There was a problem hiding this comment.
This does concern me a bit. I don't think the GC should be required here.
I can reproduce this on py3.10. Maybe there is a CPython bug? Below is what objgraph tells me
import weakref
ref = weakref.ref(x)
del x
import objgraph
obj = ref()
if obj:
objgraph.show_backrefs(obj)cc @graingert in case something rings a bell
There was a problem hiding this comment.
My mistake, at this point, x is supposed to be in memory since the spill was not, yet, triggered. This only happens a few lines below.
I cannot reproduce this any longer
There was a problem hiding this comment.
I ran into similar issues in dask-ms a while back and found I could never rely on the del x. If it's an issue of "encouraging" the decref/garbage collector, you might try allocating x inside an inner function like so:
# Run this test twice:
# - x is smaller than target and is evicted by y;
# - x is individually larger than target and it never touches fast
def _inner():
x = cls(size)
buf["x"] = x
if size < 100:
buf["y"] = cls(60) # spill x
assert "x" in buf.slow
# Test that we update the weakref cache on setitem
assert (buf["x"] is x) == expect_cached
# Do not use id_x = id(x), as in CPython id's are C memory addresses and are reused
# by PyMalloc when you descope objects, so a brand new object might end up having
# the same id as a deleted one
id_x = x.id
del x
return id_x
id_x = _inner()|
This test has been failing for the first time on 2022-03-17, i.e. before #6029 https://dask.org/distributed/test_report.html I suggest to no merge this since the gc.collect is not helping and may even confuse us more than it should |
|
the test passes for me locally with the gc disabled: distributed/distributed/tests/test_spill.py Lines 327 to 367 in 4da3b8c |
|
shutting down the profile threads before running test_weakref_cache seems to fix it for me (locally): distributed/distributed/profile.py Lines 278 to 341 in a896a01 distributed/distributed/tests/test_spill.py Line 347 in a896a01 edit: Seems to have fixed it on CI too https://github.com/graingert/distributed/actions/runs/2081825815 |
|
commenting out the distributed/distributed/profile.py Lines 297 to 302 in 5f2ffab https://github.com/graingert/distributed/runs/5806180032?check_suite_focus=true#step:11:5355 |
|
I've commented the various tests that call gc.collect() to reflect your latest findings |
I don't think this is correct - I think a sleep will be just as good as a |
|
Turning off profile threads during testing seems reasonable to me.
…On Mon, Apr 4, 2022 at 6:06 AM Thomas Grainger ***@***.***> wrote:
I've commented the various tests that call gc.collect() to reflect your
latest findings
I don't think this is correct - I think a sleep will be just as good as a
gc.collect()
—
Reply to this email directly, view it on GitHub
<#6033 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTBTI5RQI6RGFBBARMLVDLEL5ANCNFSM5SE6KFKQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Having a whole extra thread, running at all times during production and which tampers with references, which is just not there during testing - this seems quite dangerous to me. |
These profile threads aren't actually supposed to be running, they're sticking around because some other test is leaking threads |
I was not talking about test_weakref_cache in particular, but instead in general about all those tests that look like del future
gc.collect()I've reworked the PR - let's see if it's stable or not |
|
the I added: --- a/distributed/utils.py
+++ b/distributed/utils.py
@@ -459,7 +459,10 @@ class LoopRunner:
finally:
done_evt.set()
- thread = threading.Thread(target=run_loop, name="IO loop")
+ thread = threading.Thread(
+ target=run_loop,
+ name=f"IO loop for {os.environ.get('PYTEST_CURRENT_TEST')}",
+ )
thread.daemon = True
thread.start()
which lets me see where the loops are staying alive from: |
|
@graingert that list of threads is an extremely useful thing to have permanently - mind opening a separate PR to get it in, as well as proper cleanup for LocalCluster? I think this PR should continue on the direction of resiliency to the profiler regardless of your cleanup. Also, even after the cleanup, all tests that have a cluster running will still have the issue. |
| while _watch_running: | ||
| sleep(0.0001) |
There was a problem hiding this comment.
I'm not entirely sure how I feel about this. It is not thread safe so while this function is returning another profile thread might again sample the frames. At the same time, if a profile thread is actually running, this thing will burn CPU hard.
For this specific use case, that's likely not a problem but I'm wondering if a sleep(0.1) would not have an equal power without us putting this kind of instrumentation in.
There was a problem hiding this comment.
I agree, this is OK only as long as its usage is limited to unit tests
|
I don't like this watcher but I don't want to block this PR any longer. I opened #6075 to discuss the future of the profiler in our test suite |
|
@graingert is that a different type of failure, or what you would have expected this to fix? I haven't looked at this PR at all, so I'm not sure. |
|
The new failure has nothing to do with test_weakref_cache; the log is showing that the problem is that profile_lock has remained acquired by some other test before it. |


Fix regression introduced in #6029:
https://github.com/dask/distributed/runs/5768355142?check_suite_focus=true