Fix regression in test_weakref_cache by crusaderky · Pull Request #6033 · dask/distributed

crusaderky · 2022-03-31T10:45:22Z

Fix regression introduced in #6029:
https://github.com/dask/distributed/runs/5768355142?check_suite_focus=true

>       assert x2.id != id_x
E       AssertionError: assert UUID('57673261-dc39-4e98-afc3-c4ad625f9211') != UUID('57673261-dc39-4e98-afc3-c4ad625f9211')

fjetter · 2022-03-31T11:05:50Z

distributed/tests/test_spill.py

+    # Surprisingly, even on CPython this is needed to ensure that the object is garbage
+    # collected, even if there are no obvious circular references going on
+    gc.collect()


This does concern me a bit. I don't think the GC should be required here.

I can reproduce this on py3.10. Maybe there is a CPython bug? Below is what objgraph tells me

import weakref ref = weakref.ref(x) del x import objgraph obj = ref() if obj: objgraph.show_backrefs(obj)

cc @graingert in case something rings a bell

My mistake, at this point, x is supposed to be in memory since the spill was not, yet, triggered. This only happens a few lines below.

I cannot reproduce this any longer

I ran into similar issues in dask-ms a while back and found I could never rely on the del x. If it's an issue of "encouraging" the decref/garbage collector, you might try allocating x inside an inner function like so:

# Run this test twice: # - x is smaller than target and is evicted by y; # - x is individually larger than target and it never touches fast def _inner(): x = cls(size) buf["x"] = x if size < 100: buf["y"] = cls(60) # spill x assert "x" in buf.slow # Test that we update the weakref cache on setitem assert (buf["x"] is x) == expect_cached # Do not use id_x = id(x), as in CPython id's are C memory addresses and are reused # by PyMalloc when you descope objects, so a brand new object might end up having # the same id as a deleted one id_x = x.id del x return id_x id_x = _inner()

fjetter · 2022-03-31T12:42:42Z

This test has been failing for the first time on 2022-03-17, i.e. before #6029

https://dask.org/distributed/test_report.html

I suggest to no merge this since the gc.collect is not helping and may even confuse us more than it should

github-actions · 2022-03-31T14:04:39Z

Unit Test Results

      18 files ±0       18 suites ±0 9h 12m 11s ⏱️ + 19m 22s
  2 724 tests ±0   2 642 ✔️ +2     82 💤 +1 0 ❌ - 3
24 364 runs ±0 23 145 ✔️ +1 1 219 💤 +2 0 ❌ - 3

Results for commit 177def7. ± Comparison against base commit 971a96d.

♻️ This comment has been updated with latest results.

crusaderky · 2022-03-31T14:18:59Z

This test has been failing for the first time on 2022-03-17, i.e. before #6029>
I suggest to no merge this since the gc.collect is not helping and may even confuse us more than it should

That was before #5978. It has been perfectly stable between #5978 and #6029.

crusaderky · 2022-03-31T14:20:42Z

xref #6000 where @mrocklin saw the exact same problem with weakrefs elsewhere

graingert · 2022-03-31T18:45:38Z

the test passes for me locally with the gc disabled:

distributed/distributed/tests/test_spill.py

Lines 327 to 367 in 4da3b8c

    
           def test_weakref_cache(tmpdir, cls, expect_cached, size): 
        
               gc.disable() 
        
               try: 
        
                   buf = SpillBuffer(str(tmpdir), target=100) 
        
                   # Run this test twice: 
        
                   # - x is smaller than target and is evicted by y; 
        
                   # - x is individually larger than target and it never touches fast 
        
                   x = cls(size) 
        
                   canary = weakref.ref(x.canary) 
        
                   buf["x"] = x 
        
                   if size < 100: 
        
                       buf["y"] = cls(60)  # spill x 
        
                   assert "x" in buf.slow 
        
                   # Test that we update the weakref cache on setitem 
        
                   assert (buf["x"] is x) == expect_cached 
        
                   # Do not use id_x = id(x), as in CPython id's are C memory addresses and are reused 
        
                   # by PyMalloc when you descope objects, so a brand new object might end up having 
        
                   # the same id as a deleted one 
        
                   id_x = x.id 
        
                   del x 
        
                   if size < 100: 
        
                       assert (canary() is not None) == expect_cached 
        
                       buf["y"] 
        
                   assert canary() is None 
        
                   assert "x" in buf.slow 
        
                   x2 = buf["x"] 
        
                   assert x2.id != id_x 
        
                   if size < 100: 
        
                       buf["y"] 
        
                   assert "x" in buf.slow 
        
                   # Test that we update the weakref cache on getitem 
        
                   assert (buf["x"] is x2) == expect_cached 
        
               finally: 
        
                   gc.enable()

graingert · 2022-04-02T10:55:41Z

shutting down the profile threads before running test_weakref_cache seems to fix it for me (locally):

distributed/distributed/profile.py

Lines 278 to 341 in a896a01

    
           def _watch( 
        
               thread_id, 
        
               log, 
        
               stop_event, 
        
               interval="20ms", 
        
               cycle="2s", 
        
               omit=None, 
        
               stop=lambda: False, 
        
           ): 
        
               interval = parse_timedelta(interval) 
        
               cycle = parse_timedelta(cycle) 
        
               recent = create() 
        
               last = time() 
        
               while not stop() and not stop_event.is_set(): 
        
                   if time() > last + cycle: 
        
                       log.append((time(), recent)) 
        
                       recent = create() 
        
                       last = time() 
        
                   try: 
        
                       frame = sys._current_frames()[thread_id] 
        
                   except KeyError: 
        
                       return 
        
                   process(frame, None, recent, omit=omit) 
        
                   stop_event.wait(interval) 
        
           class _Profile(threading.Thread): 
        
               def __init__(self, thread_id, interval, cycle, log, omit, stop): 
        
                   self._stop_event = stop_event = threading.Event() 
        
                   super().__init__( 
        
                       name="Profile", 
        
                       target=_watch, 
        
                       kwargs={ 
        
                           "stop_event": stop_event, 
        
                           "thread_id": thread_id, 
        
                           "interval": interval, 
        
                           "cycle": cycle, 
        
                           "log": log, 
        
                           "omit": omit, 
        
                           "stop": stop, 
        
                       }, 
        
                       daemon=True, 
        
                   ) 
        
               def close(self) -> None: 
        
                   self._stop_event.set() 
        
                   self.join() 
        
           _profiles: weakref.WeakSet[_Profile] = weakref.WeakSet() 
        
           def shutdown_profile_threads(): 
        
               pop = _profiles.pop 
        
               while True: 
        
                   try: 
        
                       profile = pop() 
        
                   except KeyError: 
        
                       return 
        
                   profile.close()

distributed/distributed/tests/test_spill.py

Line 347 in a896a01

shutdown_profile_threads()

edit: Seems to have fixed it on CI too https://github.com/graingert/distributed/actions/runs/2081825815

graingert · 2022-04-03T14:06:42Z

commenting out the sys._current_frames()[thread_id] call makes test_spill.py pass:

distributed/distributed/profile.py

Lines 297 to 302 in 5f2ffab

    
           # try: 
        
           #     frame = sys._current_frames()[thread_id] 
        
           # except KeyError: 
        
           #     return 
        
           # process(frame, None, recent, omit=omit)

https://github.com/graingert/distributed/runs/5806180032?check_suite_focus=true#step:11:5355

crusaderky · 2022-04-04T10:54:12Z

I've commented the various tests that call gc.collect() to reflect your latest findings

graingert · 2022-04-04T11:06:27Z

I've commented the various tests that call gc.collect() to reflect your latest findings

I don't think this is correct - I think a sleep will be just as good as a gc.collect()

mrocklin · 2022-04-04T13:21:45Z

Turning off profile threads during testing seems reasonable to me.

…

On Mon, Apr 4, 2022 at 6:06 AM Thomas Grainger ***@***.***> wrote: I've commented the various tests that call gc.collect() to reflect your latest findings I don't think this is correct - I think a sleep will be just as good as a gc.collect() — Reply to this email directly, view it on GitHub <#6033 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTBTI5RQI6RGFBBARMLVDLEL5ANCNFSM5SE6KFKQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

crusaderky · 2022-04-05T10:02:38Z

Turning off profile threads during testing seems reasonable to me.

Having a whole extra thread, running at all times during production and which tampers with references, which is just not there during testing - this seems quite dangerous to me.

graingert · 2022-04-05T10:10:13Z

Turning off profile threads during testing seems reasonable to me.

These profile threads aren't actually supposed to be running, they're sticking around because some other test is leaking threads

crusaderky · 2022-04-05T11:41:31Z

These profile threads aren't actually supposed to be running, they're sticking around because some other test is leaking threads

I was not talking about test_weakref_cache in particular, but instead in general about all those tests that look like

del future
gc.collect()

I've reworked the PR - let's see if it's stable or not

graingert · 2022-04-05T13:58:37Z

the LocalCluster isn't calling self._loop_runner.stop() on close

I added:

--- a/distributed/utils.py
+++ b/distributed/utils.py
@@ -459,7 +459,10 @@ class LoopRunner:
             finally:
                 done_evt.set()
 
-        thread = threading.Thread(target=run_loop, name="IO loop")
+        thread = threading.Thread(
+            target=run_loop,
+            name=f"IO loop for {os.environ.get('PYTEST_CURRENT_TEST')}",
+        )
         thread.daemon = True
         thread.start()

which lets me see where the loops are staying alive from:

[<_MainThread(MainThread, started 140507058100032)>,
 <Thread(Dask-Offload_0, started 140505936979712)>,
 <Thread(TCP-Executor-44063-0, started daemon 140505708054272)>,
 <Thread(TCP-Executor-44063-1, started daemon 140505493010176)>,
 <Thread(Profile, started daemon 140505716446976)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::LocalTest::test_context_manager (call), started daemon 140502607329024)>,
 <Thread(Profile, started daemon 140504410875648)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::LocalTest::test_cores (call), started daemon 140505006462720)>,
 <Thread(Profile, started daemon 140505014855424)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::LocalTest::test_no_workers (call), started daemon 140505023248128)>,
 <Thread(Profile, started daemon 140504998070016)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::LocalTest::test_submit (call), started daemon 140504989677312)>,
 <Thread(Profile, started daemon 140504972891904)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::test_cleanup (call), started daemon 140504419268352)>,
 <Thread(Profile, started daemon 140503144199936)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::test_cleanup (call), started daemon 140504981284608)>,
 <Thread(Profile, started daemon 140504385697536)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::test_dont_select_closed_worker (call), started daemon 140504368912128)>,
 <Thread(Profile, started daemon 140504402482944)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::test_dont_select_closed_worker (call), started daemon 140503848826624)>,
 <Thread(Profile, started daemon 140504394090240)>,
 <paramiko.Transport at 0xf8785ab0 (cipher aes128-ctr, 128 bits) (active; 0 open channel(s))>,
 <paramiko.Transport at 0xdc71c070 (cipher aes128-ctr, 128 bits) (active; 0 open channel(s))>,
 <paramiko.Transport at 0xf879f430 (cipher aes128-ctr, 128 bits) (active; 0 open channel(s))>,
 <Timer(pytest_timeout distributed/tests/test_spill.py::test_weakref_cache[60-SupportsWeakRef-True], started 140505734280960)>]

crusaderky · 2022-04-05T14:20:56Z

@graingert that list of threads is an extremely useful thing to have permanently - mind opening a separate PR to get it in, as well as proper cleanup for LocalCluster?

I think this PR should continue on the direction of resiliency to the profiler regardless of your cleanup. Also, even after the cleanup, all tests that have a cluster running will still have the issue.

fjetter · 2022-04-06T08:49:48Z

distributed/profile.py

+    while _watch_running:
+        sleep(0.0001)


I'm not entirely sure how I feel about this. It is not thread safe so while this function is returning another profile thread might again sample the frames. At the same time, if a profile thread is actually running, this thing will burn CPU hard.
For this specific use case, that's likely not a problem but I'm wondering if a sleep(0.1) would not have an equal power without us putting this kind of instrumentation in.

I agree, this is OK only as long as its usage is limited to unit tests

fjetter · 2022-04-06T08:54:30Z

I don't like this watcher but I don't want to block this PR any longer. I opened #6075 to discuss the future of the profiler in our test suite

gjoseph92 · 2022-07-20T21:10:18Z

test_weakref_cache failed agin on July 5 on macOS with a 300s timeout: https://github.com/dask/distributed/runs/7199711175?check_suite_focus=true

@graingert is that a different type of failure, or what you would have expected this to fix? I haven't looked at this PR at all, so I'm not sure.

crusaderky · 2022-07-25T21:06:17Z

The new failure has nothing to do with test_weakref_cache; the log is showing that the problem is that profile_lock has remained acquired by some other test before it.

crusaderky · 2022-07-25T21:06:46Z

cc @hendrikmakait

crusaderky requested a review from jrbourbeau March 31, 2022 10:45

crusaderky self-assigned this Mar 31, 2022

crusaderky added flaky test Intermittent failures on CI. regression labels Mar 31, 2022

fjetter reviewed Mar 31, 2022

View reviewed changes

sjperkins self-requested a review March 31, 2022 16:19

crusaderky added 2 commits April 4, 2022 11:51

Regression: test_weakref_cache

7373576

update comments

5cc5586

crusaderky force-pushed the weakref branch from c7af2d4 to 5cc5586 Compare April 4, 2022 10:53

crusaderky mentioned this pull request Apr 5, 2022

Type annotations for profile.py #6067

Merged

crusaderky added 3 commits April 5, 2022 12:30

Prevent profiler from tampering with refcount

ecbf727

Merge branch 'main' into weakref

6251d5b

polish

a36dbe9

fix deadlock

8a5418a

graingert mentioned this pull request Apr 5, 2022

Leaking thread in SpecCluster because LoopRunner is never closed #6069

Open

crusaderky added 2 commits April 6, 2022 00:27

Merge branch 'main' into weakref

702f9cf

remove stress test

177def7

fjetter approved these changes Apr 6, 2022

View reviewed changes

fjetter mentioned this pull request Apr 6, 2022

Disable profiler in tests? #6075

Open

fjetter merged commit d074c9e into dask:main Apr 6, 2022

crusaderky deleted the weakref branch April 6, 2022 14:21

graingert mentioned this pull request May 24, 2022

Add a lock to distributed.profile for better concurrency control #6421

Merged

2 tasks

hendrikmakait mentioned this pull request Jul 26, 2022

Robust thread termination in test_watch and test_watch_requires_lock_to_run #6788

Merged

2 tasks

Uh oh!

Conversation

crusaderky commented Mar 31, 2022

Uh oh!

fjetter Mar 31, 2022

Choose a reason for hiding this comment

Uh oh!

fjetter Mar 31, 2022

Choose a reason for hiding this comment

Uh oh!

sjperkins Apr 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter commented Mar 31, 2022

Uh oh!

github-actions bot commented Mar 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

crusaderky commented Mar 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky commented Mar 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

graingert commented Mar 31, 2022

Uh oh!

graingert commented Apr 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

graingert commented Apr 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky commented Apr 4, 2022

Uh oh!

graingert commented Apr 4, 2022

Uh oh!

mrocklin commented Apr 4, 2022 via email

Uh oh!

crusaderky commented Apr 5, 2022

Uh oh!

graingert commented Apr 5, 2022

Uh oh!

crusaderky commented Apr 5, 2022

Uh oh!

graingert commented Apr 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky commented Apr 5, 2022

Uh oh!

fjetter Apr 6, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky Apr 6, 2022

Choose a reason for hiding this comment

Uh oh!

fjetter commented Apr 6, 2022

Uh oh!

gjoseph92 commented Jul 20, 2022

Uh oh!

crusaderky commented Jul 25, 2022

Uh oh!

crusaderky commented Jul 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sjperkins Apr 1, 2022 •

edited

Loading

github-actions bot commented Mar 31, 2022 •

edited

Loading

crusaderky commented Mar 31, 2022 •

edited

Loading

crusaderky commented Mar 31, 2022 •

edited

Loading

graingert commented Apr 2, 2022 •

edited

Loading

graingert commented Apr 3, 2022 •

edited

Loading

graingert commented Apr 5, 2022 •

edited

Loading