-
-
Notifications
You must be signed in to change notification settings - Fork 756
Description
@gjoseph92 noticed that, under some profiling conditions, turning off garbage collection had a significant impact on scheduler performance. I'm going to include some notes from him in the summary below
Notes from Gabe
See #4825 for initial discussion of the problem. It also comes up on #4881 (comment).
I've also run these with GC debug mode on (gjoseph92/dask-profiling-coiled@c0ea2aa1) and looked at GC logs. Interestingly GC debug mode generally reports GC as taking zero time:
gc: done, 0 unreachable, 0 uncollectable, 0.0000s elapsed
Some of those logs are here: https://rawcdn.githack.com/gjoseph92/dask-profiling-coiled/61fc875173a5b2f9195346f2a523cb1d876c48ad/results/cython-shuffle-gc-debug-noprofiling-ecs-prod-nopyspy.txt?raw=true
The types of objects being listed as collectable are interesting (cells, frames, tracebacks, asyncio Futures/Tasks, SelectorKey) since those are the sorts of things you might expect to create cycles. It's also interesting that there are already ~150k objects in generation 3 before the computation has even started, and ~300k (and growing) once it's been running for a little bit.
I've also tried turning off:
- statistical profiling
- bokeh dashboard
- uvloop instead of native asyncio
But none of those affected the issue.
What I wanted to do next was use refcycle or objgraph or a similar tool to try to see what's causing the cycles. Or possibly use tracemalloc + GC hooks to try to log where the objects that were being collected were initially created.
I notice that we have reference cycles in our scheduler state
In [1]: from dask.distributed import Client
In [2]: client = Client()
In [3]: import dask.array as da
In [4]: x = da.random.random((1000, 1000)).sum().persist()
In [5]: s = client.cluster.scheduler
In [6]: a, b = s.tasks.values()
In [7]: a
Out[7]: <TaskState "('sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" memory>
In [8]: b
Out[8]: <TaskState "('random_sample-sum-sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" released>
In [9]: a in b.dependents
Out[9]: True
In [10]: b in a.dependencies
Out[10]: TrueShould we be concerned about our use of reference cycles?