@@ -107,34 +107,31 @@ become troublesome, both because overhead is now in the 10 minutes to hours
107107range, and also because the overhead of dealing with such a large graph can
108108start to overwhelm the scheduler.
109109
110- There are a few things you can do to address this:
111-
112- - Build smaller graphs. You can do this by:
113-
114- - **Increasing your chunk size: ** If you have a 1000 GB of data and are using
115- 10 MB chunks, then you have 100,000 partitions. Every operation on such
116- a collection will generate at least 100,000 tasks.
117-
118- However if you increase your chunksize to 1 GB or even a few GB then you
119- reduce the overhead by orders of magnitude. This requires that your
120- workers have much more than 1 GB of memory, but that's typical for larger
121- workloads.
122-
123- - **Fusing operations together: ** Dask will do a bit of this on its own, but you
124- can help it. If you have a very complex operation with dozens of
125- sub-operations, maybe you can pack that into a single Python function
126- and use a function like ``da.map_blocks `` or ``dd.map_partitions ``.
127-
128- In general, the more administrative work you can move into your functions
129- the better. That way the Dask scheduler doesn't need to think about all
130- of the fine-grained operations.
131-
132- - **Breaking up your computation: ** For very large workloads you may also want to
133- try sending smaller chunks to Dask at a time. For example if you're
134- processing a petabyte of data but find that Dask is only happy with 100
135- TB, maybe you can break up your computation into ten pieces and submit
136- them one after the other.
137-
110+ You can build smaller graphs by:
111+
112+ - **Increasing your chunk size: ** If you have a 1,000 GB of data and are using
113+ 10 MB chunks, then you have 100,000 partitions. Every operation on such
114+ a collection will generate at least 100,000 tasks.
115+
116+ However if you increase your chunksize to 1 GB or even a few GB then you
117+ reduce the overhead by orders of magnitude. This requires that your
118+ workers have much more than 1 GB of memory, but that's typical for larger
119+ workloads.
120+
121+ - **Fusing operations together: ** Dask will do a bit of this on its own, but you
122+ can help it. If you have a very complex operation with dozens of
123+ sub-operations, maybe you can pack that into a single Python function
124+ and use a function like ``da.map_blocks `` or ``dd.map_partitions ``.
125+
126+ In general, the more administrative work you can move into your functions
127+ the better. That way the Dask scheduler doesn't need to think about all
128+ of the fine-grained operations.
129+
130+ - **Breaking up your computation: ** For very large workloads you may also want to
131+ try sending smaller chunks to Dask at a time. For example if you're
132+ processing a petabyte of data but find that Dask is only happy with 100
133+ TB, maybe you can break up your computation into ten pieces and submit
134+ them one after the other.
138135
139136Learn Techniques For Customization
140137----------------------------------
0 commit comments