Skip to content

Commit b7bbdf9

Browse files
author
Sarah Charlotte Johnson
authored
Fix indentation in Best Practices (#9196)
1 parent a62c008 commit b7bbdf9

1 file changed

Lines changed: 25 additions & 28 deletions

File tree

docs/source/best-practices.rst

Lines changed: 25 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -107,34 +107,31 @@ become troublesome, both because overhead is now in the 10 minutes to hours
107107
range, and also because the overhead of dealing with such a large graph can
108108
start to overwhelm the scheduler.
109109

110-
There are a few things you can do to address this:
111-
112-
- Build smaller graphs. You can do this by:
113-
114-
- **Increasing your chunk size:** If you have a 1000 GB of data and are using
115-
10 MB chunks, then you have 100,000 partitions. Every operation on such
116-
a collection will generate at least 100,000 tasks.
117-
118-
However if you increase your chunksize to 1 GB or even a few GB then you
119-
reduce the overhead by orders of magnitude. This requires that your
120-
workers have much more than 1 GB of memory, but that's typical for larger
121-
workloads.
122-
123-
- **Fusing operations together:** Dask will do a bit of this on its own, but you
124-
can help it. If you have a very complex operation with dozens of
125-
sub-operations, maybe you can pack that into a single Python function
126-
and use a function like ``da.map_blocks`` or ``dd.map_partitions``.
127-
128-
In general, the more administrative work you can move into your functions
129-
the better. That way the Dask scheduler doesn't need to think about all
130-
of the fine-grained operations.
131-
132-
- **Breaking up your computation:** For very large workloads you may also want to
133-
try sending smaller chunks to Dask at a time. For example if you're
134-
processing a petabyte of data but find that Dask is only happy with 100
135-
TB, maybe you can break up your computation into ten pieces and submit
136-
them one after the other.
137-
110+
You can build smaller graphs by:
111+
112+
- **Increasing your chunk size:** If you have a 1,000 GB of data and are using
113+
10 MB chunks, then you have 100,000 partitions. Every operation on such
114+
a collection will generate at least 100,000 tasks.
115+
116+
However if you increase your chunksize to 1 GB or even a few GB then you
117+
reduce the overhead by orders of magnitude. This requires that your
118+
workers have much more than 1 GB of memory, but that's typical for larger
119+
workloads.
120+
121+
- **Fusing operations together:** Dask will do a bit of this on its own, but you
122+
can help it. If you have a very complex operation with dozens of
123+
sub-operations, maybe you can pack that into a single Python function
124+
and use a function like ``da.map_blocks`` or ``dd.map_partitions``.
125+
126+
In general, the more administrative work you can move into your functions
127+
the better. That way the Dask scheduler doesn't need to think about all
128+
of the fine-grained operations.
129+
130+
- **Breaking up your computation:** For very large workloads you may also want to
131+
try sending smaller chunks to Dask at a time. For example if you're
132+
processing a petabyte of data but find that Dask is only happy with 100
133+
TB, maybe you can break up your computation into ten pieces and submit
134+
them one after the other.
138135

139136
Learn Techniques For Customization
140137
----------------------------------

0 commit comments

Comments
 (0)