You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/source/data/advanced-pipelines.rst
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -106,6 +106,16 @@ Tune the throughput vs latency of your pipeline with the ``blocks_per_window`` s
106
106
107
107
.. image:: images/dataset-pipeline-3.svg
108
108
109
+
You can also specify the size of each window using ``bytes_per_window``. In this mode, Datasets will determine the size of each window based on the target byte size, giving each window at least 1 block but not otherwise exceeding the target bytes per window. This mode can be useful to limit the memory usage of a pipeline. As a rule of thumb, the cluster memory should be at least 2-5x the window size to avoid spilling.
110
+
111
+
.. code-block:: python
112
+
113
+
# Create a DatasetPipeline with up to 10GB of data per window.
114
+
pipe: DatasetPipeline = ray.data \
115
+
.read_binary_files("s3://bucket/image-dir") \
116
+
.window(bytes_per_window=10e9)
117
+
# -> INFO -- Created DatasetPipeline with 73 windows: 9120MiB min, 9431MiB max, 9287MiB mean
0 commit comments