[docs/data] Add download to key user journeys in documentation (#59417)

richardliaw · gemini-code-assist[bot] · web-flow · commit e0049dcee72c · 2025-12-15T16:51:20.000-08:00
Shows users how to use `download` to download from URI tables.

---------

Signed-off-by: Richard Liaw &lt;rliaw@berkeley.edu&gt;
Co-authored-by: gemini-code-assist[bot] &lt;176961590+gemini-code-assist[bot]@users.noreply.github.com&gt;
diff --git a/doc/source/data/benchmark.md b/doc/source/data/benchmark.md
@@ -60,7 +60,7 @@ All benchmark results are taken from an average/std across 4 runs. A warmup was
     - Code
 -   - **Image Classification**
     - 800k images from ImageNet
-    - s3://ray-example-data/imagenet/metadata_file
+    - s3://ray-example-data/imagenet/metadata_file.parquet
     - 1 head / 8 workers of varying instance types
     - [Link](https://github.com/ray-project/ray/tree/master/release/nightly_tests/multimodal_inference_benchmarks/image_classification)
 -   - **Document Embedding**
diff --git a/doc/source/data/loading-data.rst b/doc/source/data/loading-data.rst
@@ -338,6 +338,33 @@ You can use any `codec supported by Arrow <https://arrow.apache.org/docs/python/
         arrow_open_stream_args={"compression": "gzip"},
     )
 
+
+Downloading files from URIs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Sometimes you may have a metadata table with a column of URIs and you want to download the files referenced by the URIs.
+
+You can download data in bulk by leveraging the :func:`~ray.data.Dataset.with_column` method together with the :func:`~ray.data.expressions.download` expression. This approach lets the system handle the parallel downloading of files referenced by URLs in your dataset, without needing to manage async code within your own transformations.
+
+The following example shows how to download a batch of images from URLs listed in a Parquet file:
+
+.. testcode::
+
+    import ray
+    from ray.data.expressions import download
+
+    # Read a Parquet file containing a column of image URLs
+    ds = ray.data.read_parquet("s3://anonymous@ray-example-data/imagenet/metadata_file.parquet")
+
+    # Use `with_column` and `download` to download the images in parallel.
+    # This creates a new column 'bytes' with the downloaded file contents.
+    ds = ds.with_column(
+        "bytes",
+        download("image_url"),
+    )
+
+    ds.take(1)
+
 Loading data from other libraries
 =================================
 
diff --git a/doc/source/data/working-with-images.rst b/doc/source/data/working-with-images.rst
@@ -49,6 +49,27 @@ To view the full list of supported file formats, see the
             ------  ----
             image   ArrowTensorTypeV2(shape=(32, 32, 3), dtype=uint8)
 
+    .. tab-item:: Images from Dataset of URIs
+
+        To load images from a dataset of URIs, use the :func:`~ray.data.Dataset.with_column` method together with the :func:`~ray.data.expressions.download` expression.
+
+        .. testcode::
+
+            import ray
+            from ray.data.expressions import download
+
+            ds = ray.data.read_parquet("s3://anonymous@ray-example-data/imagenet/metadata_file.parquet")
+            ds = ds.with_column("bytes", download("image_url"))
+
+            print(ds.schema())
+
+        .. testoutput::
+
+            Column     Type
+            ------     ----
+            image_url  string
+            bytes      null
+
     .. tab-item:: NumPy
 
         To load images stored in NumPy format, call :func:`~ray.data.read_numpy`.
diff --git a/release/nightly_tests/multimodal_inference_benchmarks/image_classification/ray_data_main.py b/release/nightly_tests/multimodal_inference_benchmarks/image_classification/ray_data_main.py
@@ -14,7 +14,7 @@
 
 
 NUM_GPU_NODES = 8
-INPUT_PATH = "s3://anonymous@ray-example-data/imagenet/metadata_file"
+INPUT_PATH = "s3://anonymous@ray-example-data/imagenet/metadata_file.parquet"
 OUTPUT_PATH = f"s3://ray-data-write-benchmark/{uuid.uuid4().hex}"
 BATCH_SIZE = 100