Skip to content

Commit e0049dc

Browse files
[docs/data] Add download to key user journeys in documentation (#59417)
Shows users how to use `download` to download from URI tables. --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent a9b7fc3 commit e0049dc

4 files changed

Lines changed: 50 additions & 2 deletions

File tree

doc/source/data/benchmark.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ All benchmark results are taken from an average/std across 4 runs. A warmup was
6060
- Code
6161
- - **Image Classification**
6262
- 800k images from ImageNet
63-
- s3://ray-example-data/imagenet/metadata_file
63+
- s3://ray-example-data/imagenet/metadata_file.parquet
6464
- 1 head / 8 workers of varying instance types
6565
- [Link](https://github.com/ray-project/ray/tree/master/release/nightly_tests/multimodal_inference_benchmarks/image_classification)
6666
- - **Document Embedding**

doc/source/data/loading-data.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,6 +338,33 @@ You can use any `codec supported by Arrow <https://arrow.apache.org/docs/python/
338338
arrow_open_stream_args={"compression": "gzip"},
339339
)
340340

341+
342+
Downloading files from URIs
343+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
344+
345+
Sometimes you may have a metadata table with a column of URIs and you want to download the files referenced by the URIs.
346+
347+
You can download data in bulk by leveraging the :func:`~ray.data.Dataset.with_column` method together with the :func:`~ray.data.expressions.download` expression. This approach lets the system handle the parallel downloading of files referenced by URLs in your dataset, without needing to manage async code within your own transformations.
348+
349+
The following example shows how to download a batch of images from URLs listed in a Parquet file:
350+
351+
.. testcode::
352+
353+
import ray
354+
from ray.data.expressions import download
355+
356+
# Read a Parquet file containing a column of image URLs
357+
ds = ray.data.read_parquet("s3://anonymous@ray-example-data/imagenet/metadata_file.parquet")
358+
359+
# Use `with_column` and `download` to download the images in parallel.
360+
# This creates a new column 'bytes' with the downloaded file contents.
361+
ds = ds.with_column(
362+
"bytes",
363+
download("image_url"),
364+
)
365+
366+
ds.take(1)
367+
341368
Loading data from other libraries
342369
=================================
343370

doc/source/data/working-with-images.rst

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,27 @@ To view the full list of supported file formats, see the
4949
------ ----
5050
image ArrowTensorTypeV2(shape=(32, 32, 3), dtype=uint8)
5151

52+
.. tab-item:: Images from Dataset of URIs
53+
54+
To load images from a dataset of URIs, use the :func:`~ray.data.Dataset.with_column` method together with the :func:`~ray.data.expressions.download` expression.
55+
56+
.. testcode::
57+
58+
import ray
59+
from ray.data.expressions import download
60+
61+
ds = ray.data.read_parquet("s3://anonymous@ray-example-data/imagenet/metadata_file.parquet")
62+
ds = ds.with_column("bytes", download("image_url"))
63+
64+
print(ds.schema())
65+
66+
.. testoutput::
67+
68+
Column Type
69+
------ ----
70+
image_url string
71+
bytes null
72+
5273
.. tab-item:: NumPy
5374

5475
To load images stored in NumPy format, call :func:`~ray.data.read_numpy`.

release/nightly_tests/multimodal_inference_benchmarks/image_classification/ray_data_main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515

1616
NUM_GPU_NODES = 8
17-
INPUT_PATH = "s3://anonymous@ray-example-data/imagenet/metadata_file"
17+
INPUT_PATH = "s3://anonymous@ray-example-data/imagenet/metadata_file.parquet"
1818
OUTPUT_PATH = f"s3://ray-data-write-benchmark/{uuid.uuid4().hex}"
1919
BATCH_SIZE = 100
2020

0 commit comments

Comments
 (0)