Implement preferred_chunks for netcdf 4 backends by mraspaud · Pull Request #7948 · pydata/xarray

mraspaud · 2023-06-28T08:43:30Z

According to the open_dataset documentation, using chunks="auto" or chunks={} should yield datasets with variables chunked depending on the preferred chunks of the backend. However neither the netcdf4 nor the h5netcdf backend seem to implement the preferred_chunks encoding attribute needed for this to work.

This PR adds this attribute to the encoding upon data reading. This results in chunks="auto" in open_dataset returning variables with chunk sizes multiples of the chunks in the nc file, and for chunks={}, returning the variables with then exact nc chunk sizes.

Closes If a NetCDF file is chunked on disk, open it with compatible dask chunks #1440
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

mraspaud · 2023-06-28T08:45:35Z

Hope this is up to the standards! Please tell me if there is anything I can do to improve this PR.

djhoese · 2023-06-28T13:47:15Z

Is the resulting chunk size a multiple of the on-disk file chunk size or is it exactly the file chunk size? So if I open a file with a file chunk size of 256, do I get a dask array with chunk size 256 or some larger chunk size that is a multiple of 256. By itself, 256 would perform worse than one large chunk for a lot of arrays sizes just from the overhead.

mraspaud · 2023-06-28T14:15:24Z

@djhoese if chunks={}, the same chunks size as the on-disk file will be used. if chunks="auto", a multiple of the on-disk chunk size will be used to accommodate the dask chunk size preferences. This is in line with what is said in the open_dataset docstring as I understand it.
For our case, using chunks="auto" with this patch resulted in a significant performance boost in processing the data as compared to the main branch.

jhamman

Thanks for taking this on @mraspaud. This has been something folks have wanted for some time now!

xarray/tests/test_backends.py

for more information, see https://pre-commit.ci

dcherian · 2023-07-26T03:21:02Z

Found 5 errors in 1 file (checked 142 source files)
xarray/tests/test_backends.py:1568: error: Need type annotation for "tmp_file"  [var-annotated]
xarray/tests/test_backends.py:1577: error: Argument 1 to "contextmanager" has incompatible type "Callable[[NetCDF4Base, Any, Any], None]"; expected "Callable[[NetCDF4Base, Any, Any], Iterator[<nothing>]]"  [arg-type]
xarray/tests/test_backends.py:1578: error: The return type of a generator function should be "Generator" or one of its supertypes  [misc]
xarray/tests/test_backends.py:1599: error: Need type annotation for "tmp_file"  [var-annotated]

@Illviljan @headtr1ck help!

headtr1ck · 2023-07-26T05:12:57Z

xarray/tests/test_backends.py

+                    assert all(np.asanyarray(chunksizes) == expected)
+
+    @contextlib.contextmanager
+    def create_chunked_file(self, array_shape, chunk_sizes) -> None:


Suggested change

def create_chunked_file(self, array_shape, chunk_sizes) -> None:

def create_chunked_file(self, array_shape: tuple[int, int, int], chunk_sizes: tuple[int, int, int]) -> Generator[str, None, None]:

I haven't checked it though, but it should bring you in the right direction.

for more information, see https://pre-commit.ci

mraspaud · 2023-08-14T07:17:13Z

I attempted to fix the type annotation problem, please tell me if there is more.

mraspaud · 2023-08-21T07:08:45Z

@dcherian @jhamman anything more I can do on this?

dcherian

LGTM! Thanks for your patience here @mraspaud

Can you add a note to whats-new to advertise this great improvement please?

xarray/tests/test_backends.py

mraspaud · 2023-08-31T07:47:25Z

What's new now updated

dcherian · 2023-09-08T15:37:38Z

The mypy failures are. related to these changes I think:

xarray/tests/test_backends.py:1559: error: "str" has no attribute "data"  [attr-defined]
xarray/tests/test_backends.py:1559: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice"  [index]
xarray/tests/test_backends.py:1576: error: "str" has no attribute "data"  [attr-defined]
xarray/tests/test_backends.py:1576: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice"  [index]
xarray/tests/test_backends.py:1610: error: "str" has no attribute "encoding"  [attr-defined]
xarray/tests/test_backends.py:1610: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice"  [index]

@Illviljan can you take a look please?

xarray/tests/test_backends.py

Illviljan · 2023-09-11T23:06:30Z

Thanks, @mraspaud !

mraspaud · 2023-09-12T09:01:02Z

My pleasure! thanks for merging!

mraspaud added 2 commits June 28, 2023 10:30

Write failing test

ca0ebea

Add preferred chunks to netcdf 4 backends

27d6693

github-actions bot added topic-backends io labels Jun 28, 2023

mraspaud changed the title ~~Implement peferred_chunks for netcdf 4 backend~~ Implement preferred_chunks for netcdf 4 backend Jun 28, 2023

mraspaud changed the title ~~Implement preferred_chunks for netcdf 4 backend~~ Implement preferred_chunks for netcdf 4 backends Jun 28, 2023

ghiggi mentioned this pull request Jun 28, 2023

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

Closed

jhamman reviewed Jun 28, 2023

View reviewed changes

xarray/tests/test_backends.py Show resolved Hide resolved

Illviljan added the run-benchmark Run the ASV benchmark workflow label Jun 28, 2023

mraspaud added 3 commits June 29, 2023 13:16

Add unit tests for preferred chunking

049b0f1

Fix formatting

dfbdbf7

Require dask for a couple of chunking tests

0e89353

jhamman reviewed Jun 30, 2023

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

Use xarray's interface to create a test chunked nc file

29c93b8

dcherian requested a review from jhamman July 3, 2023 15:40

dcherian and others added 3 commits July 5, 2023 09:47

Merge branch 'main' into feature-nc-preferred-chunks

d46896c

[pre-commit.ci] auto fixes from pre-commit.com hooks

996ac29

for more information, see https://pre-commit.ci

Merge branch 'main' into feature-nc-preferred-chunks

d0f3f92

headtr1ck reviewed Jul 26, 2023

View reviewed changes

jhamman mentioned this pull request Jul 26, 2023

Specify chunks in bytes #8021

Open

mraspaud and others added 4 commits August 14, 2023 09:02

Fix type annotations

a6d922a

[pre-commit.ci] auto fixes from pre-commit.com hooks

46ee947

for more information, see https://pre-commit.ci

Import Generator

c011c53

[pre-commit.ci] auto fixes from pre-commit.com hooks

c835a81

for more information, see https://pre-commit.ci

dcherian approved these changes Aug 31, 2023

View reviewed changes

dcherian reviewed Aug 31, 2023

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

dcherian added the plan to merge Final call for comments label Aug 31, 2023

mraspaud added 2 commits August 31, 2023 09:44

Use roundtrip

2c643da

Add news about the new feature

4903dec

Merge branch 'main' into feature-nc-preferred-chunks

4b81a76

dcherian removed the plan to merge Final call for comments label Sep 9, 2023

Illviljan reviewed Sep 11, 2023

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

Update xarray/tests/test_backends.py

6ee5ace

Illviljan reviewed Sep 11, 2023

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

Illviljan added 4 commits September 11, 2023 18:51

Update xarray/tests/test_backends.py

463cf60

Merge branch 'main' into pr/7948

af27333

Move whats new line

8864c11

Merge branch 'main' into pr/7948

8d1a140

Illviljan enabled auto-merge (squash) September 11, 2023 18:30

Illviljan disabled auto-merge September 11, 2023 23:05

Illviljan merged commit de66dae into pydata:main Sep 11, 2023

mraspaud deleted the feature-nc-preferred-chunks branch September 12, 2023 09:00

This was referenced Sep 12, 2023

Parallel read with MPI #6919

Closed

Memory Leak open_mfdataset #5585

Closed

dougiesquire mentioned this pull request Oct 12, 2023

Add 3rd method for computing meridional heat transport using velocities and temperature fields COSIMA/cosima-recipes#285

Merged

dave-andersen mentioned this pull request Nov 29, 2023

performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset #8490

Closed

5 tasks

weiji14 mentioned this pull request Jan 31, 2024

xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks #8691

Closed

5 tasks

charles-turner-1 mentioned this pull request Apr 8, 2025

[Catalog utility functions] find_chunking_info ACCESS-NRI/access-nri-intake-catalog#218

Closed

	def create_chunked_file(self, array_shape, chunk_sizes) -> None:
	def create_chunked_file(self, array_shape: tuple[int, int, int], chunk_sizes: tuple[int, int, int]) -> Generator[str, None, None]:

Uh oh!

Conversation

mraspaud commented Jun 28, 2023 • edited by dcherian Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mraspaud commented Jun 28, 2023

Uh oh!

djhoese commented Jun 28, 2023

Uh oh!

mraspaud commented Jun 28, 2023

Uh oh!

jhamman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dcherian commented Jul 26, 2023

Uh oh!

headtr1ck Jul 26, 2023

Choose a reason for hiding this comment

Uh oh!

mraspaud commented Aug 14, 2023

Uh oh!

mraspaud commented Aug 21, 2023

Uh oh!

dcherian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mraspaud commented Aug 31, 2023

Uh oh!

dcherian commented Sep 8, 2023

Uh oh!

Uh oh!

Uh oh!

Illviljan commented Sep 11, 2023

Uh oh!

mraspaud commented Sep 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mraspaud commented Jun 28, 2023 •

edited by dcherian

Loading