[Refactor] Improve the performance of temporal group averaging by tomvothecoder · Pull Request #689 · xCDAT/xcdat

tomvothecoder · 2024-08-29T23:04:19Z

Description

Closes [Enhancement]: Temporal averaging performance #688

TODO:

Identify performance bottlenecks
1. Generating labeled time coordinates (aka assign groups) and adding it to the existing time dimension with existing coordinates, then performing the Xarray groupby yields extremely slow results (not sure why, it's an Xarray issue). (Refer to comment) -- replace time coords with labeled time coords directly for grouping, rather than adding labeled time coords as auxiliary coords on the time dimension (which slows things down in Xarray for some reason, need to ask Xarray forum)
2. In _get_weights(), loading time lengths into memory is slow (lines) -- replace with casting to "timedelta64[ns]" then float64
3. In _get_weights(), performing validation to check the sums of weights for each group adds up to 1 is slow (lines) -- remove this unnecessary assertion
~~Identify performance optimizations -- I don't think this is necessary right now~~
1. Xarray groupby with vs. without flox package
2. Try with Dask chunking
Make sure unit tests still pass
Measure performance difference between this branch and main
Perform regression testing between branch code on same dataset

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

If applicable:

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass with my changes (locally and CI/CD build)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

codecov · 2024-08-30T23:11:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (584fcce) to head (6459c1b).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #689   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           15        15           
  Lines         1544      1546    +2     
=========================================
+ Hits          1544      1546    +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Replace `.load()` with `.astype("timedelta64[ns"])` for clarity

tomvothecoder

My initial self-review. The GH Actions build is passing.

tomvothecoder · 2024-09-03T20:56:31Z

xcdat/temporal.py

+        # 5. Calculate the departures for the data variable.
        # ----------------------------------------------------------------------
-        # This step allows us to perform xarray's grouped arithmetic to
-        # calculate departures.
-        dv_obs = ds_obs[data_var].copy()
-        self._labeled_time = self._label_time_coords(dv_obs[self.dim])
-        dv_obs_grouped = self._group_data(dv_obs)
-
-        # 5. Align time dimension names using the labeled time dimension name.
-        # ----------------------------------------------------------------------
-        # The climatology's time dimension is renamed to the labeled time
-        # dimension in step #4 above (e.g., "time" -> "season"). xarray requires
-        # dimension names to be aligned to perform grouped arithmetic, which we
-        # use for calculating departures in step #5. Otherwise, this error is
-        # raised: "`ValueError: incompatible dimensions for a grouped binary
-        # operation: the group variable '<FREQ ARG>' is not a dimension on the
-        # other argument`".
-        dv_climo = ds_climo[data_var]
-        dv_climo = dv_climo.rename({self.dim: self._labeled_time.name})
-
-        # 6. Calculate the departures for the data variable.
-        # ----------------------------------------------------------------------
-        # departures = observation - climatology
-        with xr.set_options(keep_attrs=True):
-            dv_departs = dv_obs_grouped - dv_climo
-            dv_departs = self._add_operation_attrs(dv_departs)
-            ds_obs[data_var] = dv_departs
+        ds_departs = self._calculate_departures(ds_obs, ds_climo, data_var)


Refactored this block of code into self._calculate_departures() for readability.

tomvothecoder · 2024-09-03T20:57:33Z

xcdat/temporal.py

        self._labeled_time = self._label_time_coords(dv[self.dim])
+        dv = dv.assign_coords({self.dim: self._labeled_time})


Address bottleneck #1 from PR description.

replace time coords with labeled time coords directly for grouping, rather than adding labeled time coords as auxiliary coords on the time dimension (which slows things down in Xarray for some reason, need to ask Xarray forum)

xcdat/temporal.py

tomvothecoder · 2024-09-03T20:59:14Z

xcdat/temporal.py

        # warning please use the scalar types `np.float64`, or string notation.`
        if isinstance(time_lengths.data, Array):
-            time_lengths.load()
+            time_lengths = time_lengths.astype("timedelta64[ns]")


Address bottleneck #2 from PR description

tomvothecoder · 2024-09-03T20:59:51Z

xcdat/temporal.py

+            dv = dv.assign_coords({self.dim: self._labeled_time})
+            dv_gb = dv.groupby(self.dim)


Address bottleneck #1 from PR description

replace time coords with labeled time coords directly for grouping, rather than adding labeled time coords as auxiliary coords on the time dimension (which slows things down in Xarray for some reason, need to ask Xarray forum)

tomvothecoder · 2024-09-03T21:00:18Z

xcdat/temporal.py

        time_grouped = xr.DataArray(
-            name="_".join(df_dt_components.columns),
+            name=self.dim,
            data=dt_objects,
-            coords={self.dim: time_coords[self.dim]},
+            coords={self.dim: dt_objects},
            dims=[self.dim],
            attrs=time_coords[self.dim].attrs,
        )


Address bottleneck #1 from PR description

tomvothecoder · 2024-09-03T21:01:11Z

xcdat/temporal.py

        if self._mode in ["group_average", "climatology"]:
-            self._weights = self._weights.rename({self.dim: f"{self.dim}_original"})
-            # Only keep the original time coordinates, not the ones labeled
-            # by group.
-            self._weights = self._weights.drop_vars(self._labeled_time.name)
+            weights = self._weights.assign_coords({self.dim: self._dataset[self.dim]})
+            weights = weights.rename({self.dim: f"{self.dim}_original"})

-        ds[self._weights.name] = self._weights
+        ds[weights.name] = weights


Reassign the original, unlabeled time coordinates back to the weights xr.DataArray and then rename it to "time_original" to avoid conflicting the the labeled time coordinates (now called "time").

tomvothecoder · 2024-09-03T21:02:31Z

xcdat/temporal.py

+        dv_departs = dv_departs.assign_coords({self.dim: ds_obs[self.dim]})
+        ds_departs[data_var] = dv_departs


Reassign the grouped, unlabeled time coordinates back to the final departures time coordinates (since the labeled, grouped time coordinates sometimes removes the year of the time coordinates).

tomvothecoder · 2024-09-03T21:39:06Z

Hi @chengzhuzhang, this PR is ready for review.

After refactoring, I managed to cut down the runtime as following:

Annual climatology: 33s -> 5.85s
Annual departures: 1min9s -> 11.6s
monthly group averages: 33.5s -> 5.59s.

I also performed a regression test using the same e3sm_diags dataset between main and this branch and produced identical results. The GH Actions build also passes.

Benchmarking Script

# %%
import xarray as xr
import xcdat as xc

### 1. Using temporal.climatology from xcdat
file_path = "/global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20221103.v2.LR.amip.NGD_v3atm.chrysalis/arm-diags-data/PRECT_sgpc1_198501_201412.nc"
ds = xc.open_dataset(file_path)

branch = "dev"
# %%
# 1. Calculate annual climatology
# -------------------------------
ds_annual_cycle = ds.temporal.climatology("PRECT", "month", keep_weights=True)
ds_annual_cycle.to_netcdf(f"temporal_climatology_{branch}.nc")
"""
main
--------------------------
CPU times: user 33 s, sys: 2.41 s, total: 35.4 s
Wall time: 35.4 s

refactor/688-temp-api-perf
--------------------------
CPU times: user 5.85 s, sys: 2.88 s, total: 8.72 s
Wall time: 8.78 s
"""

# %%
# 2. Calculate annual departures
# ------------------------------
ds_annual_cycle_anom = ds.temporal.departures("PRECT", "month", keep_weights=True)
ds_annual_cycle_anom.to_netcdf(f"temporal_departures_{branch}.nc")
"""
main
--------------------------
CPU times: user 1min 9s, sys: 4.8 s, total: 1min 14s
Wall time: 1min 14s

refactor/688-temp-api-perf
--------------------------
CPU times: user 11.6 s, sys: 4.32 s, total: 15.9 s
Wall time: 15.9 s
"""

# %%
# 3. Calculate monthly group averages
# -----------------------------------
ds_annual_avg = ds.temporal.group_average("PRECT", "month", keep_weights=True)
ds_annual_avg.to_netcdf(f"temporal_group_average_{branch}.nc")

"""
main
--------------------------
CPU times: user 33.5 s, sys: 2.27 s, total: 35.8 s
Wall time: 35.9 s

refactor/688-temp-api-perf
--------------------------
CPU times: user 5.59 s, sys: 2.06 s, total: 7.65 s
Wall time: 7.65 s
"""

Regression testing script

import glob

import xarray as xr

# Get the filepaths for the dev and main branches
dev_filepaths = sorted(glob.glob("qa/issue-688/dev/*.nc"))
main_filepaths = sorted(glob.glob("qa/issue-688/main/*.nc"))

for fp, mp in zip(dev_filepaths, main_filepaths):
    print(f"Comparing {fp} and {mp}")
    # Load the datasets
    dev_ds = xr.open_dataset(fp)
    main_ds = xr.open_dataset(mp)

    # Compare the datasets
    try:
        xr.testing.assert_identical(dev_ds, main_ds)
    except AssertionError as e:
        print(f"Datasets are not identical: {e}")
    else:
        print("Datasets are identical")

Next step

I will investigate the differences you pointed out here between xCDAT and the e3sm_diags climatology functions separately from this PR (related e3sm_diags discussion post)
Open a GH issue on the Xarray repo about grouping with auxiliary time coordinates resulting in a large performance hit

chengzhuzhang · 2024-09-03T22:59:37Z

xcdat/temporal.py

        weights: xr.DataArray = grouped_time_lengths / grouped_time_lengths.sum()
        weights.name = f"{self.dim}_wts"

-        # Validate the sum of weights for each group is 1.0.


It seems to be a good feature to have to check if the sum matches. But if it de-gradates the performance a lot, we can exclude it. maybe this check can be just implemented in testing (if it is not included yet). Also the _get_weights description needs to be updated to reflect that sum is no longer validated.

We should expect the logic of _get_weights() to be correct, so this assertion should not be necessary at runtime (especially with the performance hit).

I like your suggestion of making it a unit test instead. I will push a commit with this change soon.

chengzhuzhang · 2024-09-03T23:20:50Z

xcdat/temporal.py

-        if weighted and keep_weights:
-            self._weights = ds_climo.time_wts
-            ds_obs = self._keep_weights(ds_obs)
+        if keep_weights:


I notice this if statement changed from if weighted and keep_weights, should it be kept the same?

Thank you for catching this. I reverted the conditional.

chengzhuzhang

Hi Tom, Thank you for the PR! I think it looks great, just have minor comments for you to consider.

xcdat/temporal.py

- Check if sum of each weight group equals 1.0 - Update `_get_weights()` docs to remove validation portion

xcdat/temporal.py

tomvothecoder mentioned this pull request Aug 29, 2024

[Enhancement]: Temporal averaging performance #688

Closed

tomvothecoder changed the title ~~Add initial temporal performance refactor code~~ [Refactor] Improve the performance of temporal group averaging Aug 30, 2024

Add initial temporal performance refactor code

0d56ed5

tomvothecoder force-pushed the refactor/688-temp-api-perf branch from 7594df4 to 0d56ed5 Compare August 30, 2024 22:39

Extract _calculate_departures() method

4c3845d

Remove unnecessary validation of weights in _get_weights()

22312a5

Replace `.load()` with `.astype("timedelta64[ns"])` for clarity

tomvothecoder commented Sep 3, 2024

View reviewed changes

tomvothecoder self-assigned this Sep 3, 2024

tomvothecoder added this to the FY24Q4 (07/01/24 - 09/30/24) milestone Sep 3, 2024

tomvothecoder added the type: enhancement New enhancement request label Sep 3, 2024

Update xcdat/temporal.py

0f4a1af

tomvothecoder marked this pull request as ready for review September 3, 2024 21:31

Fix _keep_weights() for departures and average modes

b97e8ae

tomvothecoder mentioned this pull request Sep 3, 2024

Grouping is significantly slower when adding auxiliary coordinates to the time dimension pydata/xarray#9426

Closed

5 tasks

chengzhuzhang reviewed Sep 3, 2024

View reviewed changes

tomvothecoder commented Sep 4, 2024

View reviewed changes

xcdat/temporal.py Outdated Show resolved Hide resolved

tomvothecoder added 2 commits September 4, 2024 10:36

Update xcdat/temporal.py

84a7e10

Update _get_weights() tests

e77631e

- Check if sum of each weight group equals 1.0 - Update `_get_weights()` docs to remove validation portion

tomvothecoder commented Sep 4, 2024

View reviewed changes

xcdat/temporal.py Outdated Show resolved Hide resolved

Update xcdat/temporal.py

6459c1b

tomvothecoder merged commit 94c8932 into main Sep 4, 2024

tomvothecoder deleted the refactor/688-temp-api-perf branch September 4, 2024 19:30

This was referenced Sep 23, 2024

[Enhancement]: Consider how to test for correct weights generation in xCDAT #699

Closed

Bump to v0.7.2 #700

Merged

		self._labeled_time = self._label_time_coords(dv[self.dim])
		dv = dv.assign_coords({self.dim: self._labeled_time})

		dv = dv.assign_coords({self.dim: self._labeled_time})
		dv_gb = dv.groupby(self.dim)

		dv_departs = dv_departs.assign_coords({self.dim: ds_obs[self.dim]})
		ds_departs[data_var] = dv_departs

Conversation

tomvothecoder commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

TODO:

Checklist

Uh oh!

codecov bot commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tomvothecoder left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomvothecoder commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking Script

Regression testing script

Next step

Uh oh!

chengzhuzhang Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chengzhuzhang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomvothecoder commented Aug 29, 2024 •

edited

Loading

codecov bot commented Aug 30, 2024 •

edited

Loading

tomvothecoder commented Sep 3, 2024 •

edited

Loading

chengzhuzhang Sep 3, 2024 •

edited

Loading