Skip to content

Recommended way to extend xarray Datasets using accessors? #2473

@TomNicholas

Description

@TomNicholas

Hi,

I'm now regularly using xarray (& dask) for organising and analysing the output of the simulation code I use (BOUT++) and it's very helpful, thank you!.

However my current approach is quite clunky at dealing the extra information and functionality that's specific to the simulation code I'm using, and I have questions about what the recommended way to extend the xarray Dataset class is. This seems like a general enough problem that I thought I would make an issue for it.

Desired

What I ideally want to do is extend the xarray.Dataset class to accommodate extra attributes and methods, while retaining as much xarray functionality as possible, but avoiding reimplementing any of the API. This might not be possible, but ideally I want to make a BoutDataset class which contains extra attributes to hold information about the run which doesn't naturally fit into the xarray data model, extra methods to perform analysis/plotting which only users of this code would require, but also be able to use xarray-specific methods and top-level functions:

bd = BoutDataset('/path/to/data')

ds = bd.data  # access the wrapped xarray dataset
extra_data = bd.extra_data  # access the BOUT-specific data

bd.isel(time=-1)  # use xarray dataset methods

bd2 = BoutDataset('/path/to/other/data')
concatenated_bd = xr.concat([bd, bd2])  # apply top-level xarray functions to the data

bd.plot_tokamak()  # methods implementing bout-specific functionality

Problems with my current approach

I have read the documentation about extending xarray, and the issue threads about subclassing Datasets (#706) and accessors (#1080), but I wanted to check that what I'm doing is the recommended approach.

Right now I'm trying to do something like

@xr.register_dataset_accessor('bout')
class BoutDataset:
    def __init__(self, path):
        self.data = collect_data(path)  # collect all my numerical data from output files
        self.extra_data = read_extra_data(path)  # collect extra data about the simulation 

    def plot_tokamak():
        plot_in_bout_specific_way(self.data, self.extra_data)

which works in the sense that I can do

bd = BoutDataset('/path/to/data')

ds = bd.bout.data  # access the wrapped xarray dataset
extra_data = bd.bout.extra_data  # access the BOUT-specific data
bd.bout.plot_tokamak()  # methods implementing bout-specific functionality

but not so well with

bd.isel(time=-1)  # AttributeError: 'BoutDataset' object has no attribute 'isel'
bd.bout.data.isel(time=-1)  # have to do this instead, but this returns an xr.Dataset not a BoutDataset

concatenated_bd = xr.concat([bd1, bd2])  # TypeError: can only concatenate xarray Dataset and DataArray objects, got <class 'BoutDataset'>
concatenated_ds = xr.concat([bd1.bout.data, bd2.bout.data])  # again have to do this instead, which again returns an xr.Dataset not a BoutDataset

If I have to reimplement the APl for methods like .isel() and top-level functions like concat(), then why should I not just subclass xr.Dataset?

There aren't very many top-level xarray functions so reimplementing them would be okay, but there are loads of Dataset methods. However I think I know how I want my BoutDataset class to behave when an xr.Dataset method is called on it: I want it to implement that method on the underlying dataset and return the full BoutDatset with extra data and attributes still attached.

Is it possible to do something like:
"if calling an xr.Dataset method on an instance of BoutDataset, call the corresponding method on the wrapped dataset and return a BoutDataset that has the extra BOUT-specific data propagated through"?

Thanks in advance, apologies if this is either impossible or relatively trivial, I just thought other xarray users might have the same questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions