Skip to content

Derived variables save all data to disk before preprocessing #377

@ledm

Description

@ledm

When running the derivation, the derive preprocessor saves to disk the loaded netcdf as an netcdf in preproc. This isn't ideal, as it is extremely slow to write to disk and also results it huge bloated preproc directories. Furthermore, as the data isn't realised, I don't see why we can't keep it lazy and avoid saving it to disk.

For instance, I'm running a multi-model comparison of @tillku's derived variable Ocean Heat Content (ohc.py) between the years 1960 and 2014 (or 2005 for CMIP5). For each model in my recipe, the derivation preprocessor loads a thetao variable, applies the relevant fix, then saves the data as a netcdf. Each of these is order 20GB and takes ~10 minutes on jasmin. A small run would be around 5 models, so and hour waiting and 100GB disk space. A real publishable run would be closer to 300GB and 2-3 hours.

The frustration is compounded by the fact that the preprocessor loads all the thetao files first (which all work), then tries to load the volcello files (which don't work due to several issues - ask @valeriupredoi) -> that is fixed by iris=3

It seems that this saving to disk is unnecessary. Is there anyway that we can avoid it? We don't need to save to disk at every preprocessing stage elsewhere. Why do we need to do it here?

This continues a conversation with @bouweandela in issue #42.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions