Derived variables save all data to disk before preprocessing

When running the derivation, the derive preprocessor saves to disk the loaded netcdf as an netcdf in preproc. This isn't ideal, as it is extremely slow to write to disk and also results it huge bloated preproc directories. Furthermore, as the data isn't realised, I don't see why we can't keep it lazy and avoid saving it to disk.

For instance, I'm running a multi-model comparison of @tillku's derived variable Ocean Heat Content (ohc.py) between the years 1960 and 2014 (or 2005 for CMIP5). For each model in my recipe, the derivation preprocessor loads a `thetao` variable, applies the relevant fix, then saves the data as a netcdf. Each of these is order 20GB and takes ~10 minutes on jasmin. A small run would be around 5 models, so and hour waiting and 100GB disk space. A real publishable run would be closer to 300GB and 2-3 hours.
 
The frustration is compounded by the fact that the preprocessor loads all the `thetao` files first (which all work), then tries to load the `volcello` files ~(which don't work due to several issues - ask @valeriupredoi)~ -> that is fixed by iris=3

It seems that this saving to disk is unnecessary. Is there anyway that we can avoid it? We don't need to save to disk at every preprocessing stage elsewhere. Why do we need to do it here?

This continues a conversation with @bouweandela in issue #42. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Derived variables save all data to disk before preprocessing #377

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Derived variables save all data to disk before preprocessing #377

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions