HDF5 file IO for TCTracks #349

tovogt · 2022-01-18T10:59:10Z

For Hazard and Exposure objects, CLIMADA comes with handy methods to read and write to/from HDF5 files. For TCTracks objects, users have to either use pickle (whick has its own issues) or the write_netcdf functionality.

The problem with the NetCDF IO is that all tracks are stored in separate files. On many setups, especially with network file systems, the access times for many small files can be extremely bad.

That's why this PR adds IO methods write_hdf and from_hdf for TCTracks, in line with the corresponding functionality for Hazard and Exposure objects. All tracks are stored in a single hdf5 file.

climada/hazard/tc_tracks.py

chahank

Excellent addition!

Note for future: such a method would also be very useful for other climada objects, in particular the Impact and CostBenefit objects.

climada/hazard/tc_tracks.py

climada/hazard/test/test_tc_tracks.py

tovogt · 2022-01-18T19:32:13Z

Thanks for the quick feedback! I think I addressed all of your comments in my latest commit.

tovogt · 2022-01-19T10:30:37Z

I have to mark this as WIP because I found that there are alternative implementations (data formats) and I'm not sure which one to choose:

Before I posted this PR here for review, I had a version that was based on pandas' HDF5 io capabilities (be78053). However, I found that this produces files that are way to large. Further, it would use pickle internally to serialize strings, even if we have a fixed-length string variable like "basin". And pickle is really bad for backwards compatibility.
My initial proposal (when posting this PR) would encode each track separately as a NetCDF3 file and then store its (compressed) byte representation in the hdf5 file. This is not very elegant since it is not really using the hdf5 structure.
An alternative file format is in bf878c8 which produces NetCDF4-compliant HDF5-files that can also be inspected with NetCDF-tools like ncdump. However, the resulting hdf5 file is much (factor 4 or 5!) larger, even with zlib compression enabled. That's because it has huge amounts of metadata and metadata is not compressed. In the first format, the NetCDF headers (metadata) are compressed as well.

When gzipping the files in the format (2), the size problem is more than solved. However, it's not possible to read and write directly to/from gzipped hdf5 files when using xarray. We would first need to write the uncompressed data to disk and then gzip it. Similarly, when reading the file, we would need to uncompress the data on disk (which requires a lot of space) and then read it from there.

tovogt · 2022-01-19T14:17:21Z

@chahank I'm sorry, but I have to ask again for your review. I didn't change the API or the tests, but I changed the file format and the implementation. The produced HDF5-file is now completely NetCDF4-compliant and can be read with external NetCDF-tools like ncdump. This is much more elegant for most purposes, but requires factor 4-5 more space on disk than with the data format you reviewed at first. For example, storing the appx. 3000 tracks from the global IBTrACS data set in 1980-2019 requires 87 MB on disk [*]. This doesn't sound like a lot, but when generating 100 synthetic random walk tracks for each event, you will end up with almost 9 GB of data.

However, users that have problems with disk space can still manually gzip their files which will usually save 90% of the disk space.

[*] For comparison, when storing the 3000 tracks in separate NetCDF files using the TCTracks.write_netcdf functionality, the total disk space required is 105 MB.

climada/hazard/tc_tracks.py

chahank · 2022-01-19T15:18:03Z

Thanks @tovogt . I went through the code, and made a few remarks.

I think having a method that makes use of the files easier and compatible at the cost of memory is the right choice. And, as you said, for user where it is critical they may use extra compressing methods.

chahank · 2022-01-19T19:41:41Z

Looks good to go to me, thanks for the updates!

Thomas Vogt added 2 commits January 18, 2022 09:36

tc_tracks: convert STORM to 1-min winds

b215695

TCTracks: implement hdf io methods

be78053

tovogt requested a review from chahank January 18, 2022 10:59

tovogt commented Jan 18, 2022

View reviewed changes

climada/hazard/tc_tracks.py Outdated Show resolved Hide resolved

Thomas Vogt added 5 commits January 18, 2022 12:05

TCTracks: docstrings

246ec0d

TCTracks.write_hdf: use byte strings

6b9aac9

TCTracks.write_hdf: force basin variable to fixed length

8872762

TCTracks.from_hdf: log output

b7dae8e

TCTracks.write_hdf: log output

3ff5be2

chahank requested changes Jan 18, 2022

View reviewed changes

TCTracks: write_hdf -> write_hdf5

141acc4

Thomas Vogt added 2 commits January 19, 2022 10:28

TCTrack: make hdf5 support nc4-compliant

bf878c8

TCTrack: hdf5: use nc4-compliant zlib compression

146881f

tovogt changed the title ~~HDF5 file IO for TCTracks~~ WIP: HDF5 file IO for TCTracks Jan 19, 2022

Thomas Vogt added 2 commits January 19, 2022 13:21

TCTracks: write_hdf5 open only once

1a72170

TCTracks: from_hdf5 open only once

8fad105

tovogt changed the title ~~WIP: HDF5 file IO for TCTracks~~ HDF5 file IO for TCTracks Jan 19, 2022

TCTracks: introduce STORM_1MIN_WIND_FACTOR

9347bd3