Skip to content

Conversation

@tovogt
Copy link
Collaborator

@tovogt tovogt commented Jan 18, 2022

For Hazard and Exposure objects, CLIMADA comes with handy methods to read and write to/from HDF5 files. For TCTracks objects, users have to either use pickle (whick has its own issues) or the write_netcdf functionality.

The problem with the NetCDF IO is that all tracks are stored in separate files. On many setups, especially with network file systems, the access times for many small files can be extremely bad.

That's why this PR adds IO methods write_hdf and from_hdf for TCTracks, in line with the corresponding functionality for Hazard and Exposure objects. All tracks are stored in a single hdf5 file.

@tovogt tovogt requested a review from chahank January 18, 2022 10:59
Copy link
Member

@chahank chahank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent addition!

Note for future: such a method would also be very useful for other climada objects, in particular the Impact and CostBenefit objects.

@tovogt
Copy link
Collaborator Author

tovogt commented Jan 18, 2022

Thanks for the quick feedback! I think I addressed all of your comments in my latest commit.

@tovogt tovogt changed the title HDF5 file IO for TCTracks WIP: HDF5 file IO for TCTracks Jan 19, 2022
@tovogt
Copy link
Collaborator Author

tovogt commented Jan 19, 2022

I have to mark this as WIP because I found that there are alternative implementations (data formats) and I'm not sure which one to choose:

  1. Before I posted this PR here for review, I had a version that was based on pandas' HDF5 io capabilities (be78053). However, I found that this produces files that are way to large. Further, it would use pickle internally to serialize strings, even if we have a fixed-length string variable like "basin". And pickle is really bad for backwards compatibility.

  2. My initial proposal (when posting this PR) would encode each track separately as a NetCDF3 file and then store its (compressed) byte representation in the hdf5 file. This is not very elegant since it is not really using the hdf5 structure.

  3. An alternative file format is in bf878c8 which produces NetCDF4-compliant HDF5-files that can also be inspected with NetCDF-tools like ncdump. However, the resulting hdf5 file is much (factor 4 or 5!) larger, even with zlib compression enabled. That's because it has huge amounts of metadata and metadata is not compressed. In the first format, the NetCDF headers (metadata) are compressed as well.

When gzipping the files in the format (2), the size problem is more than solved. However, it's not possible to read and write directly to/from gzipped hdf5 files when using xarray. We would first need to write the uncompressed data to disk and then gzip it. Similarly, when reading the file, we would need to uncompress the data on disk (which requires a lot of space) and then read it from there.

@tovogt tovogt changed the title WIP: HDF5 file IO for TCTracks HDF5 file IO for TCTracks Jan 19, 2022
@tovogt
Copy link
Collaborator Author

tovogt commented Jan 19, 2022

@chahank I'm sorry, but I have to ask again for your review. I didn't change the API or the tests, but I changed the file format and the implementation. The produced HDF5-file is now completely NetCDF4-compliant and can be read with external NetCDF-tools like ncdump. This is much more elegant for most purposes, but requires factor 4-5 more space on disk than with the data format you reviewed at first. For example, storing the appx. 3000 tracks from the global IBTrACS data set in 1980-2019 requires 87 MB on disk [*]. This doesn't sound like a lot, but when generating 100 synthetic random walk tracks for each event, you will end up with almost 9 GB of data.

However, users that have problems with disk space can still manually gzip their files which will usually save 90% of the disk space.

[*] For comparison, when storing the 3000 tracks in separate NetCDF files using the TCTracks.write_netcdf functionality, the total disk space required is 105 MB.

@chahank
Copy link
Member

chahank commented Jan 19, 2022

Thanks @tovogt . I went through the code, and made a few remarks.

I think having a method that makes use of the files easier and compatible at the cost of memory is the right choice. And, as you said, for user where it is critical they may use extra compressing methods.

@chahank
Copy link
Member

chahank commented Jan 19, 2022

Looks good to go to me, thanks for the updates!

@tovogt tovogt merged commit 24d9e26 into develop Jan 20, 2022
@tovogt tovogt deleted the feature/tc_tracks_hdf branch January 20, 2022 10:03
@tovogt tovogt mentioned this pull request Jun 6, 2023
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants