-
Notifications
You must be signed in to change notification settings - Fork 149
pickle free centroids hdf #1076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
chahank
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good changes, although some tests are failing now.
As a note, centroids are currently never supposed to be anything else than points, as the impact calculations cannot deal with other geometries in the centroids at the moment. I think it does not hurt to have the code ready.
peanutfun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for tackling this issue! I have some concerns regarding the readability and performance of the code, especially because the entire data needs to be copied. I also think the tests need to be adapted a bit. See my comments.
| store.close() | ||
| xycols = [] | ||
| wkbcols = [] | ||
| store = pd.HDFStore(file_name, mode=mode, complevel=9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For zlib, it seems like looks like high compression levels only slightly reduce the file size while costing much performance. A lower value seems more advisable to me. See https://www.pytables.org/usersguide/optimization.html#compression-issues
| store = pd.HDFStore(file_name, mode=mode, complevel=9) | |
| store = pd.HDFStore(file_name, mode=mode, complevel=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave it as it is: in an arbitrary test, the cpu decrease was 0.4 seconds, about 15%, the size increase 2M, about 10%. From a $ point of view complevel 9 seems justified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15% decrease vs. 10% increase seems like an argument for a lower complevel, from my point of view. But I guess it's not that relevant. In case we run into some issues, we might consider making this a method kwarg in the future.
| for col in pandas_df.columns: | ||
| if str(pandas_df[col].dtype) == "geometry": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make clear that you do not want to iterate over all columns:
| for col in pandas_df.columns: | |
| if str(pandas_df[col].dtype) == "geometry": | |
| for col in filter(lambda x: str(x.dtype) == "geometry", pandas_df.columns): |
(Suggestion won't work because the following code needs to be indented less)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elegant suggestion - but I leave it as it is. it's more "climada style" like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then please add a comment, what you call "climada style" confused me quite a bit 😕
| for col in pandas_df.columns: | |
| if str(pandas_df[col].dtype) == "geometry": | |
| # Iterate over geometry columns (only) | |
| for col in pandas_df.columns: | |
| if str(pandas_df[col].dtype) == "geometry": |
climada/hazard/centroids/centr.py
Outdated
| crs = metadata.get("crs") | ||
| gdf = gpd.GeoDataFrame(store["centroids"], crs=crs) | ||
| gdf = gpd.GeoDataFrame(store["centroids"]) | ||
| for xycol in metadata.get("xy_columns", []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add a test with multiple xy_columns/wkb_columns to be stored and read.
Co-authored-by: Lukas Riedel <34276446+peanutfun@users.noreply.github.com>
| ) | ||
| centroids_w.write_hdf5(tmpfile) | ||
| centroids_r = Centroids.from_hdf5(tmpfile) | ||
| self.assertTrue(centroids_w == centroids_r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.assertTrue(centroids_w == centroids_r) | |
| self.assertEqual(centroids_w, centroids_r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this was actually done in purpose - the idea was to make sure the overridden equality operator does what it ought to do, regardless of what exactly happens inside assertEqual, about which I have no clue)
Changes proposed in this PR:
This PR is a pragmatic workaround of the problem described in #1055
PR Author Checklist
develop)PR Reviewer Checklist