Skip to content

Conversation

@NicolasColombi
Copy link
Collaborator

@NicolasColombi NicolasColombi commented Mar 23, 2025

Changes proposed in this PR:

This PR adds tropical cyclone basin bounds as a dictionary named BASINS_BOUNDS in 0–360 coordinates, following STORM dataset bounds definitions.

This addition can be used to determine whether a track originated in that basin simply by using shapely functionalities:

point = Point(lat, lon)
True_or_False = polygon.contains(point)

Particularly useful to decrease the memory overload during global wind fields computations by computing fields basin-wide.

Preview of the bounds:


PR Author Checklist

PR Reviewer Checklist

@NicolasColombi NicolasColombi changed the title add bounds Add tropical cyclones basin bounds Mar 23, 2025
@NicolasColombi NicolasColombi marked this pull request as ready for review March 23, 2025 18:30
@chahank
Copy link
Member

chahank commented Mar 24, 2025

Simple yet useful addition!

I would suggest two things:
1- make it a data class instead
2- make it clear that these are the STORM basin definitions. These can vary substantially, and hence cannot be defined as a definitive default

Also, I am not sure to understand the point Particularly useful to decrease the memory overload during global wind fields computations by computing fields basin-wide.?

In what way does this work beyond what is already possible?

@NicolasColombi
Copy link
Collaborator Author

Simple yet useful addition!

I would suggest two things: 1- make it a data class instead 2- make it clear that these are the STORM basin definitions. These can vary substantially, and hence cannot be defined as a definitive default

Also, I am not sure to understand the point Particularly useful to decrease the memory overload during global wind fields computations by computing fields basin-wide.?

In what way does this work beyond what is already possible?

Thanks Chahan, I will do so. Tracks from Emmanuel simulations, for example, often come without basin information. Hence, these bounds would allow you to classify tracks based on the origin basin. This would decrease the amount of memory needed because if you compute wind fields globally, all centroids near all tracks will be loaded into memory. So if you compute wind fields for a track in the NA, centroids in the NI will also be loaded. Whereas if you perform the computation basin-wide, the centroid overlap is much bigger between all tracks. This is my understanding of the discussion I had with @spjuhel, I might be misinterpreting though.

@chahank
Copy link
Member

chahank commented Mar 24, 2025

Thanks for the explanation! I am a bit confused though, because for each track only the centroids near the track are considered (and they are loaded one by one into memory).

Could you maybe post an example of a code that shows this increase in memory efficiency? That would certainly be useful to see.

@NicolasColombi
Copy link
Collaborator Author

I see, then I might have misinterpreted Sam's words. I haven't tested yet if this indeed increases efficiency (it was my hypothesis given the above), but I will try soon. Also, if you know already areas in the Winfield computation that could be optimized memory-wise, let me know. As my memory is currently exploding using global exposure at 450 arcsec with only 200 tracks (>50GB on euler) and >150GB at 150 arcsec resolution.

@NicolasColombi
Copy link
Collaborator Author

@chahank I converted it to a dataclass (unsure whether this is how you intended it) and added a method split_by_basin() which creates 6 TCTtracks instances (1 per basin) from 1 TCTtrack instance.

@chahank
Copy link
Member

chahank commented Mar 24, 2025

Thanks! Although the idea of the data class is precisely to not have a macro variable like BASIN. The idea is to have the data in that class directly. Also, the idea is that you can then clearly explain that this is an arbitrary boarder for basins based on one particular publication.

I am not sure to understand the purpose of the method split_by_basin(). Can you please provide an example?

@NicolasColombi
Copy link
Collaborator Author

I am not sure to understand exactly... is your goal to have a basin class whose instances attributes are the basin boundaries ?

The idea of split_by_basin() is that you can load world-wide tracks (e.g. Emmanuel's, which you can not select by basin) and then split this one global TCTrack instance into 6 new TCTracks instances, each containing tracks of only one basin. This is currently not possible if you have a global dataset that does not provide the basin information (e.g Emmanuel's, FAST) and it very handy even with those datasets that do provide that information.

@chahank
Copy link
Member

chahank commented Mar 24, 2025

Exactly. The data class is there to contain the data. A global variable like BASIN is hard to manage and document, and is usually quite obscure to the user as it does not appear in the documentation.

@emanuel-schmid
Copy link
Collaborator

emanuel-schmid commented Mar 24, 2025

@chahank - do you really mean dataclass and not enum? Because if it's about managing some structured kind of constants, an enum seems to be the better match and the benefits of a dataclass (free __init__ and __eq__) don't help a lot. 🤔

@ValentinGebhart
Copy link
Collaborator

Thanks for the addition! I will take a closer look, but just by a quick read I noticed that there might be problems with the coordinate range (and the antimeridian)? There is no convention in CLIMADA about whether longitude is in -180 to 180 or 0 to 360 or others. But if you use the wrong convention, the polygon.contains does not work anymore. See this code

from climada.hazard.tc_tracks import BASINS
from shapely import Point
SP = BASINS["SP"].polygon
EP = BASINS["EP"].polygon
print(SP.contains(Point(-170, -30))) # False, longitude in -180 to 180 
print(SP.contains(Point(-170 + 360, -30))) # True, longitude in 0 to 360
print(EP.contains(Point(-170, 30))) # True, longitude in -180 to 180 
print(EP.contains(Point(-170 + 360, 30))) # False, longitude in 0 to 360

Maybe this is not a problem when connected to the STORM algorithm, just to note if we wanted to use these definitions in more generality

@chahank
Copy link
Member

chahank commented Mar 24, 2025

@emanuel-schmid aaa no of course, sorry for the confusion! I meant an Enum class.

@NicolasColombi
Copy link
Collaborator Author

@emanuel-schmid makes more sense now! Let me know if it is implemented as intended. For consistency, we may also consider adding the basin pressure to that class ?

Thanks @ValentinGebhart. I agree, I switched to -180,+180 as it is consistent with the tracks coordinate range. I had to split the "SP" basin into two polygons to avoid issues with the antimeridian (original reason why I chose longitude between 0-360).

Before I write the test for subset_by_basin(), let me know if you have comments on the function.

@emanuel-schmid
Copy link
Collaborator

emanuel-schmid commented Mar 27, 2025

@NicolasColombi many thanks. I've got two questions about the subset_by_basin method

  1. wouldn't it be preferable to return an array with the basins instead of a dictionary with lists?
  2. supposedly there are much more performant ways to do this than to loop over the data - is speed irrelevant here?

@NicolasColombi
Copy link
Collaborator Author

@NicolasColombi many thanks. I've got two questions about the subset_by_basin method

  1. wouldn't it be preferable to return an array with the basins instead of a dictionary with lists?
  2. supposedly there are much more performant ways to do this than to loop over the data - is speed irrelevant here?

@emanuel-schmid
For 1. don't you lose the information on which basin the TCTtracks object belongs to with an array ? I just picked a format where you can access the data by calling the basin name. But mybe there is a better option 🙃

For 2. Speed is not so much of an issue here, but if you know a more efficient way I am absolutely keen to modify the function. Can you advise on how to optimize it?

@NicolasColombi
Copy link
Collaborator Author

@NicolasColombi many thanks. I've got two questions about the subset_by_basin method

  1. wouldn't it be preferable to return an array with the basins instead of a dictionary with lists?
  2. supposedly there are much more performant ways to do this than to loop over the data - is speed irrelevant here?

@emanuel-schmid For 1. don't you lose the information on which basin the TCTtracks object belongs to with an array ? I just picked a format where you can access the data by calling the basin name. But mybe there is a better option 🙃

For 2. Speed is not so much of an issue here, but if you know a more efficient way I am absolutely keen to modify the function. Can you advise on how to optimize it?

@emanuel-schmid Any updates ?

Comment on lines 289 to 291
BASINS_GDF = gpd.GeoDataFrame(
{"basin": b, "geometry": b.value} for b in Basin_bounds_storm
)
Copy link
Member

@chahank chahank Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This beats the purpose of the enum class a bit. Please do without it if possible (I know this was proposed by @emanuel-schmid , but this is just a very minor change and allows for future flexibility). This should be part of the method get_basins. The latter should also have a parameter source or similar, which at the moment can only have the value STORM. The method subset_by_basin should also have that.

Copy link
Collaborator Author

@NicolasColombi NicolasColombi Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to understand: do you want to remove this or move it underget_basin ? can you clarify ? for the moment, I moved it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the naming is not well chosen because it is only one way to define basins, but let's keep it as is for the moment. Thanks!

@chahank
Copy link
Member

chahank commented Sep 22, 2025

If you agree with the small cosmetic changes, please merge them. After completion of the test as discussed we can merge!

NicolasColombi and others added 5 commits September 23, 2025 10:24
Co-authored-by: Chahan M. Kropf <chahan.kropf@usys.ethz.ch>
Co-authored-by: Chahan M. Kropf <chahan.kropf@usys.ethz.ch>
Co-authored-by: Chahan M. Kropf <chahan.kropf@usys.ethz.ch>
Co-authored-by: Chahan M. Kropf <chahan.kropf@usys.ethz.ch>
Co-authored-by: Chahan M. Kropf <chahan.kropf@usys.ethz.ch>
@NicolasColombi
Copy link
Collaborator Author

If you agree with the small cosmetic changes, please merge them. After completion of the test as discussed we can merge!

Yes thanks! I just found a pretty annoying bug while reviewing the test... when the track coordinates are in 0-360 format you can miss tracks: as our bounds are in 0-180. Unfortunately tracks coordinates are not uniform across sources: Emanuel uses 0-180 for instance, so our function must be able to deal with both.

@chahank do you have an idea on how to detect the coordinate format ? conversion from one to another is easy, as long as you know the format... It does not seem to be an attribute of tracks.

@NicolasColombi
Copy link
Collaborator Author

If you agree with the small cosmetic changes, please merge them. After completion of the test as discussed we can merge!

Yes thanks! I just found a pretty annoying bug while reviewing the test... when the track coordinates are in 0-360 format you can miss tracks: as our bounds are in 0-180. Unfortunately tracks coordinates are not uniform across sources: Emanuel uses 0-180 for instance, so our function must be able to deal with both.

@chahank do you have an idea on how to detect the coordinate format ? conversion from one to another is easy, as long as you know the format... It does not seem to be an attribute of tracks.

I think the easiest option would be to require the format type as an argument. The user should know the coordinate format of their tracks. I do not see an easy way to guess the format..

Having at some point homogenous coordinates formats in climada is probably part of a bigger plan.

@chahank
Copy link
Member

chahank commented Sep 23, 2025

If you agree with the small cosmetic changes, please merge them. After completion of the test as discussed we can merge!

Yes thanks! I just found a pretty annoying bug while reviewing the test... when the track coordinates are in 0-360 format you can miss tracks: as our bounds are in 0-180. Unfortunately tracks coordinates are not uniform across sources: Emanuel uses 0-180 for instance, so our function must be able to deal with both.
@chahank do you have an idea on how to detect the coordinate format ? conversion from one to another is easy, as long as you know the format... It does not seem to be an attribute of tracks.

I think the easiest option would be to require the format type as an argument. The user should know the coordinate format of their tracks. I do not see an easy way to guess the format..

Having at some point homogenous coordinates formats in climada is probably part of a bigger plan.

I think best is to use the util.coordinates.lon_normalize and util.coordinates.lon_bounds methods to always get consistent boundaries.

You could still be problems if the position of the central meridian is changed.

@NicolasColombi
Copy link
Collaborator Author

If you agree with the small cosmetic changes, please merge them. After completion of the test as discussed we can merge!

Yes thanks! I just found a pretty annoying bug while reviewing the test... when the track coordinates are in 0-360 format you can miss tracks: as our bounds are in 0-180. Unfortunately tracks coordinates are not uniform across sources: Emanuel uses 0-180 for instance, so our function must be able to deal with both.
@chahank do you have an idea on how to detect the coordinate format ? conversion from one to another is easy, as long as you know the format... It does not seem to be an attribute of tracks.

I think the easiest option would be to require the format type as an argument. The user should know the coordinate format of their tracks. I do not see an easy way to guess the format..
Having at some point homogenous coordinates formats in climada is probably part of a bigger plan.

I think best is to use the util.coordinates.lon_normalize and util.coordinates.lon_bounds methods to always get consistent boundaries.

You could still be problems if the position of the central meridian is changed.

Done, the test is updated too.

@chahank
Copy link
Member

chahank commented Sep 23, 2025

Excellent work, I think we can merge!

@NicolasColombi
Copy link
Collaborator Author

Excellent work, I think we can merge!

Great! 🤩

@chahank chahank merged commit 4b93d1d into develop Sep 23, 2025
18 of 19 checks passed
@emanuel-schmid emanuel-schmid deleted the feature/basins-bounds branch December 11, 2025 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants