Skip to content

Conversation

@ValentinGebhart
Copy link
Collaborator

@ValentinGebhart ValentinGebhart commented Jul 17, 2025

While checking the impact yearsets tutorial, I noticed a few potential improvements of the yearset module. The main conceptual point was that the impact events were sampled without replacement. This seemed to work well if all events had the same frequency but resulted in strange behavior if the frequencies vary a lot (see example in comment below). Also, the frequency of the final impact object was unclear.

Changes proposed in this PR:

  • When sampling without replacement (default behavior), if the frequency of the different events is different, a warning is issued.
  • The yearly impact objects now has a frequency of 1/n_years, where n_years is the number of sampled years.
  • The yearly impact object was a deep copy of the original impact object with a few parameters changed (eg, the impact matrix was transfered, which I think dows not make sense). Now the yearly impact object is a fresh impact object filled with relevant attributes only.
  • The correction factor is now applied by multiplication and not division (such that the yearly impact aai_agg is the same as the aai_agg of the original impact object).
  • Minor changes in the tutorial.

PR Author Checklist

PR Reviewer Checklist

@ValentinGebhart
Copy link
Collaborator Author

ValentinGebhart commented Jul 17, 2025

This is an example for which the earlier sampling without replacement was not reasonable. One of the impact events has a significantly larger frequency. Once it was drawn in the sampling, the other events were drawn until no events were left. With replacement, the high-frequency event would be drawn more often.

import numpy as np
import climada.util.yearsets as yearsets
from climada.engine import Impact

imp = Impact()
imp.at_event = np.arange(10, 110, 10)
imp.frequency = np.concatenate([np.array([1.]), np.ones(9) * 0.01]) # highly unbalanced frequencies, first event 100 times more likely

# the number of years to sample impacts for (length(yimp.at_event) = sampled_years)
sampled_years = 10

# sample number of events per sampled year
lam = np.sum(imp.frequency)
events_per_year = yearsets.sample_from_poisson(sampled_years, lam, seed=123)
sampling_vect = yearsets.sample_events(events_per_year, imp.frequency, seed=123)
sampling_vect

Result when setting the optional parameter with_replacement=False, similar to the old version

2025-07-22 16:59:12,925 - climada.util.yearsets - WARNING - The frequencies of the different events are not equal. This can lead to distorted sampling if the frequencies vary significantly. To avoid this, please set with_replacement=True to sample with replacement instead.
[array([0]),
 array([], dtype=int64),
 array([7, 1]),
 array([6, 2]),
 array([8]),
 array([5]),
 array([], dtype=int64),
 array([9]),
 array([], dtype=int64),
 array([], dtype=int64)]

Result in the new version

[array([0]),
 array([], dtype=int64),
 array([0, 0]),
 array([0, 0]),
 array([0]),
 array([1]),
 array([], dtype=int64),
 array([0]),
 array([], dtype=int64),
 array([], dtype=int64)]

@ValentinGebhart ValentinGebhart requested review from carmensteinmann and removed request for emanuel-schmid and peanutfun July 17, 2025 14:08
Copy link
Member

@chahank chahank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I think the implementation goes in the right direction, but I would improve the way it is presented to the user a bit.

Also, the other improvements look good to me. I might look at the code in detail later to be sure it does what it should do if you want.

Comment on lines 242 to 246
if with_replacement is False:
LOGGER.warning(
"Sampling without replacement not implemented for events with varying "
"frequencies. Events are sampled with replacement."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very confusing way to implement things. Having parameter combinations that work and others which do not is to be avoided if possible.

Also, I am a bit confused here. This is just a warning, while I think it should be an error since you want the code to not run for this combination of input parameters. Silently changing inputs upon calculations is also to be avoided.

But more generally, why raise an error if the frequencies are not equal? Would it not make sense to have a default behaviour (unequal frequencies with replacement, equal frequencies without replacement), and the user can overwrite this if they want? Or even more simply, always sample with replacement?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback :)

I see your point. Actually, I also thought the easiest way is just always to sample with replacement but maybe there was a reason to implement sampling without replacement that I do not realize.

I suggest this implementation:

  • default: sample with replacement
  • Optional parameter with_replacement can be set to False to sample without replacement. If frequencies of the events are not equal, we raise an error.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional parameter with_replacement can be set to False to sample without replacement. If frequencies of the events are not equal, we raise an error.

Would you say that it is always wrong to do with replacement when the frequencies are unequal? I thought to understand that it is only a problem in certain cases. So maybe rather raise a warning and explain the potential problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm good question. I think strictly speaking, sampling with replacement (how it is implemented) is always wrong if the frequencies are unequal. However I guess the error is small if the frequencies are not super different, and only becomes significant in cases like the above.
For me it would be okay to just raise a warning, as you suggest. Should I implement that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good compromise. Anyway, the default will be with replacement, and thus most users won't have problems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now changed the default to sample with replacement, and issue a warning when sampling without replacement for unequal event frequencies.
I also changed the default to not applying the correction factor, such that the default is that the yearly impacts are actual samples of the original impact object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Just so that we do not introduce a confusing default change, can @carmensteinmann please confirm that setting the correction factor application to false by default is a good idea?


def impact_yearset_from_sampling_vect(
imp, sampled_years, sampling_vect, correction_fac=True
imp, sampled_years, sampling_vect, correction_fac=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep it as is (pending confirmation by @carmensteinmann), we should adjust the docstring. I unfortunately cannot suggest a change here directly in the comment.

@chahank
Copy link
Member

chahank commented Jul 23, 2025

So that we do not forget, the changelog needs to be updated once done.

event_id=np.arange(1, len(sampled_years) + 1),
unit=imp.unit,
haz_type=imp.haz_type,
frequency=np.ones(len(sampled_years)) / len(sampled_years),
Copy link
Member

@chahank chahank Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
frequency=np.ones(len(sampled_years)) / len(sampled_years),
frequency=np.ones(len(sampled_years)) / len(sampled_years),
frequency_unit='1/year'
aai_agg=np.sum(yimp.frequency * yimp.at_event)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And eai_exp is not defined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to your suggestion: I think yimp is not yet defined, so I cannot use it in its own initiation (when you define aai_agg), right? I could define it in the initialtion using aai_agg=np.sum(imp_per_year )/len(sampled_years), would that be better? Then the frequency would be hard coded to be constant though

to eai_exp: I believe the impact year set does not have spatial information anymore: they build upon imp.at_event of the original impact object, so it is aggregated over the locations. We could add eai_exp with one entry equal to the aai_agg, if you think that would be useful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaa correct. Yes, just use the frequency definition instead, so

Suggested change
frequency=np.ones(len(sampled_years)) / len(sampled_years),
frequency=frequency,
aai_agg = np.sum(frequency * imp_per_year)

and somewhere above frequenvy=np.ones(len(sampled_years)) / len(sampled_years)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous implementation, eai_exp remains whatever was in imp.eai_exp. Not that this was a good idea necessarily, but the current implementation changes the output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, in the previous implementation, yimp was simply a copy of the original impact object, and just a few attributes were changed. Eg, also the impact matrix was transferred if the original impact object had one. I think this results in a quite confusing output object (kind of hybrid between original events and sampled yearly impacts). That is why I changed it such that yimp only has attributes that correspond to the yearly impacts, and yes, that output can look quite different to before.

Does it make sense to change the output in this way? Or are you saying that it is problematic to change the output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or are you saying that it would be valuable to additionally tranfer just imp.eai_exp as there is still some spatial info in here? even tough the numbers of imp.eai_exp would then not add up to imp.agg_aai if we do not use the correction factor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it is much clearer now without carrying over the arguments from before, I agree. Maybe mention it in the Changelog and we are good.

The eai_exp should be computable from the impact matrix and the sampling vector right? But this would probably be a separate PR.

Copy link
Collaborator Author

@ValentinGebhart ValentinGebhart Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, I'll adapt the changelog.

Yes, using the original impact matrix and sampling vector one could compute eai_exp of the sampled impacts. Still, the yimp does not have any location information anymore (e.g. centroids), so maybe it would also be confusing to save the eai_exp in the yimp object.
A different way would be to compute a spatially explicit yearly impact set, i.e., where the sampled yearly impacts are saved per centroid (and not aggreagted). There the eai_exp would make a lot of sense. If we think this would be interesting, we can do it in a separate PR :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be interesting in the context of the multi-hazard impact module #1077 . I would suggest focusing on the multi-hazard impact module instead of tweaking the existing yearset util methods.

frequency=np.ones(len(sampled_years)) / len(sampled_years),
)

yimp.aai_agg = np.sum(yimp.frequency * yimp.at_event)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
yimp.aai_agg = np.sum(yimp.frequency * yimp.at_event)

@chahank
Copy link
Member

chahank commented Sep 10, 2025

Small cosmetic suggestion and a question regarding the change in the eai_exp value compared to the previous implementation.

Once resolved this last question, good to go!

@ValentinGebhart
Copy link
Collaborator Author

After discussion with Carmen, we decided to keep the default parameters as before:

  • sampling WITHOUT replacement: better inclusion of all original events in the sampling, hoping that situations with very unbalanced frequencies of events are "uncommon"
  • applying the correction factor: aai of yearset and original imp object are equal, even though sampled impacts have been changed.

I adapted the docstrings, tutorial and changelog accordingly.

@ValentinGebhart ValentinGebhart merged commit 16a5ca7 into develop Sep 23, 2025
17 of 19 checks passed
@ValentinGebhart ValentinGebhart deleted the feature/review_yearsets_tutorial branch September 23, 2025 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants