-
Notifications
You must be signed in to change notification settings - Fork 149
Updates to impact yearset module and review of tutorial #1075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is an example for which the earlier sampling without replacement was not reasonable. One of the impact events has a significantly larger frequency. Once it was drawn in the sampling, the other events were drawn until no events were left. With replacement, the high-frequency event would be drawn more often. import numpy as np
import climada.util.yearsets as yearsets
from climada.engine import Impact
imp = Impact()
imp.at_event = np.arange(10, 110, 10)
imp.frequency = np.concatenate([np.array([1.]), np.ones(9) * 0.01]) # highly unbalanced frequencies, first event 100 times more likely
# the number of years to sample impacts for (length(yimp.at_event) = sampled_years)
sampled_years = 10
# sample number of events per sampled year
lam = np.sum(imp.frequency)
events_per_year = yearsets.sample_from_poisson(sampled_years, lam, seed=123)
sampling_vect = yearsets.sample_events(events_per_year, imp.frequency, seed=123)
sampling_vectResult when setting the optional parameter with_replacement=False, similar to the old version Result in the new version |
chahank
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I think the implementation goes in the right direction, but I would improve the way it is presented to the user a bit.
Also, the other improvements look good to me. I might look at the code in detail later to be sure it does what it should do if you want.
climada/util/yearsets.py
Outdated
| if with_replacement is False: | ||
| LOGGER.warning( | ||
| "Sampling without replacement not implemented for events with varying " | ||
| "frequencies. Events are sampled with replacement." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very confusing way to implement things. Having parameter combinations that work and others which do not is to be avoided if possible.
Also, I am a bit confused here. This is just a warning, while I think it should be an error since you want the code to not run for this combination of input parameters. Silently changing inputs upon calculations is also to be avoided.
But more generally, why raise an error if the frequencies are not equal? Would it not make sense to have a default behaviour (unequal frequencies with replacement, equal frequencies without replacement), and the user can overwrite this if they want? Or even more simply, always sample with replacement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback :)
I see your point. Actually, I also thought the easiest way is just always to sample with replacement but maybe there was a reason to implement sampling without replacement that I do not realize.
I suggest this implementation:
- default: sample with replacement
- Optional parameter
with_replacementcan be set toFalseto sample without replacement. If frequencies of the events are not equal, we raise an error.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional parameter with_replacement can be set to False to sample without replacement. If frequencies of the events are not equal, we raise an error.
Would you say that it is always wrong to do with replacement when the frequencies are unequal? I thought to understand that it is only a problem in certain cases. So maybe rather raise a warning and explain the potential problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm good question. I think strictly speaking, sampling with replacement (how it is implemented) is always wrong if the frequencies are unequal. However I guess the error is small if the frequencies are not super different, and only becomes significant in cases like the above.
For me it would be okay to just raise a warning, as you suggest. Should I implement that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good compromise. Anyway, the default will be with replacement, and thus most users won't have problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now changed the default to sample with replacement, and issue a warning when sampling without replacement for unequal event frequencies.
I also changed the default to not applying the correction factor, such that the default is that the yearly impacts are actual samples of the original impact object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Just so that we do not introduce a confusing default change, can @carmensteinmann please confirm that setting the correction factor application to false by default is a good idea?
climada/util/yearsets.py
Outdated
|
|
||
| def impact_yearset_from_sampling_vect( | ||
| imp, sampled_years, sampling_vect, correction_fac=True | ||
| imp, sampled_years, sampling_vect, correction_fac=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep it as is (pending confirmation by @carmensteinmann), we should adjust the docstring. I unfortunately cannot suggest a change here directly in the comment.
|
So that we do not forget, the changelog needs to be updated once done. |
climada/util/yearsets.py
Outdated
| event_id=np.arange(1, len(sampled_years) + 1), | ||
| unit=imp.unit, | ||
| haz_type=imp.haz_type, | ||
| frequency=np.ones(len(sampled_years)) / len(sampled_years), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| frequency=np.ones(len(sampled_years)) / len(sampled_years), | |
| frequency=np.ones(len(sampled_years)) / len(sampled_years), | |
| frequency_unit='1/year' | |
| aai_agg=np.sum(yimp.frequency * yimp.at_event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And eai_exp is not defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to your suggestion: I think yimp is not yet defined, so I cannot use it in its own initiation (when you define aai_agg), right? I could define it in the initialtion using aai_agg=np.sum(imp_per_year )/len(sampled_years), would that be better? Then the frequency would be hard coded to be constant though
to eai_exp: I believe the impact year set does not have spatial information anymore: they build upon imp.at_event of the original impact object, so it is aggregated over the locations. We could add eai_exp with one entry equal to the aai_agg, if you think that would be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aaa correct. Yes, just use the frequency definition instead, so
| frequency=np.ones(len(sampled_years)) / len(sampled_years), | |
| frequency=frequency, | |
| aai_agg = np.sum(frequency * imp_per_year) |
and somewhere above frequenvy=np.ones(len(sampled_years)) / len(sampled_years)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous implementation, eai_exp remains whatever was in imp.eai_exp. Not that this was a good idea necessarily, but the current implementation changes the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, in the previous implementation, yimp was simply a copy of the original impact object, and just a few attributes were changed. Eg, also the impact matrix was transferred if the original impact object had one. I think this results in a quite confusing output object (kind of hybrid between original events and sampled yearly impacts). That is why I changed it such that yimp only has attributes that correspond to the yearly impacts, and yes, that output can look quite different to before.
Does it make sense to change the output in this way? Or are you saying that it is problematic to change the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or are you saying that it would be valuable to additionally tranfer just imp.eai_exp as there is still some spatial info in here? even tough the numbers of imp.eai_exp would then not add up to imp.agg_aai if we do not use the correction factor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it is much clearer now without carrying over the arguments from before, I agree. Maybe mention it in the Changelog and we are good.
The eai_exp should be computable from the impact matrix and the sampling vector right? But this would probably be a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, I'll adapt the changelog.
Yes, using the original impact matrix and sampling vector one could compute eai_exp of the sampled impacts. Still, the yimp does not have any location information anymore (e.g. centroids), so maybe it would also be confusing to save the eai_exp in the yimp object.
A different way would be to compute a spatially explicit yearly impact set, i.e., where the sampled yearly impacts are saved per centroid (and not aggreagted). There the eai_exp would make a lot of sense. If we think this would be interesting, we can do it in a separate PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be interesting in the context of the multi-hazard impact module #1077 . I would suggest focusing on the multi-hazard impact module instead of tweaking the existing yearset util methods.
climada/util/yearsets.py
Outdated
| frequency=np.ones(len(sampled_years)) / len(sampled_years), | ||
| ) | ||
|
|
||
| yimp.aai_agg = np.sum(yimp.frequency * yimp.at_event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| yimp.aai_agg = np.sum(yimp.frequency * yimp.at_event) |
|
Small cosmetic suggestion and a question regarding the change in the Once resolved this last question, good to go! |
|
After discussion with Carmen, we decided to keep the default parameters as before:
I adapted the docstrings, tutorial and changelog accordingly. |
While checking the impact yearsets tutorial, I noticed a few potential improvements of the yearset module. The main conceptual point was that the impact events were sampled without replacement. This seemed to work well if all events had the same frequency but resulted in strange behavior if the frequencies vary a lot (see example in comment below). Also, the frequency of the final impact object was unclear.
Changes proposed in this PR:
PR Author Checklist
develop)PR Reviewer Checklist