-
Notifications
You must be signed in to change notification settings - Fork 149
Add classes for calibrating average ensemble and ensemble of tragedies #1048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # script/jenkins/branches/Jenkinsfile # tests_runner.py
Use negative cost function as target function in BayesianOptimizer Co-authored-by: Thomas Vogt <57705593+tovogt@users.noreply.github.com>
…DA-project/climada_python into calibrate-impact-functions
…pact-functions # Conflicts: # doc/user-guide/climada_util_calibrate.ipynb
emanuel-schmid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome amendment! Very well written module I'd say.
I just wish the pydoc strings were more elaborate and I wonder about the use of dataclasses.
climada/util/calibrate/ensemble.py
Outdated
| from ...engine.unsequa.input_var import InputVar | ||
| from ...entity.impact_funcs import ImpactFunc, ImpactFuncSet | ||
| from ..coordinates import country_to_iso | ||
| from .base import Input, Optimizer, Output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP 8: "Absolute imports are recommended, as they are usually more readable and tend to be better behaved (or at least give better error messages)"
Imo, this is probably true for ..-modules. For siblings relative imports are OK.
| from ...engine.unsequa.input_var import InputVar | |
| from ...entity.impact_funcs import ImpactFunc, ImpactFuncSet | |
| from ..coordinates import country_to_iso | |
| from .base import Input, Optimizer, Output | |
| from climada.engine.unsequa.input_var import InputVar | |
| from climada.entity.impact_funcs import ImpactFunc, ImpactFuncSet | |
| from climada.util.coordinates import country_to_iso | |
| from .base import Input, Optimizer, Output |
| @dataclass | ||
| class EnsembleOptimizerOutput: | ||
| """The collective output of an ensemble optimization""" | ||
|
|
||
| data: pd.DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stretching the concept of a dataclass as I see it a lot. For instance, you can't even do this:
EnsembleOptimizerOutput(df1) == EnsembleOptimizerOutput(df2)
Is there a reason why it has to be a dataclass? Woldn't it be more natural to have an ordinary class that uses from_outputs as __init__ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using dataclass to avoid boilerplate code. In my view, the __init__ of the class should be as permissive as possible. Therefore, a usual init will look like this:
class MyClass:
def __init__(attr1, attr2):
self.attr1 = attr1
self.attr2 = attr2I think the code is trivial and using dataclass to construct is seems reasonable to me. It improves readability, maintainability, and extensibility.
Woldn't it be more natural to have an ordinary class that uses
from_outputsas__init__?
I think from_outputs is too restrictive for making it the default initialization (although I use it exclusively throughout my code). See my comment above.
You can't even do this:
A good point! I will implement a custom __eq__. But I don't think that's a reason to abandon dataclass in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented a custom __eq__ for EnsembleOptimizerOutput
| return ax | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Input has no __eq__, hence this class has neither. Here too, I'm not sure why it needs to be a dataclass.
Is it about the implicit immutability of a dataclass object? If so, we maybe even should make it explicit by
| @dataclass | |
| @dataclass(frozen=True) |
Although this would require to drop the __post_init__ methods and InitVars and replace them by something like
@classmethod
def from_initvars(cls, x, y):
return cls(something(x,y))
Even so, it would be nice to allow comparing two such objects meaningfully - and if it means that we have to implement __eq__ for a dataclass. Otherwise I honestly don't see a point of calling it dataclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I use dataclass mostly to avoid a trivial init. This is especially useful for derived classes, where you might easily miss that you also need to initialize the base class.
I did not check it, but a default __eq__ should be created by dataclass. I did not really consider freezing the instances, as I think it's not necessary. Input probably is the only sensible candidate for freezing, as results are distorted if one changes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really convinced given the need to implement __eq__ and __post__
- but I guess it doesn't matter much. 😎
climada/util/calibrate/ensemble.py
Outdated
|
|
||
| @dataclass | ||
| class TragedyEnsembleOptimizer(EnsembleOptimizer): | ||
| """An optimizer for the ensemble of tragedies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very short summary that doesn't give much of a clue what this implementation of the EnsembleOpitmizer does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be better now
| self.samples = rng.choice(self.samples, ensemble_size, replace=False) | ||
|
|
||
| def input_from_sample(self, sample: list[tuple[int, int]]): | ||
| """Subselect all input""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this does not become clear in the docs. The idea is the following: Each EnsembleOptimizer only stores a list of samples, which contains the indices of the data points that are to be used in each optimization. The concrete optimizer then defines in input_from_sample, how a new Input object is generated for each sample and thus each optimization. In TragedyEnsembleOptimizer, this is used to drastically reduce the amount of hazard data that needs to be processed by each optimization. I'll try to clarify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I clarified
climada/util/calibrate/ensemble.py
Outdated
|
|
||
| @dataclass | ||
| class AverageEnsembleOptimizer(EnsembleOptimizer): | ||
| """An optimizer for the average ensemble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Hard to understand. I wish the description was more elaborate. s.b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I clarified
climada/util/calibrate/ensemble.py
Outdated
|
|
||
| @dataclass | ||
| class EnsembleOptimizer(ABC): | ||
| """Abstract base class for defining an ensemble optimizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is an ensemble optimizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will clarify in the docs. An optimizer yields a single set of optimal parameters (or impact functions), whereas an ensemble optimizer yields an ensemble of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
* Use weights for sampling with replacement in AverageEnsembleOptimizer. * Update tests
|
@emanuel-schmid I hope the docs are clearer now! I also found a bug in the implementation where the |
…-project/climada_python into cross-calibrate-impact-functions
emanuel-schmid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving the docs! Sorry for taking so long.
It all looks good to me, apart from a few minor, mainly cosmetical, concerns.
Happy to merge.
| return -self.input.cost_func(data, predicted, weights) | ||
|
|
||
| def run(self, controller: BayesianOptimizerController) -> BayesianOptimizerOutput: | ||
| def run(self, **opt_kwargs) -> BayesianOptimizerOutput: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the signature not just like the pydoc describes it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same argument as we had in the discussion about the unsequa abstract base class:
The Optimizer abstract base class defines an interface that should be kept for all derived classes. However, the derived classes can, or in this case must, take additional, class specific arguments. The only elegant (?) way I see for doing this is simply going with a variadic kwargs argument and then documenting the method like it had the actual arguments that are needed here.
tldr: The linter would complain: https://pylint.pycqa.org/en/latest/user_guide/messages/warning/arguments-differ.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we do the same thing in the ScipyOptimizer:
| def run(self, **opt_kwargs) -> ScipyMinimizeOptimizerOutput: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 aah, right! 👍
| # Take the controller | ||
| try: | ||
| controller = opt_kwargs.pop("controller") | ||
| except KeyError as err: | ||
| raise RuntimeError( | ||
| "BayesianOptimizer.run requires 'controller' as keyword argument" | ||
| ) from err | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this wouldn't be necessary with a signature as described in the pydoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but see above.
| return ax | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really convinced given the need to implement __eq__ and __post__
- but I guess it doesn't matter much. 😎
| LOGGER = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def sample_data(data: pd.DataFrame, sample: list[tuple[int, int]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function could be private if we wanted (with regard to possible Extensions better not though)
| return data_sampled | ||
|
|
||
|
|
||
| def sample_weights(weights: pd.DataFrame, sample: list[tuple[int, int]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dito
| return weights_sampled | ||
|
|
||
|
|
||
| def event_info_from_input(inp: Input) -> dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dito
Co-authored-by: Emanuel Schmid <51439563+emanuel-schmid@users.noreply.github.com>
|
@peanutfun this is ready for mergin - isn't it? |
Changes proposed in this PR:
PR Author Checklist
develop)PR Reviewer Checklist