Skip to content

Conversation

@peanutfun
Copy link
Member

@peanutfun peanutfun commented Apr 17, 2025

Changes proposed in this PR:

  • Add optimizers for calibrating "average ensembles" and "ensembles of tragedies" of impact functions
  • Add option to assign weights to each calibration data point

PR Author Checklist

PR Reviewer Checklist

peanutfun and others added 30 commits March 31, 2023 16:23
# Conflicts:
#	script/jenkins/branches/Jenkinsfile
#	tests_runner.py
Use negative cost function as target function in BayesianOptimizer

Co-authored-by: Thomas Vogt <57705593+tovogt@users.noreply.github.com>
Copy link
Collaborator

@emanuel-schmid emanuel-schmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome amendment! Very well written module I'd say.
I just wish the pydoc strings were more elaborate and I wonder about the use of dataclasses.

Comment on lines 37 to 40
from ...engine.unsequa.input_var import InputVar
from ...entity.impact_funcs import ImpactFunc, ImpactFuncSet
from ..coordinates import country_to_iso
from .base import Input, Optimizer, Output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP 8: "Absolute imports are recommended, as they are usually more readable and tend to be better behaved (or at least give better error messages)"

Imo, this is probably true for ..-modules. For siblings relative imports are OK.

Suggested change
from ...engine.unsequa.input_var import InputVar
from ...entity.impact_funcs import ImpactFunc, ImpactFuncSet
from ..coordinates import country_to_iso
from .base import Input, Optimizer, Output
from climada.engine.unsequa.input_var import InputVar
from climada.entity.impact_funcs import ImpactFunc, ImpactFuncSet
from climada.util.coordinates import country_to_iso
from .base import Input, Optimizer, Output

Comment on lines +155 to +159
@dataclass
class EnsembleOptimizerOutput:
"""The collective output of an ensemble optimization"""

data: pd.DataFrame
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is stretching the concept of a dataclass as I see it a lot. For instance, you can't even do this:

EnsembleOptimizerOutput(df1) == EnsembleOptimizerOutput(df2)

Is there a reason why it has to be a dataclass? Woldn't it be more natural to have an ordinary class that uses from_outputs as __init__ ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using dataclass to avoid boilerplate code. In my view, the __init__ of the class should be as permissive as possible. Therefore, a usual init will look like this:

class MyClass:

    def __init__(attr1, attr2):
        self.attr1 = attr1
        self.attr2 = attr2

I think the code is trivial and using dataclass to construct is seems reasonable to me. It improves readability, maintainability, and extensibility.

Woldn't it be more natural to have an ordinary class that uses from_outputs as __init__ ?

I think from_outputs is too restrictive for making it the default initialization (although I use it exclusively throughout my code). See my comment above.

You can't even do this:

A good point! I will implement a custom __eq__. But I don't think that's a reason to abandon dataclass in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented a custom __eq__ for EnsembleOptimizerOutput

return ax


@dataclass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input has no __eq__, hence this class has neither. Here too, I'm not sure why it needs to be a dataclass.
Is it about the implicit immutability of a dataclass object? If so, we maybe even should make it explicit by

Suggested change
@dataclass
@dataclass(frozen=True)

Although this would require to drop the __post_init__ methods and InitVars and replace them by something like

@classmethod
def from_initvars(cls, x, y):
    return cls(something(x,y))

Even so, it would be nice to allow comparing two such objects meaningfully - and if it means that we have to implement __eq__ for a dataclass. Otherwise I honestly don't see a point of calling it dataclass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I use dataclass mostly to avoid a trivial init. This is especially useful for derived classes, where you might easily miss that you also need to initialize the base class.

I did not check it, but a default __eq__ should be created by dataclass. I did not really consider freezing the instances, as I think it's not necessary. Input probably is the only sensible candidate for freezing, as results are distorted if one changes it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really convinced given the need to implement __eq__ and __post__
- but I guess it doesn't matter much. 😎


@dataclass
class TragedyEnsembleOptimizer(EnsembleOptimizer):
"""An optimizer for the ensemble of tragedies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very short summary that doesn't give much of a clue what this implementation of the EnsembleOpitmizer does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be better now

self.samples = rng.choice(self.samples, ensemble_size, replace=False)

def input_from_sample(self, sample: list[tuple[int, int]]):
"""Subselect all input"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this does not become clear in the docs. The idea is the following: Each EnsembleOptimizer only stores a list of samples, which contains the indices of the data points that are to be used in each optimization. The concrete optimizer then defines in input_from_sample, how a new Input object is generated for each sample and thus each optimization. In TragedyEnsembleOptimizer, this is used to drastically reduce the amount of hazard data that needs to be processed by each optimization. I'll try to clarify

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I clarified


@dataclass
class AverageEnsembleOptimizer(EnsembleOptimizer):
"""An optimizer for the average ensemble
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Hard to understand. I wish the description was more elaborate. s.b.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I clarified


@dataclass
class EnsembleOptimizer(ABC):
"""Abstract base class for defining an ensemble optimizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an ensemble optimizer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will clarify in the docs. An optimizer yields a single set of optimal parameters (or impact functions), whereas an ensemble optimizer yields an ensemble of them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@peanutfun
Copy link
Member Author

@emanuel-schmid I hope the docs are clearer now!

I also found a bug in the implementation where the replace parameter in AverageEnsembleOptimizer would not have any effect because the data structure would only allow to add a sample once. I fixed this by implementing an additional data structure data_weights to Input, along with some tests. Effectively, drawing a sample more than once in AverageEnsembleOptimizer will multiply the weights of that sample. Other than that, users are now free to weight their data as they see fit (although we did not consider that necessary before)

@peanutfun peanutfun requested a review from emanuel-schmid July 9, 2025 07:48
Copy link
Collaborator

@emanuel-schmid emanuel-schmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the docs! Sorry for taking so long.
It all looks good to me, apart from a few minor, mainly cosmetical, concerns.
Happy to merge.

return -self.input.cost_func(data, predicted, weights)

def run(self, controller: BayesianOptimizerController) -> BayesianOptimizerOutput:
def run(self, **opt_kwargs) -> BayesianOptimizerOutput:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the signature not just like the pydoc describes it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same argument as we had in the discussion about the unsequa abstract base class:

The Optimizer abstract base class defines an interface that should be kept for all derived classes. However, the derived classes can, or in this case must, take additional, class specific arguments. The only elegant (?) way I see for doing this is simply going with a variadic kwargs argument and then documenting the method like it had the actual arguments that are needed here.

tldr: The linter would complain: https://pylint.pycqa.org/en/latest/user_guide/messages/warning/arguments-differ.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we do the same thing in the ScipyOptimizer:

def run(self, **opt_kwargs) -> ScipyMinimizeOptimizerOutput:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 aah, right! 👍

Comment on lines +647 to +654
# Take the controller
try:
controller = opt_kwargs.pop("controller")
except KeyError as err:
raise RuntimeError(
"BayesianOptimizer.run requires 'controller' as keyword argument"
) from err

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this wouldn't be necessary with a signature as described in the pydoc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but see above.

return ax


@dataclass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really convinced given the need to implement __eq__ and __post__
- but I guess it doesn't matter much. 😎

LOGGER = logging.getLogger(__name__)


def sample_data(data: pd.DataFrame, sample: list[tuple[int, int]]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function could be private if we wanted (with regard to possible Extensions better not though)

return data_sampled


def sample_weights(weights: pd.DataFrame, sample: list[tuple[int, int]]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

return weights_sampled


def event_info_from_input(inp: Input) -> dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

peanutfun and others added 2 commits September 3, 2025 14:31
Co-authored-by: Emanuel Schmid <51439563+emanuel-schmid@users.noreply.github.com>
@emanuel-schmid
Copy link
Collaborator

@peanutfun this is ready for mergin - isn't it?

@emanuel-schmid emanuel-schmid merged commit 593ebf0 into develop Sep 29, 2025
19 checks passed
@emanuel-schmid emanuel-schmid deleted the cross-calibrate-impact-functions branch September 29, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants