-
Notifications
You must be signed in to change notification settings - Fork 21
fix: curve fit nonscalar initial guess and bound #3452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/scipp/curve_fit.py
Outdated
| b[1].to(unit=unit, dtype=float).value, | ||
| ) | ||
| return b | ||
| le, ri = (v if isinstance(v, Variable) else scalar(v) for v in b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is odd. Previously we required that either both bounds are variables or both are numbers. Now they can be mixed. Normally, we don't allow this. See, e.g., linspace and arange which can take either variables or numbers, but never a mixture.
I think curve_fit should not deviate from this.
What's more, scipp.scipy.optimize.curve_fit does not allow this. It will only lead to confusion when the two functions deviate from each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's better to have them both be scalars or both be variables.
There's also a bug here, if inf is passed as the bound, you will get a unit error unless the associated parameter is dimensionless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't call this a bug. This is consistent with how we treat raw numbers elsewhere. They are treated as dimensionless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right
| def _as_scalar(obj, unit): | ||
| if unit == default_unit: | ||
| return obj | ||
| return scalar(value=obj, unit=unit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the discussion about mixed variable / number bounds: Deviating in the behaviour of params from scipp.scipy.optimize.curve_fit will be confusing.
Also, this is potentially in conflict with the signature of the fit model. Given a case like this
def model(x: sc.Variable, a: sc.Variable, b: float) -> sc.Variable: ...
curve_fit(['x'], model, da, p0={'a': sc.scalar(1.0), 'b': 2.0})curve_fit will call this with b = sc.scalar(2.0). This violates the expectations of the author of model. And it effectively limits the signature of fit models to only allow Scipp variables as arguments.
It is possible to make that limitation. But then it should be clearly documented that fit models are only allowed to take variables as arguments. And again, this should then also be done for scipp.scipy.optimize.curve_fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main argument for always converting parameters and bounds to variables is that having parameters and bounds only be scipp variables reduces complexity, and reduces the number of combinations of argument types that we need to test for.
I don't really see the benefit of allowing non-scipp parameter values. Do you have a situation in mind where that would be useful?
If the motivation is that we want to avoid having to wrap the fit function I think that's not so useful because
1. wrapping a function is very easy to do in python, and
2. wrapping will have to be done in most situations anyway, because only rarely will your coordinate names and parameter names match the argument names of the function you are calling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed about the documentation.
About scipp.scipy.optimize.curve_fit maybe we should consider removing it to avoid having to keep them in sync. We don't have to remove it from the api, I think it should be possible to implement the scipp.scipy.optimize.curve_fit in terms of scipp.curve_fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly brought this up to make sure that you considered this issue and that the interface is clearly defined. I don't know of a concrete use case. Fit parameters will always be floats, so wrapping them in a variable is always possible.
Concerning removing scipp.scipy.optimize.curve_fit, I'd be very careful because
- Its interface is simpler and sufficient in most applications.
- You need to make sure that
curve_fitis not significantly slower. Bear in mind that applications like vanadium peak removal need to invoke the fit function many times. Andscipp.scipy.optimize.curve_fitis already almost 2x slower than using scipy directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And scipp.scipy.optimize.curve_fit is already almost 2x slower than using scipy directly.
Yes performance is an issue for small fits.
The overhead for a very small 1d fit is (on my computer):
* 920us for scipp.curve_fit
* 510us for scipp.optimize.curve_fit
* 230us for scipy.optimize.curve_fit
import scipp as sc
from scipp.scipy import optimize
import numpy as np
import scipy as sp
t = sc.linspace('t', 0, 10, 10)
da = sc.DataArray(t**2 + t + 1, coords=dict(t=t))
%timeit sc.curve_fit(['t'], lambda t, a, b: t**2 + a*t + b, da)
%timeit optimize.curve_fit(lambda t, *, a, b: t**2 + a*t + b, da)
%timeit sp.optimize.curve_fit(lambda t, a, b: t**2 + a*t + b, t.values, da.data.values)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its interface is simpler and sufficient in most applications.
I think it is marginally simpler for 1d cases, since you don't have to pass the name of the coordinate that is used in the fit. But I don't think it is sufficiently much simpler that that's a good argument for keeping it.
I think having two functions with the same name doing similar things is a bigger source of confusion.
But I agree that the performance difference for small fits might on it's own be a sufficient argument for keeping it.
Is the peak finding a significant portion of the runtime in the diffraction workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the peak finding a significant portion of the runtime in the diffraction workflow?
Probably not. The current plan is to run it offline and store the results. It was meant as an example to remind ourselves that the overhead of calling the fit can be relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that a rare 2x slowdown is not worth complicating things over.
(Rare in the sense "rarely the factor determining total runtime")
99ecf3a to
a682c5f
Compare
Co-authored-by: Jan-Lukas Wynen <jan-lukas.wynen@ess.eu>
Fixes #3437