-
Notifications
You must be signed in to change notification settings - Fork 21
Avoid making many dimension labels #3748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| _USED_AUX_DIMS = [] | ||
|
|
||
|
|
||
| def new_dim_for(*data: Variable | DataArray) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of uuid, can we simply use something like str(x.dims)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but what does that gain us? If the same dim gets reused for another variable later on, str(x.dims) for the previous x no longer has any meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but what does that gain us
Avoid having to store _USED_AUX_DIMS?
As far as I understand, then new dims are created locally and thrown away immediately after? (maybe I missed something?)
If the same dim gets reused for another variable later on, str(x.dims) for the previous x no longer has any meaning.
I didn't get that part. Can you explain maybe with an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My implementation stores all labels that it generates and tries to reuse them whenever possible. So the dims are not thrown away but kept for later, sort of like the underlying mechanism keeps the dim ids around as well.
If we use str(x.dims), then we actually throw away the dim after using it. But the id will stick around. That is fine if we generate the dim repeatedly for inputs that have the same dims. But if the input dims change, then we generate a new dim. My proposed implementation does not.
I didn't get that part. Can you explain maybe with an example?
I thought you wanted to keep the result of str(x.dims) in _USED_AUX_DIMS and reuse it. In that case, you might, e.g., generate the dim from a var with ['x', 'y'] and get 'x-y' but then later use that dim for another variable with dims ['t']. So the temporary label would bear no relation to that new variable.
I initially thought that you wanted to do this to get more understandable dim labels in which case, this could lead to confusion. But ultimately, it does not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you wanted to keep the result of str(x.dims) in _USED_AUX_DIMS and reuse it.
Yeah no I was thnking of doing the same kind of fix I did in Tof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understood now. But I think that this still ultimately leads to more allocated dim labels. Not 2^16 but still more than needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach to store and reuse all the uuids, but maybe that is just because I had the same idea. 😬
Can we document and make this part of the public API? Downstream packages may want to and should use this, if we put in the effort to have a proven and tested solution.
This breaks an import cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a few unit tests (e.g. that it works with scalar, Variable, and DataArray)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
Include |
src/scipp/core/dimensions.py
Outdated
| """Return a dimension label that is not in the input's dimensions. | ||
| The returned label is intended for temporarily reshaping an array and should | ||
| not become visible to users. | ||
| The label is guaranteed to not be present in the input, but it may be used in | ||
| other variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should explain here why using uuid or other generated labels is problematic, i.e., why one should use this function?
src/scipp/core/dimensions.py
Outdated
| : | ||
| A dimension label that is not in any variable or data array in ``data``. | ||
| """ | ||
| used = {*(x.dims for x in data)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this.
From what I can tell:
usedis a set of tuples_USED_AUX_DIMSis a list of strings- therefore
dim not in usedwill always evaluate toTrue
I imagine that is not the intended behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this does not unpack the dims properly. I updated it to use chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test that checks if the problem is resolved? Maybe something like the experiment Celine did
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. How would you do that? We can make 2**16 variables using
def test_new_dim_reuses_labels() -> None:
# Scipp supports up to 2**16 different dim labels.
# Make more arrays than that to test that new_dim_for reuses labels.
for _ in range(2**16 + 10):
dim = sc.new_dim_for(sc.arange('x', 1000))
# This should not raise:
sc.array(dims=[dim], values=[1, 2, 3])But that test alone takes 7s with a debug build.
We cannot simply check that _USED_AUX_DIMS does not grow during a given test because it may be used concurrently by a test on a different thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can get it down to 3s by pulling sc.arange('x', 1000) out of the loop. Is this acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, _USED_AUX_DIMS is not thread save, right? Does this mean that, e.g., using Scipp with Dask can break badly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can get it down to 3s ... Is this acceptable?
Yeah, seems acceptable to me. But if we're confident the implementation works as expected and is unlikely to break we don't necessarily need to run it every test run. I think it's fine to only run expensive tests more rarely.
This still generates new dimension labels. But for it to lead to using too many labels, someone would have to pass the same variable through 2^16 nested calls of functions using
new_dim_for. That seems highly unlikely.Fixes #3747.