Skip to content

Conversation

@jl-wynen
Copy link
Member

@jl-wynen jl-wynen commented Aug 20, 2025

This still generates new dimension labels. But for it to lead to using too many labels, someone would have to pass the same variable through 2^16 nested calls of functions using new_dim_for. That seems highly unlikely.

Fixes #3747.

@jl-wynen jl-wynen requested a review from nvaytet August 20, 2025 06:52
_USED_AUX_DIMS = []


def new_dim_for(*data: Variable | DataArray) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of uuid, can we simply use something like str(x.dims)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but what does that gain us? If the same dim gets reused for another variable later on, str(x.dims) for the previous x no longer has any meaning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what does that gain us

Avoid having to store _USED_AUX_DIMS?
As far as I understand, then new dims are created locally and thrown away immediately after? (maybe I missed something?)

If the same dim gets reused for another variable later on, str(x.dims) for the previous x no longer has any meaning.

I didn't get that part. Can you explain maybe with an example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My implementation stores all labels that it generates and tries to reuse them whenever possible. So the dims are not thrown away but kept for later, sort of like the underlying mechanism keeps the dim ids around as well.

If we use str(x.dims), then we actually throw away the dim after using it. But the id will stick around. That is fine if we generate the dim repeatedly for inputs that have the same dims. But if the input dims change, then we generate a new dim. My proposed implementation does not.

I didn't get that part. Can you explain maybe with an example?

I thought you wanted to keep the result of str(x.dims) in _USED_AUX_DIMS and reuse it. In that case, you might, e.g., generate the dim from a var with ['x', 'y'] and get 'x-y' but then later use that dim for another variable with dims ['t']. So the temporary label would bear no relation to that new variable.
I initially thought that you wanted to do this to get more understandable dim labels in which case, this could lead to confusion. But ultimately, it does not matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you wanted to keep the result of str(x.dims) in _USED_AUX_DIMS and reuse it.

Yeah no I was thnking of doing the same kind of fix I did in Tof.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understood now. But I think that this still ultimately leads to more allocated dim labels. Not 2^16 but still more than needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach to store and reuse all the uuids, but maybe that is just because I had the same idea. 😬

Can we document and make this part of the public API? Downstream packages may want to and should use this, if we put in the effort to have a proven and tested solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a few unit tests (e.g. that it works with scalar, Variable, and DataArray)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SimonHeybrock
Copy link
Member

Include new_dim_for in the API docs?

Comment on lines 183 to 188
"""Return a dimension label that is not in the input's dimensions.
The returned label is intended for temporarily reshaping an array and should
not become visible to users.
The label is guaranteed to not be present in the input, but it may be used in
other variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should explain here why using uuid or other generated labels is problematic, i.e., why one should use this function?

:
A dimension label that is not in any variable or data array in ``data``.
"""
used = {*(x.dims for x in data)}
Copy link
Contributor

@jokasimr jokasimr Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this.
From what I can tell:

  • used is a set of tuples
  • _USED_AUX_DIMS is a list of strings
  • therefore dim not in used will always evaluate to True

I imagine that is not the intended behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this does not unpack the dims properly. I updated it to use chain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test that checks if the problem is resolved? Maybe something like the experiment Celine did

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. How would you do that? We can make 2**16 variables using

def test_new_dim_reuses_labels() -> None:
    # Scipp supports up to 2**16 different dim labels.
    # Make more arrays than that to test that new_dim_for reuses labels.
    for _ in range(2**16 + 10):
        dim = sc.new_dim_for(sc.arange('x', 1000))
        # This should not raise:
        sc.array(dims=[dim], values=[1, 2, 3])

But that test alone takes 7s with a debug build.

We cannot simply check that _USED_AUX_DIMS does not grow during a given test because it may be used concurrently by a test on a different thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get it down to 3s by pulling sc.arange('x', 1000) out of the loop. Is this acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, _USED_AUX_DIMS is not thread save, right? Does this mean that, e.g., using Scipp with Dask can break badly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get it down to 3s ... Is this acceptable?

Yeah, seems acceptable to me. But if we're confident the implementation works as expected and is unlikely to break we don't necessarily need to run it every test run. I think it's fine to only run expensive tests more rarely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Errors from using uuid for creating dimension labels

5 participants