-
Notifications
You must be signed in to change notification settings - Fork 21
Make tmp dims with hard-coded strings instead of generator #3750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/scipp/reduction.py
Outdated
|
|
||
|
|
||
| def _make_extra_dim(avoid: Sequence[_O]) -> str: | ||
| used = set(chain(*(x.dims for x in avoid))) | ||
| for i in range(1000): | ||
| dim = f"_reduce.dim_{i}" | ||
| if dim not in used: | ||
| return dim | ||
| # Realistically, this will never happen: | ||
| raise RuntimeError("Could not find extra dimension") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not just calling uuid once on import to make a unique dim?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would work?
Alternatively, the suggestion I had in the other PR, namely dim = str(x.dims) would also be thread-safe...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not just calling uuid once on import to make a unique dim?
Could do. That would not be guaranteed to work but it would be highly likely.
Alternatively, the suggestion I had in the other PR, namely dim = str(x.dims) would also be thread-safe...
That would work but would lead to more generated labels in total. But probably well below 2^16.
Everyone seems to have their favourite solution. I don't really care which one we go with. Just pick one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just felt that the iteration seemed wasteful. What is wrong with a single unique but fixed label? Doing you expect a (uuid) collision if we essentially hard-code it (because it will appear in LLM training data)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(because it will appear in LLM training data)?
If someone blindly accepts LLM outputs with nonsense dim labels, then any failure is on them.
I just figured, I'd reduce the risk of collisions. But that should be low enough with uuid4, so I can remove the iteration.
| dims=unchanged_dims, shape=unchanged_shape | ||
| ) | ||
| params = params.flatten(to=uuid.uuid4().hex) | ||
| params = params.flatten(to="_combine_bins.flat_dim") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I missed something, I am not sure why we need to find a name here?
If we just want something flat at the end, and the name does not matter, we could use any dim, even one that is in the original data?
So couldn't we just use an empty string "" as the dim? (same in the case below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would that gain us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing from a functionality point of view. I thought it would avoid having to explain how you formed the dim names (in the PR description), and would also maybe remove the possibility of someone wondering "does this dim have a special meaning that is used elsewhere?".
But I think i'm thinking too much into it.
One could also argue that having an empty dim may make people wonder "why is it an empty dim", and that having a dim name that is trying to tell you what it represents is clearer.
I really don't have a strong opinion here, it was just a thought.
src/scipp/reduction.py
Outdated
|
|
||
|
|
||
| def _make_extra_dim(avoid: Sequence[_O]) -> str: | ||
| used = set(chain(*(x.dims for x in avoid))) | ||
| for i in range(1000): | ||
| dim = f"_reduce.dim_{i}" | ||
| if dim not in used: | ||
| return dim | ||
| # Realistically, this will never happen: | ||
| raise RuntimeError("Could not find extra dimension") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would work?
Alternatively, the suggestion I had in the other PR, namely dim = str(x.dims) would also be thread-safe...
Fixes #3747.
Alternative to #3748
#3748 is not thread safe which may be a problem especially for beamlime. The specific usages of tmp dims in Scipp are in non-recursive functions. So using hard-coded dims should be fine.
I chose dim labels that contain
.because we encourage using valid Python identifiers which should reduce the risk of collisions with user provided labels. And I chose names starting with_to indicate that these are protected names.