Skip to content

Conversation

@SimonHeybrock
Copy link
Member

Fixes #3277.

Reviewer: Please consider the logic very carefully, I think it is probably not enough to look at the diff. Please also verify that we can indeed skip the bin-index validation.

Add support for operations on slices of binned data, which raised previously.

Aside from these changes, I am avoiding overhead in the setup logic of binned variables (tracing down this overhead is what actually led me to discover the linked bug). For few bins this is not so relevant, but it becomes significant when we have many bins. This may also impact some benchmarks @YooSunYoung is running for data-streaming applications.

Examples:

# binned data array, with data-array as content
da.copy()
da * 1
da.bins.concat()

For 50M bins and 100M events in da, times reduce from around 2 seconds to 0.7 seconds, i.e., we removed 1.3 seconds overhead. For 10M bins I see around 250 ms gain, i.e., we can see this is roughly proportional to the total bin count.

Copy link
Member

@jl-wynen jl-wynen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me.

const auto indices2d =
makeVariable<scipp::index_pair>(Dims{Dim::X, Dim::Y}, Shape{2, 2},
Values{std::pair{0, 1}, std::pair{1, 2},
std::pair{2, 4}, std::pair{4, 5}});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add an empty bin?

auto data_buffer =
variable::variableFactory().create(type, dims, unit, variances);
// If the buffer size is unchanged and input indices match output indices we
// can use a cheap and simply copy of the buffers coords and masks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// can use a cheap and simply copy of the buffers coords and masks.
// can use a cheap and simple copy of the buffer's coords and masks.

return make_bins(copy(indices), dim,
variableFactory().create(type, dims, unit, variances));
return make_bins_no_validate(
indices, dim, variableFactory().create(type, dims, unit, variances));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this now lead to sharing indices? Or is this function only called with a fresh set of indices that the user has no handle of?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell it is always fresh, but please double check! The intention is that this does not lead to sharing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it is only called in one place and that passes in freshly created indices.

@SimonHeybrock SimonHeybrock force-pushed the remove-binned-data-operation-overhead branch from 9addc8f to 5a56d47 Compare October 11, 2023 08:50
@SimonHeybrock SimonHeybrock merged commit 351eefd into main Oct 11, 2023
@SimonHeybrock SimonHeybrock deleted the remove-binned-data-operation-overhead branch October 11, 2023 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Operations on transposed (or similar) binned data return incorrect values in bin-element coords/masks/attrs

3 participants