-
Notifications
You must be signed in to change notification settings - Fork 21
Require that all dataset items have the same sizes #3199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c42ce4b to
4ef9965
Compare
SimonHeybrock
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review:
| Sizes m_sizes; | ||
| holder_type m_items; | ||
| bool m_readonly{false}; | ||
| bool m_sizes_are_set{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me why we need this mechanism, is it so we can create an empty dataset?
I now wonder if this (the mechanism and Dataset) could be implemented a bit simpler: Now (in contrast to when Dataset was first written) we have a shared ownership mechanism. Can we simply insert coords into all the data-array's coords that are in the dataset? Basically, Dataset is just a simple map<str, DataArray>, with a mechanism where setting coords inserts into all (and inserting new items will add links to existing coords, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me why we need this mechanism, is it so we can create an empty dataset?
Yes
I now wonder if this (the mechanism and Dataset) could be implemented a bit simpler [...]
I don't see any immediate argument against this. But we should do it separately, this PR is already complicated enough.
| ASSERT_EQ(coords.sizes(), Sizes()); | ||
| ASSERT_FALSE(coords.sizes_are_set()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving aside how one would implement this on the C++ side, I would feel much better if Dataset().coords would return None if there is no data. This whole sizes_are_set mechanism looks like it was no fun at all to implement, risks some odd bugs and is hard to think about. What is your view on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you still return this even when coords have been assigned? E.g.
ds = sc.Dataset(coords={'x': sc.scalar(1)})There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, then they are not None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then this would no longer work:
ds = sc.Dataset(coords={'x': sc.scalar(1)})
ds['a'] = sc.arange('x', 6)Here, inserting a scalar coord does not set the sizes. So we can insert an array data item. To support this, we need a way to track whether the dataset is supposed to contain scalars or whether the coord does not relate to the data's dims.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is fine? The sizes of the coords is explicitly {}, I do not see a real problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is conceptually a bit strange because normally, scalar coords don't relate to the size of the dataset / data array. But I don't think this is a strong reason for supporting it here.
One last argument against your suggestion: It would make type checking annoying. Every use of ds.coords[key] would be flagged as a bad union access because None doesn't support the operation. There is no easy way around it for the user. They would have to either write lengthy code to check the type, cast to Mapping[str, sc.Variable], or suppress the check in this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type checking is a good argument. I suppose we have the same problem for, among others, da.unit and da.bins. Not sure what to do about all these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but can't you apply the same argument to uninitialized sizes (unless you represent them as empty dict)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are currently represented as an empty dict. That's not great 🙁
To avoid these problems (and avoid mypy annoyances), we would have to disallow creating empty, unsized datasets. I can't quite remember why we wanted support for those in the first place.
What I'm saying is that, once you create a Dataset, it has a fixed size. E.g., sc.Dataset() is scalar and so is sc.Dataset(coords={'x': sc.scalar(1)}).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion summary:
- Revert addition of
sizesarg toDataset.__init__ - Require either data or coords passed to
Dataset.__init__, such that we can always define sizes.Dataset({})->sizes={}Dataset()-> exceptionDataset(coords={})->sizes={}
- If unknown sizes, use
dictorDataGroupand constructDatasetafter adding items.
4ef9965 to
a8b9893
Compare
The previous approach is no longer required because all dataset items have the same dims now. Also, it was broken, see issue #3191.
# Conflicts: # docs/about/release-notes.rst # docs/tutorials/solar_flares.ipynb
# Conflicts: # docs/about/release-notes.rst # tests/coords/transform_coords_test.py
SimonHeybrock
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before going further with the review, can we discuss the following comments? I think it shows the setDataInit approach may not be safe enough.
| template <class Op, class A> | ||
| auto apply_with_broadcast(const Op &op, const A &a, const DataArray &b) { | ||
| Dataset res; | ||
| for (const auto &item : a) | ||
| res.setData(item.name(), op(item, b)); | ||
| return res; | ||
| res.setDataInit(item.name(), op(item, b)); | ||
| return res.is_valid() ? res : Dataset({}, {}); | ||
| } | ||
|
|
||
| template <class Op, class B> | ||
| auto apply_with_broadcast(const Op &op, const DataArray &a, const B &b) { | ||
| Dataset res; | ||
| for (const auto &item : b) | ||
| res.setData(item.name(), op(a, item)); | ||
| res.setDataInit(item.name(), op(a, item)); | ||
| return res; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a discrepancy here. The latter can return an invalid dataset if the input has no items?
In general, is it the correct approach to return Dataset({}, {})? I was first wondering if we should just keep the coords, but that is of course not what the old implementation did. Would it make sense to create a dataset with the same dims? Conceptually:
Dataset res(merge(a.dims(), b.dims())); // not necessarily this syntax, defines dims
for (const auto &item: a)
res.setData(item.name(), ...); // no need for `setDataInit`
return res; // no risk of returning invalidThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we conclude that we shouldn't have a constructor that takes sizes?
But this approach seems reasonable provided that we can always predict the output shape of op(a, item). Is it guaranteed to produce the merged dims of the inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we conclude that we shouldn't have a constructor that takes sizes?
I think we did (at least on the Python side), but I cannot quite remember why. But maybe we need it in C++?
But this approach seems reasonable provided that we can always predict the output shape of
op(a, item). Is it guaranteed to produce the merged dims of the inputs?
I think so, that is what transform does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are cases that are more complicated. E.g.
template <class Func, class... Args>
Dataset apply_to_items(const Dataset &d, Func func, Args &&...args) {
Dataset result;
for (const auto &data : d)
result.setDataInit(data.name(), func(data, std::forward<Args>(args)...));
return result;
}where we only know the size after applying func. So avoiding the current approach would mean handling the first iteration of the loop differently from the rest. setDataInit was my attempt at keeping the loops tidy.
Also, in this case, what would be the output shape if the input has no items? Simply returning the input shape could lead to problems downstream if the caller expects a change in dims.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another tricky example is concat because it contains the dim merging logic in the concat function for variables and it is intertwined with the actual concatenation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, how can we avoid returning invalid Dataset?
| Dataset apply_to_items(const Dataset &d, Func func, Args &&...args) { | ||
| Dataset result; | ||
| for (const auto &data : d) | ||
| result.setData(data.name(), func(data, std::forward<Args>(args)...)); | ||
| result.setDataInit(data.name(), func(data, std::forward<Args>(args)...)); | ||
| return result; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will return invalid dataset if input is valid but has no items? See comment above... should we init with dims instead?
lib/dataset/shape.cpp
Outdated
| if (dss.front().empty()) | ||
| return Dataset({}, Coords(concat(map(dss, get_sizes), dim), | ||
| concat_maps(map(dss, get_coords), dim))); | ||
| return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this return an invalid dataset if the std::all_of above is always false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? It would still merge and return the coords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if there are no items to merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean when the inputs don't have any overlapping data items and none of them have coords? Then yes, it would return an invalid dataset and that is bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong, but even if there are coords (but no overlapping data items) you will get invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one you highlighted. It calls gets all coords of all datasets and merges them. This seems independent of whether the dss contain data or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me the diff looks like that code is only called when the first item is not empty? This is my point in the original comment (5 days ago).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. This should check result instead of dss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... which brings us back to my earlier point: It feels like the "invalid dataset" solution is way to easy to mess up, even for you as the original author?
|
Seeing how difficult it is to agree on things here, do we even need to keep |
|
The whole reason for this PR was that we want to use datasets in binned data. We are far away from making that work with data groups. Also, in a call, Simon and I basically agreed to go with the mechanism of invalid datasets. We're not happy with it but it seems to get the job done. |
lib/dataset/dataset.cpp
Outdated
| /// It can be used for creating a new dataset and filling it step by step. | ||
| /// | ||
| /// When using this, always make sure to ultimately produce a valid dataset. | ||
| /// setDataInit is often called in a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// setDataInit is often called in a look. | |
| /// setDataInit is often called in a loop. |
SimonHeybrock
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great effort, in particular also refactoring old tests which where using Dataset as a hack and add many new tests 👍
I think this is close to ready now, a few remaining comments and questions:
| [[nodiscard]] Dataset or_empty() && { | ||
| if (is_valid()) | ||
| return std::move(*this); | ||
| return Dataset({}, {}); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
lib/dataset/dataset.cpp
Outdated
| Dataset out({}, m_coords.rename_dims(names)); | ||
| Dataset out; | ||
| for (const auto &[name, da] : m_data) | ||
| out.setData(name, da.rename_dims(names, false)); | ||
| return out; | ||
| out.setDataInit(name, da.rename_dims(names, false)); | ||
| if (out.is_valid()) { | ||
| out.setCoords(m_coords.rename_dims(names)); | ||
| return out; | ||
| } | ||
| // out is invalid because no data has been set. | ||
| return Dataset({}, m_coords.rename_dims(names)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we init from coords first, as before? Don't they carry the dimensions? I also think a saw a solution like that for copy, so this should work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Maybe I changed this function too many times that the original solution got lost.
| TEST(ConcatenateTest, dataset_with_no_data_items) { | ||
| Dataset ds; | ||
| ds.setCoord(Dim::X, | ||
| makeVariable<double>(Dims{Dim::X}, Shape{4}, Values{1, 2, 3, 4})); | ||
| ds.setCoord(Dim("points"), makeVariable<double>(Dims{Dim::X}, Shape{4}, | ||
| Values{.1, .2, .3, .4})); | ||
| EXPECT_EQ(concat2(ds.slice({Dim::X, 0, 2}), ds.slice({Dim::X, 2, 4}), Dim::X), | ||
| ds); | ||
| } | ||
|
|
||
| TEST(ConcatenateTest, dataset_with_no_data_items_histogram) { | ||
| Dataset ds; | ||
| ds.setCoord(Dim("histogram"), makeVariable<double>(Dims{Dim::X}, Shape{4}, | ||
| Values{.1, .2, .3, .4})); | ||
| ds.setCoord(Dim::X, makeVariable<double>(Dims{Dim::X}, Shape{5}, | ||
| Values{1, 2, 3, 4, 5})); | ||
| EXPECT_EQ(concat2(ds.slice({Dim::X, 0, 2}), ds.slice({Dim::X, 2, 4}), Dim::X), | ||
| ds); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, such datasets were not allowed at some point. I have since added TEST_F(Concatenate1DTest, empty_dataset) further up in the same file. I'm also going to add one for histograms.
| Random rand; | ||
| rand.seed(78847891); | ||
| RandomBool rand_bool; | ||
| rand_bool.seed(93481); | ||
| da = DataArray(makeVariable<double>(Dims{Dim::X, Dim::Y}, Shape{3, 4}, | ||
| Values(rand(3 * 4))), | ||
| {{Dim::X, makeVariable<double>(Dims{Dim::X}, Shape{3}, | ||
| Values(rand(3)))}, | ||
| {Dim::Y, makeVariable<double>(Dims{Dim::Y}, Shape{4}, | ||
| Values(rand(4)))}}, | ||
| {{"mask", makeVariable<bool>(Dims{Dim::X}, Shape{3}, | ||
| Values(rand_bool(3)))}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| for i in range(4): | ||
| a.coords['1d'].values[i] = sc.DataArray(float(i) * sc.units.m) | ||
| a.coords['dataset'] = sc.scalar(sc.Dataset(data={'a': array_1d, 'b': array_2d})) | ||
| a.coords['dataset'] = sc.scalar(sc.Dataset(data={'a': array_1d})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we simply going to fail loading legacy files? Or should we return a DataGroup instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I'd rather fail because
- returning a datagroup would be tricky to implement (fallback mechanism like scippnexus)
- It's unlikely that downstream code still works.
- Our hdf5 files are not meant for long term storage. So the few files that could cause problems for use can be fixed manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose loading individual items as data arrays from such files would be a workaround that users can employ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be done. But requires knowing the file structure.
# Conflicts: # docs/about/release-notes.rst
Fixes #3189