Skip to content

Conversation

@jl-wynen
Copy link
Member

@jl-wynen jl-wynen commented Aug 9, 2023

Fixes #3189

@jl-wynen jl-wynen requested a review from SimonHeybrock August 9, 2023 12:51
@jl-wynen jl-wynen force-pushed the dataset-require-matching-sizes branch from c42ce4b to 4ef9965 Compare August 9, 2023 14:42
Copy link
Member

@SimonHeybrock SimonHeybrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review:

Sizes m_sizes;
holder_type m_items;
bool m_readonly{false};
bool m_sizes_are_set{false};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me why we need this mechanism, is it so we can create an empty dataset?

I now wonder if this (the mechanism and Dataset) could be implemented a bit simpler: Now (in contrast to when Dataset was first written) we have a shared ownership mechanism. Can we simply insert coords into all the data-array's coords that are in the dataset? Basically, Dataset is just a simple map<str, DataArray>, with a mechanism where setting coords inserts into all (and inserting new items will add links to existing coords, etc.).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me why we need this mechanism, is it so we can create an empty dataset?

Yes

I now wonder if this (the mechanism and Dataset) could be implemented a bit simpler [...]

I don't see any immediate argument against this. But we should do it separately, this PR is already complicated enough.

Comment on lines 133 to 134
ASSERT_EQ(coords.sizes(), Sizes());
ASSERT_FALSE(coords.sizes_are_set());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving aside how one would implement this on the C++ side, I would feel much better if Dataset().coords would return None if there is no data. This whole sizes_are_set mechanism looks like it was no fun at all to implement, risks some odd bugs and is hard to think about. What is your view on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you still return this even when coords have been assigned? E.g.

ds = sc.Dataset(coords={'x': sc.scalar(1)})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, then they are not None.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this would no longer work:

ds = sc.Dataset(coords={'x': sc.scalar(1)})
ds['a'] = sc.arange('x', 6)

Here, inserting a scalar coord does not set the sizes. So we can insert an array data item. To support this, we need a way to track whether the dataset is supposed to contain scalars or whether the coord does not relate to the data's dims.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is fine? The sizes of the coords is explicitly {}, I do not see a real problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is conceptually a bit strange because normally, scalar coords don't relate to the size of the dataset / data array. But I don't think this is a strong reason for supporting it here.

One last argument against your suggestion: It would make type checking annoying. Every use of ds.coords[key] would be flagged as a bad union access because None doesn't support the operation. There is no easy way around it for the user. They would have to either write lengthy code to check the type, cast to Mapping[str, sc.Variable], or suppress the check in this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type checking is a good argument. I suppose we have the same problem for, among others, da.unit and da.bins. Not sure what to do about all these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but can't you apply the same argument to uninitialized sizes (unless you represent them as empty dict)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are currently represented as an empty dict. That's not great 🙁

To avoid these problems (and avoid mypy annoyances), we would have to disallow creating empty, unsized datasets. I can't quite remember why we wanted support for those in the first place.
What I'm saying is that, once you create a Dataset, it has a fixed size. E.g., sc.Dataset() is scalar and so is sc.Dataset(coords={'x': sc.scalar(1)}).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion summary:

  • Revert addition of sizes arg to Dataset.__init__
  • Require either data or coords passed to Dataset.__init__, such that we can always define sizes.
    • Dataset({}) -> sizes={}
    • Dataset() -> exception
    • Dataset(coords={}) -> sizes={}
  • If unknown sizes, use dict or DataGroup and construct Dataset after adding items.

jl-wynen and others added 2 commits October 10, 2023 09:25
Co-authored-by: Simon Heybrock <12912489+SimonHeybrock@users.noreply.github.com>
# Conflicts:
#	docs/about/release-notes.rst
#	tests/coords/transform_coords_test.py
Copy link
Member

@SimonHeybrock SimonHeybrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before going further with the review, can we discuss the following comments? I think it shows the setDataInit approach may not be safe enough.

Comment on lines 95 to 109
template <class Op, class A>
auto apply_with_broadcast(const Op &op, const A &a, const DataArray &b) {
Dataset res;
for (const auto &item : a)
res.setData(item.name(), op(item, b));
return res;
res.setDataInit(item.name(), op(item, b));
return res.is_valid() ? res : Dataset({}, {});
}

template <class Op, class B>
auto apply_with_broadcast(const Op &op, const DataArray &a, const B &b) {
Dataset res;
for (const auto &item : b)
res.setData(item.name(), op(a, item));
res.setDataInit(item.name(), op(a, item));
return res;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a discrepancy here. The latter can return an invalid dataset if the input has no items?

In general, is it the correct approach to return Dataset({}, {})? I was first wondering if we should just keep the coords, but that is of course not what the old implementation did. Would it make sense to create a dataset with the same dims? Conceptually:

Dataset res(merge(a.dims(), b.dims())); // not necessarily this syntax, defines dims
for (const auto &item: a)
  res.setData(item.name(), ...);  // no need for `setDataInit`
return res;  // no risk of returning invalid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we conclude that we shouldn't have a constructor that takes sizes?

But this approach seems reasonable provided that we can always predict the output shape of op(a, item). Is it guaranteed to produce the merged dims of the inputs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we conclude that we shouldn't have a constructor that takes sizes?

I think we did (at least on the Python side), but I cannot quite remember why. But maybe we need it in C++?

But this approach seems reasonable provided that we can always predict the output shape of op(a, item). Is it guaranteed to produce the merged dims of the inputs?

I think so, that is what transform does?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases that are more complicated. E.g.

template <class Func, class... Args>
Dataset apply_to_items(const Dataset &d, Func func, Args &&...args) {
  Dataset result;
  for (const auto &data : d)
    result.setDataInit(data.name(), func(data, std::forward<Args>(args)...));
  return result;
}

where we only know the size after applying func. So avoiding the current approach would mean handling the first iteration of the loop differently from the rest. setDataInit was my attempt at keeping the loops tidy.

Also, in this case, what would be the output shape if the input has no items? Simply returning the input shape could lead to problems downstream if the caller expects a change in dims.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another tricky example is concat because it contains the dim merging logic in the concat function for variables and it is intertwined with the actual concatenation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, how can we avoid returning invalid Dataset?

Comment on lines 78 to 83
Dataset apply_to_items(const Dataset &d, Func func, Args &&...args) {
Dataset result;
for (const auto &data : d)
result.setData(data.name(), func(data, std::forward<Args>(args)...));
result.setDataInit(data.name(), func(data, std::forward<Args>(args)...));
return result;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will return invalid dataset if input is valid but has no items? See comment above... should we init with dims instead?

Comment on lines 140 to 143
if (dss.front().empty())
return Dataset({}, Coords(concat(map(dss, get_sizes), dim),
concat_maps(map(dss, get_coords), dim)));
return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this return an invalid dataset if the std::all_of above is always false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It would still merge and return the coords.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if there are no items to merge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean when the inputs don't have any overlapping data items and none of them have coords? Then yes, it would return an invalid dataset and that is bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, but even if there are coords (but no overlapping data items) you will get invalid?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one you highlighted. It calls gets all coords of all datasets and merges them. This seems independent of whether the dss contain data or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the diff looks like that code is only called when the first item is not empty? This is my point in the original comment (5 days ago).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This should check result instead of dss.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... which brings us back to my earlier point: It feels like the "invalid dataset" solution is way to easy to mess up, even for you as the original author?

@nvaytet
Copy link
Member

nvaytet commented Oct 16, 2023

Seeing how difficult it is to agree on things here, do we even need to keep Dataset or can we remove it and make do with DataGroup?

@jl-wynen
Copy link
Member Author

The whole reason for this PR was that we want to use datasets in binned data. We are far away from making that work with data groups.

Also, in a call, Simon and I basically agreed to go with the mechanism of invalid datasets. We're not happy with it but it seems to get the job done.

/// It can be used for creating a new dataset and filling it step by step.
///
/// When using this, always make sure to ultimately produce a valid dataset.
/// setDataInit is often called in a look.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// setDataInit is often called in a look.
/// setDataInit is often called in a loop.

Copy link
Member

@SimonHeybrock SimonHeybrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great effort, in particular also refactoring old tests which where using Dataset as a hack and add many new tests 👍

I think this is close to ready now, a few remaining comments and questions:

Comment on lines +227 to +231
[[nodiscard]] Dataset or_empty() && {
if (is_valid())
return std::move(*this);
return Dataset({}, {});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 235 to 276
Dataset out({}, m_coords.rename_dims(names));
Dataset out;
for (const auto &[name, da] : m_data)
out.setData(name, da.rename_dims(names, false));
return out;
out.setDataInit(name, da.rename_dims(names, false));
if (out.is_valid()) {
out.setCoords(m_coords.rename_dims(names));
return out;
}
// out is invalid because no data has been set.
return Dataset({}, m_coords.rename_dims(names));
Copy link
Member

@SimonHeybrock SimonHeybrock Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we init from coords first, as before? Don't they carry the dimensions? I also think a saw a solution like that for copy, so this should work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Maybe I changed this function too many times that the original solution got lost.

Comment on lines -290 to -309
TEST(ConcatenateTest, dataset_with_no_data_items) {
Dataset ds;
ds.setCoord(Dim::X,
makeVariable<double>(Dims{Dim::X}, Shape{4}, Values{1, 2, 3, 4}));
ds.setCoord(Dim("points"), makeVariable<double>(Dims{Dim::X}, Shape{4},
Values{.1, .2, .3, .4}));
EXPECT_EQ(concat2(ds.slice({Dim::X, 0, 2}), ds.slice({Dim::X, 2, 4}), Dim::X),
ds);
}

TEST(ConcatenateTest, dataset_with_no_data_items_histogram) {
Dataset ds;
ds.setCoord(Dim("histogram"), makeVariable<double>(Dims{Dim::X}, Shape{4},
Values{.1, .2, .3, .4}));
ds.setCoord(Dim::X, makeVariable<double>(Dims{Dim::X}, Shape{5},
Values{1, 2, 3, 4, 5}));
EXPECT_EQ(concat2(ds.slice({Dim::X, 0, 2}), ds.slice({Dim::X, 2, 4}), Dim::X),
ds);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, such datasets were not allowed at some point. I have since added TEST_F(Concatenate1DTest, empty_dataset) further up in the same file. I'm also going to add one for histograms.

Comment on lines +42 to +53
Random rand;
rand.seed(78847891);
RandomBool rand_bool;
rand_bool.seed(93481);
da = DataArray(makeVariable<double>(Dims{Dim::X, Dim::Y}, Shape{3, 4},
Values(rand(3 * 4))),
{{Dim::X, makeVariable<double>(Dims{Dim::X}, Shape{3},
Values(rand(3)))},
{Dim::Y, makeVariable<double>(Dims{Dim::Y}, Shape{4},
Values(rand(4)))}},
{{"mask", makeVariable<bool>(Dims{Dim::X}, Shape{3},
Values(rand_bool(3)))}});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

for i in range(4):
a.coords['1d'].values[i] = sc.DataArray(float(i) * sc.units.m)
a.coords['dataset'] = sc.scalar(sc.Dataset(data={'a': array_1d, 'b': array_2d}))
a.coords['dataset'] = sc.scalar(sc.Dataset(data={'a': array_1d}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we simply going to fail loading legacy files? Or should we return a DataGroup instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I'd rather fail because

  1. returning a datagroup would be tricky to implement (fallback mechanism like scippnexus)
  2. It's unlikely that downstream code still works.
  3. Our hdf5 files are not meant for long term storage. So the few files that could cause problems for use can be fixed manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose loading individual items as data arrays from such files would be a workaround that users can employ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done. But requires knowing the file structure.

# Conflicts:
#	docs/about/release-notes.rst
@jl-wynen jl-wynen enabled auto-merge October 17, 2023 09:32
@jl-wynen jl-wynen merged commit 2a4a270 into main Oct 17, 2023
@jl-wynen jl-wynen deleted the dataset-require-matching-sizes branch October 17, 2023 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Restrict Dataset to items with matching dimensionality

4 participants