Require that all dataset items have the same sizes #3199

jl-wynen · 2023-08-09T12:51:40Z

Fixes #3189

SimonHeybrock

Partial review:

docs/about/release-notes.rst

lib/dataset/bins.cpp

lib/dataset/data_array.cpp

SimonHeybrock · 2023-08-10T05:49:56Z

lib/dataset/include/scipp/dataset/sized_dict.h

  Sizes m_sizes;
  holder_type m_items;
  bool m_readonly{false};
+  bool m_sizes_are_set{false};


Remind me why we need this mechanism, is it so we can create an empty dataset?

I now wonder if this (the mechanism and Dataset) could be implemented a bit simpler: Now (in contrast to when Dataset was first written) we have a shared ownership mechanism. Can we simply insert coords into all the data-array's coords that are in the dataset? Basically, Dataset is just a simple map<str, DataArray>, with a mechanism where setting coords inserts into all (and inserting new items will add links to existing coords, etc.).

Remind me why we need this mechanism, is it so we can create an empty dataset?

Yes

I now wonder if this (the mechanism and Dataset) could be implemented a bit simpler [...]

I don't see any immediate argument against this. But we should do it separately, this PR is already complicated enough.

lib/dataset/shape.cpp

SimonHeybrock · 2023-08-10T06:01:41Z

lib/dataset/test/coords_view_test.cpp

+  ASSERT_EQ(coords.sizes(), Sizes());
+  ASSERT_FALSE(coords.sizes_are_set());


Leaving aside how one would implement this on the C++ side, I would feel much better if Dataset().coords would return None if there is no data. This whole sizes_are_set mechanism looks like it was no fun at all to implement, risks some odd bugs and is hard to think about. What is your view on this?

Would you still return this even when coords have been assigned? E.g.

ds = sc.Dataset(coords={'x': sc.scalar(1)})

No, then they are not None.

Then this would no longer work:

ds = sc.Dataset(coords={'x': sc.scalar(1)}) ds['a'] = sc.arange('x', 6)

Here, inserting a scalar coord does not set the sizes. So we can insert an array data item. To support this, we need a way to track whether the dataset is supposed to contain scalars or whether the coord does not relate to the data's dims.

I think that is fine? The sizes of the coords is explicitly {}, I do not see a real problem?

It is conceptually a bit strange because normally, scalar coords don't relate to the size of the dataset / data array. But I don't think this is a strong reason for supporting it here.

One last argument against your suggestion: It would make type checking annoying. Every use of ds.coords[key] would be flagged as a bad union access because None doesn't support the operation. There is no easy way around it for the user. They would have to either write lengthy code to check the type, cast to Mapping[str, sc.Variable], or suppress the check in this line.

Type checking is a good argument. I suppose we have the same problem for, among others, da.unit and da.bins. Not sure what to do about all these?

Hmm, but can't you apply the same argument to uninitialized sizes (unless you represent them as empty dict)?

They are currently represented as an empty dict. That's not great 🙁

To avoid these problems (and avoid mypy annoyances), we would have to disallow creating empty, unsized datasets. I can't quite remember why we wanted support for those in the first place.
What I'm saying is that, once you create a Dataset, it has a fixed size. E.g., sc.Dataset() is scalar and so is sc.Dataset(coords={'x': sc.scalar(1)}).

Discussion summary:

Revert addition of sizes arg to Dataset.__init__

Require either data or coords passed to Dataset.__init__, such that we can always define sizes.

Dataset({}) -> sizes={}

Dataset() -> exception

Dataset(coords={}) -> sizes={}

If unknown sizes, use dict or DataGroup and construct Dataset after adding items.

The previous approach is no longer required because all dataset items have the same dims now. Also, it was broken, see issue #3191.

# Conflicts: # docs/about/release-notes.rst # docs/tutorials/solar_flares.ipynb

lib/dataset/dataset.cpp

Co-authored-by: Simon Heybrock <12912489+SimonHeybrock@users.noreply.github.com>

# Conflicts: # docs/about/release-notes.rst # tests/coords/transform_coords_test.py

SimonHeybrock

Before going further with the review, can we discuss the following comments? I think it shows the setDataInit approach may not be safe enough.

SimonHeybrock · 2023-10-11T04:32:55Z

lib/dataset/arithmetic.cpp

 template <class Op, class A>
 auto apply_with_broadcast(const Op &op, const A &a, const DataArray &b) {
  Dataset res;
  for (const auto &item : a)
-    res.setData(item.name(), op(item, b));
-  return res;
+    res.setDataInit(item.name(), op(item, b));
+  return res.is_valid() ? res : Dataset({}, {});
 }

 template <class Op, class B>
 auto apply_with_broadcast(const Op &op, const DataArray &a, const B &b) {
  Dataset res;
  for (const auto &item : b)
-    res.setData(item.name(), op(a, item));
+    res.setDataInit(item.name(), op(a, item));
  return res;
 }


There is a discrepancy here. The latter can return an invalid dataset if the input has no items?

In general, is it the correct approach to return Dataset({}, {})? I was first wondering if we should just keep the coords, but that is of course not what the old implementation did. Would it make sense to create a dataset with the same dims? Conceptually:

Dataset res(merge(a.dims(), b.dims())); // not necessarily this syntax, defines dims for (const auto &item: a) res.setData(item.name(), ...); // no need for `setDataInit` return res; // no risk of returning invalid

Didn't we conclude that we shouldn't have a constructor that takes sizes?

But this approach seems reasonable provided that we can always predict the output shape of op(a, item). Is it guaranteed to produce the merged dims of the inputs?

Didn't we conclude that we shouldn't have a constructor that takes sizes?

I think we did (at least on the Python side), but I cannot quite remember why. But maybe we need it in C++?

But this approach seems reasonable provided that we can always predict the output shape of op(a, item). Is it guaranteed to produce the merged dims of the inputs?

I think so, that is what transform does?

There are cases that are more complicated. E.g.

template <class Func, class... Args> Dataset apply_to_items(const Dataset &d, Func func, Args &&...args) { Dataset result; for (const auto &data : d) result.setDataInit(data.name(), func(data, std::forward<Args>(args)...)); return result; }

where we only know the size after applying func. So avoiding the current approach would mean handling the first iteration of the loop differently from the rest. setDataInit was my attempt at keeping the loops tidy.

Also, in this case, what would be the output shape if the input has no items? Simply returning the input shape could lead to problems downstream if the caller expects a change in dims.

Another tricky example is concat because it contains the dim merging logic in the concat function for variables and it is intertwined with the actual concatenation.

So, how can we avoid returning invalid Dataset?

SimonHeybrock · 2023-10-11T04:36:31Z

lib/dataset/dataset_operations_common.h

 Dataset apply_to_items(const Dataset &d, Func func, Args &&...args) {
  Dataset result;
  for (const auto &data : d)
-    result.setData(data.name(), func(data, std::forward<Args>(args)...));
+    result.setDataInit(data.name(), func(data, std::forward<Args>(args)...));
  return result;
 }


Will return invalid dataset if input is valid but has no items? See comment above... should we init with dims instead?

SimonHeybrock · 2023-10-11T04:39:33Z

lib/dataset/shape.cpp

+  if (dss.front().empty())
+    return Dataset({}, Coords(concat(map(dss, get_sizes), dim),
+                              concat_maps(map(dss, get_coords), dim)));
  return result;


Won't this return an invalid dataset if the std::all_of above is always false?

Why? It would still merge and return the coords.

But if there are no items to merge?

Do you mean when the inputs don't have any overlapping data items and none of them have coords? Then yes, it would return an invalid dataset and that is bad.

Correct me if I am wrong, but even if there are coords (but no overlapping data items) you will get invalid?

Which code?

The one you highlighted. It calls gets all coords of all datasets and merges them. This seems independent of whether the dss contain data or not.

To me the diff looks like that code is only called when the first item is not empty? This is my point in the original comment (5 days ago).

Right. This should check result instead of dss.

... which brings us back to my earlier point: It feels like the "invalid dataset" solution is way to easy to mess up, even for you as the original author?

nvaytet · 2023-10-16T14:18:52Z

Seeing how difficult it is to agree on things here, do we even need to keep Dataset or can we remove it and make do with DataGroup?

jl-wynen · 2023-10-16T14:44:13Z

The whole reason for this PR was that we want to use datasets in binned data. We are far away from making that work with data groups.

Also, in a call, Simon and I basically agreed to go with the mechanism of invalid datasets. We're not happy with it but it seems to get the job done.

SimonHeybrock · 2023-10-17T07:25:27Z

lib/dataset/dataset.cpp

 /// It can be used for creating a new dataset and filling it step by step.
+///
+/// When using this, always make sure to ultimately produce a valid dataset.
+/// setDataInit is often called in a look.


Suggested change

/// setDataInit is often called in a look.

/// setDataInit is often called in a loop.

SimonHeybrock

Great effort, in particular also refactoring old tests which where using Dataset as a hack and add many new tests 👍

I think this is close to ready now, a few remaining comments and questions:

SimonHeybrock · 2023-10-17T07:27:31Z

lib/dataset/include/scipp/dataset/dataset.h

+  [[nodiscard]] Dataset or_empty() && {
+    if (is_valid())
+      return std::move(*this);
+    return Dataset({}, {});
+  }


lib/dataset/bins.cpp

SimonHeybrock · 2023-10-17T07:32:46Z

lib/dataset/dataset.cpp

-  Dataset out({}, m_coords.rename_dims(names));
+  Dataset out;
  for (const auto &[name, da] : m_data)
-    out.setData(name, da.rename_dims(names, false));
-  return out;
+    out.setDataInit(name, da.rename_dims(names, false));
+  if (out.is_valid()) {
+    out.setCoords(m_coords.rename_dims(names));
+    return out;
+  }
+  // out is invalid because no data has been set.
+  return Dataset({}, m_coords.rename_dims(names));


Why can't we init from coords first, as before? Don't they carry the dimensions? I also think a saw a solution like that for copy, so this should work?

Good point! Maybe I changed this function too many times that the original solution got lost.

lib/dataset/dataset_operations_common.h

SimonHeybrock · 2023-10-17T07:39:03Z

lib/dataset/test/concat_test.cpp

-TEST(ConcatenateTest, dataset_with_no_data_items) {
-  Dataset ds;
-  ds.setCoord(Dim::X,
-              makeVariable<double>(Dims{Dim::X}, Shape{4}, Values{1, 2, 3, 4}));
-  ds.setCoord(Dim("points"), makeVariable<double>(Dims{Dim::X}, Shape{4},
-                                                  Values{.1, .2, .3, .4}));
-  EXPECT_EQ(concat2(ds.slice({Dim::X, 0, 2}), ds.slice({Dim::X, 2, 4}), Dim::X),
-            ds);
-}
-
-TEST(ConcatenateTest, dataset_with_no_data_items_histogram) {
-  Dataset ds;
-  ds.setCoord(Dim("histogram"), makeVariable<double>(Dims{Dim::X}, Shape{4},
-                                                     Values{.1, .2, .3, .4}));
-  ds.setCoord(Dim::X, makeVariable<double>(Dims{Dim::X}, Shape{5},
-                                           Values{1, 2, 3, 4, 5}));
-  EXPECT_EQ(concat2(ds.slice({Dim::X, 0, 2}), ds.slice({Dim::X, 2, 4}), Dim::X),
-            ds);
-}
-


Right, such datasets were not allowed at some point. I have since added TEST_F(Concatenate1DTest, empty_dataset) further up in the same file. I'm also going to add one for histograms.

SimonHeybrock · 2023-10-17T07:41:04Z

lib/dataset/test/data_array_comparison_test.cpp

+    Random rand;
+    rand.seed(78847891);
+    RandomBool rand_bool;
+    rand_bool.seed(93481);
+    da = DataArray(makeVariable<double>(Dims{Dim::X, Dim::Y}, Shape{3, 4},
+                                        Values(rand(3 * 4))),
+                   {{Dim::X, makeVariable<double>(Dims{Dim::X}, Shape{3},
+                                                  Values(rand(3)))},
+                    {Dim::Y, makeVariable<double>(Dims{Dim::Y}, Shape{4},
+                                                  Values(rand(4)))}},
+                   {{"mask", makeVariable<bool>(Dims{Dim::X}, Shape{3},
+                                                Values(rand_bool(3)))}});


SimonHeybrock · 2023-10-17T08:23:31Z

tests/io/hdf5_test.py

    for i in range(4):
        a.coords['1d'].values[i] = sc.DataArray(float(i) * sc.units.m)
-    a.coords['dataset'] = sc.scalar(sc.Dataset(data={'a': array_1d, 'b': array_2d}))
+    a.coords['dataset'] = sc.scalar(sc.Dataset(data={'a': array_1d}))


Are we simply going to fail loading legacy files? Or should we return a DataGroup instead?

Good question! I'd rather fail because

returning a datagroup would be tricky to implement (fallback mechanism like scippnexus)

It's unlikely that downstream code still works.

Our hdf5 files are not meant for long term storage. So the few files that could cause problems for use can be fixed manually.

I suppose loading individual items as data arrays from such files would be a workaround that users can employ?

Could be done. But requires knowing the file structure.

# Conflicts: # docs/about/release-notes.rst

jl-wynen requested a review from SimonHeybrock August 9, 2023 12:51

jl-wynen force-pushed the dataset-require-matching-sizes branch from c42ce4b to 4ef9965 Compare August 9, 2023 14:42

SimonHeybrock reviewed Aug 10, 2023

View reviewed changes

jl-wynen force-pushed the dataset-require-matching-sizes branch from 4ef9965 to a8b9893 Compare August 14, 2023 07:56

This was referenced Aug 23, 2023

Precedence of aligned coords over unaligned in dataset #3149

Closed

Length-2 bin-edges in dataset #3148

Closed

jl-wynen added 24 commits September 11, 2023 12:54

Disable dim change of dataset

cdc0299

Also test order of iteration

7c97b8f

Add new DatasetFactory

a88a778

Always generate new coords

c869e2b

Do not move input masks and attrs

c19274a

Add multi-dim coord to test dataset

97a006c

Fix more tests

01ca177

Always concat dataset items

1dddaec

The previous approach is no longer required because all dataset items have the same dims now. Also, it was broken, see issue #3191.

Require that dataset has data when setting coords

1abda74

Fix tests

fce02bd

Track whether coord sizes have been set

b9de028

Remove SizedDict::rebuildSizes

f393b3f

Fix tests

0c53343

Draft release notes

a518fd7

Remove old DatasetFactory3D

e1b9856

Remove unused fixture

88a0e5e

Preserve sizes set flag in copy

44509f3

Remove bad items

0a5d3b2

Fix rename_dims for dataset

aae8b77

Fix some tests

ea20344

Fix remaining tests

fcd3357

Fix use of Dataset in docs

6afabcd

Suppress cppcheck warning

17c316d

Add PR id

76ba284

jl-wynen added 10 commits October 4, 2023 14:01

Fix usage of Dataset in docs

389f4cd

Remove commented out code

abab4db

Remove obsolete release note

e76348a

Fix dataset init from coords

4b5aeef

Fix Dataset.rename

dc08f6b

Simplify SizedDict ctor

90391be

Remove unused function

5d4c553

Add test that coords cant decrease in size

3349119

Merge branch 'main' into dataset-require-matching-sizes

bd73f3d

# Conflicts: # docs/about/release-notes.rst # docs/tutorials/solar_flares.ipynb

More detailed comment

a398c25

SimonHeybrock reviewed Oct 10, 2023

View reviewed changes

lib/dataset/dataset.cpp Outdated Show resolved Hide resolved

jl-wynen and others added 2 commits October 10, 2023 09:25

Fix typo

8410941

Co-authored-by: Simon Heybrock <12912489+SimonHeybrock@users.noreply.github.com>

Merge branch 'main' into dataset-require-matching-sizes

93e9aca

# Conflicts: # docs/about/release-notes.rst # tests/coords/transform_coords_test.py

SimonHeybrock reviewed Oct 11, 2023

View reviewed changes

Make sure to always return a valid dataset

0de1da8

SimonHeybrock reviewed Oct 17, 2023

View reviewed changes

jl-wynen added 3 commits October 17, 2023 10:51

Revert change of Dataset::rename_dims

f5e3345

Fix typo

0e576b5

Test concat with histogram dataset without data

6ccec95

SimonHeybrock approved these changes Oct 17, 2023

View reviewed changes

Merge branch 'main' into dataset-require-matching-sizes

6eb1182

# Conflicts: # docs/about/release-notes.rst

jl-wynen enabled auto-merge October 17, 2023 09:32

jl-wynen added 2 commits October 17, 2023 13:59

Merge branch 'main' into dataset-require-matching-sizes

7dfe4cf

Merge branch 'main' into dataset-require-matching-sizes

1f2eec7

jl-wynen merged commit 2a4a270 into main Oct 17, 2023

jl-wynen deleted the dataset-require-matching-sizes branch October 17, 2023 12:48

jl-wynen mentioned this pull request Oct 19, 2023

Concatenating datasets in new dimension does not work #3191

Closed

		ASSERT_EQ(coords.sizes(), Sizes());
		ASSERT_FALSE(coords.sizes_are_set());

	/// setDataInit is often called in a look.
	/// setDataInit is often called in a loop.

Require that all dataset items have the same sizes #3199

Require that all dataset items have the same sizes #3199

Uh oh!

Conversation

jl-wynen commented Aug 9, 2023

Uh oh!

SimonHeybrock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SimonHeybrock left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvaytet commented Oct 16, 2023

Uh oh!

jl-wynen commented Oct 16, 2023

Uh oh!

SimonHeybrock Oct 17, 2023 •

edited

Loading