Skip to content

Conversation

@SimonHeybrock
Copy link
Member

No description provided.

- ``Dataset`` will no longer be able to represent certain types of data.
Users will need to resort to ``DataGroup`` instead, which has other limitations, such as requiring to duplicate coordinates.
Another option would be to replicate data values of the lower-dimensional items to match the dimensionality of the higher-dimensional items.
This would reuqire more memory, but would force the users to be explicit about the meaning of data they want to represent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done with a broadcast which would only increase memory use when you do an operation with it.

SimonHeybrock and others added 2 commits July 4, 2023 08:07
Co-authored-by: Jan-Lukas Wynen <jan-lukas.wynen@ess.eu>
Recently we have introduced ``DataGroup``, which drops the restriction of compatible dimension extents but, unlike ``Dataset``, does in turn not provide support for joint coordinates.
The addition of ``DataGroup`` was triggered by a long series of cases where we realized that ``Dataset`` is not useful and flexbile enough in real applications.
This is not to say that ``Dataset`` is entirely useless, but it is not as useful as we had hoped.
One key area that is not covered by ``DataGroup`` is the representation of table-like data (or multi-dimensional generalizations thereof), in a manner similar to ``pandas.DataFrame``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the name Dataset was inspired by the Xarray Dataset which can have members with different dimensionality, should we consider renaming our Dataset to DataFrame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had a paragraph with that consideration in an earlier draft, and then removed it :D

I now think that DataFrame is too loaded by Pandas knowledge:

  • Always 1-D
  • Indices, no coordinates
  • No dimension labels
  • Big well known API that we do not plan to adopt

~~~~~~~~

There are two possible ways of reasoning about ``Dataset``.
Firstly, we may argue that while technically complex, the work has already been done, and the problems detailed below are encountered only in edge cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we finally going to refactor the C++ implementation of DataArray to not use Dataset but its own dedicated implementation? (at the moment, it almost feels like we already have legacy code in the C++ that no one wants to touch because it 'works')
(and maybe the same with the 'buckets' which should really be called 'bins'?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already did that several years ago.

@SimonHeybrock SimonHeybrock merged commit adec01d into main Jul 6, 2023
@SimonHeybrock SimonHeybrock deleted the adr-dataset-item-dimensionality branch July 6, 2023 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants