Description of new feature
@philippemiron is converting data from NetCDF4 files into Awkward Arrays, and one of the features we lack is a place to put attributes. These can be descriptions, units, meanings of flags, etc., and a single field can have more than one of these (e.g. so that "units" are programmatically accessible). In general, any JSON-encodable data.
This applies at two levels: globally for a whole array, in such a way that all derived arrays pass on the attributes, and per-record field. The per-record field attributes should only be passed on if the meaning of the field is not changed, and should not be counted as part of the Content node's type. Also, these should be read and written to files as metadata wherever possible.
Why not use parameter __doc__?
This would work as a per-record field attribute that is passed on whenever the meaning of the field is not changed, and doesn't count as part of the Content node's type. However, it has to be a string, since this is what goes into the Python __doc__ property (and therefore IPython and Jupyter help). The attributes have to be general JSON-encodable metadata.
Also, this only encodes per-record field attributes, not global attributes.
Why not use a new parameter?
It would only be accessible through idioms like array.layout.parameters["attrs"]. This is high-level data analyst information, and they shouldn't have to go through layout, which is mid-level, for library developers. Also, such an idiom would fail if the layout ever gets buried in another Content node (i.e. the user has to be aware of layouts and how they change, which is not a high-level view!). For example, rearranging records in a RecordArray nests the RecordArray within an IndexedArray for performance reasons; so after certain kinds of slices, array.layout.parameters["attrs"] becomes array.layout.content.parameters["attrs"], probably unexpectedly.
Why not use behavior?
It's an odd thing to do, but behavior is passed from an ak.Array to any new ak.Array derived from it. Any keys in behavior that aren't recognized are ignored. However, this would only encode global attributes, not per-record field attributes, and behavior (which typically contains function objects and class objects) is not serialized when writing files, or filled when reading files.
What instead?
This is really two new features:
- Per-record field metadata, as a parameter, that follows all the same rules as
__doc__, but doesn't have to be a string and is exposed at high-level in a different way (one that allows non-strings). Parameters are already serializable.
- Global metadata that follows all the same rules as
behavior, but is serializable.
They should have names like attrs, following xarray's convention, but the per-record field attrs should be distinguishable somehow from the global attrs.
Perhaps both of these should be dicts (JSON objects) and when we're looking at one field, the two dicts are overlaid, with the per-field keys taking precedence over the global keys? That sounds natural and would minimize the use of names, but it sounds like just the sort of thing that would break something in the future.
What does xarray do?
>>> import xarray as xr
>>> dataarray = xr.DataArray([1, 2, 3], attrs={"wow": "look!"})
>>> dataset = xr.Dataset({"x": dataarray}, attrs={"wow": "wee!", "another": "thing"})
>>> dataset.attrs
{'wow': 'wee!', 'another': 'thing'}
>>> dataset.x.attrs
{'wow': 'look!'}
xarray never has this issue because DataArray and Dataset are different types and can never be confused. In Awkward, an ak.Array is an ak.Array is an ak.Array, regardless of whether it's an array of lists of records, an array of just the records, or an array of one of the fields of those records. The global attrs can propagate down when extracting the array of records from the array of lists of records, but when you get to the array of one of the fields of those records, there's a conflict. They should probably be named differently, since "per-field attributes" and "global attributes" are different concepts, but which one gets the exalted name of "attrs"? And what would the other one be called?
Description of new feature
@philippemiron is converting data from NetCDF4 files into Awkward Arrays, and one of the features we lack is a place to put attributes. These can be descriptions, units, meanings of flags, etc., and a single field can have more than one of these (e.g. so that "units" are programmatically accessible). In general, any JSON-encodable data.
This applies at two levels: globally for a whole array, in such a way that all derived arrays pass on the attributes, and per-record field. The per-record field attributes should only be passed on if the meaning of the field is not changed, and should not be counted as part of the Content node's type. Also, these should be read and written to files as metadata wherever possible.
Why not use parameter
__doc__?This would work as a per-record field attribute that is passed on whenever the meaning of the field is not changed, and doesn't count as part of the Content node's type. However, it has to be a string, since this is what goes into the Python
__doc__property (and therefore IPython and Jupyter help). The attributes have to be general JSON-encodable metadata.Also, this only encodes per-record field attributes, not global attributes.
Why not use a new parameter?
It would only be accessible through idioms like
array.layout.parameters["attrs"]. This is high-level data analyst information, and they shouldn't have to go throughlayout, which is mid-level, for library developers. Also, such an idiom would fail if the layout ever gets buried in another Content node (i.e. the user has to be aware of layouts and how they change, which is not a high-level view!). For example, rearranging records in a RecordArray nests the RecordArray within an IndexedArray for performance reasons; so after certain kinds of slices,array.layout.parameters["attrs"]becomesarray.layout.content.parameters["attrs"], probably unexpectedly.Why not use behavior?
It's an odd thing to do, but
behavioris passed from an ak.Array to any new ak.Array derived from it. Any keys inbehaviorthat aren't recognized are ignored. However, this would only encode global attributes, not per-record field attributes, andbehavior(which typically contains function objects and class objects) is not serialized when writing files, or filled when reading files.What instead?
This is really two new features:
__doc__, but doesn't have to be a string and is exposed at high-level in a different way (one that allows non-strings). Parameters are already serializable.behavior, but is serializable.They should have names like
attrs, following xarray's convention, but the per-record fieldattrsshould be distinguishable somehow from the globalattrs.Perhaps both of these should be dicts (JSON objects) and when we're looking at one field, the two dicts are overlaid, with the per-field keys taking precedence over the global keys? That sounds natural and would minimize the use of names, but it sounds like just the sort of thing that would break something in the future.
What does xarray do?
xarray never has this issue because DataArray and Dataset are different types and can never be confused. In Awkward, an ak.Array is an ak.Array is an ak.Array, regardless of whether it's an array of lists of records, an array of just the records, or an array of one of the fields of those records. The global attrs can propagate down when extracting the array of records from the array of lists of records, but when you get to the array of one of the fields of those records, there's a conflict. They should probably be named differently, since "per-field attributes" and "global attributes" are different concepts, but which one gets the exalted name of "
attrs"? And what would the other one be called?