Specifying and validating an array of arrays of ...
Describe the bug
The data source I'm parsing represents dates as an array of arrays of integers, i.e.
"date": [
[ 2022, 8, 8 ]
]
I am looking for a way to represent and validate this using linkml, but without a list or array type, I'm not sure how to do it.
In JSONschema, it would be represented as
"date": {
"type": "array",
"items": {
"type": "array",
"items": {
"type": "integer",
}
}
}
To Reproduce
Source data, list_of_lists_data.json
{
"thing": {
"list_of_lists": [
[ 2020, 10, 8 ]
]
}
}
linkml schema, list_of_lists.yaml
id: https://www.example/com/list_of_lists
name: list_of_lists
description: linkml spec containing a list of lists
prefixes:
linkml: https://w3id.org/linkml/
imports:
- linkml:types
default_range: string
slots:
list_of_lists_of_ints:
examples:
- value: [[2020, 10, 8]]
multivalued: true
classes:
RootClass:
tree_root: true
attributes:
thing:
range: Thing
Thing:
slots:
- list_of_lists_of_ints
As-is, the data file passes validation but there is no validation of the inner array of list_of_lists. If you add range: string to the definition of list_of_lists_of_ints, i.e.
slots:
list_of_lists_of_ints:
examples:
- value: [[2020, 10, 8]]
multivalued: true
range: string
...it passes validation, even though the contents of the field should not be parsed as a string.
Creating custom types
Since the default validation doesn't work, I tried creating a specific type.
slots:
list_of_lists_of_ints:
examples:
- value: [[2020, 10, 8]]
multivalued: true
range: ListOfIntsType
...
types:
ListOfIntsType: ## representing [2020, 10, 8]
uri: rdf:List
base: list # `base`: python base type that implements this type definition
multivalued: true
range: integer
...but there's no way to describe the type further as multivalued and range are not valid fields.
OK, how about creating a custom class?
slots:
list_of_lists_of_ints:
examples:
- value: [[2020, 10, 8]]
multivalued: true
range: ListOfIntsClass ## this is going to be representing [2020, 10, 8]
...
classes:
ListOfIntsClass:
attributes:
'': # try to represent an anonymous array
multivalued: true
range: integer
Doesn't work (not surprisingly).
From the python perspective, it should be possible to validate an array of arrays of ints, but how does one specify this in linkML?
Great detective work and valiant attempts, but indeed this is not supported, largely by intention. LinkML tries to be unopinionated, but doing things such as modeling dates as lists does go against the grain.
Having said that, we should have some kind of supported. I envision 3 alternate approaches
- Multidimensional data approach
- Fixed column serializations
- Add an explicit list type
I'll attempt to explain each
Multidimensional For the Multidimensional approach, it helps to think of a use case with multidimensional data. This is the kind of thing the HDMF format excels at. Let's say I have a dataset of temperatures at points on the globe at different time points.
The more LinkMLesque way to model this is something like
Dataset:
attributes:
observations:
range: Observation
multivalued: true
Observation:
attributes:
lat:
range: decimal
long:
range: decimal
height_in_meters:
range: float
unit:
ucum_code: m
temp_in_kelvin:
range: float
unit:
ucum_code: K
Assume all fields are required. optionally we can add a compound unique key of (lat, long, height).
You can also imagine adding a time dimension here
There isn't any need for LoLs in our modeling. However, the default way of storing this as json/yaml may be inefficient, and data scientists may prefer to manipulate multidimensional arrays.
There are a few different ways of serializing, including a flat list with accomanying metadata on the dimensions:

But this could also be as a LoLoL
observations:
_dimensions:
lat:
- 100, 120, 140
long:
- 100, 120, 140
height:
- 0
- 20
- 40
_data:
- - - 292
- 293
- 292
- - 292
- 290
- 294
- - 293
- 293
- 291
- - - 296
- 295
- 293
- - 296
- 296
- 296
- - 296
- 296
- 296
- - - 296
- 298
- 298
- - 298
- 298
- 296
- - 297
- 297
- 298
A specific example of this in our domain is biom format:
- https://github.com/biocore/biom-format/blob/master/examples/min_sparse_otu_table.biom
Object-as-list Serialization
Here the basic idea is that you give each slot a rank and add an annotation that states the object should be serialized as a list rather than key-value
The internal representation would still be an object though.
List types
We would have a builtin class called List (possibly other types too) that would have a single slot for members, default to Any, but this could be subclassed with different constraints on members. We would need to extend existing parsers and serializers such that members is hidden and just the direct yaml/json structure is used. It's not clear what the best internal python implementation would be, we'd have to think how this would work with dataclasses, pydantic, etc
Which to choose?
The multidimensional data approach is part of a less specified longer term roadmap for LinkML, in collaboration with our LBNL colleagues who develop HDMF
I suspect the other two would fit your use case better. I think the object-as-list serialization is likely easiest
For the short term your best bet is an initial transform of the LoL data into a LoObjects, which may be not wholly satisfactory...
I would imagine that suggestions that the data provider change their date representations to something more sensible would get zero traction, unfortunately. It's a little frustrating as they provide YYYY-MM-DD and epoch timestamps elsewhere.
Due to other quirks with the data source, JSONschema, and linkML, it looks like I will have to do some transformations on the source data before it can be loaded, so converting the LoLs into LoOs (😆 -> 🚽) would be the most pragmatic approach.
I hadn't looked at JSONschema for a while, but it seems the newer drafts (post v7) have much more powerful ways of specifying data structures involving arrays. Worth bearing in mind if/when linkml is extended to cover nested arrays?
You win the internet for that emoji chain!
good call with the json-schema link. In fact I think conceiving of as Object-as-list Serialization simply as Tuples is the most elegant solution to your date representation and other analogous issues, and would work well with an internal tuple representation is most languages, e.g
>>> print(yaml.safe_dump((2022,8,9)))
- 2022
- 8
- 9
@ialarmedalien, we finally have first-class array support in the metamodel.
We are now opening more targeted issues, e.g
- #1888
- #1889