-
Notifications
You must be signed in to change notification settings - Fork 23
BEP032: NWB + NIX minimal content validation #369
Description
related to BEP032
Requested by @rwblair and @effigies
The major validators for NWB or Neo+NIX are written entirely or mostly in Python
A compromise for achieving basic validation of these file types in this web validator is to simply check minimal headers
This may not guarantee entire schematic compliance of all file contents, but that is left as out of scope for this integration
These headers have multiple stages based on backend
Backend filetypes
- NWB can be HDF5 or Zarr
- NIX can be HDF5
HDF5
HDF5 can typically be determined by the existence of the first 8 magic bytes (depending on version 0 or 1+): https://www.loc.gov/preservation/digital/formats/fdd/fdd000229.shtml
HDF5 v0:
Hex: 89 48 44 46 0d 0a 1a 0a 00
ASCII: \211 HDF \r \n \032 \n
HDF5 v1:
Hex: 89 48 44 46 0d 0a 1a 0a 01
ASCII: \211 HDF \r \n \032 \n
Code suggestions
In Python we have used the following snippet, IDK if that is helpful to you:
with open(filename, 'rb') as f:
file_signature = f.read(8)
return file_signature == b'\x89HDF\r\n\x1a\n'Claude suggests something along the lines of:
function isHDF5(file) {
return new Promise((resolve, reject) => {
const reader = new FileReader();
const blob = file.slice(0, 8); // Only read the first 8 bytes
reader.onload = function (e) {
const bytes = new Uint8Array(e.target.result);
const HDF5_SIGNATURE = [0x89, 0x48, 0x44, 0x46, 0x0D, 0x0A, 0x1A, 0x0A];
const match = HDF5_SIGNATURE.every((byte, i) => bytes[i] === byte);
resolve(match);
};
reader.onerror = reject;
reader.readAsArrayBuffer(blob);
});
}Credit to @h-mayorquin for the original discovery of this years ago (even prior to the age of high-quality AI)
Testing asset (small): https://dandiarchive.s3.amazonaws.com/blobs/ca4/dfe/ca4dfea1-fdae-48b5-a798-cc9d453f307d
Testing asset (large): https://dandiarchive.s3.amazonaws.com/blobs/8d7/b49/8d7b49de-8a84-48b2-9e0e-81ccf2ec22b6
Zarr
Unfortunately, this seems like it can depend on the specific type of storage configuration (primarily whether it is zipped, which version of Python was used to perform the zip, or whether there is consolidated metadata) and there doesn't seem to be a 'magic byte' per se
| Format | How to detect |
|---|---|
| Zarr directory (v2) | Contains a .zarray, .zgroup, or .zattrs file at the root |
| Zarr directory (v3) | Contains a zarr.json file at the root |
Zarr in a ZIP (.zarr.zip) |
A ZIP file containing the above metadata files at particular bytes |
So for the most part I'd just have a really basic attempt to see if it has the 'zarr-like' directory structure in terms of those structured 'hidden' files for v2 or the consolidated JSON for v3
Code suggestions
Claude suggests something along the lines of:
async function detectZarrVersion(file) {
const header = new Uint8Array(await file.slice(0, 4).arrayBuffer());
const isZip = [0x50, 0x4B, 0x03, 0x04].every((b, i) => header[i] === b);
if (!isZip) return null;
const JSZip = (await import('jszip')).default;
const zip = await JSZip.loadAsync(file);
const filenames = Object.keys(zip.files);
// Check for zarr.json (v3) first — it's more specific
if (filenames.some(f => f === 'zarr.json' || f.endsWith('/zarr.json'))) {
// Optionally parse to confirm
const zarrJson = filenames.find(f => f === 'zarr.json' || f.endsWith('/zarr.json'));
const content = JSON.parse(await zip.files[zarrJson].async('string'));
if (content.zarr_format === 3) return 'v3';
}
// Check for .zarray/.zgroup (v2)
if (filenames.some(f => /\.(zarray|zgroup|zattrs)$/.test(f))) {
return 'v2';
}
return null; // It's a ZIP but not Zarr
}Testing asset (v2): https://dandiarchive.s3.amazonaws.com/zarr/d097af6b-8fd8-4d83-b649-fc6518e95d25/
Testing asset (v3): I do not know of any v3 Zarr NWB, though that has been speculative as being more optimized for some time now
I've also never tried a zip store (doesn't quite make sense with the usual way we package NWB contents, and would only apply to small files anyway) but it might be possible as an option through HDMF-Zarr... let me know if you would like me to delve deeper into that
File formats
NWB
If a file is determined to be HDF5 from above, it can be determined to be NWB-formatted as described by @bendichter in a follow-up comment
Similarly, if it is determined to be of Zarr backend, you can be pretty sure that it is NWB rather than NIX, but feel free to take the extra step of determining NWB version intend from the metadata JSON file which @bendichter will describe in their follow-up comment
(I will try to come back to the top here and edit this with any details as the conversation continues)
NIX
If a file is determined to be HDF5 from above, it can be determined to be Neo-compliant structure from the description to be provided by @twachtler and @ree-gupta in a follow-up comment