Skip to content

BEP032: NWB + NIX minimal content validation #369

@CodyCBakerPhD

Description

@CodyCBakerPhD

related to BEP032

Requested by @rwblair and @effigies

The major validators for NWB or Neo+NIX are written entirely or mostly in Python

A compromise for achieving basic validation of these file types in this web validator is to simply check minimal headers

This may not guarantee entire schematic compliance of all file contents, but that is left as out of scope for this integration

These headers have multiple stages based on backend

Backend filetypes

  • NWB can be HDF5 or Zarr
  • NIX can be HDF5

HDF5

HDF5 can typically be determined by the existence of the first 8 magic bytes (depending on version 0 or 1+): https://www.loc.gov/preservation/digital/formats/fdd/fdd000229.shtml

HDF5 v0:

    Hex: 89 48 44 46 0d 0a 1a 0a 00
    ASCII: \211 HDF \r \n \032 \n

HDF5 v1:

    Hex: 89 48 44 46 0d 0a 1a 0a 01
    ASCII: \211 HDF \r \n \032 \n
Code suggestions

In Python we have used the following snippet, IDK if that is helpful to you:

with open(filename, 'rb') as f:
    file_signature = f.read(8)
return file_signature == b'\x89HDF\r\n\x1a\n'

Claude suggests something along the lines of:

function isHDF5(file) {
  return new Promise((resolve, reject) => {
    const reader = new FileReader();
    const blob = file.slice(0, 8); // Only read the first 8 bytes

    reader.onload = function (e) {
      const bytes = new Uint8Array(e.target.result);
      const HDF5_SIGNATURE = [0x89, 0x48, 0x44, 0x46, 0x0D, 0x0A, 0x1A, 0x0A];

      const match = HDF5_SIGNATURE.every((byte, i) => bytes[i] === byte);
      resolve(match);
    };

    reader.onerror = reject;
    reader.readAsArrayBuffer(blob);
  });
}

Credit to @h-mayorquin for the original discovery of this years ago (even prior to the age of high-quality AI)

Testing asset (small): https://dandiarchive.s3.amazonaws.com/blobs/ca4/dfe/ca4dfea1-fdae-48b5-a798-cc9d453f307d
Testing asset (large): https://dandiarchive.s3.amazonaws.com/blobs/8d7/b49/8d7b49de-8a84-48b2-9e0e-81ccf2ec22b6

Zarr

Unfortunately, this seems like it can depend on the specific type of storage configuration (primarily whether it is zipped, which version of Python was used to perform the zip, or whether there is consolidated metadata) and there doesn't seem to be a 'magic byte' per se

Format How to detect
Zarr directory (v2) Contains a .zarray, .zgroup, or .zattrs file at the root
Zarr directory (v3) Contains a zarr.json file at the root
Zarr in a ZIP (.zarr.zip) A ZIP file containing the above metadata files at particular bytes

So for the most part I'd just have a really basic attempt to see if it has the 'zarr-like' directory structure in terms of those structured 'hidden' files for v2 or the consolidated JSON for v3

Code suggestions

Claude suggests something along the lines of:

async function detectZarrVersion(file) {
  const header = new Uint8Array(await file.slice(0, 4).arrayBuffer());
  const isZip = [0x50, 0x4B, 0x03, 0x04].every((b, i) => header[i] === b);

  if (!isZip) return null;

  const JSZip = (await import('jszip')).default;
  const zip = await JSZip.loadAsync(file);
  const filenames = Object.keys(zip.files);

  // Check for zarr.json (v3) first — it's more specific
  if (filenames.some(f => f === 'zarr.json' || f.endsWith('/zarr.json'))) {
    // Optionally parse to confirm
    const zarrJson = filenames.find(f => f === 'zarr.json' || f.endsWith('/zarr.json'));
    const content = JSON.parse(await zip.files[zarrJson].async('string'));
    if (content.zarr_format === 3) return 'v3';
  }

  // Check for .zarray/.zgroup (v2)
  if (filenames.some(f => /\.(zarray|zgroup|zattrs)$/.test(f))) {
    return 'v2';
  }

  return null; // It's a ZIP but not Zarr
}

Testing asset (v2): https://dandiarchive.s3.amazonaws.com/zarr/d097af6b-8fd8-4d83-b649-fc6518e95d25/

Testing asset (v3): I do not know of any v3 Zarr NWB, though that has been speculative as being more optimized for some time now

I've also never tried a zip store (doesn't quite make sense with the usual way we package NWB contents, and would only apply to small files anyway) but it might be possible as an option through HDMF-Zarr... let me know if you would like me to delve deeper into that

File formats

NWB

If a file is determined to be HDF5 from above, it can be determined to be NWB-formatted as described by @bendichter in a follow-up comment

Similarly, if it is determined to be of Zarr backend, you can be pretty sure that it is NWB rather than NIX, but feel free to take the extra step of determining NWB version intend from the metadata JSON file which @bendichter will describe in their follow-up comment

(I will try to come back to the top here and edit this with any details as the conversation continues)

NIX

If a file is determined to be HDF5 from above, it can be determined to be Neo-compliant structure from the description to be provided by @twachtler and @ree-gupta in a follow-up comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions