BEP032: NWB + NIX minimal content validation

related to [BEP032](https://github.com/bids-standard/bids-specification/pull/2307)

Requested by @rwblair and @effigies 

The major validators for NWB or Neo+NIX are written entirely or mostly in Python

A compromise for achieving basic validation of these file types in this web validator is to simply check minimal headers

This may not guarantee entire schematic compliance of all file contents, but that is left as out of scope for this integration

These headers have multiple stages based on backend



## Backend filetypes

- NWB can be HDF5 or Zarr
- NIX can be HDF5


### HDF5

HDF5 can typically be determined by the existence of the first 8 magic bytes (depending on version 0 or 1+): https://www.loc.gov/preservation/digital/formats/fdd/fdd000229.shtml

HDF5 v0: 

```
    Hex: 89 48 44 46 0d 0a 1a 0a 00
    ASCII: \211 HDF \r \n \032 \n
```

HDF5 v1: 
```
    Hex: 89 48 44 46 0d 0a 1a 0a 01
    ASCII: \211 HDF \r \n \032 \n
```

<details><summary>Code suggestions</summary>

In Python we have used the following snippet, IDK if that is helpful to you:

```python
with open(filename, 'rb') as f:
    file_signature = f.read(8)
return file_signature == b'\x89HDF\r\n\x1a\n'
```

Claude suggests something along the lines of:

```javascript
function isHDF5(file) {
  return new Promise((resolve, reject) => {
    const reader = new FileReader();
    const blob = file.slice(0, 8); // Only read the first 8 bytes

    reader.onload = function (e) {
      const bytes = new Uint8Array(e.target.result);
      const HDF5_SIGNATURE = [0x89, 0x48, 0x44, 0x46, 0x0D, 0x0A, 0x1A, 0x0A];

      const match = HDF5_SIGNATURE.every((byte, i) => bytes[i] === byte);
      resolve(match);
    };

    reader.onerror = reject;
    reader.readAsArrayBuffer(blob);
  });
}
```

Credit to @h-mayorquin for the original discovery of this years ago (even prior to the age of high-quality AI)

</details>

Testing asset (small): https://dandiarchive.s3.amazonaws.com/blobs/ca4/dfe/ca4dfea1-fdae-48b5-a798-cc9d453f307d
Testing asset (large): https://dandiarchive.s3.amazonaws.com/blobs/8d7/b49/8d7b49de-8a84-48b2-9e0e-81ccf2ec22b6


### Zarr

Unfortunately, this seems like it can depend on the specific type of storage configuration (primarily whether it is zipped, which version of Python was used to perform the zip, or whether there is consolidated metadata) and there doesn't seem to be a 'magic byte' per se

| Format | How to detect |
|--------|--------------|
| **Zarr directory (v2)** | Contains a `.zarray`, `.zgroup`, or `.zattrs` file at the root |
| **Zarr directory (v3)** | Contains a `zarr.json` file at the root |
| **Zarr in a ZIP (`.zarr.zip`)** | A ZIP file containing the above metadata files at particular bytes |

So for the most part I'd just have a really basic attempt to see if it has the 'zarr-like' directory structure in terms of those structured 'hidden' files for v2 or the consolidated JSON for v3

<details><summary>Code suggestions</summary>

Claude suggests something along the lines of:

```javascript
async function detectZarrVersion(file) {
  const header = new Uint8Array(await file.slice(0, 4).arrayBuffer());
  const isZip = [0x50, 0x4B, 0x03, 0x04].every((b, i) => header[i] === b);

  if (!isZip) return null;

  const JSZip = (await import('jszip')).default;
  const zip = await JSZip.loadAsync(file);
  const filenames = Object.keys(zip.files);

  // Check for zarr.json (v3) first — it's more specific
  if (filenames.some(f => f === 'zarr.json' || f.endsWith('/zarr.json'))) {
    // Optionally parse to confirm
    const zarrJson = filenames.find(f => f === 'zarr.json' || f.endsWith('/zarr.json'));
    const content = JSON.parse(await zip.files[zarrJson].async('string'));
    if (content.zarr_format === 3) return 'v3';
  }

  // Check for .zarray/.zgroup (v2)
  if (filenames.some(f => /\.(zarray|zgroup|zattrs)$/.test(f))) {
    return 'v2';
  }

  return null; // It's a ZIP but not Zarr
}
```

</details>

Testing asset (v2): https://dandiarchive.s3.amazonaws.com/zarr/d097af6b-8fd8-4d83-b649-fc6518e95d25/

Testing asset (v3): I do not know of any v3 Zarr NWB, though that has been speculative as being more optimized for some time now

I've also never tried a zip store (doesn't quite make sense with the usual way we package NWB contents, and would only apply to small files anyway) but it might be possible as an option through HDMF-Zarr... let me know if you would like me to delve deeper into that



## File formats

### NWB

If a file is determined to be HDF5 from above, it can be determined to be NWB-formatted as described by @bendichter in a follow-up comment

Similarly, if it is determined to be of Zarr backend, you can be pretty sure that it is NWB rather than NIX, but feel free to take the extra step of determining NWB version intend from the metadata JSON file which @bendichter will describe in their follow-up comment

(I will try to come back to the top here and edit this with any details as the conversation continues)



### NIX

If a file is determined to be HDF5 from above, it can be determined to be Neo-compliant structure from the description to be provided by @twachtler and @ree-gupta in a follow-up comment

Format	How to detect
Zarr directory (v2)	Contains a `.zarray`, `.zgroup`, or `.zattrs` file at the root
Zarr directory (v3)	Contains a `zarr.json` file at the root
Zarr in a ZIP (`.zarr.zip`)	A ZIP file containing the above metadata files at particular bytes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BEP032: NWB + NIX minimal content validation #369

Backend filetypes

HDF5

Zarr

File formats

NWB

NIX

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BEP032: NWB + NIX minimal content validation #369

Description

Backend filetypes

HDF5

Zarr

File formats

NWB

NIX

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions