Variable length data types by LDeakin · Pull Request #40 · zarrs/zarrs

LDeakin · 2024-07-17T12:12:16Z

Resolves #21.

This is a substantial change that adds support for variable length data types to zarrs.
There were some breaking changes necessary to support this:

Array store/retrieve "bytes" methods now take/return ArrayBytes which can represent fixed or variable length bytes, rather than just a slice-like
Array store/retrieve "elements" variants use new Element[Owned] traits, with better validation
Encoded bytes are aliased to RawBytes
Codec traits have had some changes to accommodate the distinction between ArrayBytes and RawBytes

Data types

String (utf-8)
Binary

Codecs

`vlen`

{
  "name": "vlen",
  "configuration": {
    "data_codecs": [{"name": "bytes"},{"name": "blosc","configuration": {"cname": "zstd", "clevel":5,"shuffle": "bitshuffle", "typesize":1,"blocksize":0}}],
    "index_codecs": [{"name": "bytes","configuration": { "endian": "little" }},{"name": "blosc","configuration":{"cname": "zstd", "clevel":5,"shuffle": "shuffle", "typesize":4,"blocksize":0}}],
    "index_data_type": "uint32"
  }
}

Based on zarr-developers/zeps#47 (comment).

Structure:

a little-endian uint64 representing the size in bytes of the encoded index,
index data structured using the Apache arrow variable-size binary layout with the validity bitmap elided (https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) and encoded with index_codecs,
contiguous element bytes encoded with data_codecs.

The encoded index size is necessary to support index compression and partial decoding. If this were not available, the index could not used a bytes-to-bytes compression codec. A bytes-to-bytes compression codec could follow vlen, but then "data" is potentially running through a compression codec twice.

`vlen_v2`

{
  "name": "vlen_v2"
}

This matches Zarr V2 style interleaved encoding, which is implemented by numcodecs vlen-utf8, vlen-bytes, and vlen-array. These are all essentially the same codec, with data type-dependent behaviour. It makes sense to standardise a single codec for Zarr V3 to support Zarr V2 vlen-utf8/bytes/array encoded data without reencoding chunks.

Encoding Efficiency (32-bit index)

Sum of chunk sizes (in bytes) on "city" column of zarr-developers/zarr-python#2036 (comment).

https://github.com/LDeakin/zarrs/blob/variable_length_data_types/tests/cities.rs.

encoding	compression	size
vlen_v2		642196
vlen_v2	zstd 5	362626
vlen		642580
vlen	zstd 5	346950

codecov · 2024-07-17T12:24:30Z

Codecov Report

Attention: Patch coverage is 85.78135% with 424 lines in your changes missing coverage. Please review.

Project coverage is 81.33%. Comparing base (d54b89d) to head (71b9d78).

Files	Patch %	Lines
src/array/element.rs	65.41%	46 Missing ⚠️
...rray_to_bytes/sharding/sharding_partial_decoder.rs	88.28%	39 Missing ⚠️
src/array/array_sync_sharded_readable_ext.rs	63.95%	31 Missing ⚠️
...en_interleaved/vlen_interleaved_partial_decoder.rs	53.03%	31 Missing ⚠️
src/array/codec/array_to_bytes/vlen.rs	71.13%	28 Missing ⚠️
src/array/array_bytes.rs	93.63%	25 Missing ⚠️
...o_bytes/vlen_interleaved/vlen_interleaved_codec.rs	78.72%	20 Missing ⚠️
src/array/array_representation.rs	56.75%	16 Missing ⚠️
src/array/codec/array_to_bytes/vlen_interleaved.rs	68.00%	16 Missing ⚠️
src/array/array_async_readable_writable.rs	68.08%	15 Missing ⚠️
... and 27 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #40      +/-   ##
==========================================
+ Coverage   79.56%   81.33%   +1.76%     
==========================================
  Files         142      152      +10     
  Lines       19544    20837    +1293     
==========================================
+ Hits        15550    16947    +1397     
+ Misses       3994     3890     -104

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Data type sizes are now represented by `DataTypeSize` instead of usize Also adds `ArraySize`. Both are `Fixed` only for now. Need to support `Variable` throughout the codebase. Change codec API in prep for variable sized data types Enable `{Array,DataType}Size::Variable` Implement `CowArrayBytes::validate()` and add `CodecError::InvalidVariableSizedArrayOffsets` Use `CowArrayBytes::validate()` impl `From` for `CowArrayBytes` for various types Array `_element` methods now use `T: Element` Add `vlen` codec metadata Fix codecs bench Implement an experimental vlen codec Use `impl Into<ArrayBytesCow<'a>>` in array methods Use `RawBytesCow` consistently Remove various vlen todo's Cleanup `ArrayBytes` Use `ArrayError::InvalidElementValue` for invalid string encodings Add `ArraySubset::contains()` Add `FillValue::new_empty()` Add remaining vlen support to array `store_` methods and improve vlen validation Add remaining vlen support to array `retrieve_` methods Partial decoding in the vlen filter Fix async vlen errors Sharding codec vlen support Add vlen support to sharding partial decoder vlen support for sharded_readable_ext `offsets_u64_to_usize` handle 32-bit system Minor FillValue doc update Remove unused ArraySubset methods and add related convenience functions Add cities test Add `Arrow32` vlen encoding Add support for Interleave32 (Zarr V2) vlen encoding fmt clippy Set minimum version for num-complex Fix `ArrayBytes` from `&[u8; N]` for rust < 1.77 Add `binary` data type Vlen improve docs and test various encodings. Fix `cities.csv` encoding. `vlen` change encoding names Validate `vlen` codec `length32` encoding against `zarr-python` v2 Don't store `zarrs` metadata in cities test output Split `vlen` into `vlen` and `vlen_interleaved` Vlen supports separate index/dat encoding with full codec chains. Fix typesize in vlen `index_codecs` metadata Add support for `String` fill value metadata Add `FillValueMetadata::Unsupported` `ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail. vlen cleanup Change vlen codec identifiers given they are experimental Move duplicate `extract_decoded_regions` fn into `array_bytes` + other minor changes Minor vlen_partial_decoder cleanup Add support for `zarr-python` nonconformant `|O` V2 data type Support conversion of Zarr V2 arrats with `vlen-*` codecs to V3 Update root docs for new vlen related codecs/data types Cleanup `get_vlen_bytes_and_offsets`

…e::equals_all` with vlen data

LDeakin force-pushed the variable_length_data_types branch 4 times, most recently from 79c61cc to ac7ebc7 Compare July 18, 2024 01:28

LDeakin mentioned this pull request Jul 19, 2024

Draft ZEP 0007: Strings zarr-developers/zeps#47

Open

LDeakin force-pushed the variable_length_data_types branch 2 times, most recently from 5bc59ea to b3110ea Compare July 25, 2024 01:24

LDeakin force-pushed the variable_length_data_types branch from 77e9e4e to 0c114bf Compare July 25, 2024 01:35

LDeakin added 5 commits July 25, 2024 13:41

Fix store value truncation

495473f

Add ArraySize::new, fix ArrayBytes::new_fill_value, fix `FillValu…

222d149

…e::equals_all` with vlen data

Add array_write_read_string example

71b9d78

Rename vlen_interleaved to vlen_v2

36c6e80

Fmt pass

16b6c2f

LDeakin marked this pull request as ready for review July 25, 2024 04:28

LDeakin merged commit 649abe1 into main Jul 25, 2024

LDeakin deleted the variable_length_data_types branch July 25, 2024 04:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable length data types#40

Variable length data types#40
LDeakin merged 6 commits intomainfrom
variable_length_data_types

LDeakin commented Jul 17, 2024 •

edited

Loading

Uh oh!

codecov bot commented Jul 17, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LDeakin commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Data types

Codecs

vlen

vlen_v2

Encoding Efficiency (32-bit index)

Uh oh!

codecov bot commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LDeakin commented Jul 17, 2024 •

edited

Loading

`vlen`

`vlen_v2`

codecov bot commented Jul 17, 2024 •

edited

Loading