Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #40 +/- ##
==========================================
+ Coverage 79.56% 81.33% +1.76%
==========================================
Files 142 152 +10
Lines 19544 20837 +1293
==========================================
+ Hits 15550 16947 +1397
+ Misses 3994 3890 -104 ☔ View full report in Codecov by Sentry. |
79c61cc to
ac7ebc7
Compare
5bc59ea to
b3110ea
Compare
Data type sizes are now represented by `DataTypeSize` instead of usize
Also adds `ArraySize`.
Both are `Fixed` only for now. Need to support `Variable` throughout the codebase.
Change codec API in prep for variable sized data types
Enable `{Array,DataType}Size::Variable`
Implement `CowArrayBytes::validate()` and add `CodecError::InvalidVariableSizedArrayOffsets`
Use `CowArrayBytes::validate()`
impl `From` for `CowArrayBytes` for various types
Array `_element` methods now use `T: Element`
Add `vlen` codec metadata
Fix codecs bench
Implement an experimental vlen codec
Use `impl Into<ArrayBytesCow<'a>>` in array methods
Use `RawBytesCow` consistently
Remove various vlen todo's
Cleanup `ArrayBytes`
Use `ArrayError::InvalidElementValue` for invalid string encodings
Add `ArraySubset::contains()`
Add `FillValue::new_empty()`
Add remaining vlen support to array `store_` methods and improve vlen validation
Add remaining vlen support to array `retrieve_` methods
Partial decoding in the vlen filter
Fix async vlen errors
Sharding codec vlen support
Add vlen support to sharding partial decoder
vlen support for sharded_readable_ext
`offsets_u64_to_usize` handle 32-bit system
Minor FillValue doc update
Remove unused ArraySubset methods and add related convenience functions
Add cities test
Add `Arrow32` vlen encoding
Add support for Interleave32 (Zarr V2) vlen encoding
fmt
clippy
Set minimum version for num-complex
Fix `ArrayBytes` from `&[u8; N]` for rust < 1.77
Add `binary` data type
Vlen improve docs and test various encodings.
Fix `cities.csv` encoding.
`vlen` change encoding names
Validate `vlen` codec `length32` encoding against `zarr-python` v2
Don't store `zarrs` metadata in cities test output
Split `vlen` into `vlen` and `vlen_interleaved`
Vlen supports separate index/dat encoding with full codec chains.
Fix typesize in vlen `index_codecs` metadata
Add support for `String` fill value metadata
Add `FillValueMetadata::Unsupported`
`ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail.
vlen cleanup
Change vlen codec identifiers given they are experimental
Move duplicate `extract_decoded_regions` fn into `array_bytes`
+ other minor changes
Minor vlen_partial_decoder cleanup
Add support for `zarr-python` nonconformant `|O` V2 data type
Support conversion of Zarr V2 arrats with `vlen-*` codecs to V3
Update root docs for new vlen related codecs/data types
Cleanup `get_vlen_bytes_and_offsets`
77e9e4e to
0c114bf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #21.
This is a substantial change that adds support for variable length data types to
zarrs.There were some breaking changes necessary to support this:
ArrayByteswhich can represent fixed or variable length bytes, rather than just a slice-likeElement[Owned]traits, with better validationRawBytesArrayBytesandRawBytesData types
Codecs
vlen{ "name": "vlen", "configuration": { "data_codecs": [{"name": "bytes"},{"name": "blosc","configuration": {"cname": "zstd", "clevel":5,"shuffle": "bitshuffle", "typesize":1,"blocksize":0}}], "index_codecs": [{"name": "bytes","configuration": { "endian": "little" }},{"name": "blosc","configuration":{"cname": "zstd", "clevel":5,"shuffle": "shuffle", "typesize":4,"blocksize":0}}], "index_data_type": "uint32" } }Based on zarr-developers/zeps#47 (comment).
Structure:
uint64representing the size in bytes of the encoded index,index_codecs,data_codecs.The encoded index size is necessary to support index compression and partial decoding. If this were not available, the index could not used a bytes-to-bytes compression codec. A bytes-to-bytes compression codec could follow
vlen, but then "data" is potentially running through a compression codec twice.vlen_v2{ "name": "vlen_v2" }This matches Zarr V2 style interleaved encoding, which is implemented by numcodecs
vlen-utf8,vlen-bytes, andvlen-array. These are all essentially the same codec, with data type-dependent behaviour. It makes sense to standardise a single codec for Zarr V3 to support Zarr V2vlen-utf8/bytes/arrayencoded data without reencoding chunks.Encoding Efficiency (32-bit index)
Sum of chunk sizes (in bytes) on "city" column of zarr-developers/zarr-python#2036 (comment).
https://github.com/LDeakin/zarrs/blob/variable_length_data_types/tests/cities.rs.