Zarr Dask Table

We would like a persistence format for Dask.dataframes that can scale to multiple machines.  Solutions like HDF5, BColz, Castra are single-file-system only and have their own issues besides.  Solutions like CSV can scale out but are inefficient.  Solutions like Parquet are currently poorly supported in Python, and are complex on a single-machine.

[Zarr](http://zarr.readthedocs.io/en/latest/) is an interesting library that implements a sane subset of the HDF5 model (regularly chunked ndarrays, groups, metadata) on MutableMappings (memory, disk, s3/hdfs).  I'm curious what a modern tabular format would look like built on top of Zarr and what the performance would be.  This could be a stop-gap until Parquet support, or it could be a long-term competitor that also scaled down nicely, or extended out to nd-array and grouped case.
### Some things Zarr does well
- Scales from memory, to single-file-disk, to multi-file-disk, to S3/HDFS, to other 
- Tuned performance
- Compression
- Sane and simple design with a published spec, active and funded maintainer, and good test coverage
- Extensible with metadata and groups
### Some things we would need to figure out
- How do we efficiently encode text data
- How do we efficiently encode categorical data
- How do we encode partition information
- How do we deal with the fact that partitions aren't regularly sized or that we don't even have known sizes ahead of time
### Motivation

I think that this could be useful for Dask, fulfilling a need that we have.  I also think that it could be a fun experiment for Zarr, to see how it responds to a new use case with different constraints.  

I don't think that this replaces efforts towards Parquet Python support, which remains a dominant storage format with many of the above questions already answered well.
### Thoughts on partitions

I see two options here:
1.  We use one partition per zarr-array, arranging them into a group
2.  We push on Zarr to support unknown chunk-sized arrays that don't support slicing, but do support picking out particular chunks

If using Zarr with many single chunked arrays organzied into groups is not particularly slow then I say we stick with that.  Otherwise I'd be curious what providing a `chunks=None` option would look like for Zarr and if this added complexity is worth it
### Thoughts on Text

I've been using `msgpack` to encode lists of text lately, which seems to do a good job in terms of maturity and performance.  I'm curious if there is any appetite with Zarr to expand the spec to include a special text dtype.  This is a clear deviation from the "zarr's model is just numpy's model" but text is an important case that NumPy doesn't appear likely to handle well in the moderate future.

cc @alimanfoo @jcrist @martindurant @hussainsultan @shoyer


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Zarr Dask Table #1599

Some things Zarr does well

Some things we would need to figure out

Motivation

Thoughts on partitions

Thoughts on Text

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Zarr Dask Table #1599

Description

Some things Zarr does well

Some things we would need to figure out

Motivation

Thoughts on partitions

Thoughts on Text

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions