Skip to content

Zarr Dask Table #1599

@mrocklin

Description

@mrocklin

We would like a persistence format for Dask.dataframes that can scale to multiple machines. Solutions like HDF5, BColz, Castra are single-file-system only and have their own issues besides. Solutions like CSV can scale out but are inefficient. Solutions like Parquet are currently poorly supported in Python, and are complex on a single-machine.

Zarr is an interesting library that implements a sane subset of the HDF5 model (regularly chunked ndarrays, groups, metadata) on MutableMappings (memory, disk, s3/hdfs). I'm curious what a modern tabular format would look like built on top of Zarr and what the performance would be. This could be a stop-gap until Parquet support, or it could be a long-term competitor that also scaled down nicely, or extended out to nd-array and grouped case.

Some things Zarr does well

  • Scales from memory, to single-file-disk, to multi-file-disk, to S3/HDFS, to other
  • Tuned performance
  • Compression
  • Sane and simple design with a published spec, active and funded maintainer, and good test coverage
  • Extensible with metadata and groups

Some things we would need to figure out

  • How do we efficiently encode text data
  • How do we efficiently encode categorical data
  • How do we encode partition information
  • How do we deal with the fact that partitions aren't regularly sized or that we don't even have known sizes ahead of time

Motivation

I think that this could be useful for Dask, fulfilling a need that we have. I also think that it could be a fun experiment for Zarr, to see how it responds to a new use case with different constraints.

I don't think that this replaces efforts towards Parquet Python support, which remains a dominant storage format with many of the above questions already answered well.

Thoughts on partitions

I see two options here:

  1. We use one partition per zarr-array, arranging them into a group
  2. We push on Zarr to support unknown chunk-sized arrays that don't support slicing, but do support picking out particular chunks

If using Zarr with many single chunked arrays organzied into groups is not particularly slow then I say we stick with that. Otherwise I'd be curious what providing a chunks=None option would look like for Zarr and if this added complexity is worth it

Thoughts on Text

I've been using msgpack to encode lists of text lately, which seems to do a good job in terms of maturity and performance. I'm curious if there is any appetite with Zarr to expand the spec to include a special text dtype. This is a clear deviation from the "zarr's model is just numpy's model" but text is an important case that NumPy doesn't appear likely to handle well in the moderate future.

cc @alimanfoo @jcrist @martindurant @hussainsultan @shoyer

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions