Skip to content

Import/Export capability #353

@chrisbennight

Description

@chrisbennight

This capability might be part of geowave, or might be rolled as a separate project.

The base need is to provide the ability to export a geowave dataset (or subset of a dataset) to a single file, and import that same file back in to geowave.

The use case here is two-fold:

  1. Provide a mechanism to backup geowave data sets
    • This means by default (i.e. no additional arugments) an export -> import cycle should result in a data set that's functionally and semantically equivalent to the original
  2. Provide a mechanism to snapshot data to simplify persistence format changes (between versions)
    • This pushes a desire to make the serialization format somewhat independent from persistence formats where possible (it's a balance between duplicating code vs. duplicating dependencies). With full re-use we loose the ability to make the serialization format independent of persistence changes (i.e. we could just export r-files).
  • Export Functionality
    • Takes a geowave namespace as an input
    • Exports (to HDFS) a serialized format of the dataset (single file) which contains (contains here means can be derived from)
      • Feature type definition
      • Original index definition
      • Original namespace name
      • Data values (including visibility)
    • Stretch functionality (not required in initial version, might be moved to separate enhancement ticket)
      • Ability to provide a CQL filter which subsets the data being exported
  • Import Functionality
    • Takes a serialized data file (single file) and imports it back into a geowave instance
    • By default pulls namespace name, feature type definition, and index configuration from serialized file
      • Allows user to optionally override namespace and index configuration
    • Stretch functionality
      • Allows user to override feature type
        • Need to further specify functionality - how are feature types mappings expressed in this case?
        • Merge capability (key off feature id?)
  • Serialization Details
    • File size is relevant
    • Avro is what we have leveraged in other places for this, should use here unless there's a strong reason not to.
    • Consider creating a feature collection concept so the feature names don't have to duplicated in every feature instance.
    • Some structure required - we want to roll index definitions, feature definitions, etc. into a single file with all the features (don't want to deal with multiple files).
    • One of the metadata fields should be a hash of the data collection
      • Stretch goal: optional parity/ecc support
    • Data adapters: SimpleFeature support is what's immediately required, but some thought/design on handling other data adapters. Might have a pluggable serialization capability that keys off the data adapter class? (If we re-use the data adapter directly that might prevent us from using this capability to mitigate persistence changes)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions