This capability might be part of geowave, or might be rolled as a separate project.
The base need is to provide the ability to export a geowave dataset (or subset of a dataset) to a single file, and import that same file back in to geowave.
The use case here is two-fold:
- Provide a mechanism to backup geowave data sets
- This means by default (i.e. no additional arugments) an export -> import cycle should result in a data set that's functionally and semantically equivalent to the original
- Provide a mechanism to snapshot data to simplify persistence format changes (between versions)
- This pushes a desire to make the serialization format somewhat independent from persistence formats where possible (it's a balance between duplicating code vs. duplicating dependencies). With full re-use we loose the ability to make the serialization format independent of persistence changes (i.e. we could just export r-files).
- Export Functionality
- Takes a geowave namespace as an input
- Exports (to HDFS) a serialized format of the dataset (single file) which contains (contains here means can be derived from)
- Feature type definition
- Original index definition
- Original namespace name
- Data values (including visibility)
- Stretch functionality (not required in initial version, might be moved to separate enhancement ticket)
- Ability to provide a CQL filter which subsets the data being exported
- Import Functionality
- Takes a serialized data file (single file) and imports it back into a geowave instance
- By default pulls namespace name, feature type definition, and index configuration from serialized file
- Allows user to optionally override namespace and index configuration
- Stretch functionality
- Allows user to override feature type
- Need to further specify functionality - how are feature types mappings expressed in this case?
- Merge capability (key off feature id?)
- Serialization Details
- File size is relevant
- Avro is what we have leveraged in other places for this, should use here unless there's a strong reason not to.
- Consider creating a feature collection concept so the feature names don't have to duplicated in every feature instance.
- Some structure required - we want to roll index definitions, feature definitions, etc. into a single file with all the features (don't want to deal with multiple files).
- One of the metadata fields should be a hash of the data collection
- Stretch goal: optional parity/ecc support
- Data adapters: SimpleFeature support is what's immediately required, but some thought/design on handling other data adapters. Might have a pluggable serialization capability that keys off the data adapter class? (If we re-use the data adapter directly that might prevent us from using this capability to mitigate persistence changes)
This capability might be part of geowave, or might be rolled as a separate project.
The base need is to provide the ability to export a geowave dataset (or subset of a dataset) to a single file, and import that same file back in to geowave.
The use case here is two-fold: