Add _source-only snapshot repository#32844
Conversation
This change adds a `_source` only snapshot repository that allows to wrap any existing repository as a _backend_ to snapshot only the `_source` part including live docs markers. Snapshots taken with the `source` repository won't include any index structures. The snapshot will be reduced in size and functionality such that it requires in-place reindexing during restore. The restore process will copy the `_source` data locally and reindexing all data during the recovery from snapshot phase. Users have 2 options for re-indexing: * full reindex: where the data will be reindexed with the original mapping * minimal reindex: where the data will be reindexed with a disabled mapping that results in an index that can only be accessed via `_id`. Both options allow using and updating the index while the latter is mainly for scan/scroll purposes and re-indexing after the fact. This feature aims mainly for disaster recovery use-cases where snapshot size is a concern or where time to restore is less of an issue.
| @@ -0,0 +1,39 @@ | |||
| [[repository-src-only]] | |||
There was a problem hiding this comment.
@clintongormley @debadair I'd love to get input where this should be linked from and where it should be located. at this point it's stand-alone.
| import static org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.FIELDS_EXTENSION; | ||
| import static org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.FIELDS_INDEX_EXTENSION; | ||
|
|
||
| public final class SourceOnlySnapshot { |
There was a problem hiding this comment.
this could be generally useful and moved to lucene land. I will do this after the fact.
| public static final Setting<Boolean> RESTORE_MINIMAL = Setting.boolSetting("restore_minimal", | ||
| false, Setting.Property.NodeScope); | ||
|
|
||
| public static final String SNAPSHOT_DIR_NAME = "_snapshot"; |
There was a problem hiding this comment.
@bleskes we use tmp dirs for restore and snapshot. I wonder if that is ok or if there are any concerns. Our shard deletion mechanism should take care of cleaning up.
| BytesReference source = rootFieldsVisitor.source(); | ||
| if (source != null) { // nested fields don't have source. in this case we should be fine. | ||
| // TODO we should have a dedicated origin for this LOCAL_TRANSLOG_RECOVERY is misleading. | ||
| Engine.Result result = shard.applyTranslogOperation(new Translog.Index(uid.type(), uid.id(), |
|
I did some initial benchmarks using our
the snapshots are all taken to a local disk ie. no network involved here. I will follow up with restore times which I expect to be much better for full backups (
|
…ssary, skip translog and use append only optimization
|
here are some updated numbers:
|
| * Queries other than `match_all` will return no results. | ||
|
|
||
| * `_get` requests are not supported. | ||
| * Queries other than `match_all` and `_get` requests are not supported. |
There was a problem hiding this comment.
this reads in two ways - you can also read this as if _get works (if you don't understand that _get is not what we see as a query)
|
|
||
| @Override | ||
| public Terms terms(String field) { | ||
| throw new UnsupportedOperationException("_source only indices can't be searched or filtered"); |
This change adds a `_source` only snapshot repository that allows to wrap any existing repository as a _backend_ to snapshot only the `_source` part including live docs markers. Snapshots taken with the `source` repository won't include any indices, doc-values or points. The snapshot will be reduced in size and functionality such that it requires full re-indexing after it's successfully restored. The restore process will copy the `_source` data locally starts a special shard and engine to allow `match_all` scrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only. This feature aims mainly for disaster recovery use-cases where snapshot size is a concern or where time to restore is less of an issue. **NOTE**: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.
This change adds a `_source` only snapshot repository that allows to wrap any existing repository as a _backend_ to snapshot only the `_source` part including live docs markers. Snapshots taken with the `source` repository won't include any indices, doc-values or points. The snapshot will be reduced in size and functionality such that it requires full re-indexing after it's successfully restored. The restore process will copy the `_source` data locally starts a special shard and engine to allow `match_all` scrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only. This feature aims mainly for disaster recovery use-cases where snapshot size is a concern or where time to restore is less of an issue. **NOTE**: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.
|
@s1monw What's the meaning of minimal vs full reindex in this comment? |
|
@jhalterman that was an early version of the change that I reverted. These numbers are meaningless now. |
We can't rely on the leaf reader ordinal in a wrapped reader since it might not correspond to the ordinal in the SegmentInfos for it's SegmentCommitInfo. Relates to elastic#32844 Closes elastic#33689
This change adds a
_sourceonly snapshot repository that allows to wrapany existing repository as a backend to snapshot only the
_sourcepartincluding live docs markers. Snapshots taken with the
sourcerepositorywon't include any indices, doc-values or points. The snapshot will be reduced in size and
functionality such that it requires full re-indexing after it's successfully restored.
The restore process will copy the
_sourcedata locally starts a special shard and engineto allow
match_allscrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only.This feature aims mainly for disaster recovery use-cases where snapshot size is
a concern or where time to restore is less of an issue.
NOTE: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.