Skip to content

[Feature Proposal] Enhancement of Repository Plugin #6354

@ashking94

Description

@ashking94

The purpose of this issue is to gather community feedback on proposal of enhancing the repository plugin.

OpenSearch repository plugin, today, provides transfer capabilities via streams to remote store. This plugin allows user to store the index data in off-cluster external repositories such as Amazon S3, Google Cloud Storage, or a shared filesystem, in addition to the local disk of the OpenSearch cluster. By using the repository plugin, OpenSearch users can take advantage of Snapshot feature to backup and restore to protect against data loss, enable disaster recovery, and create replicas of their data for testing and development purposes. With remote-backed storage, the user now has the ability to protect against data loss by automatically creating continuous backups of all index transactions and sending them to remote storage. OpenSearch users can achieve request level durability using remote-backed storage.

Problem Statement

OpenSearch repository plugin today provides interfaces such as writeBlob interface to facilitate transfer of a file by using a single InputStream. This means that the file referenced by InputStream needs to be processed serially. This restricts the capabilities of underlying plugin to serially transfer buffered content of a file and only after successful processing of first buffer, subsequent buffer of content is read and transferred. Parallel processing of multiple parts of a file is therefore, not possible. Use cases such as download or upload of multiple parts of a file in parallel cannot be supported due to this.

S3 repository plugin, for instance, provides support for multi-part upload but due to single InputStream restriction of base plugin, upload of each part happens serially even though S3 provides support for parallel upload of individual parts of a file.

Proposed Solution

We propose to enhance existing repository plugin to extend support for mutliple stream suppliers for underlying vendor plugins to be able to optionally provide multi-stream based implementations for remote transfers. Provisioning multiple streams instead of abstracting with file based transfer would provide control in core Opensearch code to pre-process buffered content with multiple stream wrappers after content is read and before transfer can take place. Stream suppliers instead of concrete streams can further facilitate delegation of stream creation till remote transfer is started. Following can be some of the abstractions we propose to provide the required support :

  1. To check if upload blob is supported.
  2. Upload blob supplied with upload context which can consist of ordered collection of stream suppliers along with metadata of each stream like length, headers, etc.
  3. To check if download blob is supported.
  4. Download blob supplied with download context which can consist of ordered collection of stream appliers to be applied on top of sdk input stream before persisting data on disk. Each applier can have metadata needed for applying, associated with it.

Credits - @vikasvb90, @ashking94

Metadata

Metadata

Assignees

Labels

discussIssues intended to help drive brainstorming and decision makingenhancementEnhancement or improvement to existing feature or requestfeatureNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions