feat!(clp-package): Add support for providing multiple S3 object keys as input for compression.#1383
feat!(clp-package): Add support for providing multiple S3 object keys as input for compression.#1383LinZhihao-723 wants to merge 7 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| args_parser.add_argument( | ||
| "--s3-single-prefix", | ||
| action="store_true", | ||
| help=( | ||
| "Treat the S3 URL as a single prefix. If set, only a single S3 URL should be provided" | ||
| " and it must be explicitly given as a positional argument." | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Please check whether this flag name makes sense @kirkrodrigues @junhaoliao
|
|
||
| class S3InputConfig(S3Config): | ||
| type: Literal[InputType.S3.value] = InputType.S3.value | ||
| keys: Optional[List[str]] = None |
There was a problem hiding this comment.
Please check whether passing keys in this way makes sense @kirkrodrigues @junhaoliao.
In eithere key ingestion or prefix ingestion, the key prefix will be required so S3Config base is untouched.
There was a problem hiding this comment.
(I haven't taken a very careful look at the implementation. sorry if i misunderstood anything.)
In what case do we expect the keys to be None and when not? I'm guessing that
1. Non-None and Non-empty: When the input config is created by an S3 scanner, or (to be implemented, ) when the user provides multiple s3 keys when submitting a compression job.
2. None: when the user only provides a single s3 prefix when submitting a compression job
Since this is a list where a length of 0 is allowed, do you think we can make it non-optional?
There was a problem hiding this comment.
You're right. But actually length of 0 is not allowed. With Pydantic, shall we make a validator?
Description
Before this PR, the CLP package compression only accepted a single S3 URL as the input. The S3 URL is decomposed into a bucket, a region code, and a key prefix. Everything under the key prefix will be compressed.
This PR adds support for providing multiple S3 object keys as input for compression. Now, users can specify a list of S3 URLs, where each URL is treated as a URL to an actual key in the bucket. Key-prefix-based ingestion is still supported, however, users must explicitly specify that the input is a prefix by using
--s3-single-prefixoption.The current implementation has the following requirements for the given input URLs:
To support multi-keys, we update
S3InputConfigto store an optionalkeysfield. If this field is not set, prefix-based ingestion will be used as before. Otherwise, we will traverse the bucket to collect object metadata of the given keys to create compression jobs.Checklist
breaking change.
Validation performed