Add option to process only metadata of objects in S3 scan mode by kkondaka · Pull Request #5470 · opensearch-project/data-prepper

kkondaka · 2025-02-27T19:51:14Z

Description

Add option to process only metadata of objects in S3 scan mode

Issues Resolved

Resolves #5433

Check List

[ X] New functionality includes testing.
New functionality has a documentation issue. Please link to it in this PR.
- New functionality has javadoc added
[X ] Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Kondaka <krishkdk@amazon.com>

dlvenable

This can be an interesting feature. We can make improvements to make this more generic and move the S3 source in a better direction for similar use-cases.

Also, we should make some changes to other configurations when we are not loading data. For example, we don't need a codec and shouldn't even set it (it would be misleading). And we can also disable compression. There might be a few others as well.

dlvenable · 2025-02-28T01:06:23Z

.../main/java/org/opensearch/dataprepper/plugins/source/s3/configuration/S3ScanScanOptions.java

    @JsonProperty("end_time")
    private LocalDateTime endTime;

+    @JsonProperty("metadata_only")


I think we should make this more generic by use of an enum. There may be different configurations that users want:

Object data only (what we have now)

Object data and object metadata

Metadata only

Also, this option should exist on the high level configuration for s3. It is applicable for both SQS and S3 scan.

We do not have "object data only" we already have object data and metadata (bucket and key) but not complete metadata. So, not sure how we want to handle that

Also "metadata only" really doesn't make sense for s3-sqs when we have sqs source. I guess we could still do it.

dlvenable · 2025-02-28T01:10:15Z

...ns/s3-source/src/main/java/org/opensearch/dataprepper/plugins/source/s3/S3ObjectHandler.java

+     *
+     * @throws IOException exception is thrown every time because this is not supported
+     */
+    default void processS3ObjectMetadata(final S3ObjectReference s3ObjectReference,


Rather than adding another method and having more conditionals in all the code, we can simplify this by implementing a new S3ObjectHandler which handles metadata-only.

Even better, I think we could have a different abstraction within the S3ObjectWorker to handle this differently. This would help us share metric reporting.

dlvenable · 2025-02-28T01:15:18Z

...ins/s3-source/src/main/java/org/opensearch/dataprepper/plugins/source/s3/S3ObjectWorker.java

+            acknowledgementSet.add(event);
+        }
+        AtomicLong lastCheckpointTime = new AtomicLong(System.currentTimeMillis());
+        final AtomicInteger saveStateCounter = new AtomicInteger();


This is never read.

dlvenable · 2025-02-28T01:16:04Z

...ins/s3-source/src/main/java/org/opensearch/dataprepper/plugins/source/s3/S3ObjectWorker.java

+            LOG.warn("Failed to get metadata for S3 object: s3ObjectReference={}.", s3ObjectReference);
+            s3ObjectPluginMetrics.getS3ObjectNoRecordsFound().increment();
+        }
+        s3ObjectPluginMetrics.getS3ObjectSizeSummary().record(s3ObjectSize);


I'm not sure it makes sense to include size metrics for metadata. We aren't processing the actual data.

dlvenable · 2025-02-28T01:16:36Z

...ins/s3-source/src/main/java/org/opensearch/dataprepper/plugins/source/s3/S3ObjectWorker.java

+                               final SourceCoordinator<S3SourceProgressState> sourceCoordinator,
+                               final String partitionKey) throws IOException {
+        final S3InputFile inputFile = new S3InputFile(s3Client, s3ObjectReference, bucketOwnerProvider, s3ObjectPluginMetrics);
+        final String BUCKET = "bucket";


These should all be private static fields.

Also, you can clarify the names with _KEY. e.g. BUCKET_KEY.

I thought about it and I felt KEY_KEY would not be good. So, decided to not use _KEY

dlvenable · 2025-02-28T01:18:46Z

...ins/s3-source/src/main/java/org/opensearch/dataprepper/plugins/source/s3/S3ObjectWorker.java

+        } catch (final Exception e) {
+            LOG.error("Failed writing S3 objects to buffer.", e);
+        }
+        if (acknowledgementSet != null && sourceCoordinator != null && partitionKey != null &&


This seems a very leaky abstraction. Why does this code need to care about all these details? I see we have this in the current code and it is producing duplicate code and unclear responsibilities. We can refactor this with a simpler Consumer of some sort to finish the record. Let the S3 scan take care of this.

Signed-off-by: Kondaka <krishkdk@amazon.com>

Zhangxunmt · 2025-03-06T22:49:27Z

.../main/java/org/opensearch/dataprepper/plugins/source/s3/configuration/S3ScanScanOptions.java

+    private boolean metadata_only = false;
+
    @JsonProperty("buckets")
    @Valid
    private List<S3ScanBucketOptions> buckets;


This "metadata_only" is at the S3 Scan level so it applies to all the buckets under this scan. This doesn't support listing several S3 buckets with different metadata_only options under the same Scan as discussed earlier? The first bucket scans only for meta data, but the second bucket scans for content.

So is it possible to define this "metadata_only" at the bucket level?

buckets: # scans two buckets - bucket: metadata_only: true name: "ml-input-bucket" filter: include_prefix: - sagemaker/sagemaker_djl_batch_input - bucket: name: "ml-output-bucket" filter: include_prefix: - sagemaker/output/sagemaker_djl_batch_input

@Zhangxunmt Good point. I will look into this.

@dlvenable The current approach will not work for bucket level option to support metadata_only. I think I have to go back to previous implementation (but with no code duplication). Let me know what you think.

We can probably move the new configuration into the bucket and keep it for scan for the time being. But, I'd still have the code operate in such a way that it could work with sqs.

Signed-off-by: Kondaka <krishkdk@amazon.com>

graytaylor0 · 2025-03-11T20:23:33Z

.../integrationTest/java/org/opensearch/dataprepper/plugins/source/s3/S3ScanObjectWorkerIT.java

        when(pluginMetrics.counter(S3_OBJECTS_DELETE_FAILED_METRIC_NAME)).thenReturn(s3DeleteFailedCounter);
        S3ObjectDeleteWorker s3ObjectDeleteWorker = new S3ObjectDeleteWorker(s3Client, pluginMetrics);

+        //when(s3ScanScanOptions.getBuckets()).thenReturn(List.of(s3ScanBucketOptions));


Extra comment

graytaylor0 · 2025-03-11T20:26:42Z

.../integrationTest/java/org/opensearch/dataprepper/plugins/source/s3/S3ScanObjectWorkerIT.java

        buffer = mock(Buffer.class);
        recordsReceived = 0;

+        //s3ScanBucketOptions = mock(S3ScanBucketOptions.class);


Extra comment

graytaylor0 · 2025-03-11T20:29:53Z

.../main/java/org/opensearch/dataprepper/plugins/source/s3/configuration/S3ScanScanOptions.java

    @JsonProperty("end_time")
    private LocalDateTime endTime;

+    @JsonProperty("metadata_only")


Is this actually getting used anywhere in the code?

Signed-off-by: Kondaka <krishkdk@amazon.com>

dlvenable · 2025-03-12T20:58:38Z

data-prepper-plugins/s3-source/build.gradle


    classpath = sourceSets.integrationTest.runtimeClasspath
    systemProperty 'tests.s3source.bucket', System.getProperty('tests.s3source.bucket')
+    systemProperty 'tests.s3source.bucket2', System.getProperty('tests.s3source.bucket2')


Why do we need another bucket? Let's use paths within the bucket to avoid multiple resources. In order to use another bucket, we'd need to create new resources in AWS for the testing account.

dlvenable · 2025-03-12T20:59:51Z

...rc/main/java/org/opensearch/dataprepper/plugins/source/s3/configuration/S3DataSelection.java

+
+    @JsonCreator
+    public static S3DataSelection fromOptionValue(final String name) {
+        return S3_DATA_SELECTION_MAP.get(name.toLowerCase());


Don't force lowercase. This allows for variable casing. Just accept the expected strings only - e.g. data_only.

return S3_DATA_SELECTION_MAP.get(name);

Signed-off-by: Kondaka <krishkdk@amazon.com>

…earch-project#5470) Add option to process only metadata of objects in S3 scan mode Signed-off-by: Kondaka <krishkdk@amazon.com>

…earch-project#5470) Add option to process only metadata of objects in S3 scan mode Signed-off-by: Kondaka <krishkdk@amazon.com> Signed-off-by: George Chen <qchea@amazon.com>

…earch-project#5470) Add option to process only metadata of objects in S3 scan mode Signed-off-by: Kondaka <krishkdk@amazon.com>

…earch-project#5470) Add option to process only metadata of objects in S3 scan mode Signed-off-by: Kondaka <krishkdk@amazon.com> Signed-off-by: mamol27 <mamol27@yandex.ru>

Add option to process only metadata of objects in S3 scan mode

f45865d

Signed-off-by: Kondaka <krishkdk@amazon.com>

kkondaka requested review from KarstenSchnitter, chenqi0805, dinujoh, dlvenable, engechas, graytaylor0, oeyh, san81, sb2k16 and srikanthjg as code owners February 27, 2025 19:51

dlvenable requested changes Feb 28, 2025

View reviewed changes

addressed review comments

47a35e6

Signed-off-by: Kondaka <krishkdk@amazon.com>

Zhangxunmt mentioned this pull request Mar 6, 2025

add ml_inference processor for offline batch inference #5507

Merged

4 tasks

Zhangxunmt reviewed Mar 6, 2025

View reviewed changes

Modified to provide data selection option at bucket level

4436f8a

Signed-off-by: Kondaka <krishkdk@amazon.com>

graytaylor0 reviewed Mar 11, 2025

View reviewed changes

Addressed review comments

97cb918

Signed-off-by: Kondaka <krishkdk@amazon.com>

dlvenable requested changes Mar 12, 2025

View reviewed changes

graytaylor0 previously approved these changes Mar 13, 2025

View reviewed changes

Addressed review comments

6fac48b

Signed-off-by: Kondaka <krishkdk@amazon.com>

kkondaka dismissed graytaylor0’s stale review via 6fac48b March 14, 2025 02:52

graytaylor0 previously approved these changes Mar 14, 2025

View reviewed changes

Removed bucket2 from build.gradle

fc945bf

Signed-off-by: Kondaka <krishkdk@amazon.com>

kkondaka dismissed graytaylor0’s stale review via fc945bf March 14, 2025 18:33

dlvenable approved these changes Mar 14, 2025

View reviewed changes

dlvenable merged commit f32d12e into opensearch-project:main Mar 14, 2025
44 of 47 checks passed

Conversation

kkondaka commented Feb 27, 2025

Description

Issues Resolved

Check List

Uh oh!

dlvenable left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zhangxunmt Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Zhangxunmt Mar 6, 2025 •

edited

Loading