Add Lambda Synchronous processor support by srikanthjg · Pull Request #4700 · opensearch-project/data-prepper

srikanthjg · 2024-07-01T23:00:26Z

Description

Adds AWS lambda as a remote processor for dataprepper.
Further details mentioned in #4699

Issues Resolved

Resolves #4699

Check List

New functionality includes testing.
New functionality has a documentation issue. Please link to it in this PR.
- New functionality has javadoc added
Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

dlvenable · 2024-07-02T15:38:42Z

data-prepper-plugins/aws-lambda/build.gradle

-    testImplementation project(':data-prepper-plugins:parse-json-processor')
+    testImplementation 'org.powermock:powermock-module-junit4:2.0.9'
+    testImplementation 'org.powermock:powermock-api-mockito2:2.0.9'
+    testImplementation 'junit:junit:4.13.2'


You shouldn't need any of these four lines. They are provided by the root project.

kkondaka · 2024-08-05T19:52:20Z

...a/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaClientFactory.java

+import software.amazon.awssdk.core.retry.RetryPolicy;
+import software.amazon.awssdk.services.lambda.LambdaClient;
+
+public final class LambdaClientFactory {


Did you explore the possibility of using one LambdaClientFactory class? I see one class with that name in lambda sink directory

sure i can merge the two and move it to common.

kkondaka · 2024-08-05T19:53:48Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+        if (mode != null && mode.equalsIgnoreCase(LambdaProcessorConfig.SYNCHRONOUS_MODE)) {
+            invocationType = SYNC_INVOCATION_TYPE;
+        } else {
+            throw new RuntimeException("mode has to be synchronous or asynchronous");


Something like "unsupported mode {}", mode is better message here.

kkondaka · 2024-08-14T07:10:18Z

@srikanthjg white source check is failing. I am ready to approve this.

dlvenable · 2024-08-15T12:21:52Z

...src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessorConfig.java

+    @JsonProperty("max_retries")
+    private int maxConnectionRetries = DEFAULT_CONNECTION_RETRIES;
+
+    @JsonProperty("mode")


The term "mode" is quite ambiguous. I think we can borrow the term "invocation_type" from AWS Lambda itself.

https://docs.aws.amazon.com/lambda/latest/api/API_Invoke.html#API_Invoke_RequestSyntax

dlvenable · 2024-08-15T12:22:48Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+            throw new RuntimeException("Unsupported mode " + mode);
+        }
+
+        codec = new LambdaJsonCodec(batchKey);


This approach is very restrictive. It assumes that the body fits the format we ask. We should follow the same pattern used elsewhere in Data Prepper by allowing for a pluggable output codec.

LambdaJsonCodec is an implementation of OutputCodec - link . I needed this specifically for batch processing when more than one event needs to be mapped to json. JsonOutputCodec is only event at a time. Maybe i can change the name to something more generic like BulkJsonOutputCodec or BatchJsonOutputCodec? It will be used for all dial out processors. In s3 sink we use BufferedCodec, but the only implementation is Parquet currently and the way we want to implement for lambda is different.

All OutputCodecs are made for batches.

Customers can use either JsonOutputCodec or NdjsonOutputCodec.

The json codec already supports a configurable key name. This would replace the batch_key which you don't need.

codec: json: key_name: myKey

So the customer can have two options:

JSON

codec: json: key_name: myKey

Yields:

{ "myKey" : [ { ...event1... }, { ...event2... }, { ...event3... } ] }

The customer can use ndjson

codec: ndjson:

Yields:

{ ...event1... } { ...event2... } { ...event3... }

Bulk will always need a key as it will be considered one payload, so i guess ndJson cannot be used.

There is also a difference when it comes to handling single event without batch. In this case, i still want to convert dataprepper event to json but i dont want to have a key, i want to pass on the user's data as it is to lambda as payload; but current output codec forces me to have a key. To address that, i either need to add new behaviour to json writeEvent method, to convert event directly to json OR write a new codec(which is what i did).

The behaviour i want seem to be a combination of the 2 codecs - ndjson and json. i want json behaviour for bulk and ndjson behaviour when for single event.

@srikanthjg ,

Bulk will always need a key as it will be considered one payload, so i guess ndJson cannot be used.

Yes, this makes sense. The payload needs to be JSON and ND-JSON with multiple events becomes non-JSON.

There is also a difference when it comes to handling single event without batch.

Actually, an ndjson which writes a single event gives you exactly what you want in this case. It is exactly the same output.

I also see that you are trying to support the concept of calling a Lambda for each event. Improving the configuration can help make this clearer. Right now there are multiple configurations which the user needs to carefully set to get the desired output.

This is a simpler way to configure it.

To have a single invocation per event, add a boolean flag. The user need not make any more decisions.

aws_lambda: function_name: MyFunction invocation_per_event: true

The default should be to batch, and this can have the existing defaults. You can probably keep this configuration the same. Though, rename batch_key to key_name for consistency with the other APIs. Also, disallow setting an event size of 1 as this is not the goal of this approach.

aws_lambda: function:name: MyFunction

Second, you can still use the existing codecs.

When invocation_per_event is set to true, you can use the NdJsonInputCodec internally. Otherwise, use the JsonCodec and provide the batch.key_name as the keyName in the codec configuration.

regarding the configuration, i already have "invocation_type" as a configuration, this allows to set per event invocation or batch invocation(RequestResponse or Event).

If we are implementing this internally, i cannot use parse-json-processor as a plugin but will have to take a dependency on it. Is it ok for one processor to take a dependency on the other?

I wanted to avoid this, hence went with implementing a custom codec. But i think this codec can also be used by other dial-out processors eventually, i can make it generic.

dlvenable · 2024-08-15T12:23:59Z

...src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessorConfig.java

+
+public class LambdaProcessorConfig {
+
+    public static final String SYNCHRONOUS_MODE = "RequestResponse";


We should stick with Data Prepper naming conventions: request_response.

dlvenable · 2024-08-15T12:24:06Z

...src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessorConfig.java

+public class LambdaProcessorConfig {
+
+    public static final String SYNCHRONOUS_MODE = "RequestResponse";
+    public static final String ASYNCHRONOUS_MODE = "Event";


Change to event.

dlvenable · 2024-09-12T17:59:51Z

...src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessorConfig.java

+
+public class LambdaProcessorConfig {
+
+    public static final String REQUEST_RESPONSE = "RequestResponse";


Let's use request-response to match our existing naming conventions.

dlvenable · 2024-09-12T18:00:09Z

...src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessorConfig.java

+public class LambdaProcessorConfig {
+
+    public static final String REQUEST_RESPONSE = "RequestResponse";
+    public static final String EVENT = "Event";


Let's use event to match our existing naming conventions.

dlvenable · 2024-09-12T18:12:34Z

...ws-lambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/sink/LambdaSinkConfig.java


 public class LambdaSinkConfig {

+    public static final String REQUEST_RESPONSE = "RequestResponse";


Let's consolidate these constant values with the LambdaProcessorConfig so that they don't diverge.

dlvenable · 2024-09-13T13:24:30Z

...ws-lambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/sink/LambdaSinkConfig.java

-    public static final String EVENT = "Event";
-    public static final String BATCH_EVENT = "batch_event";
-    public static final String SINGLE_EVENT = "single_event";
+    public static final String REQUEST_RESPONSE = "request-response";


Perhaps make a CommonLambdaConfig class that has these constants. We should avoid duplicating these or we may have future mismatches.

dlvenable

Thank you @srikanthjg for this contribution!

dinujoh · 2024-09-17T14:33:16Z

data-prepper-plugins/aws-lambda/README.md

+                maximum_size: 3mb
+```
+
+`invocation_type` as RequestResponse will be used when the response from aws lambda comes back to dataprepper.


nit:
invocation_type as RequestResponse is used when DataPrepper needs to process the response from AWS Lambda.

invocation_type as Event is used when the response from AWS Lambda is sent to an S3 bucket.

dinujoh · 2024-09-17T14:34:54Z

data-prepper-plugins/aws-lambda/README.md

+
+In batch options, an implicit batch threshold option is that if events size is 3mb, we flush it.
+`payload_model` this is used to define how the payload should be constructed from a dataprepper event.
+`payload_model` as batch_event is used when the output needs to be formed as a batch of multiple events,


are there other values for paylod_model ?

dinujoh · 2024-09-17T14:37:31Z

...nTest/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessorServiceIT.java

+
+    @ParameterizedTest
+    @ValueSource(ints = {1,3})
+    void verify_records_to_lambda_success(final int recordCount) throws Exception {


consider adding test for InvocationType Event

invocation type event will be disabled for now, will be releasing event type with asynchronous support that requires additional infra changes. I have disabled it in the verification for now, ll fix the readme.

dinujoh · 2024-09-17T14:39:15Z

...c/main/java/org/opensearch/dataprepper/plugins/lambda/common/client/LambdaClientFactory.java


            return LambdaClient.builder()
-                .region(lambdaSinkConfig.getAwsAuthenticationOptions().getAwsRegion())
+                .region(awsAuthenticationOptions.getAwsRegion())


Consider enabling SDK metrics to track number of request, timeout, throttle etc.

...mbda/src/main/java/org/opensearch/dataprepper/plugins/lambda/common/config/BatchOptions.java

dinujoh · 2024-09-17T14:43:38Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+            codec = new NdjsonOutputCodec(ndjsonOutputCodecConfig);
+            isBatchEnabled = false;
+        } else{
+            throw new RuntimeException("invalid payload_model option");


Can this validation be part of lambdaProcessorConfig ?

dinujoh · 2024-09-17T14:43:58Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+
+        if(!lambdaProcessorConfig.getInvocationType().equals(LambdaCommonConfig.EVENT) &&
+                !lambdaProcessorConfig.getInvocationType().equals(LambdaCommonConfig.REQUEST_RESPONSE)){
+            throw new RuntimeException("Unsupported invocation type " + lambdaProcessorConfig.getInvocationType());


same as above

dinujoh · 2024-09-17T14:46:18Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+                if (currentBuffer.getEventCount() == 0) {
+                    codec.start(currentBuffer.getOutputStream(), event, codecContext);
+                }
+                codec.writeEvent(event, currentBuffer.getOutputStream());


is currentBuffer thread safe ?

yes this is running in the context of a processor.

dinujoh · 2024-09-17T14:49:27Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+
+    }
+
+    void flushToLambdaIfNeeded(List<Record<Event>> resultRecords) throws InterruptedException, IOException {


this can be private function

dinujoh · 2024-09-17T14:49:43Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+
+    void flushToLambdaIfNeeded(List<Record<Event>> resultRecords) throws InterruptedException, IOException {
+
+        LOG.info("Flush to Lambda check: currentBuffer.size={}, currentBuffer.events={}, currentBuffer.duration={}", currentBuffer.getSize(), currentBuffer.getEventCount(), currentBuffer.getDuration());


Is there excessive logging in this method ?

sure will reduce them.

dinujoh · 2024-09-17T14:49:56Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+        }
+    }
+
+    LambdaResult retryFlushToLambda(Buffer currentBuffer, final AtomicReference<String> errorMsgObj) throws InterruptedException {


this can be private function

dinujoh · 2024-09-17T14:50:07Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+        return lambdaResult;
+    }
+
+    Event convertLambdaResponseToEvent(InvokeResponse lambdaResponse) {


this can be private function

dinujoh · 2024-09-17T14:55:00Z

...s-lambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/sink/LambdaSinkService.java

+        Map<String, String> invocationTypeMap = Map.of(
+                LambdaCommonConfig.EVENT, EVENT_LAMBDA
+        );


Move this to static constant

dinujoh · 2024-09-17T15:13:21Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+        this.bufferFactory = new InMemoryBufferFactory();
+        try {
+            currentBuffer = this.bufferFactory.getBuffer(lambdaClient, functionName, invocationType);


The buffer is overloaded in DataPrepper. Looks like this is not just buffer but tightly coupled with lambda. We should consider renaming this class and interface to be clear.

i am handling the same way we would do it in the sink. i can address refactor in another pr.

dinujoh · 2024-09-17T15:15:36Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+            } catch (AwsServiceException | SdkClientException e) {
+                errorMsgObj.set(e.getMessage());
+                LOG.error("Exception occurred while uploading records to lambda. Retry countdown  : {} | exception:", retryCount, e);
+                --retryCount;


is this retry on top of lambda client retry ? Any reason we need this ?

Signed-off-by: Srikanth Govindarajan <srigovs@amazon.com>

dlvenable · 2024-09-24T17:43:18Z

...src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessorConfig.java

+    private String invocationType = REQUEST_RESPONSE;
+
+    @JsonProperty("payload_model")
+    private String payloadModel = BATCH_EVENT;


We should make a Java enum for this.

Here is an example:

https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/s3-source/src/main/java/org/opensearch/dataprepper/plugins/source/s3/configuration/NotificationTypeOption.java

I'm ok with doing this in a follow-on PR.

dlvenable · 2024-09-24T17:43:30Z

...src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessorConfig.java

+    private int maxConnectionRetries = DEFAULT_CONNECTION_RETRIES;
+
+    @JsonProperty("invocation_type")
+    private String invocationType = REQUEST_RESPONSE;


We should make this an enum as well.

dlvenable · 2024-09-24T17:43:53Z

...ws-lambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/sink/LambdaSinkConfig.java

    private int maxConnectionRetries = DEFAULT_CONNECTION_RETRIES;

+    @JsonProperty("invocation_type")
+    private String invocationType = EVENT;


Should be an enum as well.

dlvenable · 2024-09-24T17:43:56Z

...ws-lambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/sink/LambdaSinkConfig.java

+    private String invocationType = EVENT;
+
+    @JsonProperty("payload_model")
+    private String payloadModel = BATCH_EVENT;


Should be an enum as well.

dinujoh · 2024-09-24T17:45:55Z

Did you look into moving some of the null check validations from the plugin into Config ? Other than this the changes look good to me

dlvenable

@srikanthjg , Thank you for this contribution. I have a few other changes we should try to get in to improve it. But, let's follow-on in another PR.

dlvenable · 2024-09-24T18:00:41Z

...ambda/src/main/java/org/opensearch/dataprepper/plugins/lambda/processor/LambdaProcessor.java

+
+    public static final String NUMBER_OF_RECORDS_FLUSHED_TO_LAMBDA_SUCCESS = "lambdaProcessorObjectsEventsSucceeded";
+    public static final String NUMBER_OF_RECORDS_FLUSHED_TO_LAMBDA_FAILED = "lambdaProcessorObjectsEventsFailed";
+    public static final String LAMBDA_LATENCY_METRIC = "lambdaProcessorLatency";


We can rename this to simply requestLatency. Or you could call it lambdaRequestLatency, but that seems unnecessary.

As I read the code, this is the time to make the request to Lambda regardless of it being request/response or event.

Add Lambda Processor Synchronous Mode support Make LambdaClientFactory common to sink and processor Signed-off-by: Srikanth Govindarajan <srigovs@amazon.com>

srikanthjg requested review from KarstenSchnitter, asifsmohammed, chenqi0805, dinujoh, dlvenable, engechas, graytaylor0, kkondaka and oeyh as code owners July 1, 2024 23:00

dlvenable requested changes Jul 2, 2024

View reviewed changes

srikanthjg force-pushed the lambda-processor branch 3 times, most recently from 879671f to 96615fc Compare July 22, 2024 08:00

srikanthjg force-pushed the lambda-processor branch 2 times, most recently from 67f4a1d to 98b27af Compare August 1, 2024 19:01

kkondaka reviewed Aug 5, 2024

View reviewed changes

srikanthjg force-pushed the lambda-processor branch from 98b27af to 2fe00e8 Compare August 6, 2024 23:04

dlvenable requested changes Aug 15, 2024

View reviewed changes

srikanthjg force-pushed the lambda-processor branch from 2fe00e8 to 426a023 Compare August 15, 2024 21:46

srikanthjg force-pushed the lambda-processor branch from 426a023 to e7ff721 Compare September 11, 2024 00:40

dlvenable requested changes Sep 13, 2024

View reviewed changes

srikanthjg force-pushed the lambda-processor branch from 88a6549 to 8d5633d Compare September 13, 2024 16:16

dlvenable previously approved these changes Sep 13, 2024

View reviewed changes

srikanthjg dismissed dlvenable’s stale review via 57603ff September 16, 2024 07:10

dlvenable previously approved these changes Sep 16, 2024

View reviewed changes

dinujoh reviewed Sep 17, 2024

View reviewed changes

srikanthjg added 7 commits September 20, 2024 12:33

Add Lambda Processor Synchronous Mode support

30401dd

Signed-off-by: Srikanth Govindarajan <srigovs@amazon.com>

Make LambdaClientFactory common to sink and processor

edadb63

Signed-off-by: Srikanth Govindarajan <srigovs@amazon.com>

Address Comments

bca9a57

Signed-off-by: Srikanth Govindarajan <srigovs@amazon.com>

Address codec comment

036de73

Signed-off-by: Srikanth Govindarajan <srigovs@amazon.com>

Address configuration syntax comment

4391b2d

Signed-off-by: Srikanth Govindarajan <srigovs@amazon.com>

Address Lambda Payload invocation type

8b16eba

Signed-off-by: Srikanth Govindarajan <srigovs@amazon.com>

Address comments on retries, logging, refactor

21ee56e

Signed-off-by: Srikanth Govindarajan <srigovs@amazon.com>

srikanthjg dismissed dlvenable’s stale review via 21ee56e September 20, 2024 19:39

srikanthjg force-pushed the lambda-processor branch from 57603ff to 21ee56e Compare September 20, 2024 19:39

dlvenable reviewed Sep 24, 2024

View reviewed changes

dinujoh approved these changes Sep 24, 2024

View reviewed changes

dlvenable approved these changes Sep 24, 2024

View reviewed changes

dlvenable merged commit a236804 into opensearch-project:main Sep 24, 2024

This was referenced Nov 9, 2024

Address thread safety for lambda processor and lambda sink #5181

Merged

Address multi-threading issues with AWS Lambda Plugin #5194

Closed

dlvenable mentioned this pull request Jan 16, 2025

Welcoming Srikanth Govindarajan (srikanthjg) to the Data Prepper maintainers #5337

Merged

4 tasks


		public class LambdaProcessorConfig {

		public static final String SYNCHRONOUS_MODE = "RequestResponse";


		public class LambdaProcessorConfig {

		public static final String REQUEST_RESPONSE = "RequestResponse";


		public class LambdaSinkConfig {

		public static final String REQUEST_RESPONSE = "RequestResponse";


		}

		void flushToLambdaIfNeeded(List<Record<Event>> resultRecords) throws InterruptedException, IOException {


		void flushToLambdaIfNeeded(List<Record<Event>> resultRecords) throws InterruptedException, IOException {

		LOG.info("Flush to Lambda check: currentBuffer.size={}, currentBuffer.events={}, currentBuffer.duration={}", currentBuffer.getSize(), currentBuffer.getEventCount(), currentBuffer.getDuration());

Conversation

srikanthjg commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues Resolved

Check List

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srikanthjg Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkondaka commented Aug 14, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srikanthjg Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srikanthjg Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlvenable left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

srikanthjg commented Jul 1, 2024 •

edited

Loading

srikanthjg Aug 6, 2024 •

edited

Loading

srikanthjg Aug 15, 2024 •

edited

Loading

srikanthjg Sep 9, 2024 •

edited

Loading