Skip to content

(aws-logs): log retention failing cdk deployment with OperationAbortedException #17546

@horsmand

Description

@horsmand

What is the problem?

The LogRetention custom resource is causing CDK deployments to fail due to a race condition happening due to the log group the log retention is trying to create and the log group being created (and the retention period being set) for the log retention Lambda.

There was previously an issue made for this and a fix was completed and released, but many people are still reporting that the issue persists. The previous issue is here: #15709

Reproduction Steps

This is difficult to reproduce consistently, but the more a CDK app makes use of the LogRetention custom resource, the more likely it is to happen.

The RFDK integration tests (RFDK is a CDK construct library) deploy multiple stacks in parallel and have seen the failure on 7 of the last 10 runs.

What did you expect to happen?

We were expecting the CDK app to deploy successfully.

What actually happened?

The CDK app failed with the following error:

CREATE_FAILED        | Custom::LogRetention                        | RenderStructRQ2/RenderQueue/LogGroupWrapper (RenderStructRQ2RenderQueueLogGroupWrapper18972F11) Received response stat
us [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress against this resource. Please try again. (RequestId: 5a7d9dda-94e2-4d20-ad9b-eb66d61746fa)
        new LogRetention (/local/home/horsmand/workspace/rfdk/integ/node_modules/@aws-cdk/aws-logs/lib/log-retention.ts:53:22)
        \_ Function.createOrFetch (/local/home/horsmand/workspace/rfdk/integ/packages/aws-rfdk/lib/core/lib/log-group-factory.ts:46:9)
        \_ new RenderQueue (/local/home/horsmand/workspace/rfdk/integ/packages/aws-rfdk/lib/deadline/lib/render-queue.ts:398:37)
        \_ new RenderStruct (/local/home/horsmand/workspace/rfdk/integ/integ/lib/render-struct.ts:103:24)
        \_ Object.<anonymous> (/local/home/horsmand/workspace/rfdk/integ/integ/components/deadline/deadline_02_renderQueue/bin/deadline_02_renderQueue.ts:52:3)
        \_ Module._compile (internal/modules/cjs/loader.js:1085:14)
        \_ Module.m._compile (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/index.ts:1365:23)
        \_ Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
        \_ Object.require.extensions.<computed> [as .ts] (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/index.ts:1368:12)
        \_ Module.load (internal/modules/cjs/loader.js:950:32)
        \_ Function.Module._load (internal/modules/cjs/loader.js:790:14)
        \_ Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
        \_ main (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/bin.ts:331:12)
        \_ Object.<anonymous> (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/bin.ts:482:3)
        \_ Module._compile (internal/modules/cjs/loader.js:1085:14)
        \_ Object.Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
        \_ Module.load (internal/modules/cjs/loader.js:950:32)
        \_ Function.Module._load (internal/modules/cjs/loader.js:790:14)
        \_ Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
        \_ /local/home/horsmand/.nvm/versions/node/v14.17.1/lib/node_modules/npm/node_modules/libnpx/index.js:268:14

CDK CLI Version

1.129.0

Framework Version

No response

Node.js Version

14.17.1

OS

Amazon Linux 2

Language

Typescript

Language Version

TypeScript ~4.4.4

Other information

I modified the LogRetention Lambda to include a few more log statements and was able to reproduce the error.
This is my modified createLogGroupSafe function:

var _a;

let retryCount = (options === null || options === void 0 ? void 0 : options.maxRetries) == undefined ? 10 : options.maxRetries;
console.log(`retryCount: ${retryCount}`); // DEBUGGING
const delay = ((_a = options === null || options === void 0 ? void 0 : options.retryOptions) === null || _a === void 0 ? void 0 : _a.base) == undefined ? 10 : options.retryOptions.base;
do {
    try {
        const cloudwatchlogs = new AWS.CloudWatchLogs({ apiVersion: '2014-03-28', region, ...options });
        await cloudwatchlogs.createLogGroup({ logGroupName }).promise();
        console.log(`LogGroup ${logGroupName} created successfully. Returning.`); // DEBUGGING
        return;
    }
    catch (error) {
        console.log('Caught error'); // DEBUGGING
        console.log(error); // DEBUGGING
        console.log(error.code); // DEBUGGING
        if (error.code === 'ResourceAlreadyExistsException') {
            // The log group is already created by the lambda execution
            console.log('ResourceAlreadyExistsException'); // DEBUGGING
            return;
        }
        if (error.code === 'OperationAbortedException') {
            console.log(`Caught OperationAbortedException. Retry: ${retryCount}`); // DEBUGGING
            if (retryCount > 0) {
                retryCount--;
                console.log(`Retrying with new retry count: ${retryCount}`); // DEBUGGING
                await new Promise(resolve => setTimeout(resolve, delay));
                continue;
            }
            else {
                console.log('Out of retry attempts'); // DEBUGGING
                // The log group is still being created by another execution but we are out of retries
                throw new Error('Out of attempts to create a logGroup');
            }
        }
        // Any other error
        console.error(error);
        throw error;
    }
} while (true); // exit happens on retry count check

This produced the following log:

START RequestId: d0d569a4-4171-468a-988f-b09e041e0644 Version: $LATEST
2021-11-17T02:55:41.181Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    
{
    "RequestType": "Create",
    "ServiceToken": "arn:aws:lambda:us-west-2:##########:function:RFDKInteg-RQ-ComponentTie-LogRetentionaae0aa3c5b4d-Wm8HvYy4jYOD",
    "ResponseURL": "https://cloudformation-custom-resource-response-uswest2.s3-us-west-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-west-2%3A##########%3Astack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f%7CRenderStructRQ2RenderQueueLogGroupWrapper18972F11%7C5a7d9dda-94e2-4d20-ad9b-eb66d61746fa?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20211117T025540Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKIA54RCMT6SDABHSHLV%2F20211117%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Signature=49cb653700d2f4d9d1a165e48f90cfade037da40c77ab2b356dfa6d73bb83ced",
    "StackId": "arn:aws:cloudformation:us-west-2:##########:stack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f",
    "RequestId": "5a7d9dda-94e2-4d20-ad9b-eb66d61746fa",
    "LogicalResourceId": "RenderStructRQ2RenderQueueLogGroupWrapper18972F11",
    "ResourceType": "Custom::LogRetention",
    "ResourceProperties": {
        "ServiceToken": "arn:aws:lambda:us-west-2:##########:function:RFDKInteg-RQ-ComponentTie-LogRetentionaae0aa3c5b4d-Wm8HvYy4jYOD",
        "RetentionInDays": "3",
        "LogGroupName": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue"
    }
}

2021-11-17T02:55:41.188Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    retryCount: 10
2021-11-17T02:55:41.867Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    LogGroup RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue created successfully. Returning.
2021-11-17T02:55:42.006Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    retryCount: 10
2021-11-17T02:55:42.130Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    Caught error
2021-11-17T02:55:42.170Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException: The specified log group already exists
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'ResourceAlreadyExistsException',
  time: 2021-11-17T02:55:42.129Z,
  requestId: '43e6312e-25b3-4d54-a95d-01c393e66280',
  statusCode: 400,
  retryable: false,
  retryDelay: 63.22735256810377
}
2021-11-17T02:55:42.207Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException
2021-11-17T02:55:42.207Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException
2021-11-17T02:55:42.290Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    OperationAbortedException: A conflicting operation is currently in progress against this resource. Please try again.
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'OperationAbortedException',
  time: 2021-11-17T02:55:42.290Z,
  requestId: '16d429d1-1e27-4575-921e-237d508d1318',
  statusCode: 400,
  retryable: false,
  retryDelay: 28.945744461931877
}
2021-11-17T02:55:42.307Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    Responding 
{
    "Status": "FAILED",
    "Reason": "A conflicting operation is currently in progress against this resource. Please try again.",
    "PhysicalResourceId": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue",
    "StackId": "arn:aws:cloudformation:us-west-2:##########:stack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f",
    "RequestId": "5a7d9dda-94e2-4d20-ad9b-eb66d61746fa",
    "LogicalResourceId": "RenderStructRQ2RenderQueueLogGroupWrapper18972F11",
    "Data": {
        "LogGroupName": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue"
    }
}

END RequestId: d0d569a4-4171-468a-988f-b09e041e0644

From this we can see that:

  1. It created the initial log it was trying to create successfully.
  2. It hit a ResourceAlreadyExistsException, likely when trying to create the LogGroup for the Lambda itself, which is expected if the Lambda has previously ran.
  3. After that, it goes right to an OperationAbortedException without hitting the Caught error logging line I added or flowing through any of the retry logic.

I believe this logging would indicate that the OperationAbortedException might not be coming from createLogGroupSafe and might instead be coming from setRetentionPolicy that happens after.

Metadata

Metadata

Assignees

Labels

@aws-cdk/aws-logsRelated to Amazon CloudWatch LogsbugThis issue is a bug.needs-triageThis issue or PR still needs to be triaged.p1

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions