DEV Community: Warren Parad

Making rate limiting in AWS less terrible

Warren Parad — Tue, 07 Apr 2026 00:00:00 +0000

Full disclosure, it is still terrible. I don't promise it wouldn't be, just rather less terrible.

There are lots of bad ways to do this. I don't think there are any best practices unfortunately. Each comes with its own set of drawbacks.

This is an article for those utilizing AWS in some capacity,
and as such we can't avoid the elephant in the room, API Gateway.
For the rest of the article, when I say `API Gateway`,
I mean the `AWS API Gateway` product, abbreviated by `APIGW`.
This is an unfortunate naming since there are architectural components
called `API Gateways` and in reality, what APIGW provides,
actually isn't that. But when I need to make a distinction,
I will, by calling that out.

Why rate limiting matters

Realistically, you have an API. You always have an API, if you didn't you probably wouldn't be reading this article in the first place. Your API may be deployed behind APIGW using Lambda. Or maybe it's an ALB handling your TLS Termination for your containerized compute. At some point, you are going to figure out that you need rate limiting. It's not usually an if, but rather a when. And when that time comes, it's not per-IP or per authenticated account, but per user rate limiting.

Too often the advice in AWS is "Throw a WAF at it". And that's not exactly wrong, but it's not the nuanced answer you're looking for either. What if you did do per-IP — would that really not work? What about somehow rate limiting on the JWT the user is already sending? I'll get to all of that and more.

But first, why do you even care?

Rate limiting solves a real business problem. And usually more than one. And the reason you need to be clear about which problem you're solving is that different motivations lead to different architectural choices, and most of the bad solutions out there come from not being specific about what you're protecting and most importantly why.

Protecting expensive downstream resources. Your API calls a database, a third-party service, or triggers compute that costs real money per invocation. One user hammering an endpoint can run up your bill in minutes. Without rate limiting, your cost model is "whatever the most aggressive user decides to spend on your behalf." In the world of today, usually there is some threat actor, just desiring to use your solution as a database or as pure compute.

Maintaining your uptime SLA. This is the one people underestimate. Rate limiting isn't just about cost or abuse, it's a real and vital strategy for uptime. Blocking malicious traffic before it saturates your origin is what keeps your service viable for the users who actually matter.

Enforcing fair usage across tenants. In a multi-tenant system, you'll have shared resources. And one tenant over consuming allocated capacity will degrade the experience for everyone else. Rate limiting is the mechanism that prevents that.

Protecting yourself from your own customers' bugs. And not every spike is malicious. A customer could easily ship a mobile app with an infinite retry loop, or a misconfigured webhook fires on every event. Which then means, you're suddenly absorbing ten thousand requests per second from a single client. Your SLA doesn't care whether the outage was caused by an attacker or by one customer lambda-bombing themselves.

No matter where you go, rate limiting is not just a feature. Fundamentally, you will get to the point where it's required infrastructure that protects your system from what the outside world can throw at it. And the hard part isn't deciding you need it; but rather implementing it correctly without burying you with the mountain of cloud maintenance.

I don't know if this has been written before. But enough people get rate limiting with API Gateway wrong that another post on the topic can't hurt.

AWS API Gateway Usage Plans

There are so many things wrong with AWS API Gateway (APIGW), such that this article could be dedicated to just that. But instead I've taken a different focus. But in order to that I still need to touch upon at least the relevant parts.

APIGW has two forms: REST (V1, Legacy) and HTTP (V2). V1 is called REST because it supports OpenAPI Spec v2.0 for model validations, has a notion of documentation, lets you automatically deploy a CloudFront distribution on top of your API, and does rate limiting using what they call Usage Plans.

In reality, REST is 3.5x the cost of HTTP — $3.50/million vs $1.00/million. The world has moved onto the v3.2 version of the OpenAPI Spec. No one needs the built-in documentation, when portals like OpenAPI Explorer exist, and the APIGW CloudFront isn't a real CloudFront, you have no control over it, and thus don't get any of the benefits. And now I can finally get to usage plan part.

And perhaps the question is What the heck are usage plans?

I'm so glad you asked. I've seen many people reach for APIGW explicitly for the usage plans even if they're not otherwise using APIGW, for example when they are currently utilizing an ALB. The truth is, the only good usage of APIGW is for Lambda Functions. Custom Domains, Certificates, maybe mTLS — if you are using Lambda Functions. If you aren't using one, then you don't need APIGW, there is nothing it does, that it does well. That means there is nothing left which would justify any value by adding it to architecture, unless you are already using APIGW.

"I couldn't have done X before, but with APIGW I can!" — Someone out there on the internet

And that's true, but you could also do X with likely CloudFront, or directly in your compute or maybe even with the ALB, but please don't use APIGW otherwise.

What about usage plans?

Oh, right, I lost the point.

What usage plans are and how they work

The mental model is straightforward. You create a usage plan, which defines a throttle (rate + burst) and an optional quota.

Usage Plan "Standard Tier"
├── Throttle: 100 requests/second, burst 200
└── Quota: 50,000 requests/month

And then you create an "API key" and attach them to the plan. An API Key isn't actually an API key, it just what APIGW decided to use its infinite wisdom to call an instance of the usage plan. It's the mapping of the usage plan to the user in question you want to rate limit. The problem is "How do you assign this mapping?"

APIGW usage plans work by letting you first create "API Keys", assign the key to a usage plan, and then later when a user interacts with your API, for every request you tell APIGW which API Key is being used.

More specifically:

Usage Plan "Standard Tier"
├── Throttle: 100 requests/second, burst 200
├── Quota: 50,000 requests/month
├── API Key: user_001  ← attached
├── API Key: user_002  ← attached
└── API Key: user_003  ← attached

So for instance, when the user with JWT sub user_001 shows up at your API, you can tell APIGW that it should find the usage plan attached to the API Key user_001. You convey this critical information to APIGW via a custom lambda authorizer. You could also do this ridiculous thing of completely discarding any notion of security and asking the user to send you their API Key in a custom field and using that to key off. But I wouldn't recommend it. (It's worth noting this is probably what the original APIGW developers had in mind when they created it, but we know API keys are insecure by design, I've extensively covered that in how machine to machine authentication works.)

import { ApiGatewayClient } from 'aws';

async function authorizer() {
    return {
        principalId: 'user_001',
        usageIdentifierKey: 'user_001'
    };
}

Example authorizer implementation

This seems perfect for per-user rate limiting. Create an API Key with the same value as the user's sub / user ID. Assign it to a plan. Done.

It's not done.

Hard limits on usage plan keys

Usage plans have a hard cap on the number of API keys: 10,000 per account per region. This is not adjustable. You cannot request an increase. I'm sure there is some amount of money where that isn't true in practice. But, it's fun to think about hard limits as actually being unmovable, and you probably have better things to do than praying that some poor TAM will help you with your support case.

Now, you may be thinking: "But I only have 1,000 users." However, you need to look at this from a business perspective, not a technical one. To be successful you might only need 1,000 paying users. But if your churn is around 50%, that's 500 churned keys per year. Sign-ups that don't convert can easily be another 500–1,000 per year depending on scale. Which means in a few years, and it only takes one good ad campaign, your limit of 10,000 is completely insufficient.

There is an exception here potentially for business customers. For B2B apps, you wouldn't likely be using the user ID as the key anyway, you'd use the business account ID. Which means, you'll likely have at least one factor of magnitude fewer account IDs than consumer user IDs, probably even fewer. So this solution may actually be sufficient for those scenarios. All the other limitations unfortunately still apply.

For consumer apps, the user ID as the plan key violates the scaling needs of any real user base. And for most business apps as well.

The bootstrap problem

Forgetting about the hard limit, doesn't alleviate all our issues however. Another obvious one that will immediately come up is that there is no Default Rate Limit. Once you enable a Usage Plan for an API, every request needs to be coupled back to a usage plan. That means there must be an API Key created for that user.

Here lives a paradox. API calls require API Keys, but you won't know to create the API Key without there first being a call to your API. This leaves a couple of possible solutions:

Option 1: control plane API

Utilize the APIGW control plane to check if an API Key exists in your custom lambda authorizer. If it doesn't exist, you can use the control plane to associate the api key with the right usage plan at that moment. The APIGW hard limit for CreateApiKey is 5 RPS per AWS Account. So there is no way you are going to be calling this directly in your authorizer for every request.

Thankfully we don't have to, but the GetApiKey api isn't even documented there. So we have no idea what that means. Assuming the rate limit is at least some multiple of CreateApiKey, still leaves us with a situation where we will end up getting throttled in the Lambda Authorizer when we call the APIGW control plane.

You might be thinking, well it's okay, but remember why we created this in the first place. You will end up getting throttled to your APIGW control plane at the exact moment where you are also getting spammed and need rate limiting to work. Not a great story.

Now, you could attempt to turn on Authorizer Caching to ~once per hour, and hope this reduces the load on GetApiKey down enough to provide breathing room. In reality that is going to provide only limited value.

import { ApiGatewayClient } from 'aws';

const apiGatewayClient = new ApiGatewayClient();

async function authorizer() {
    const userId = 'user_001';

    const apiKey = await apiGatewayClient.getApiKeys({ nameQuery: userId, includeValues: true, limit: 1 });
    if (!apiKey.items.length) {
        const apiKey = await apiGatewayClient.createApiKey({ name: userId, value: userId });
        await apiGatewayClient.createUsagePlanKey({ keyId: apiKey.id, usagePlanId });
    }

    return {
        principalId: userId,
        usageIdentifierKey: userId
    };
}

Authorizer: Just in time api key provisioning

It's doable like this, but it isn't great. In essence it doesn't really fix the problem, you've just moved it somewhere else. It feels like it works, but remember, that this has the issue that we are creating a strategy that allows anyone to abuse your API just by creating new accounts. So in practice generating API keys and attaching them to usage plans isn't a real rate-limiting strategy. We want to actually block new accounts from getting automatic api keys.

There's also two subtle bugs in the above code. What happens if the api key is created, but is never attached to a usage plan? We'll have a critical failure for that user. This could happen and for sure will happen to any user who attempts to sign up while you are having an incident. The second is that on every call you are going to be slowing down your authorizer by calling a control plane, and one that was not designed to handle this exact thing. Want to slow an authorizer by multiple seconds and definitely get rate limited yourself? Not very sustainable.

One more thing on authorizing caching. Authorizing caching is extremely dangerous in its own way, since you might be allowing expired tokens to still be used with your API. API Gateway Authorizer caching doesn't know to automatically expire the cache when the token expires! However, focusing on our use case, Cache TTL does reduce calls, but the first request per user per cache window still hits the control plane.

Some quick math: if you have 5,000 active users and a cache TTL of 5 minutes. When the cache expires every ~5 minutes, that amounts to 1,000 control plane calls per minute (17 RPS), just for the "does this key exist?" check. Under traffic spikes, which are the exact scenario you're rate limiting for, cache misses increase. More unique users — means more control plane calls — means you're DDoSing the APIGW control plane while trying to prevent a DDoS on your API. At any sort of scale this isn't going to work in the first place, even theoretically.

So let's move on.

Option 2: Pregeneration

I personally hate this next strategy, but I know for some things people love it. To completely side step the problem of having to generate the usage plan api key and attaching it to the usage plan in the authorizer, while the user is actively making a request, you can pregenerate some keys.

But that's where this falls down a bit, how many should you generate, how do you know to generate more, what should you even do with those keys.

I think the whole usage plan thing is a lost cause, but I'll try to provide some guidance for strategy, just in case it's the one you decide to end up going with.

First of all, of course it's easy to generate some usage plan api keys and store them in a DB somewhere. I don't know if storing the keys in a DB is better than generating them at runtime in the authorizer. But if it is, you are swapping "generation" using the API Gateway API for "generation" using your own "API", which is probably just a query to a database. You might end up with some race conditions there on which key should be selected and given to whom.

Another thing to be cognizant of, is who will own that key. See, keys aren't really owned by anyone. There is no way to assign keys to individual users (how would that even work?), and for sure not before you even know who the user is. So at what point do you give the user key, and what is the key value?

In the Option 1 strategy above and the Option 3 strategy below, we're assigning keys to users based on the assumption that the api key exactly matches the user ID. But if it doesn't, how does the user even get the key in the first place in order to call your API?

The trivial answer is: They call a dedicated endpoint, and that returns them the key. Well that doesn't really make sense because it completely duplicates the problem that this Option was supposed to solve in the first place. Maybe there is a smarter answer here, but I honestly don't know what that would be.

The non-trivial answer is: When users sign up, decide ahead of time what their user ID will be, so that you know also ahead of time what the API Key will be.

That's also not really a solution, because it requires coupling your sign-in process to your API Key generation process, and that might not even be something for which you are fully in control.

The last thing that comes to mind for this solution is a long term problem. Users can "buy" keys. Since keys aren't coupled to users, they can sign up multiple times, get multiple keys and rotate through them to call your APIs. Since you do no sort of validation to make sure the user is using the right key.

But this brings up another point. Where are users supposed to save their API Key? If we don't know who the user is before they call our API, then we can't create an API Key that can be determined from their user ID. That means the key value will be different from their user ID. If it is different then it needs to be available somewhere for the client to store. Someone has to be responsible for storing it somewhere. I suppose one thing you could do is maintain a list of user IDs to api key mappings in your database and create an endpoint called GET /user-api-keys returning a list of api keys available for the user, so that they can use that key for followup requests. Again at that point, you'll have an endpoint to return a key, might as well side-step that completely and just assume the key is the same as the user ID, and not bother even storing the keys in the first place.

Option 3: Account Creation

One genius thing you probably already thought of to solve the above problems is — what if we create the usage plan api key during account creation instead?

Genius!

To do that, you could rate limit all your endpoints without issue, except for the Account Creation one. Right?

Wrong!

Users will create accounts through your UI. When they do that, they will likely need to load a bunch of data from your service, which would enable them to understand exactly how to do that. So in practice you'll have more than just the POST /accounts API that needs to be completely exposed. Of course you'll still have an authorizer on there to validate incoming JWTs, but rate limiting would be an unsolved issue, and rate limiting with the same API Key as the one for all other endpoints would be impossible. (Because the API Key wouldn't exist when those account creation endpoints are called.)

Remember, you probably still are caching the authorizer result, so that you don't need to fetch an OAuth JWT-public key lists to validate tokens. But at that same time this means you baked in the result of the authorizer into the cache. And that cached result says "No API Key available".

Wait, why is that again?

How do you know to call POST /accounts in the first place? Well of course you know because first you called GET /accounts. Is GET /accounts rate limited? Is it rate limited using the same authorizer as your "account creation endpoint" authorizer or your "all other endpoints" authorizer? Depending on which authorizer you used, you might have already determined which endpoints were acceptable to be called and with which rate.

Also you have a secondary problem here, is your account creation flow async? I know ours is. Which means that API Key might only be created minutes from now, but the user is sitting on the UI with a little spinner waiting for the account to be created.

One trick that sort of works is in your authorizer, is turn on caching, but also, look up the mapped user or B2B customer account in your database. If that account was just created in the last 5 minutes, give them a limited use API Key, otherwise use an API Key dedicated to their account.

This shifts the burden from your API resources (your backend origin compute) to a custom Lambda authorizer that interacts with a database in a cached way. Depending on your needs you can also get pretty fancy here, but I wouldn't recommend it. The whole point is a hack to get these new users a temporary key that works, but long term isn't the key that they'll be using for 100% of their requests.

This avoids the unlimited fallback plan that would be a new security hole, and side-steps building a rate limiter with a bypass for the exact users you can't yet identify.

import { DynamoDB } from 'aws';

const dynamoDBClient = new DynamoDB();

async function authorizer() {
    const userId = 'user_001';

    const account = await dynamoDBClient.get({ TableName: 'account', userId });
    const accountWasRecentlyCreated = account.createdTime < now.minus({ minutes: 5 });

    return {
        principalId: userId,
        usageIdentifierKey: accountWasRecentlyCreated ? 'Temporary-Usage-Plan-Key' : userId
    };
}

Authorizer for recently created users

Usage Plans in practice

And that works ... sort of. The API surface for managing usage plans is painful. And not the normal kind of painful. It's the kind where you realize the API was designed for manual console clicks, not programmatic management.

A key can be associated with up to 10 usage plans simultaneously. Yay, you might think, but actually only one usage plan applies per API stage. So if you need to change a user's tier, for instance to move the user or account from "Standard" to "Premium", you have to:

Call DeleteUsagePlanKey to remove the key from the old plan
Call CreateUsagePlanKey to add it to the new plan

Two separate API calls. No transaction. No atomicity. Between step 1 and step 2, the key is unassociated — meaning the user has no access at all. Under load, that window matters.

With caching, the window matters less, but at a scale where rate limits are a responsible strategy, this feels like an irresponsible solution. While you aren't going to be doing this every day, the users who will be most affected are those who are your high frequency users.

Said differently, users that require a higher rate plan, that want to pay you for that, require you to temporarily delete their usage plan, so that you can upgrade them. Good Luck!

Quotas

Another area that is a straight pit of failure are the usage quotas that come with the usage plans. The rate-limiting part is nice, but you're encouraged to also set a fixed quota for resources. And it sounds like a very convincing idea!

However, you'll soon find out that these quotas only reset at the end of the day or month. Which seems incredibly arbitrary. But it's worse. As you are encouraged to set them, what happens in practice is that some user will use up all of the quota. Success!

When that happens, they are blocked, just like we wanted. However, now comes the problem, is that really what you want? You probably have a much better business strategy in play.

But the profit in this case is — that user pays more. There are two problems with this:

You are likely blocking critical production access to your API, since the quota is consumed.
Because the quota is blocked you are likely also blocking the access to your API to allow them to pay and increase the quota.

What you have done here is introduced a technical solution to a business problem, where in reality these are completely separate concerns. But, I'll get more into this later, in the Do you really need rate limiting section.

Endpoint cardinality

You'll be happy to know, you can define per-method throttle overrides — GET /items at 100/s, POST /items at 10/s, scoped to specific resource + method combinations within the plan. Sometimes, if you are lucky the API will even allow specifying 0.1/s or slower, but often it will complain.

And so, usage plans do support per-method throttle overrides. You can set different rates for GET /items vs POST /items at the resource + method level within a plan.

But the problem in practice is that every user on the same plan gets the same per-method limits. If you want user A to have different endpoint limits than user B, you need different plans. It's likely that the combinatorial explosion of users × endpoint tiers makes this unworkable for anything beyond a handful of static tiers. And each of your plans will end up looking like:

Type: AWS::ApiGateway::UsagePlan
Properties:
  UsagePlanName: TIER1
  Throttle:
    BurstLimit: 100
    RateLimit: 20
  ApiStages:
    - ApiId: !Ref ApiGateway
      Stage: production
      Throttle:
        /v1/records/GET:
          BurstLimit: 1
          RateLimit: 1
        /v1/accounts/POST:
          BurstLimit: 1
          RateLimit: 0.000001
  Quota:
    # Since quotas only reset once a full day,
    #   it's going to be pain everywhere when a customer hits it.
    # Instead we'll just pick something really really high.
    Limit: 70000000
    Period: DAY

Usage Plan configuration

What you actually want instead is per-user-per-endpoint granularity, which of course would require one plan per user per endpoint configuration. And to be able to dynamically update this based on their pricing plan and expected usage. That's not rate limiting, that's a whole plan management system.

The usage plan verdict

I don't really understand the world where the usage plans architecture makes sense, so it isn't one I've been able to justify. To use it for actual rate limiting is building on a foundation that fights you at every step: hard limits, non-atomic updates, a terrible API, and a bootstrap problem that creates the exact hole you're trying to close.

It explains a lot when you understand that Usage Plans only exist for the legacy APIGW V1, and don't exist in V2. It's a good indicator to remember, if it doesn't exist on HTTP APIs you likely should think twice before going to production with it.

Rolling your own rate limiter

So, Usage Plans are out. The next place most people land is: build the rate limiter yourself. Do they, do they really? Everyone's got to know that building it yourself is rife with no shortage of challenges. But we can't exclude that there might be an actual good reason for the 0.1% use cases. So let's review it as a potential solution. Which reduces to:

Store a counter somewhere, increment it on every request, block if exceeded.

How hard could it be?

In a traditional server architecture, you could keep counters in memory. Nginx does this. Envoy does this. Rate limiting is a solved problem when you have a process that lives long enough to count. This requires legacy infrastructure coupled with fixed compute and a centralized reverse proxy layer to filter all requests through. We know fundamentally this isn't scalable.

And if you're running Lambda, or another kind of compute or containers that are stateless, then this will likely immediately break down. In production, you might have hundreds or thousands of instances running concurrently, each with no knowledge of the others. There is no shared memory. There is no "the server." Every invocation is an island.

With a small number of instances, you could just have a decentralized handling in each instance. Sure, that rate limit of 100 requests per second with three instances effectively becomes 300 requests per second. But that won't really impact your infrastructure that much.

Getting back to it, we need a counter store. Something external that all your instances can read and write atomically. In AWS, there are two realistic options.

DynamoDB gives you atomic counters via UpdateItem with ADD.
And ValKey (formerly Redis) gives you INCR with a TTL.

Both are battle-tested primitives. Both do exactly what you need for this problem.

Sort of.

This approach genuinely has advantages. You're in full control. You can set whatever limits you want, per user, per endpoint, per whatever custom dimension makes sense for your business. No hard caps on API keys, no bootstrap paradox, no APIGW control plane throttling. And your architecture configuration is completely independent. It doesn't matter if you are running a single compute instance or hundreds of thousands of them. (Well, it sort of does, as you will need to also scale up your counter store).

Then you deploy it.

Problem 1: Every request pays the tax

Even if we throw out the complexity of managing this technology, where do we deploy it, what is the schema, how do we integrate with it, when do we upgrade, how do we gracefully fallback ... The first thing you'll notice is latency. Every request, not just the ones you want to block, every single one, now has a mandatory round-trip to your counter store before it does anything useful. One of the worst mistakes inexperienced architects make is creating a solution that solves for an edge case by degrading the most common use case. It's only the edge case that should be affected by complexity, but here is it, everyone.

For DynamoDB in the same region, that's roughly 5–10ms. For ValKey, maybe 1–2ms. These aren't catastrophic numbers. But they're on every request. Your best customer making 10 requests per minute is paying the same latency tax as the abuser making 10,000. You're taxing 100% of your traffic to protect against the fraction that's problematic.

Problem 2: Cost

Then there's the cost. Using these resources isn't free, and as you scale, you'll need to scale this solution. While you might be able to get something interesting out of the same stack (e.g. use ValKey for a second purpose), you've essentially added a critical path component to your service, product, API, which is load bearing.

DynamoDB: every counter update is a write request unit. An attacker at 10,000 requests per second means 10,000 writes per second to your counter table — 864 million per day. You're burning DynamoDB write capacity on the counter store to prevent the attacker from burning compute on your API. The rate limiter itself becomes a denial of wallet attack surface.

ValKey: great performance, but now you have a stateful cluster in your "serverless" architecture. ElastiCache nodes to size, failover to configure, connection pooling in Lambda to manage. You went serverless to not manage infrastructure, and now you're managing a Redis cluster because you needed a counter.

If you have thousands of instances of your compute, you will need a cluster that can handle thousands of concurrent connections to your store of choice. Are you starting to love this solution?

Problem 3: Edge cases

There is a lot I'm going to overlook in this section, not because they are annoying to talk about, but because I don't think they are important. If you are pedantic, you can consider:

The complexities of race conditions with your atomic store?
Does 10 RPS mean 10 PRS max, average, burst, or best effort?
What about when your store goes down, do you fail open or close? How do you even write that code?
Does your storage of choice even support the scale you need? Before we were looking at compute, but now we've shifted the concern to your Rate Limiting implementation. How are you going to avoid making your rate limiter your service bottleneck?

In a lot of cases you probably took a very nice and simple distributed compute system, centralized it, and created a huge single point of failure.

Problem 4: The devil

And it's always in the details. I haven't even started talking about the implementation, so let's get to that. How do you even implement this in practice?

Above I alluded to using INCR or ADD in a database request. For simplicity, let's assume you are using DynamoDB here. Hopefully the complexity of the implementation will immediately become clear.

One person is going to jump up and down say say:

async function rateLimitCheck(userId) {
    await dynamoDbClient.update({
        TableName: 'RateLimits',
        Key: { PK: 'UserId' },
        UpdateExpression: 'SET #counter = if_not_exists(#counter, :zero) + :one',
        ExpressionAttributeNames: {
            '#counter': 'counter'
        },
        ExpressionAttributeValues: {
            ':zero': 0,
            ':one': 1
        }
    });
}

But that's just going to increase monotonically to the maximum value supported by DynamoDB, which I don't even know off the top of my head. Let's assume it is some NodeJS implementation support for MAX_INT or BIG_INT, but even then, that's probably wrong.

And that's not even the worst part. How do we check that if a user is consuming more than 10 per second?

For kicks, I gave this problem to Gemini to see what hot garbage it returns (and spoiler alert not only was it wrong it was also in python):

table.update_item(
    Key={'PK': user_id},
    UpdateExpression="SET req_count = if_not_exists(req_count, :zero) + :inc, ttl_attr = :ttl",
    ConditionExpression="if_not_exists(req_count, :zero) < :limit",
    ExpressionAttributeValues={
        ':inc': 1,
        ':zero': 0,
        ':limit': 10,
        ':ttl': current_second + 60 # Expire after 1 minute to save space
    }
)

DynamoDB incorrect rate limiting

This doesn't work for so many reasons, I think the biggest reason is that it hinges on the fact that after the TTL is over the data will be wiped, but DynamoDB doesn't work like that. TTL doesn't guarantee the data is gone after that point. So if a user reached the rate limit yesterday, there might still be a value of 10 sitting in the DB. We could add some code to deal with that case, but you are far better off asking for a Token Bucket Algorithm.

async function rateLimitCheck(userId) {
    await {
        TableName: "RateLimits",
        Key: { PK: userId },
        UpdateExpression: `
        SET lastRefillTime = :now, 
            tokens = (if_not_exists(tokens, :cap)
              + (:now - if_not_exists(lastRefillTime, :now)) * :rate),
            tokens = (if_not_exists(tokens, :cap) > :cap ? :cap : tokens) - :one`,
        ExpressionAttributeValues: {
            ":now": Math.floor(Date.now() / 1000),
            ":rate": 10,
            ":cap": 10,
            ":one": 1,
        },
        ConditionExpression:
          `(if_not_exists(tokens, :cap)
            + (:now - if_not_exists(lastRefillTime, :now)) * :rate) >= :one`
    };
}

Okay, also Gemini failed again, except this time it tried to write just a completely invalid DynamoDB Expression. Do you see it? There is no ? ternary operator in DynamoDB...

You also can't set the value twice, that doesn't work. And another problem that you might find is that expressions within the SET can be executed out of order. So you can't assume order of operations.

You can play around a lot with this, and I guarantee you there is a way to achieve this, but the expressions above are not it.

And while I can share with you how to do this, and there are some quite clever things about how we have implemented interesting DynamoDB logic to explicitly handle metrics tracking in Authress, the truth is we already know that this solution does not scale, and it's very difficult to get this logic right in the first place.

The core aspects of the solution require:

storing multiple values for multiple timestamps and ranges
handling missing rows from the DB
read before write
DynamoDB Lists / Arrays

Fundamentally, you'll want to increment a set of values for the user, return all the data, then aggregate and decide what to do in the authorizer.

Authorizer? I never really talked about where you would even run this code. You need a full list of all the requests coming into your service, but your authorizer is likely caching. Which means the authorizer can't do it, since it will only see 1 request / TTL, and not all of them. So that means it is happening in your service. And if that's the place this is happening that means your APIGW and Lambda (or your container service) are getting the full request and processing at least part of it before blocking the rate-limited request. That's going to consume more resources, potentially defeating the purpose of Rate Limiting in the first place.

The custom gateway alternative

At this point, you might be thinking: forget AWS-native solutions, I'll just throw a reverse proxy in front of my API. There's got to be something out there that handles this out of the box. And there are:

All have rate limiting built in. Battle-tested with various degrees of documentation, widely deployed, for the most part. This is a solved problem in the non-cloud world.

If you have a fixed fleet of servers sitting behind a load balancer, cloud or not, a reverse proxy with rate limiting works beautifully. The proxy sees every request, keeps counters in memory, and blocks or passes in microseconds. No external counter store, no latency tax, no DynamoDB bill.

And in fact, you're in AWS. So the question is: where do you deploy this thing?

Do I really need to iterate through all the bad options? I can't think of any way to not ruin a good thing by adding this to the stack. I think the truth is, it isn't something that you can add in, you need to fundamentally replace APIGW to make it work. And if you are replacing APIGW, your compute still needs something to serve HTTP and terminate TLS.

Your options then are using an ALB, opening your EC2 up to the entire world, or running in EKS (yuck, I can't believe I said that). And as a result I'm going to claim there is no solution. If you do have a solution to this, please let me know. In any case I'm skipping over this section entirely, and moving on.

The truth is, there is no simple way to make this work, and no way to make it work without throwing away core benefits to running in the cloud in the first place.

Making progress: AWS WAF

So if you can't do it in APIGW, can't do it in your own code, and can't do it with a custom proxy — what's left?

Apparently: "Throw a WAF at it."

And it's not wrong — AWS WAF can be attached to almost everything that receives HTTP traffic:

That's pretty broad. And WAF evaluates rules before the request reaches your origin — which means it can block traffic at the edge, before you pay for compute. It has rate-based rules which look promising and it has aggregate keys that let you group requests by custom dimensions. It sounds like exactly the right tool for per-user rate limiting.

It isn't.

Or at least not entirely, so of course we need to review why not.

WAF Rate-based rules

Rate-Based rules allow you to aggregate on incoming requests' properties, and use those in aggregate to block requests. For the most part, this is fire and forget, you don't think about users, you don't think about endpoints, and often you don't even think about APIs or services.

And that's sort of the problem, it's great for blocking threat actors and malicious attacks as I have talked about at length. But it's just not great when you need granularity. And that is for a few different reasons.

The goal of rate limiting is to restrict access to our resources, our endpoints, and our services as much as possible without getting the compute origin backend involved. Another problem is that actually the WAF cannot even be attached to the AWS API Gateway HTTP API V2. Unfortunate. But even if we could, we will see some problems with that as soon as we get to the implementation.

So let's try it out.

Examples

Here's a starting example that helps make it clear how rules work. Let's rate limit by user IP address:

Name: PerIPPerEndpointRateLimit
Priority: 1
Action:
  Block: {}
Statement:
  RateBasedStatement:
    Limit: 600
    EvaluationWindowSec: 60
    AggregateKeyType: CUSTOM_KEYS
    CustomKeys:
      - IP: {}
      - UriPath:
          TextTransformations:
            - Priority: 0
              Type: NONE

So this is nice and likely handles almost all of the scenarios you might run into. It allows for 600 requests over 60 seconds. Which means burst handling is also included.

But then where the usefulness ends. You know how we might want to have different rates for different endpoints?

We'll there are three problems that fall out here:

The WAF is decoupled from our application, so it doesn't understand that GET /orders/order_001 has the same route as GET /orders/order_002, and they are both really GET /orders/{order_id}. There is no way to bridge this gap. It may or may not be your desired state.
The second problem, which you might have guessed is that WAFs have a rule capacity. Not surprisingly as all of these rules need to run on every request. If you don't hit one of the other rate -based rules limit quotas, then you surely will hit the rule limit of 5000 Web ACL Capacity Units (WCU). The next WAF Rule below comes out to be 62 WCU. Which means we would get about 80 rules, and that's before all the other things you might want to throw in there. From our production environment, we are using ~164 WCU for one of them. Which means we would be allowed 60 rules. There is such a thing as WAF Rule Scope-Down statements, but I don't know how much that would help in practice here. It isn't the same endpoint transparency you would get with an application level solution.
Even if we somehow managed to get around the first two things, here's the kicker, we don't want IP an address. Hopefully it is obvious why. But if not, let me clue you in. Are you building a solution where you will have customers or users connecting from the same location? Physical address? Or Business address? Businesses usually have a small fixed set of available IPs, demand to allowlist small IPv4 CIDR blocks, and heavily reuse them. Or maybe you have a client that is using a cloud provider or VPN where addresses are being shared.

For this last one, maybe that's all a feature and not a bug. But if you are like me, this isn't really the right approach, and want some more robust and more importantly more accurate.

For users, you probably want to at least switch over to the JWT that is being sent. And we can do that by making this change:

Name: PerUserPerEndpointRateLimit
Priority: 1
Action:
  Block: {}
Statement:
  RateBasedStatement:
    Limit: 600
    EvaluationWindowSec: 60
    AggregateKeyType: CUSTOM_KEYS
    CustomKeys:
      - Header:
          Name: Authorization
          TextTransformations:
            - Priority: 0
              Type: NONE
      - UriPath:
          TextTransformations:
            - Priority: 0
              Type: NONE

And that's great, and works for the most part. If you are happy with it, then I'm happy for you.

But I'm not happy with it, because tokens expire, and furthermore users might have multiple tokens generated in different ways for any number of reasons.

I want to caution that none of the WAF configuration I'm talking about here has anything to do with malicious threat actors. They require a completely different perspective when being dealt with. The goal up until this point is simply to prevent users from accidentally using too much of our service, in a way that we would expect. Like clicking refresh too many times, or sending too many emails, and not attempting to bypass our rate limiting.

So utilizing the authorization token header isn't great. And as a matter of fact, this is also a problem for us regarding threat actors, as they can fire off unlimited requests with different JWTs. See, nothing about the above rule actually checks that the authorization token is valid. So while a single token will be blocked after 600 requests in a minute, it won't stop users from using fabricated tokens or even sending garbage in the header.

And even for legitimate users, the counting is just wrong. A user makes 100 requests with token A. Token expires, they refresh, now they have token B. They make 100 more requests. That's 200 requests from the same user, but WAF sees two separate aggregation buckets at 100 each, neither triggers the limit. The user is never rate limited, they just happen to rotate credentials frequently enough to stay under the per-token threshold.

Can we actually get to user ID based rate limiting using a WAF?

And before I answer that question, I want to revisit the premise of this whole article. (Of course if you are impatient feel free to skip this next section and scroll directly to Configuring the WAF for User IDs).

Alternative 1: Is rate limiting required?

Usage plans, counter stores, WAF rules, custom proxies. Every solution so far adds infrastructure, and each one comes with its own scaling problems, its own cost, its own failure modes. And every one of them is a permanent operational commitment. Someone has to understand it, debug it, and evolve it. Forever. I believe everyone who has made it to this section seriously understands the maintenance burden of running "free software" in production.

So before we get to the clever solution, take a step back. What does it actually cost you to NOT block these requests? Specifically how much actual money does it cost to do nothing. How meetings do you no longer need to have, epics on kanban boards that never need to be created, Objectives and Key Results that never need to be discussed. How much exactly will letting those requests in, cost you. DB queries, compute wall time, requests logged?

Does it cost hundreds, thousands, or millions of euros, francs, dollars? Understand your goal posts. This will also help you identify optimizations that should be in scope, versus ones that don't need to be. And we'll see these scenarios below:

Async: the architecture alternative

If your concern is "users hammer an endpoint and cause expensive work", what if the work is all async?

That is, what if it were possible to move all the expensive operations behind a queue. A common failure mode inexperienced engineers make is asking for APIGW allowed processing time and Lambda to be increased beyond the 30 seconds, and 15 minutes respective maximums. Having the capability to do something that is costly, because it is cheap to implement is the crux of the most common human pit of failure.

Easy things are more likely to be done, even when they are wrong. The right thing is never done when it is harder.

And SQS, EventBridge, Step Functions, all support strategies to handle async processing of incoming requests. This makes doing the right thing easy.

But, you can't just throw the incoming requests into these async background queues and hope everything works out. However, even taking the first step of doing so will take an incredible load off of your critical path compute. Imagine, having non-critical infrastructure being overwhelmed instead of your critical production endpoints?

Further, with even a little bit of thinking you might realize that, fundamentally async infrastructure components by design all support deduplication. This means — the same user submitting the same thing 100 times results in only 1 processed item. SQS content-based deduplication or a simple idempotency key in DynamoDB, EventBridge, or Step Functions, and suddenly you don't care how many times they call the endpoint, because the duplicate work is never performed more than once.

This fundamental strategy works out of the box in 99% of cases. And in the few it doesn't, read up on Idempotency. The processing is never duplicated, and even when it does, there is no harm to your system. This converts expensive malicious or negligent writes back to simple and inexpensive.

And while this doesn't work for reads. Your writes, which are usually the expensive operations you're trying to protect, are solved.

Batching

I know I said it, I didn't want to. It's such a dirty word: batch endpoints. For almost two decades, I've been a staunch opponent to Batch. Fundamentally, that's because batches are an anti-pattern in REST APIs (the real kind, not the APIGW kind). It breaks resource-oriented design, complicates error handling, and makes caching impossible. But if your users legitimately need to perform N operations, a batch endpoint lets them do it in 1 request instead of N. You've reduced the request volume at the source, not by blocking, but by making the efficient path the easy path.

You might be asking yourself, why was I so against batch operations, and what made me change my mind. The first part is simple; like most incorrect uses of technology, inexperience engineers optimizing for made-up problems tend to switch to batch, just like they switch to websockets, graphQL, K8s. Sure there are reasonable use cases, but most of the time it's led by misunderstanding rather than conscious thought. Batches are often a concept of the business domain, product management, or UX decision that creeps into the API design. The acceptance of them is usually because at the same time an engineer says "I don't want to make more than one API call from the UI on any user action".

But as soon as you let go of the notion of a UI can only make one request at time, you realize that a batch endpoint need not be necessary. GraphQL is so bad, that they even admit themselves how useless of a technology it is for most of the web in their April 1st post.

Discard your principles and stick to actual real business problems, not the imagined ones of 10 years from now.

The second part, about why I changed my mind, is that there is a great way to handle batches in a RESTful API compliant way. You create resource called batches and let users POST /batches or POST /batch-processing, which takes all the necessary inputs, and you can validate it there. You let the notion of a batch be a resource itself. Once you do that, the semantics in APIs work out of the box again. The only wrong thing to do is to create an endpoint called /orders:batch which takes a batch of orders. There's a reason why in REST the route should be a plural noun, and /orders:batch isn't it. And if you aren't sure why that is, please read Building Microservices by Sam Newman.

And with this simple shift in mindset and perspective, the rate limit conversation changes from "how do I block 1,000 requests/min" to "how do I handle 1 batch request/min containing 1000 items, how do we add throttling in our architecture, and how do we manage errors in a user transparent way?"

Sure not all of these questions are easy to answer, but all of them are easier to answer than how do we rate limit our users so that different users get different rates to different endpoints, and everyone is happy about it?

If your architecture can absorb the traffic through deduplication, batching, or idempotency, you've solved the problem at a layer that doesn't require edge infrastructure, WAF rules, or HMAC cookies. Not every API can do this, but check before you build.

The hidden cost

The flip side of the whole cost calculation is that the rate-limiting infrastructure actually costs money as well. As identified earlier, most rate limiting solutions aren't free, they aren't even cheap. And they scale weirdly. Attackers cost you lots of money and as do users that pay you, but everyone takes a cut along the way. To prevent going over a rate limit of 10 RPS, in a world where everyone uses exactly that limit, you are paying for 10 RPS per user to DDB as writes with no benefit. That is, no one needs to be rate limited in that world, and yet you are running an expensive infra that provides no value. ROI = Negative.

That 10 RPS is about $32.40 per user per month (30 days * 10 RPS * ~$1.25 per WRU, that's DynamoDB Write Request Units). There is no way this works as a solution for businesses for mostly non-enterprise software customers, let alone for private consumers out there. This also tells you why so many B2B applications require a sales call before letting you onto the product in the first place. If you see talk to sales as part of onboarding, you can be sure their technology stack is not built to stand up against users accidentally calling their API too much, let alone attacks from threat actors.

(I do have to admit, that number is a bit conservative and it could be less depending on if you are using provisioned or reserved capacity, but I don't think it is a real strategy to make business decisions based on optimizing for calculations like this. Case and point, you don't actually know what the usage is going to be in the end, so you can't go out and pre-provision your Rate-Limiting infrastructure, you have to pay for it on-demand in the first place.)

In any case, as I mentioned before, what needs to be evaluated is the cost of letting attackers call your endpoints or users abusing your service resources as compared to the cost to maintain a solution. Sure you might only be charging $5 / month for your service, but if the rate-limiting solution is going to cost $32.40 per user per month, it isn't even an option not to consider what your actual cost is. If it costs $0.10 / user / month at the 10 RPS. You can afford a 300x fold increased attack surface before even getting to the point of implementing rate limiting. And that's assuming every single user is malicious.

Let's break it down.

If you have 1000 users, each paying $5 / month. That's $5k / month in revenue. As a baseline, your cost is only $100 / month. Your monthly net profit is $4.9k per month.

At your infrastructure cost of $0.10 / month for 10 RPS, to eat your entire profit, that would require 490,000 RPS of malicious requests. No way that is going to happen at this scale. If you set a global rate limit of 100 RPS per user, you won't even need to think about this problem ever again. No API Keys, no magic infrastructure, just ignore the problem.

The business solution

Okay, I know you don't read my stuff for the business perspective, I write on technology after all. But hear me out, in this case, I want you to consider why you are even building or creating the service API you are making. It of course exists to make money* (and say money*, because maybe it exists for some social benefit, or to help a cause you really care about. That is value for you, it might not be currency, but it is value.)

So, you want to convince users to pay more money, as money solves everything. If they paid more, it means you could buy more capacity for your database or other constrained architecture. If you have more capacity, you don't need to limit incoming requests as much, which means that essentially "higher plans = higher costs", but at the same time "more money = more resources".

At the end of the day, you want users to pay more, rather than pay less and rate limit. Rate limiting isn't a smart business strategy, it's a terrible one. But it is a technical means to an end that might make users purchase a premium plan. Do you even need to rate limit, or do you need to just tell your users that you will rate limit them?

The real solution usually instead looks like:

Throw a CDN, such as CloudFront, on top of your API
Use CDN logging to track how many requests, data, and usage customers are using
Convert that into trackable metrics
Send emails to users that are reaching or have reached the next premium plan usage volume
Pray that they upgrade
And if they don't upgrade you can rate limit them after the fact, or terminate their free account, I don't know, you do you.
Or because rate limiting is expensive (see the previous section), just allow them to use your service below cost and hope that causes them to tell all their friends about it, or bring it to their companies, who will then pay you for it. And pay a lot.

When you think like the business, you focus on real problems, rather than just the technical ones which might not be the right one to even focus on in the first place.

Alternative 2: WAF + User IDs

So we need the WAF to aggregate on the user ID.

Can we parse it out of the Authorization Header? The header already contains a JWT, and JWTs contain user IDs. As a refresher the current rule looks like this:

Name: PerUserPerEndpointRateLimit
Priority: 1
Action:
  Block: {}
Statement:
  RateBasedStatement:
    Limit: 600
    EvaluationWindowSec: 60
    AggregateKeyType: CUSTOM_KEYS
    CustomKeys:
      - Header:
          Name: Authorization
          TextTransformations:
            - Priority: 0
              Type: NONE
      - UriPath:
          TextTransformations:
            - Priority: 0
              Type: NONE

And that TextTransformations object looks pretty appealing. Remember, JWTs are there base64url encoded strings joined together using the separator: .

eyJhbGciOiJFZERTQSIsImtpZCI6InB2ZTQ3OGlHU3g4VzJnc3p6UVlta1QiLCJ0eXAiOiJhdCtqd3QifQ.eyJpc3MiOiJodHRwczovL2xvZ2luLmF1dGhyZXNzLmlvIiwic2NvcGUiOiJvcGVuaWQgcHJvZmlsZSBlbWFpbCIsInN1YiI6InV
zZXJfaWQiLCJpYXQiOjE2ODUwMjEzOTAsImV4cCI6MTY4NTEwNzc5MCwiYXVkIjpbImh0dHBzOi8vYXBpLmF1dGhyZXNzLmlvIl19.
ciKCNA8PzPfKGGEiGVbbOumGu64Ft55Sh0lOl8IBl9KEuYUaSCw

And here are all the valid transformations we can use:

BASE64_DECODE
BASE64_DECODE_EXT
CMD_LINE
COMPRESS_WHITE_SPACE
CSS_DECODE
ESCAPE_SEQ_DECODE
HEX_DECODE
HTML_ENTITY_DECODE
JS_DECODE
LOWERCASE
MD5
NONE
NORMALIZE_PATH
NORMALIZE_PATH_WIN
REMOVE_NULLS
REPLACE_COMMENTS
REPLACE_NULLS
SQL_HEX_DECODE
URL_DECODE
URL_DECODE_UNI
UTF8_TO_UNICODE

We can get really close with BASE64_DECODE_EXT, to handle base64url, and you might think we can plug the JWT into that BASE64_DECODE_EXT(AuthorizationHeader), except for two things, we actually need those invalid base64 characters because they are valid base64url, just not valid base64. And second if you try to decode the above JWT without first splitting it by the separator, you'll end up with this:

{"alg":"EdDSA","kid":"pve478iGSx8W2gszzQYmkT","typ":"at+jwt"}\x07�&�72#�&�GG\x073�����v���\x17WF�&W72��"�'66�\x06R#�&�\x06V�B\x07\x07&�f��R\x06V�\x16��"�'7V"#�'

Everything up until the separator works, for the most part. But everything after that, the part that includes the user ID (sub), turns into garbage.

And even if we could get that out, we'd be left with a string like:

`{"iss":"https://login.authress.io","scope":"openid profile email","sub":"user_id","iat":1685021390,"exp":1685107790,"aud":["https://api.authress.io"]}`

There'd have to be some other way to find a matcher inside of a string.

To recap, we are missing the following functions:

split on delimiter
BASE64 Decode for URLs — not just plain base64 decode
Find in string() OR string to JSON() OR property of JSON string()

For which, we have none of those.

Starting from scratch

And with that we are back to the baseline of somehow using a WAF, but we aren't sure how to plug values into it. The only trick in the book left is first generating a custom property and then looking at that custom property in the WAF. If we could get the user ID into a header that could be sent on every request, we could just use that.

Any chance you can convince all your users, unauthenticated users, and malicious threat actors to kindly offer their identity in a nice header called x-ratelimit-user-id.

Well I suppose not, so we'll need to do that for them. The AWS WAF supports rules that looks at a custom header, so you can aggregate on a custom header. We can call it x-ratelimit-user-id, and at the very least we can take the first step of updating our WAF rule to depend on this new option.

Name: PerUserRateLimit
Priority: 1
Action:
  Block: {}
Statement:
  RateBasedStatement:
    Limit: 600
    EvaluationWindowSec: 60
    AggregateKeyType: CUSTOM_KEYS
    CustomKeys:
      - Header:
          Name: x-ratelimit-user-id
          TextTransformations:
            - Priority: 0
              Type: NONE

So now there are two goals, figure out how to calculate the x-ratelimit-user-id, and second, how to get it into every request.

No matter how you have your architecture set up nor what sort of product or service you are offering, it is going to fall into two categories:

Category 1: Unauthenticated Users

Since WAFs in AWS always execute before your origin compute is run, there must be some endpoint exposed that allows returning valid x-ratelimit-user-ids while at the same time not blocking requests for lack of them.

One way to achieve this is to offer an endpoint like GET /ratelimit-user-id, for which the value can be returned. Then the client can hard code this in their routes.

Another strategy would be a shared algorithm where clients know they must stick the x-ratelimit-user-id in all requests or be automatically blocked by your solution.

If you have a dedicated UI or even if only have API based user interactions, it's easy to bake this into the SDKs or UIs you create. The problem with this approach is that clients know how to bypass your setup. Actually you might start to see the real problem here, in that clients will always be able to bypass your strategy by just passing in an arbitrary x-ratelimit-user-id.

For malicious threat actors, this does not create a viable option. But for everyone else, whose goal isn't to DDoS your service, they have a desire to get value out of your endpoints. Which means it is enough to block all invalid x-ratelimit-user-ids in your origin.

Since we have no way of conveying to the WAF which x-ratelimit-user-ids are valid in a scalable way (we can of course dynamically update WAF rules with a list of valid ones, which is neither scalable or desirable), we must resort to validating x-ratelimit-user-ids outside of the WAF.

This leads us to the second category.

Category 2: Authenticated Users

We need a way to validate the incoming x-ratelimit-user-ids to ensure that users aren't just stuffing random values into the headers and hoping it works. Even if these aren't real threat actors, users who want to use your service will abuse any mechanism which grants them value. Additionally, we know that we can't just utilize the token or some token value, as they would just generate new tokens. Users figuring out that "logging out and then back in" enables them to continue to use your service, will log out and log back in to continue to use your service.

Let's assume for the moment it would be sufficient to stuff the user ID (sub) from the JWT into the x-ratelimit-user-id token, and then validate that the user ID from the JWT matches the header in your APIGW Authorizer. We could do that.

import { ApiGatewayClient } from 'aws';

async function authorizer(request) {
    const userId = await parseJwt(request.headers.authorization);
    if (userId !== request.headers['x-ratelimit-user-id']) {
      throw Error('Unauthorized');
    }

    return {
        principalId: userId
    };
}

The WAF runs on every request, the authorizer runs on every uncached authorization request, and now your backend origin only sees incoming requests that are not rate limited and are also valid requests.

I'll take this as a success, and we don't even need a custom unauthenticated endpoint, since users will know what their user ID is.

Mostly.

Protecting users from attacks

Three problems:

Problem 1 — With the user ID in the x-ratelimit-user-id we get to avoid telling the user which value to stick in there, but we are going to see that with solutions to Problems 2 and 3, we'll have to add this endpoint back in.

Problem 2 — We don't have the capacity to rate limit individual endpoints for the users. All endpoints are using the same rate limit. GETs are likely going to prevent POSTs, not great.

Problem 3 — Since x-ratelimit-user-id are predictable, malicious attackers can grief your users by taking leaked user IDs, which are probably not sensitive according to you and your platform, and injecting them into their requests. Said differently, if an attacker gains knowledge of a list of your platform's user IDs, they can cause a DoS for all your users, simply by calling your endpoints, and passing in the x-ratelimit-user-id value. And remember, they can do this because the authorizer runs after the rate limiter.

Letting malicious users deprive paying users of using your solution is not a great look. And especially not, when we actually could be doing something about it.

As soon as we are in the land of customized header values that need to be kept a secret, that means generating them on the backend origin side, and thus we are back to an endpoint that is essentially: GET /ratelimit-user-id.

But now we are in a good place because we can securely generate a value in a way that can't be abused by attackers, is usable by all users, provides the benefits we need, and also supports whatever complexity you want, such as rate limiting per endpoint. A common strategy is to use the HMAC hashing method.

import { createHmac } from 'crypto';

const hash = createHmac('sha256', privateSecureKey)
  .update(userId)
  .update(endpointType)
  .update(accountId)
  .update(AnythingElseThatMakesSense)
  .digest('hex');

Since your secure key is private to you and your authorizer, no one can fake x-ratelimit-user-ids and you can be sure your WAF is working correctly. One other accidental benefit here is that you can even include additional prefixes in this hash to help your WAF rules scope-down to using the right rule. If the wrong rule is selected because someone messes with the hash, then your authorizer will throw an error.

Downside: authorizer needs both values as identity cache keys, as we have to rerun the authorizer every time one of the critical values used for authorization is different.

At this point we've solved a number of problems that have been stacking up.

🗹 The user ID is encoded into a header that exists before the WAF evaluates and is consistent after. So the token rotation problem is solved. The HMAC is derived from the user ID, not the JWT, so it's stable across new tokens.
🗹 The bootstrap problem from usage plans is solved, because unauthenticated users fall back to an alternatively defined rule for the unauthenticated endpoint for fetching new x-ratelimit-user-id header values potentially using some sort of IP-based limiting.
🗹 This solution also prevents against malicious attackers using known user IDs to cause rate limiting on actual users.

We aren't able to completely prevent a consumption of origin resources on requests with broken x-ratelimit-user-ids. Since your authorizer is still being called. But if that is cached, then your origin isn't going to be fully utilized.

However because of that there is actually a subtle problem introduced. What happens if you receive two requests in serial?

# Request 1:

Authorization: Bearer JWT-token-1
Endpoint: GET /orders
x-ratelimit-user-id: HASH-1

# And then
# Request 2:

Authorization: Bearer JWT-token-1
Endpoint: GET /invoices
x-ratelimit-user-id: HASH-2

You need to make sure that the authorizer knows to recalculate the hash for the /invoices requests otherwise the x-ratelimit-user-id will be reusing the one from the /orders request. As a matter of fact any user could still pass in a request changing the hash manually:

# Request Malicious Request:

Authorization: Bearer JWT-token-1
Endpoint: GET /orders
x-ratelimit-user-id: MANUALLY-CRAFTED-HASH

Since the authorizer was cached it will ignore all future requests until the cache expires. You need to change your authorizer identity key to include the new header as part of the request as well. For exactly how and why to do this, I've gone into quite the detail in this API Gateway Security Review.

This is an easy problem to solve, but it means the usefulness of authorizer caching will be reduced. In most cases, I recommend reducing the usefulness of the caching anyway because it often means something is being overlooked on the security front.

So maybe not so bad.

Another problem we have to deal with is the user changing the endpoint in the request after a successful authorizer validation.

# Request Malicious Request:

Authorization: Bearer JWT-token-1
Endpoint: GET /OTHER-ENDPOINT
x-ratelimit-user-id: HASH-1

If we take the valid Authorization Token and valid x-ratelimit-user-id and pass them in to a completely separate request, again the authorizer is still cached. This tells us we either have to eat the cost of revalidating the x-ratelimit-user-id in our backend origin compute or we need to further reduce the viability of the authorizer to include anything that would alter the semantics of the request:

HTTP Method
Request URI (or templated URI)
Authorization Header
And anything else you are using in your authorizer
And also the x-ratelimit-user-id

Every property used in the authorizer must be part of the identity cache key.

And with that we finally have a solution that actually works, works correctly, and works at scale.

Performance and reliability

My personal gripes with the above solution come in the form of the fact that — in practice we do a lot of clever things in our authorizers, and forcing them to re-run on basically every request to ensure that the x-ratelimit-user-id (or whatever you called it) is the correct value is a huge burden. For instance, if you call your database in your authorizer today, then since the caching of the authorizer becomes next to useless, that database call is no longer cached.

For instance, we love to do a lookup of DNS domains to customer accounts in the Authress authorizers. This hits the database. With the cache, this happens once per token per hour. With the above strategy this happens N times per hour per token based on the actual usage of our service by our users.

You can do some nice hacks and cache data yourself in your custom Lambda authorizers by just having an in-memory cache, and using that. And that also works. So there are ways around these problems.

What would be really nice is if there was some way to off load the whole rate-limiting infrastructure to its own segregated area, have it run much before the execution of the authorizer, and forget about the complexity with the origin calculated rate limit header value validation.

Introducing CloudFront + WAF

You can take this to another level however, and add a CloudFront before your APIGW. Having a CF would give you a couple of different solutions here. A couple of important things to note. Adding a CloudFront allows you to also add CloudFront Functions and a CloudFront Functions Key-Value store. Hopefully it is obvious where this is going. The other thing is that CloudFront itself accepts WAF as a protection mechanism.

Having a CloudFront as part of your architecture is almost always the best practice in AWS. It grants you necessary opportunity to cache requests/responses as well as receive requests directly from AWS backbone rather than having the request traverse across the internet before getting to your origin's API.

This converts the question into, what can we do with a function running at the edge?

The good thing with functions is that they aren't Lambda functions, they are primitive and simple javascript functions, and as compared with APIGW usage plans, they don't have a cold-start problem. The bad thing with functions is that they are primitive and simple javascript functions. But that might not be a complete failure.

There are two solution paths here. WAF before CloudFront and WAF after CloudFront.

WAF after

WAF after CloudFront is easier to reason about, but harder to deal with. It requires attaching the WAF to a piece of infrastructure that accepts a WAF. That means you significantly limit your potential architecture solutions based on this decision, and it might not actually get you anything. The good part about it however is that you can spin up your CloudFront, attach a CloudFront Function, and use that to generate the x-ratelimit-user-id based on the incoming request. The client never needs to know about it.

Then the WAF will see this value as a header, and perform the rate limiting that we so desperately have wanted to have, since I introduced you to WAF Rate Limiting, just a few minutes ago.

import crypto from 'crypto';

async function handler(event) {
    const request = event.request;

    if (!request.headers.authorization) {
        return {
            statusCode: 401,
            statusDescription: 'Unauthorized'
        };
    }

    const token = request.headers.authorization.value.replace('Bearer ', '');
    const jwt = JSON.parse(
        Buffer.from(token.split('.')[1], 'base64url').toString()
    );

    if (jwt.sub) {
        const hash = crypto.createHmac('sha256', 'hash')
        .update(jwt.sub)
        .update(jwt.accountId)
        .update(request.uri)
        .digest('hex');
        request.headers['x-ratelimit-user-id'] = {
            value: hash
        };
    }

    return request;
}

Honestly, I would much prefer to do JWT verification using EdDSA public keys, but the CF function can neither access the internet, nor perform JWT signature validation ... yet. Weirdly it supports CWT signature verification, just not for JWTs. The limited cryptographic functions it has access to can be seen by reviewing the AWS custom javascript runtime. Maybe CWTs are coming to a SaaS Identity Provider near you, but I wouldn't know anything about that.

With the CF Function approach, the flow becomes:

Request arrives at CloudFront edge
CF Function extracts the user identity from the request Authorization header JWT and computes HMAC(userId, privateKey)
CF Function sets the x-ratelimit-user-id header on the request
WAF evaluates the rate-based rule against that header
If under the limit, the request passes through to the origin

With this, the client never sees or touches the x-ratelimit-user-id. It's computed at the edge, validated at the edge, and aggregated at the edge. Your origin just handles business logic. Since the header is generated at the edge, we never need to validate the x-ratelimit-user-id header.

This means you can block APIGW requests before they get there to the APIGW in the first place. And we wouldn't even need to pass the request or configuration onto the APIGW. Your authorizer can focus on the Authorization Header JWT without paying attention to this new header.

And there are other benefits. Previously, an attacker that fabricates the x-ratelimit-user-ids would normally still hit your origin, and our APIGW authorizer, since the WAF can't block invalid IDs. Each fabricated ID would have a new rate limit that starts at zero, so the attacker never triggers the limit. Your origin would still be called.

Now, those x-ratelimit-user-ids are completely internal both from the client, and most importantly, from the origin. They are generated and passed directly to the WAF. You don't even need to secure them, a simple hash is sufficient.

So there are real, very tangible benefits.

It's worth calling out that CF Functions have compute limits — the execution ceiling is tight. HMAC validation at the edge is feasible but constrained, and if your function exceeds the limit, CloudFront drops the request entirely.

WAF before

If there is a way to put the WAF before CloudFront, we will have solved every problem in the book. When I say before, I mean it executes before the CloudFront Function executes. It would grant us the ability to rate limit with any origin configuration, not just ones that accept a WAF, and it would allow us to do it without the user needing to understand how rate limiting works.

In practice, this can be done by attaching a WAF directly to a CloudFront Distribution. And while this architecture feels the best it opens you up to dealing with those two annoying edge cases:

How to get the header to be evaluated by the WAF when it is being generated afterwards by the CF Function?
And likewise how can we receive the header, to be available to the WAF, in the first place?

Since the WAF header evaluation is based on the request coming from the client, the only conclusion we can come from is that the client must have knowledge of the new header in order to utilize it. That means we need to provide it.

This is a real change in your architecture strategy to make this happen. As discussed before, this could be generated by a dedicated endpoint, such as GET /ratelimit-user-id. However this time, we can rate limit the endpoint using the WAF without origin complexity, all while also generating it in the same CloudFront Function.

So like the previous scenario, we can completely avoid knowledge of the header, rate limiting, or origin technology when setting up our architecture. However, we do need to account for the fact that since this is exposed to clients, we need to secure the hash generation:

import cf from 'cloudfront';
import crypto from 'crypto';

async function handler(event) {
    const request = event.request;

    if (!request.headers.authorization) {
        return {
            statusCode: 401,
            statusDescription: 'Unauthorized'
        };
    }

    const token = request.headers.authorization.value.replace('Bearer ', '');
    const jwt = JSON.parse(
        Buffer.from(token.split('.')[1], 'base64url').toString()
    );

    const keyValueStore = cf.kvs();
    const secret = await keyValueStore.get('hmac-secret', { format: 'string' });
    const hash = crypto.createHmac('sha256', secret)
        .update(jwt.sub)
        .update(jwt.accountId)
        .digest('hex');

    if (request.uri === '/ratelimit-user-id') {
        return {
            statusCode: 200,
            statusDescription: 'OK',
            headers: {
                'x-ratelimit-user-id': { value: hash }
            }
        }
    }

    if (hash !== request.headers['x-ratelimit-user-id']) {
        return {
            statusCode: 429,
            statusDescription: 'Too Many Requests'
        };
    }

    return request;
}

The only remaining part is to ask the client to insert the returned header back in every subsequent request. If all your clients are UI based, then you can directly set a cookie instead:

    // replaces the early return in the /ratelimit-user-id branch above
    if (request.uri === '/ratelimit-user-id') {
        return {
            statusCode: 200,
            statusDescription: 'OK',
            headers: {
                'set-cookie': { 
                    value: `RateLimitUserId=${hash}; Path=/; Secure; HttpOnly`
                }
            }
        };
    }

And honestly, even if you don't only have UI clients, you can embed this logic into all your SDKs, and still provide a lower rate limit for requests that don't include it. That is, from a business standpoint, you can offer a low rate limit to make requests, block everything higher, and still allow users a reasonable upgrade path, without needing to change your architecture at all.

However, there is actually a problem with this compared to the other strategy. And that's we can't rate limit differently on different endpoints, easily.

That's because we need a different hash per endpoint, and we don't know which endpoint they are calling when they first call the GET /ratelimit-user-id. So depending on use case, this strategy might not work at all without a lot of extra complexity. That said, there are ways around this, and it's actually a simple matter for anyone who understands HMACs, but incredibly challenging for anyone who doesn't.

The TL;DR of HMAC here, is that you can HMAC an HMAC using the hash as the secret key, and verify the HMAC of the HMAC using the same hash re-derived on the CF Function side:

import cf from 'cloudfront';
import crypto from 'crypto';

async function handler(event) {
    const request = event.request;

    if (!request.headers.authorization) {
        return {
            statusCode: 401,
            statusDescription: 'Unauthorized'
        };
    }

    const token = request.headers.authorization.value.replace('Bearer ', '');
    const jwt = JSON.parse(
        Buffer.from(token.split('.')[1], 'base64url').toString()
    );

    const keyValueStore = cf.kvs();
    const secret = await keyValueStore.get('hmac-secret', { format: 'string' });
    const hash = crypto.createHmac('sha256', secret)
        .update(jwt.sub)
        .update(jwt.accountId)
        .digest('hex');

    if (request.uri === '/ratelimit-user-id') {
        return {
            statusCode: 200,
            statusDescription: 'OK',
            headers: {
                'x-ratelimit-user-id': { value: hash }
            }
        }
    }

    // Add the request.uri to the current HMAC
    const endpointHash = crypto.createHmac('sha256', hash).update(request.uri).digest('hex');
    if (endpointHash !== request.headers['x-ratelimit-user-id']) {
        return {
            statusCode: 429,
            statusDescription: 'Too Many Requests'
        };
    }

    return request;
}

This one weird trick

The only thing I'm going to say is that there exists the capability to put a CloudFront on your CloudFront. Which allows the WAF to simultaneously be both after and before a CloudFront, allowing a CF Function to run before the WAF, while still having any origin you want.

Given the nuanced and quite unobvious pitfalls with such an approach, and that they are outside of the scope of the article, I'm not going to go into it any further, other than to reference that such an architecture could technically exist.

Wrapping up

The thread running through every approach in this article is the same problem: the thing you need to rate limit on — the authenticated user identity — doesn't exist where rate limiting happens. APIGW Usage Plans need the identity at request time, but the authorizer cache needs significantly more than just the identity as part of the cache key, and the 10k requests-per-second ceiling doesn't care how carefully you've designed it. WAF needs the identity at the decision point, but a WAF running before your origin only has the unprocessed the JWT. Rolling your own counter store relocates the problem to a smaller, more fragile target. And, a custom proxy just adds an operational surface to the same architectural mismatch.

CF Functions approach gets the closest to eliminating the gap. The rate limiting hash gets computed at the edge, after the JWT arrives, before the WAF decision point, without the origin being involved at all. And because the HMAC can include the endpoint type, you finally get per-user-per-endpoint rate limiting that actually works: GET /items and POST /items are separate buckets for each user, without the WCU cost of rules that aggregate on broken hashes.

Your authorizer goes back to doing one job. And your origin doesn't know rate limiting exists.

But let's be honest about what's left.

The first authenticated request per user still requires a round trip to establish the hash identifier from the JWT via the HMAC. You can cache it in a cookie and reduce subsequent overhead, but the bootstrap still happens. There's no architecture here that eliminates it entirely without severely restricting you on other aspects.

The correct framing for what you've built isn't rate limiting, but rather a cost ceiling with a known failure mode. Once you’ve defined the blast radius, failure modes, actual costs, what needs to be explicitly protected (the database, the downstream services, etc), and you know what can still be hit cheaply (the edge, the APIGW, the Lambda), only then, can you build the right solution. That's really the only defensible position. And it's less bad than everything else I’ve brought up in this article.

Again, there are some clever solutions, but none of them are super great.

For help understanding this article or how you can implement auth
and similar security architectures in your services, feel free to 
reach out to be via the community server.

Join the community

Actually Fixing AWS S3

Warren Parad — Sat, 14 Mar 2026 00:00:00 +0000

For help understanding this article or how you can implement auth
and similar security architectures in your services, feel free to 
reach out to me via the community server.

Join the community

AWS just released a supposed fix for S3 bucket squatting by utilizing what they are calling Account Regional Namespaces. I don't understand the hype, and now I'm going to explain why.

Broken: S3 Bucket Names are Global

S3 bucket names are global. Not global to your account. Not global to your region. Global to the entire AWS partition — every account, every region, every customer who has ever existed on AWS.

This was not a deliberate design philosophy. It was a default from 2006 that nobody corrected. S3 launched when AWS was essentially a startup with Amazon as its main customer. Global uniqueness was the path of least resistance. Nobody asked whether it would cause problems at scale, because at the time "scale" meant a hundreds or thousand developers, not millions of accounts and decades of production workloads.

But, that default is still in place today.

AWS's relationship with the S3 naming model, circa every year since 2008.

The sad truth is, nobody needs global bucket names. There is no use case that requires your bucket name to be universally unique across every AWS customer on the planet. The value of global uniqueness flows entirely in one direction: it must have simplified the original implementation. The cost of global uniqueness flows in the other direction: two decades of pain for every customer who has ever tried to name a bucket something sensible.

The abomination lives on because someone probably said "Wouldn't be cool if you could expose your S3 bucket publicly?" And for that the bucket name would have to be in the URL, and therefore globally unique (and also require that the bucket name be lowercase and RFC 7553 compliant). This is true but also irrelevant. S3 doesn't even support TLS for custom domains. So there is no way to actually serve an asset such as https://assets.mycompany.com directly from your S3 Bucet. None, full stop. Let's break that down, there are three parts to that URL — HTTPS, your domain, and something that maps to the S3 bucket. It has always been, and still is only PICK 2.

Anyone who needs a public URL with a real domain and HTTPS already is using CloudFront as a reverse proxy. As a matter of fact, every SPA out there, must be using CloudFront in order to achieve HTTPS or they must not be using a custom domain. The only suitible URL is the CloudFront distribution's alias, not the S3 bucket name. The bucket name is internal plumbing that nothing outside your AWS account should ever reference directly. I'm here to tell you that not only are global bucket names a mistake, there is actually an easy way to fix it. One has to wonder why AWS hasn't.

The people who think they need global bucket names are the people using S3 Virtual Hosting — mybucketname.s3.amazonaws.com — which does have TLS, but on AWS's domain, not theirs. And of course, there the sad case for supporting this pattern indefinitely because AWS is much nicer than some other cloud providers that constantly deprecate actually required features, such as DNS Zone hosting. Although in recent times that hasn't held up as much, and gives credence to AWS dropping the concept as it would have direct Security and Reliability wins. Not to mention straight out improvement by reducing complexity. There is no case for making it the architectural foundation of an object storage service used by billions of production workloads. And as we will see shortly, exposing that endpoint directly comes with its own expensive problem that CloudFront eliminates entirely.

The reality is none of the following are tradeoffs you agreed to. They are the consequences of a default, set in 2006, that nobody changed. The cost has landed on you ever since. And boils down to basically one core concept.

Name squatting

The boring version: the bucket name you want — mycompany-prod-logs, myapp-assets, opentofu-state — was registered years ago by someone who no longer works at the company that registered it. AWS has no mechanism for name reclamation. That name is gone until the current owner deletes the bucket, which may never happen. So what you might think, just choose a new name, like you would choose a new username, or website domain. This isn't a new problem after all.

But the reality is: bucket names are predictable, and predictable names are claimable before you need them, and it turns out some bucket names you actually very much need.

The researchers at Aqua Security demonstrated this at Black Hat USA 2024, calling it Bucket Monopoly. AWS services, themselves, create S3 buckets automatically use naming patterns derived from your account ID. Account IDs are not secret — they appear in IAM role ARNs, error messages, S3 URLs, and CloudTrail logs. And while good hygiene means keeping your AWS account ID obscured, the bucket names themselves must be completely public. S3 Virtual Hosting resolves every bucket as a DNS subdomain (mybucket.s3.amazonaws.com), Certificate transparency, and passive DNS collectors observe and index those queries continuously. And while they might not have caught everything, any bucket that has ever received traffic via Virtual Hosting has a name that likely exists in a DNS database outside your control.

Many naming patterns were vulnerable:

Athena: aws-athena-query-results-{account-id}-{region} — data query results
Elastic Beanstalk: elasticbeanstalk-{region}-{account-id} — application build artifacts
AWS Config: config-bucket-{account-id} — compliance and configuration records
CloudFormation, Glue, EMR, SageMaker, ServiceCatalog, and CodeStar all also have had similar patterns

The complete impact ranged from data exfiltration to remote code execution to full-service takeover. AWS has patched many of these services after disclosure.

The CDK case may be the worst case. AWS's own infrastructure-as-code tool hack wrapper (because actually the CDK isn't the IaC tool) bootstraps a staging bucket with a name that was never random:

cdk-hnb659fds-assets-{account-id}-{region}

The qualifier hnb659fds is a hardcoded constant in CDK's bootstrap template. It has never changed. Anyone who knows your account ID knows your CDK staging bucket name. If that bucket does not exist — because you deleted it, or becouse you have not bootstrapped yet, or because someone cleaned up an old environment — an attacker can claim it. CDK will then use that bucket to store and retrieve CloudFormation templates. The attacker injects a malicious template. CDK deploys it using an IAM role with broad permissions. Full account takeover.

Aqua Security found over 38,000 accounts susceptible. The vulnerability was present for years before being fixed in CDK v2.149.0 in July 2024.

To be clear, an attacker who learns your AWS Account ID, can register those bucket names before you deploy the service. AWS will see that the bucket exists, trust it, and the route your data into the attacker's bucket. This is happening even without your knowledge. Have you actually checked that every bucket AWS is secretly sending data to is self-owned by your account? Probably not, you probably don't even know which buckets AWS is using.

Security Through Obscurity

I thought it would go without saying, but I'm sure someone will bring it up: "Keep your bucket name obscure" is not a defense, since you can figure out these buckets by just using AWS services. And worse, the bucket name shows up in website hosting CNAMEs, presigned urls, and other places. It is publicly available.

And of course the inverse is also a problem. S3 bucket names carry implicit trust. When your infrastructure reads configuration from my-config-bucket, it assumes the content is authoritative because the name is correct. The global namespace means that assumption is structurally unsound — the name and the owner are not bound to each other in any durable way. An attacker who controls a bucket your infrastructure reads from doesn't need to exfiltrate anything. They inject. Your service pulls the configuration, trusts it, and acts on it.

This is not abstract. Consider the pattern of storing IAM permission mappings in S3 and distributing them via OU StackSets across an AWS organization. Something I actually just wrote about doing. An attacker who controls that bucket — whether by squatting the name, claiming it after a deletion, or exploiting a misconfigured access policy — can inject a permissions map that adds their own identity as a trusted principal. The StackSet propagates the poisoned configuration to every account in the org. Their CICD pipeline assumes the role via OIDC federation. Full organization-wide access, delivered through the normal configuration path, with no credentials created and no anomalous API calls.

This is the same pattern that made Clownstrike's botched configuration update in 2024 so severe. A trusted delivery mechanism pushed configuration that every endpoint pulled and acted on without independent verification. The delivery channel was correct. The content was not. Millions of machines followed instructions from a source they had no reason to distrust.

The difference is that Clownstrike's delivery infrastructure was their own, and the configuration was negligent, not malicious. Whereas the S3 version of this attack does not require compromising the infrastructure owner at all, it only requires claiming a bucket name.

The global namespace is what makes this entire attack class possible. In a correctly scoped namespace, your bucket names are yours, and an attacker in a different account cannot claim them. AWS built a shared global pool and then built their own services on top of it using predictable names, inheriting the vulnerability they created.

Security misconfiguration

The public access model exists because bucket names are global. Since any AWS account can reference your bucket by name, making a bucket readable without credentials makes it readable by everyone — which is occasionally intentional and routinely catastrophic.

The deeper problem: S3's access control system has never cleanly separated "accessible by my AWS account" from "accessible by the public internet." That distinction is not a first-class concept in S3. It has to be constructed from a combination of overlapping controls, each added at a different point in S3's history, each with its own interaction rules:

Bucket policies — grant access to specific principals or to * (everyone)
ACLs — a separate, older system with its own grantees, including the confusingly named AuthenticatedUsers property
Block Public Access — four separate boolean flags that apply restrictions over policies and ACLs, added only in 2018 as a retroactive guardrail
Object Ownership — controls whether ACLs are enforced at all, added later still
IAM Policies — scopes permissions to principals with IAM authority.

Each layer was added to contain the blast radius of the previous one. None of them establish "private to my account" as the starting point. They establish "open to everything" as the starting point and ask you to correctly configure the restrictions. Miss one flag, misread one grantee, inherit one policy from a module you didn't write — and the bucket is likely public.

I like this article from 6 years ago talking a bit about that

IAM access summarized

But then you realize, this is just how IAM works, it isn't how S3 works at all. Sure whether or not IAM grants access is part of the picture, but where's the rest of it? I was trying to find a document in the AWS Docs that does a good job of explaining. There isn't one. There are over One Hundred Pages on access control in S3 alone. Don't believe me, count them. To be fair we have more than one page on similar Authorization concepts in the Authress KB. However, arguably what we designed has to be significantly more complex, since it has to handle literally every possible authorization scenario.

This is not a configuration problem. It is an architecture problem. It is a security problem. The controls are layered on top of a model that was never designed to be private.

And while the likelihood of getting it wrong has gone down significantly, the trade-off has been increased burden on configuration and setup.

Historical Hacks

Each problem identified by the community attracted a from AWS patch. But no one said they were the right patch.

Forced random suffixes

For buckets operated by AWS Services, you have no recourse, but for buckets you manage for your own platform, you have a small, but not very satisfying alternative. Because the global pool is full of names claimed by other accounts, you cannot have the names you want. my-app-assets is taken. opentofu-state is taken. prod-logs is taken. The community's answer to the problem, years before AWS even started to take any approach, is to use the only reliable strategy available — append a random suffix and stop trying to name things sensibly: my-app-assets-8f2a3c, opentofu-state-a1b2c3, prod-logs-9e4d71.

A list of your S3 buckets is now a list of opaque identifiers. Understanding which bucket belongs to which service requires either tagging discipline — which degrades over time — or reading OpenTofu state, which is stored in an S3 bucket with a random suffix. Not to mention this only gets around the creation problem, and doesn't remotely address the security angle.

This is not a novel problem. Discord ran the same experiment with usernames. Their original system appended a four-digit discriminator to every display name: warren#0088. Globally unique, unambiguous, machine-friendly. I don't remember anyone that could actually remember their discriminator. I can't imagine how many friend requests failed because users entered the wrong tag. With only 10,000 discriminators available per name, popular names of course ran out.

Discord's fix was not to make the discriminator longer. They separated the unique identifier — the username, used for backend lookups — from the display name, which is human-readable and non-unique. The part that needed global uniqueness was the lookup mechanism. The part humans see and share does not need to be globally unique at all.

S3 never made this distinction. The bucket name is simultaneously the unique global identifier, the human-readable label, and the public URL component. When all three concerns are collapsed into one string that must be globally unique across every AWS customer, you get my-app-assets-8f2a3c. That is your discriminator.

Forced predictable suffixes

For us we've taken a slightly different approach. And that's because random suffixes cannot be dynamically used at read time, are not idempotent, and that means usually hard-coding this string in multiple places. Or worse, I've seen many implementations attempt to export the generated S3 name from the infrastructure process to somewhere else, effectively coupling disparate systems that had no business being coupled together.

Our approach is to add the AWS Account ID, the Region, and an internal consistent identifier to ever bucket we create. Now everyone will understand what that means. For example, you can imagine you choose something like -${accountId}-${region}-un1que1d. Is that clever? Not really, but it is far better than having every bucket have a random ID.

The ExpectedBucketOwner property

One hack AWS added was integrating a new parameter into the S3 bucket APIs, which could validate ownership on bucket related actions such as Creation, PutObject, and GetObject. Released in Oct 2020, every S3 API call could now include the expected AWS account ID of the bucket owner. If the bucket exists but belongs to a different account, the call fails. You add this header to your SDK calls, your bucket policies, your presigned URL logic. The problem of AWS created buckets was so bad, that AWS needed an internal security fix for the problem. And this helped a little bit for us users as well. It isn't a real solution though, just something hacked on top.

The problem with this hack though, is that it is security you have to opt into, and if you are using some library or reusable module, good luck assuming that made it in.

CDK v2.149.0

In the July 2024 fix for the CDK boostrap, AWS merged a change that adds a condition to the CDK bootstrap role, preventing the attacker-controlled-bucket scenario. However, the fix still required teams to re-run cdk bootstrap. Any environment bootstrapped with CDK v2.148.1 or earlier and not yet re-bootstrapped remains vulnerable. The hack qualifier still remains hnb659fds, but you can change it, if you want to.

Block Public Access

By 2018, the pattern was clear: teams were misconfiguring bucket policies and ACLs what seemed like on-purpose, as if they were on a mission to win an award. Objects were going public, breaches were making headlines, and the individual controls were too granular and too easy to get wrong. AWS's response was to add a meta-level override: Block Public Access — four boolean flags that sit above all bucket policies and ACLs and veto any access grant that would expose objects to the public internet. To be clear, these flags don't affect the bucket at all, the affect the ability for you to change those other insecure properties on the bucket.

BlockPublicAcls, IgnorePublicAcls, BlockPublicPolicy, RestrictPublicBuckets. Each flag a different angle on the same problem.

It is a kill switch. It works, for the most part. It was necessary because the model it was bolted onto had no safe default — the access system started too easy to open and required teams to correctly configure the restrictions, which teams reliably failed to do at scale. Block Public Access does not change that model. It adds a blunt override and calls it a fix. AWS enabled it by default for new accounts in 2022.

Paying for unauthorized access

Did you know until 2024, if someone attempted to access your AWS S3 bucket, even if it was never public, would still incur a charge for you? This massive oversight was fixed under the radar, and you can read more about it in the release Amazon S3 will no longer charge for several HTTP error codes. How that ever got off the ground in the first place is honestly shocking.

OU: Block Public Access

Finally, only last year, did AWS release the ability for AWS Organizations to turn off the incredibly insecure configuration by utilizing one of the S3 Org level policies. Now you can actually be sure you don't accidentally get it wrong, or I guess also find out if you did much sooner than you would have.

The entire history of S3 naming advice, summarized.

Are these hacks? Yes, yes they are. That is because the default considerations for using S3 require more configuration then lesser used strategies. If you want your bucket to be public, you configure less than you do if you want it to stay private. If you want to make sure you are secure and writing to your own bucket, you need to add properties, rather than remove.

What AWS Just Shipped

The biggest challenges with all of these hacks is — that with each new one being introduced, it required every service, product, application, and library to directly integrate that change. That's because every API, architecture decision, and code path had to account for this change. These weren't just hacks AWS made to solve the problem, these were bad hacks that pushed the burden on customers.

And so, AWS has watched the community embed account IDs, regions, and random identifiers into bucket names for years. That must have meant we loved it, because then they shipped that exact pattern as a first-class feature: Account Regional Namespaces.

The feature applies works in that when you create a bucket named myapp-logs and request it in your account-regional namespace: myapp-logs-123456789012-us-east-1-an. The -an suffix signals to the S3 service that this name is scoped to your account and region. Nobody else can register anything-123456789012-us-east-1-an — the 123456789012-us-east-1 segment is reserved for your account. How AWS managed to promise that buckets with an -an suffix don't already exist, and none of those bucket where in a cross-account scenario, is beyond me. Maybe they didn't. The likelihood is very small, that someone already had a bucket with a suffix of -{accountId}-{region}-an, but if they did, and they had a cross account scenario, then that is now broken. Or maybe it isn't, maybe that special bucket according to the new rules was created in the correct account, but in reality someone else owns it.

And so, we can see the same problematic pattern with this one as all the other hacks.

It is opt-in. You must set a special header or use a special property on CreateBucket. Existing buckets are not migrated. Existing tooling does not generate these names. Every piece of infrastructure code that creates S3 buckets needs to be updated to use the new naming convention. And that means every service, SDK, API, library, product, etc... that you are using must also make this change.

It wastes 26+ characters to your bucket name. S3 bucket names have a 63-character limit. You now have at most 37 characters to work with before you hit the wall. If you have a naming convention like {environment}-{team}-{service}-{purpose}, you are already in trouble. Hopefully each team in your organization has their own AWS account, but I know some of us aren't that lucky. You might be asking yourself, why 63? Well this limitation also almost certainly exists because the bucket name has to be part of the url as a subdomain. And DNS parts max out at 63 according to RFC 1123.

It does not address the actual architectural problem. Your bucket is still globally addressable via s3.amazonaws.com. The access model is unchanged. The public bucket problem is unchanged.

And then there is the SDK story.

Clever engineers will immediately ask:

If my bucket name no longer explicitly includes my account ID and region, I cannot just pass around the bucket name. How do I write portable infrastructure?

My answer: You don't.

The obvious AWS's answer: pass the account ID and region as a special token that the SDK resolves at runtime from the current execution environment. Instead of hardcoding 123456789012, you reference a variable that CloudFormation or the SDK resolves from the execution context.

So it's a second hack layered on top of the first one. The question is philosophical but practical, and AWS' answer is technical. That's a weird take.

You now have infrastructure code that creates bucket names by concatenating a prefix with a runtime-resolved account ID and region. Your IaC state needs to capture the resolved name, not the template. Your references to the bucket in other services need to either embed the same resolution logic or accept the full resolved name as an input. Your cross-account pipelines — CI/CD systems deploying into multiple accounts — need to be aware of this resolution mechanism.

AWS did not fix the problem. They added an opt-in feature that partially addresses one symptom, then added tooling to work around the limitations of that feature. You'll notice in the same release post, they also include the changes they had to make to CloudFormation S3 Resource. The people celebrating are celebrating a band-aid on a fracture.

How S3 Is Actually Used

But the real goal of tis article is actually talk about a solution. And to do that we need to review the fundamental use cases of S3. In practice it exists for four distinct use cases. Which of course have almost nothing in common:

1. Private object storage — build artifacts, backups, data lakes, Lambda packages, database snapshots, OpenTofu, Terraform, IaC state files, and SPA access by CloudFront. No direct external access. Internal AWS service-to-service or IAM-authenticated only. I'm go out on a limb and say this is 99% percent of the S3 usage by volume and by bucket count.

2. Event-driven processing — S3 event notifications triggering Lambda functions. An object is created or deleted; an event fires; a Lambda processes it. (One caveat here is that you MUST Never do this because S3 event notifications are not durable, ensure that all S3 events are sent directly to SQS, and then to Lambda.) The bucket name and ARN arrive in the event payload:

{
  "Records": [
    {
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventTime": "2024-03-01T12:00:00.000Z",
      "eventName": "ObjectCreated:Put",
      "userIdentity": {
        "principalId": "AWS:AROAEXAMPLEID:session"
      },
      "responseElements": {
        "x-amz-request-id": "EXAMPLE123456789",
      },
      "s3": {
        "s3SchemaVersion": "1.0",
        "configurationId": "upload-processor-trigger",
        "bucket": {
          "name": "my-app-uploads",
          "ownerIdentity": {
            "principalId": "AEXAMPLEOWNERID"
          },
          "arn": "arn:aws:s3:::my-app-uploads"
        },
        "object": {
          "key": "uploads/photo.jpg",
          "size": 1024,
          "eTag": "d41d8cd98f00b204e9800998ecf8427e",
          "sequencer": "0A1B2C3D4E5F678901"
        }
      }
    }
  ]
}

Notice what is not in this payload: a public-facing URL. The bucket.name and bucket.arn reference the internal bucket name. S3 ARNs have never included an account ID or region — arn:aws:s3:::my-app-uploads, not arn:aws:s3:us-east-1:123456789012:my-app-uploads. The identifier in the event is already the private bucket identifier, not a public one. And it would be easy to add the region and account ID to this ARN and likely not break a single thing.

And that's the tell. The event-driven use case has always operated on private identifiers. The Lambda function receiving this event doesn't care what the bucket is called publicly, or whether it has a public URL at all. It cares about the object key and the internal bucket reference — both of which are already account-scoped and private by nature. S3's internal event system was already operating on the right model. The global namespace was never part of this path.

3. Presigned URLs — assets that could be served over CloudFront because they are cacheable, but because you don't want them to be public, such as user owned data, you create a strategy to serve user data directly from S3. And same goes in reverse, you allow users to upload data, but rather than needing to deal with it in your service API, you directly have the client integrate with S3.

4. Direct public access — open buckets, bucket website hosting, ACL-public objects, resolvable by a public DNS. This is the pattern that causes all the breaches, all the confusion, and almost all of the architectural complexity AWS has accumulated in S3 over the years.

Category 4 is a tiny fraction of actual S3 usage by any metric you choose. It is responsible for a disproportionate fraction of the design surface area, the security incidents, and the policy complexity. And all the fixes so far make the usages of (1), (2), and (3) more challenging, while increasing the safety of (4). This is not how you solve architectural problems. You want to play a strategy where the most frequent uses are optimized for security, where the threat model identifies the biggest risk, to subvert that, not protect a screendoor or a fence in the middle of the desert.

The data breaches you read about were almost always S3 misconfiguration involving category 4. A few illustrative examples from a single year — 2017 alone:

Verizon — 14 million customer records including names, addresses, and account PINs, left in a publicly accessible bucket by a third-party vendor (NICE Systems). The bucket was open for weeks after Verizon was notified.
Accenture — Four public buckets containing 137GB of internal data: credentials, decryption keys, the master AWS KMS access key for their cloud platform, and data from clients across the Fortune 500.
WWE — 3 million fan records including home addresses, ages of children, ethnicity, and account details. Open to anyone with the URL.
GoDaddy — Configuration data for 31,000 GoDaddy servers exposed in a public bucket. In a detail that should give everyone pause: the bucket was used and misconfigured by an AWS employee.

The fix in every case should have been "make S3 harder to misconfigure." But the advice and resolution we've seen was instead: "fix your IAM policies", "enable Block Public Access", "audit your bucket ACLs." Patches. Tooling. Guardrails, Security Hub findings around a footgun that should not exist in the first place.

The reason category 4 exists at all is historical. In 2006, if you wanted to serve a file publicly from the internet, you needed a publicly accessible server. S3 was that server. CloudFront did not launch until 2008. IAM did not launch until 2011. The access model AWS ships with S3 today is the access model from an era when the alternatives did not exist yet. (I'm of course speculating here, because I didn't use AWS until 2008, and couldn't find a great source for this.)

Yet, some of the hacks to fix this problem have happened much later than 2011, and realistically, none of them even required IAM to make this happen.

The Real Root Cause

All of that complexity — ACLs, Object Ownership, Block Public Access, website hosting, and the hacks added attempt to fix secord-order mistakes. They were pilled ontop of the one thing nobody touched: the naming model. And it's the real feature everyone wants:

Feature 1: The same logical bucket name across multiple AWS accounts.

Take OpenTofu (or any IaC for that matter) for instance. You need remote state storage. The canonical setup: one S3 bucket per account, typically named something like {org}-opentofu-state or {account-name}-tfstate. Simple, readable, deterministic.

In practice, you have a dev account, a staging account, a production account, a security account, a shared-services account. You want 123456798012-opentofu-state in all of them. Under the current global namespace, you cannot have that. You have to name them 123456798012-opentofu-state-dev, 123456798012-opentofu-state-prod, and so on — encoding the account into the name because the namespace doesn't do it for you.

With the new account-regional namespaces, you can now have opentofu-state scoped to each account. In theory. But in practice, all the changed was the interface for creating buckets, the usage of the buckets and their names are still the same as without this latest feature, and worse, without changing anything regarding how the service actually works, now everyone needs to make change. It is the worst of all fates:

OpenTofu's and other IaC's S3 backend configuration needs to be updated to use the new naming scheme
Any modules that reference this bucket by name need to be updated
Any existing state files pointing to the old bucket names need to be migrated
Your bootstrap process — the code that creates the state bucket before OpenTofu can run — needs to support the new CreateBucket header

None of this is easily managed. And while you can opt out of things like (2) and (3), you all know that there is some "security theater" going on at large enterprises that will claim a migration here "increases security". I'm sure there the associated security hub finding that is going to come out soon with a Critical level. All of it is work that should not have been necessary if the architecture had been correct from the start.

Feature 2: The same logical bucket name across multiple regions.

Multi-region active-active deployments are increasingly common. You want my-app-assets in us-east-1 and eu-west-1. Under the account-regional namespace, these would be my-app-assets-123456789012-us-east-1-an and my-app-assets-123456789012-eu-west-1-an — different names for logically identical resources. Your infrastructure code must now either parameterize the region or generate the full resolved name in every place that references the bucket.

This is the same problem that existed before the fix. The namespace is account-regional — it scopes names to an account and a region. That is correct for preventing name collisions, but it means your logical bucket name is still not portable across regions. The same bucket in a different region is a different name. Your replication configuration, your CDN origin setup, your cross-region failover logic — all of it must carry the full resolved name around. You can have the same DynamoDB Table Name used in every region, but not S3.

The underlying issue is that S3 conflated four separate concerns:

Identity — what is this bucket called?
Location — which account owns it, and which region holds the data?
Addressability — how do external clients find it?
Accessibility — Who should have access to it?

AWS's new feature embeds all four into the name string itself: myapp-123456789012-us-east-1-an. The account ID is in the name. The region is in the name. The identity is whatever is left over after you subtract those 26+ characters. The an limits access. This is not a namespace — it is a naming convention that happens to be enforced by the S3 service on creation only. The four concerns are still coupled; they are just coupled inside the string rather than explicitly as configuration.

Intelligent Design

I want to be clear, AWS S3 is a fantastic service. It is so great in fact that there are no small number of huge businesses built around duplicating the S3 API. There are 20 years of successes after all. And I don't want to gloss over that:

What S3 gets right

Object storage is the correct primitive. An opaque key — a bucket name and an object path — maps to a sequence of bytes. Durable, versioned, regionally placed, with a consistent API surface across every SDK AWS ships. Lifecycle rules, replication, object tagging, multipart uploads, and locking (but only recently unfortunately). These are the right tools for managing data at scale, and they work.

Additionally, Presigned URLs are the correct mechanism for temporary access delegation. Credential-scoped, time-limited, no IAM policy change required. The object stays private; the URL grants access for a window. That's also the right design.

Do I need to mention the high durability of 99.999999999%, and the reliability of 99.99% as well?

None of this needs to change. The problem isn't storage. It's two things piled on top of storage: the naming model and the access model.

Secure by default

Every AWS primitive designed with security in mind starts from the same position: the unconfigured state is safe.

IAM: default deny on everything. No permission exists until you create one explicitly. The account with no IAM policies grants access to nothing.

VPC Security Groups: inbound traffic blocked by default. Every allow rule is explicit. The security group you just created, without touching it? It denies everything. (excluding the default VPC, which I'm not going to get into here)

KMS customer-managed keys: a key with no resource policy grants decryption to nobody — except the account root, which is a recovery mechanism, not an access path. Grants are explicit.

S3 is the exception.

Secure by default doesn't mean "safe unless you misconfigure it." It means safe by construction. The state you reach without doing anything must be the safe state. And for me that also excludes the presence of pits of failure. If it is easy to do the wrong thing, then this a dangerous state. Public access for instance, must require deliberate, explicit, named work. Not the absence of a flag. Not the absence of a policy. Not a default you forgot to change.

S3 had it backwards. And the fix isn't more flags. The fix is a model where a public bucket cannot exist — because public access isn't a property a bucket can have, it's a property of a feature called "promotion".

My Prospal: Private by Default, Public by Promotion

Here is the core insight that AWS released but no one wanted to commit to:

Bucket names are global (partly addressed by the new feature, but only for new buckets, only opt-in, only with a 26-character tax)
Buckets are the unit of access control
Public access is a property of the bucket
Anyone with the bucket name and the right IAM permissions (or no permissions required, if it's public) can read objects

The right model: A Private Bucket Service. If you tilt your head sideways and squint, you might see that such a thing has been here all along, and I'm sure there is even an already existing AWS primative that encapsulates this concept internally.

info

By Private , I mean that the bucket is private to your account, not private in the fact that it just isn't publicly accessible.

Allow the creation of S3 Private Buckets the same way you would the current S3 Public Buckets. Might as well rename the current API to be Public Buckets Service instead, although I guess PBS was already taken, not to mention Public and Private both start with P a bit of an oversight in the english language.
Private Buckets only exist in that one region in that one account, and make use of the AWS ARNs correctly with aws account ID and region in the ARN.
All interactions within the account will assume the private bucket, and never the public bucket. These are your API calls through SDKs, Event Source Mappings for SQS, Event notifications.
Names follow the same strategy as they do today, (although since they aren't public, please let us have upper case characters)
Objects are private. Not by default. Always. Without exception.
Public access is not a property of the bucket. (Want to create a public bucket still? I'll get to that in moment.)

I don't think this a novel concept. DynamoDB works exactly this way.

And under this model, my-app-assets in us-east-1 and my-app-assets in eu-west-1 are two separate buckets each globally identifiable via the ARN, and accessible via the region based parameter in the SDK/CLI/API (which by the way is already necessary.) Your infrastructure code references the bucket name as it always has done.

What's missing you might ask?

No 26-character suffix. No runtime SDK token substitution. No encoding of internal topology into names that humans have to read and type. No weird public configuration, no ACLs, no URLs associated with the buckets, no pits of failure.

The Cornerstone Example

When you create a Bucket today s3PublicClient.createPublicBucket(), let's call it my-app-assets. It has a ridiculous number of limitations for creation, which I will get to later as well as the underlying assumption that you will make some part of it public. It comes with:

Bucket Policy
CORS Policy
DNS Name
Bucket Website
Global ARN
Public Access Block configuration
63 character lowercase name restriction
I'm sure there are 20 more things here that also no one needed.

That bucket is created with the ARN arn:aws:s3:::my-app-assets.

This doesn't go away, you can still call that API, if you really wanted to. But the truth is that no one would call that API, because very few people need that API. Instead you would call the s3PrivateClient.createPrivateBucket(), and you will get a bucket with an ARN arn:aws:s3:REGION:AWS_ACCOUNT_ID:my-app-assets. That bucket operates with everything you would want in a private bucket:

Encryption
Governance
Presigned URL support
Resource Policies
etc...

But it doesn't have any of those things for the public bucket. If you want those things above, you would need to call s3PrivateClient.promoteBucket(). The parameters for that should be something like:

s3PrivateClient.promoteBucket({
      bucket: 'my-app-assets',
      publicBucketName: 'my-app-assets-public',
      ...
})

Doing so at that moment would validate if that public bucket name exists. Everything continues to work the same from a public bucket standpoint, but we are also afforded all the benefits of the private bucket without any of the risks.

This also prevents there being any backwards compatibility isuses as far as infrastructure management and creation goes, because the S3 Public API still exists, the only difference is now there is also the S3 Private API which can be used to create the local buckets, and when desired will be promoted to also be a public bucket. Additionally, you'll see later that migration on the AWS side is necessary to support this.

If I were an S3 Architect, I might ensure that all public bucket names start with public- or exist in the namespace public/ or public:, so that someone could not accidentally write arn:aws:s3:::my-app-assets and get a malicious attacker's promoted S3 private bucket.

That is, if an attacker created arn:aws:s3:us-east-1:666666666666:my-app-assets, and promoted it to be arn:aws:s3:::my-app-assets. Then you could create arn:aws:s3:us-east-1:000000000000:my-app-assets and accidentally reference it as arn:aws:s3:::my-app-assets. In doing so, you would be again using that attackers bucket. Holistically, this is the same problem that has always existed up until this point, so this strategy isn't worse. It is just not perfect. But that's a mistake AWS might need to live it.

It would be better if would have to explicitly add in the public prefix and write arn:aws:s3:::aws-public-buckets/my-app-assets for all public buckets. But that's a breaking change, so likely off the table. However as I mention below, there are great ways to protect against this that AWS can help with.

Public Buckets: How promotion works

A bucket, once created, is private. The bucket's access state never changes. What changes is what you attach to it.

There are two core public scenarios that I'll call promotion paths that still must have solutions for:

Presigned URLs: Temporary Promotion

You issue a time-limited, credential-signed URL for a specific object. The URL encodes the object path, an expiration, and a signature derived from your IAM credentials. Anyone with that URL can read that object — for the duration you specified. When it expires, access ends. The bucket policy didn't change. The object's access model didn't change. The credential signed the request; the routing table resolved the bucket; S3 validated the signature and served the object.

A presigned URL today looks like https://mybucket.s3.amazonaws.com/file.png?X-Amz-Credential=AKID123%2F20240101%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=.... The X-Amz-Credential field already contains the account identifier — derived from the access key ID, which maps to an account. S3 extracts that account, consults the routing table for mybucket in that account, and routes to the right physical bucket. The global uniqueness constraint was never doing the routing work here. The credential was.

I want to say that again, presigned urls will still absolutely work out of the box without any changes.

This is because Presigned URLs are not an S3 concept. They're an IAM concept that S3 validates. To explain, we need to dive into how AWS IAM actually works. AWS IAM uses their custom SigV4 signature strategy for every request to AWS. And every request to AWS goes over the wire on a AWS owned DNS url for the service with all the necessary parameters.

For instance, your SDK computes a SigV4 signature using your IAM credentials — the access key ID and its corresponding secret. No AWS API call is made. The URL is computed entirely locally. This is how it works for every AWS service API. When you call DynamoDB this happens, and the same thing happens when you call S3.

Presigned S3 is a trick. After constructing the full HTTP payload to send to the service, instead of actually sending it, you give it to someone else. Then that person executes the payload. Normally it wouldn't matter who executes it, but what if some part of the payload was allowed to change between the generation of the HTTP payload and the exector executing, let's say for instance: the Binary Body. In this way, you could generate a request that encodes the bucket, the object path, the expiration, and the signature, and hand it to some other user. They present it to S3 with a custom binary.

When S3 receives the request, it extracts the access key ID from X-Amz-Credential, looks up the corresponding IAM entity via STS, re-derives the expected signature, and checks that it matches. Then it checks the expiration. Then it checks that the IAM entity had s3:GetObject permission at signing time. If all three pass, S3 serves the object (or persists it in the case of s3:PutObject).

That's all. S3 is just doing IAM validation, the same thing every other service is doing. It is not checking whether the bucket is public. It is not consulting the access model at all. A fully private bucket — no ACLs, no public access configuration, nothing — can serve objects via presigned URL because the authorization is credential-based, IAM-based, AWS-API based, it is not a unique access-model built into public S3-based buckets.

Public Buckets: Permanent Promotion

Since, public access is not a PrivateBucket property, there has to be some way to still expose public access to the PrivateBucket data. And so the proposal would allow making a PrivateBucket public by requesting a bucket name from the global authoritative S3 Bucket Name list. The same process you already have today for S3 buckets, when you create a new one.

In the new model, the public properties, the ACLs, the website configuration, aren't properties of the bucket. They're a separate resource: a public access configuration. Which today is what is called S3. So you might be able to see why I'm suggesting a name change. When you create one, attach it to your private bucket, and the S3 URL is created. You remove it, and the S3 URL stops existing. The bucket itself never changes state. The URL is a consequence of the configuration, not a property of the storage. And that URL, that's the thing that must be globally unique, and most importantly that doesn't even need to match the original bucket, and it won't.

Website configuration lives there too. Index documents, error documents, redirect rules — these move from bucket settings into the public access configuration. The s3-website endpoint exists because the configuration says it should, not because the bucket was created with a flag set.

And because the user-defined string — the bucket name — is preserved through the Public Bucket configuration. What is no longer true is that this string must be globally unique for the private bucket. That constraint was never load-bearing. It was just there because of the expectation on public usage.

The Custom HTTP domains using S3 website hosting — with CNAMEs pointing to mybucket.s3-website-us-east-1.amazonaws.com or not — continue to work. The website configuration moves into the public access configuration resource; the s3-website endpoint continues to exist as long as that configuration exists. No customer change is required.

Because this functionality is separate, AWS can disable (and hopefully dismantle) in one huge swath all of the public features of S3 that are insecure by default, and lead new AWS accounts down the path of CloudFront for public access. If you need custom domains, TLS termination on your own domain, caching, WAF, HTTP/2, geographic restrictions, or edge functions — that's not an S3 question. That's a CDN question. And the answer is CloudFront as the reverse proxy with a private S3 bucket origin granted access via the Origin Access Control configuration.

The bucket stays private. CloudFront has authorized access to it. Your users get a production-grade delivery layer with every security consideration you need. S3's job is to hold the bytes and serve them to one authenticated caller — the distribution. CloudFront's job is to serve those bytes to the world under your domain, your TLS certificate, your cache rules.

This is already how every serious production setup works. The new model doesn't change that. It just makes it the only coherent option, instead of one option among several confusing ones.

A New CloudFront Opportunity

Presigned URLs have a structural limitation today that nobody talks about: the SigV4 signature is computed over the canonical request, which includes the Host header. And so the URL is signed against mybucket.s3.amazonaws.com. Change the hostname and the signature fails. Which actually is a huge problem for CloudFront Functions when rerouting requests to a different origin (sometimes it works). This means custom domains for presigned URLs are impossible today. Every download link, every document export, every profile photo URL your product generates contains s3.amazonaws.com. Your customers see your infrastructure provider in every URL. There is no way around it with the current model.

The right fix is for CloudFront to gain first-class presigned URL support: the ability to validate SigV4 signatures on behalf of S3. If CloudFront can validate the signature, the URL can be generated against your CloudFront custom domain — with your ACM certificate, on your domain — and CloudFront handles the validation and the downstream request to S3. The signing mechanism doesn't change. The client code doesn't change. The SDK GeneratePresignedURL call works identically, just against a different hostname. Ironically, CloudFront offers some partial functionality for Signed Request URLs and Signed Cookies, but these actually have a security hole because they don't include the same level of control that IAM policies provide. CloudFront + IAM would be a real game changer for Presigned URLs.

The S3 team's outstanding task

Now on to easy but annoying part. AWS cannot simply remove public bucket support creation path. It isn't the millions of buckets in production, but rather all the code paths that create buckets and then make assumptions about them. Some of those code paths were written by teams that no longer exist.

Any migration strategy that requires customers to take action will fail for the long run. The path forward has to be one where the default behavior improves without requiring every customer to update their infrastructure. Something that the current history of hacks haven't gotten correct at all. (Although their folly resulted only in decreased security rather than broken configuration.)

AWS can either trudge along with this currently broken S3 architecture riddled with pits of failures. Or they can admit they made a mistake and default all new accounts' buckets to not contain a public access strategy. This is actually the right thing to do, and they can do this safely as they have deprecated even whole AWS services before.

Phase 1 — New accounts, new defaults

Public S3 Buckets completely disabled by default, no website hosting, no ACLs, no bucket policies. All of these are blocked from usage without a support ticket. We don't need the public configuration.

This doesn't break existing buckets. And new infrastructure gets the right defaults. The blast radius is almost zero. There are some AWS organizations out there that are dynamically creating S3 buckets in automatically provisioned new AWS accounts with assumptions based on how buckets work. When creating a new account and then a bucket in that account, they will see a problem. This just needs to be communicated.

You might be thinking, couldn't there just be a magic flag on bucket creation that specifies that the bucket is account/region bound, call that flag: private: true. The problem is removing the restriction to private buckets MUST BE OPT-OUT. private: true makes the default the legacy insecure current state, and keeps public access is opt-out. And therefore it still allows all the bucket negligence awards that Corey is so keen on giving out. A flag is not sufficient, and instead there needs to be a mature approach to the migration. Which is why the recommendation here is:

Rename S3 everywhere to "S3 Public Bucket Configuration"
Reintroduce S3 as a Private Bucket concept

Phase 2 — AWS internal service updates

AWS has some internal work to do. Luckily most of the mess that was caused is squarely cornered into the S3 Public Bucket Configuration and none of it actually affects our new private bucket creation or usage. That means, after the rename, AWS can go back through all of their services and retarget all interactions with S3 to use the new Private S3 SDKs/API. This is squarely in their control.

S3 Bucket Events + Lambda Event Source Mapping

One area where there is a bit of a cross over are events like S3 Events over SQS => Lambda. But as discussed earlier that's actually a no-op. Similarly, Lambda Event Source Mapping (Lambda ESM), used for automatically polling SQS is a non-issue. But the reason why is worth understanding. An ESM configuration is account-scoped in the first place. When you set up a Lambda trigger, you're making an authenticated API call inside your account: "Lambda function X should fire on events bucket Y." The ESM record lives in your account. The bucket lives in your account. AWS resolves the bucket reference using the account context of that API call — not the public namespace.

The current ESM ARN looks like arn:aws:s3:::mybucket — no account ID, no region, because those were implicit in the global uniqueness guarantee. In the new model, mybucket is a private identifier scoped to your account. The ARN format doesn't change. The resolution just shifts from "global name lookup" to "private identifier lookup within account context" — which AWS handles internally. No customer touches their ESM configuration. No ARN format changes. No trigger reconfiguration. Future ARN formats for the ESM should take the account ID and the bucket region, but AWS needs to maintain the global mapping table they already have that allows the account-less, region-less ESM bucket ARN to resolve the bucket in the specific region, in the correct specific account. In other words, ESM resource should accept either the global bucket naming strategy or the region-account local one.

The message here "Update your Event Source Mappings for Buckets so that you have the account ID or region specificed". This might be the first ever [Action Required] email, that actually has a required action. Or maybe they'll just update Security Hub to include a finding to fix this, and an AWS Config rule that validates it with an automatic remediation.

CloudFront S3 origin compatibility

CloudFront is not part of S3's access model — it's a CDN that sits in front of a private S3 bucket, authorized via OAC. That already works today and obviously must continue to work in the new model. The only S3-specific change AWS needs to make is ensuring that CloudFront's S3 origin configuration resolves bucket references using the private identifier rather than the global name. Again, that is an internal AWS concern. No customer CloudFront configuration changes. I'm sure there is someone out there that is going to request cloudfront have access to S3 buckets in another account. AWS can easily support a similar solution to the ESM as above, CloudFront accepts either the global S3 ARN or the account-region localized one.

Phase 3 — Configuration Split

Every existing S3 bucket is already the private half of the new model. Customers haven't been creating "public buckets" — they've been creating private buckets and then attaching public configuration to them in the form of ACLs, Block Public Access exemptions, Bucket Policies, and website hosting settings. The private bucket has always existed. What hasn't existed is the explicit separation exposed to AWS Account users. That starts now.

Since the buckets themselves and the public access configuration don't actually change here, the only thing AWS has to do is backpopulate a list of S3 private buckets whose names will be the exact same same as the current PublicBucket name. The goal being that all AWS S3 buckets should be referencable by their account-region localized arn, and the relevant console UI exists to display that. That's a script even Kiro could write in an afternoon.

Presigned URL configuration handling

As argued above, the Presigned URL configuration already will work out of the box since the exact same problem has already been solved for literally every other resource in AWS. The one caveat here is that there will likely need to be a new method GeneratePresignedBucketUrlForPrivateBucket to make sure it includes the account Id and the region explicitly so that the public bucket configuration isn't necessary to continue to use that option. That's because the current method doesn't take in the account ID or the region, but just the bucket name.

The one exception is cross-account presigned URLs — an IAM identity in Account B generating URLs for a bucket that lives in Account A. I personally don't even know if this is possible, but technically I don't see why not. In this case, if we use the X-Amz-Credential to determine the account, AWS would incorrectly assume the account is B (where the identity is) and not Account A (where the bucket actually lives). But AWS S3 have very competent architects, so I'll leave that challenge for them to solve (I can imagine using this same new GeneratePresigned menthod I just suggested above).

It's also worth noting that potentially the presigned URL configuration could be an explicit resource you create when you need it similar to the public access. And by default just create it for all existing buckets.

Phase 3 — Deprecation

The best part of this design is that regarding deprecations there are none! Since all we are actually doing is changing the same of some SDKs to improve readibily and really just the text in the UI. The only real change that is necessary here is going through all the docs and updating the content with more appropriate and clear naming.

Most importantly, over time, the "public bucket" moniker will disappear entirely from the documentation as a concept, from customer usages, and most importantly from the news. And what replaces it? A private bucket with an explicit access configuration attached when needed. Two resources, two concerns, neither coupled to the other by default. The access model that caused two decades of breaches stops being something new engineers get to learn about.

The Objections

Proposing a fundamental redesign of S3's control plane will attract objections. Here are the ones I felt like addressing:

What About SPA Websites?

The most common objection: "But I host my react/vue/solidjs app on S3 with website hosting enabled, and it works fine."

It works, but it isn't correct architecture. Let's be precise about what is actually happening.

Your S3 bucket is serving HTTP at http://my-app.s3-website-us-east-1.amazonaws.com. Your domain is resolved by one two ways:

Option A — CNAME directly to the S3 website endpoint. — You have no TLS. S3 website hosting is HTTP only — it has no mechanism to serve HTTPS for a custom domain. Your users therefore must be on HTTP, so this is not a viable production setup. It actually doesn't work at all.

Option B — CloudFront in front. — CloudFront handles TLS (via ACM), your custom domain, HTTP→HTTPS redirects, the 404 → /index.html behavior for client-side routing, cache headers, compression, and geographic distribution. S3 is behind CloudFront, serving bytes when requested.

Option C — The website domain is the S3 url — You are freely passing out your S3 bucket URL to clients and asking them to remember that custom url. Something for sure is going to break some day, but nothing stopped you from doing it.

Sites that only use S3 website hosting without CloudFront, serving plain HTTP is not a counterexample. It is a site that is broken and getting more broken by the day. Chrome announced in 2023 that it is moving towards HTTPS by default, automatically upgrading HTTP navigations to HTTPS. An S3 website serving HTTP gets upgraded to HTTPS by the browser, and since S3 cannot serve HTTPS on a custom domain, the request fails. Firefox has had an HTTPS-Only Mode available since 2020 that blocks HTTP sites entirely. These are not future concerns. They are not esoteric. They are not nuanced. They are the current state of the web. A site that only works over HTTP is not a production website in 2026. It is a broken website that has not been maintained.

Which means in every functional production scenario, S3 website hosting is doing nothing useful. CloudFront is handling everything. S3 is holding bytes.

Therefore, Option B is every production SPA, S3 website hosting is contributing nothing. CloudFront is doing all the work that makes the setup viable. The bucket does not need to be public. Website hosting does not need to be enabled. The only reason engineers enable website hosting is that they are following a tutorial that predates CloudFront's ability to serve private S3 buckets, and nobody told them the tutorial was outdated. Or more likely, someone did, but they didn't listen.

CloudFront likely has been able to serve our new Private S3 bucket concept since Origin Access Control (OAC) replaced the older Origin Access Identity (OAI) approach. OAC supports server-side encrypted buckets, covers all S3 regions, and signs requests to private S3 using SigV4. Even before OAC, your bucket never needed to be public.

There could be a concern that CloudFront doesn't know how to talk to anything other than a public S3 bucket or a public URL. But interestingly enough, CloudFront now also supports private origins via ALB with VPC origins, which closes the last remaining scenario where direct public exposure might have been argued as necessary. You can run your origin entirely inside a VPC, with no public exposure, and serve it through CloudFront. The gap is gone.

And the "CloudFront costs more" objection doesn't land either. CloudFront has a free tier: 1 TB of data transfer per month, 10 million HTTP requests, and 2 million CloudFront function invocations. A landing page or documentation site that fits in an S3 bucket almost certainly fits within that free tier, and even if it doesn't, at scale you are still getting the benefit of the cost reduction.

A complexity argument would be more interesting. Setting up a CloudFront distribution requires more steps than enabling S3 website hosting. That is true. But the complexity exists either way, it is just hidden. And you still need TLS. You still need the index.html routing behavior for client-side routing (or a more expensive CloudFront function). You still end up at CloudFront. The engineers who skip it are the ones serving HTTP from a subdomain with no TLS, which is screams for a denial-of-wallet attack.

And for users who genuinely have not set up CloudFront, a la Option C : the AWS S3 migration plan already answers this. The configuration split means existing public buckets keep their public access configuration intact, those sites keep working. The owner does nothing. When they are ready to do it correctly, the options are available.

Bucket Origin Responses

There is one thing I left out, and I didn't want to bring this up because it's annoying, but I'm sure someone will call me out on it.

When you set up S3 as an origin for your CloudFront, you might have the need to control the response headers. Historically, you were not able to configure anything in CloudFront, let alone do it dynamically. And so using S3 to set the CORS policies or other security policies was required. However now, CloudFront offers response headers, and while it isn't everything, even S3 isn't sufficient for specifying all the relevant headers. While I don't love it, for Authress, we have a CloudFront Function attached to every response. There is a performance hit and a cost hit to do this on literally every S3 related request. But argubly it is a small price to pay to have CloudFront do the thing that it should be doing all along, and not to save this configuration in S3 where it doesn't. Maybe AWS could be nice and still offer this configuration in S3, or be nice and add this as an option to CloudFront, or be nice and make CloudFront functions even cheaper, because why not, API Gateway velocity templates are free after all!

You're asking AWS to blow up a working control plane

Yes. That is what a migration looks like. The alternative is two more decades of incremental patches, each one adding more surface area and more documentation burden without touching the underlying design, and worst of all, still enables a massive pit of failure.

The control plane does not need to be blown up for customers. The translation layer proposal in the previous section means existing workloads continue working. What needs to change is the model exposed to new infrastructure — the primitives developers learn, the defaults they encounter, and the architecture that tutorials recommend.

AWS has done this before. The IAM role model replaced key-based authentication for most AWS-to-AWS access patterns. And AWS IIC replaces IAM roles for organizations and SSO. CloudFront Origin Access Control replaced Origin Access Identity. Neither replacement was instantaneous, and neither broke existing workloads. The old model continued working through a maintained compatibility layer while the new model became the default for anything new.

The objection treats "existing behavior must never change" and "defaults must never improve" as the same thing. They are not.

The Better Announcement

The Account Regional Namespaces announcement solves one real problem, the name collisions, using an opt-in mechanism with a 26-character tax on your bucket names, tooling that requires SDK and CloudFormation updates to remain portable. But it has zero impact on the access model that causes actual harm.

The right announcement would have looked like this:

The best feature ever Private Buckets: Account-regional namespaces are the default — for all new bucket creation, no suffix, no opt-in, just the natural behavior that every engineer already wanted is now expected. Change nothing, get all the value.
The recommendation for public content: A managed CloudFront promotion layer — as the only path to public content, surfaced as a first-class feature with its own console workflow, not a best practice buried in the CloudFront documentation. Because for some reason, AWS likes to improve their console, it still surprises me for how many ClickOps isn't just a migration strategy but a business critical one.
Backwards compatibility is still and always will work — Legacy ACLs and direct public bucket access still exist — but as of today they are deprecated and require a support ticket to activate. The on-ramp is gone. The escape hatch remains, for now.

Instead, we got a feature that requires you to append -123456789012-us-east-1-an to your bucket names, a second feature that lets your SDK dynamically resolve that suffix from the execution environment, and a wave of blog posts explaining how to wire these two features together. And of course we still have to wait for your-favorite-tool™ to implement this funcitonality.

This is not a fix. It is a patch on top of a patch, with new documentation for how to apply both patches correctly. AWS has a long history of excellent engineering, but I don't concern this new functionality to be part of it.

The gap between "what was shipped" and "what would fix the problem" is not subtle. It is not a matter of resources or engineering difficulty. Name collisions, the problem I can only imagine customers have been filing tickets about for years, was partially addressed. But the access model that still will cause actual harm was not.

Until the access model changes, the endless stream of conflicting advice will remain out there on the internet.

For help understanding this article or how you can implement auth
and similar security architectures in your services, feel free to 
reach out to be via the community server.

Join the community

Securing CI/CD Access to AWS

Warren Parad — Tue, 03 Mar 2026 00:00:00 +0000

I've seen a lot of complex tooling in my experience, but by far the worst case is designing just one more tool to do something. Especially in the age where software is free, we become burdened by just one more tool. We know at Authress that increased complexity => increased failure rate.

The solution is to utilize the tools we already have, just a little bit better. In this case — "just a little bit better" — is adding a trivial amount to your existing AWS built-in technologies, and doing it in a way that you won't even need to add extra management overhead.

For help understanding this article or how you can implement auth
 and similar security architectures in your services, feel free to 
reach out to us via the community server.

Join the community

❌ The Wrong Way

There are lots of ways this could have gone wrong. In fact, if you ask any of the "Reasoning LLMs", and are unlucky enough not be told IDK , you will find out things like:

Deploy a Lambda Function to every account is the right option - Don't do that.
List all the accounts in a CFN template mapping - You will run out of template space, you are limited, especially if you have more than a couple of AWS Accounts or GitHub/GitLab accounts. Often requires a complex Fn::Or, chunked chain to fit it in the template in the first place. Assuming you don't hit the 200 key mapping limit.
Using a CloudFormation Parameter - You aren't going to know the AWS Account up front any way, I don't even know how this was going to work, assuming you don't have the 4096 character limit for parameter values.
Creating a CloudFormation Macro - And for a moment a Macro sounds like a good answer, until you realize that OU Stack Sets aren't allowed to use Transforms which are required.
Using a CFN Module - I'm actually surprised none of the LLMs came up with this solution, but the problem is that it will still deploy a lambda function into every account.

At least the lambda function in every account would work, but it isn't clean, you'll get a lambda in every account, and potentially also region, which comes with at least one IAM role, a CloudWatch Logs Group, and who knows what else.

Someone out there is probably saying "Why aren't you using OpenTofu for that", I'll leave that as a challenge for the reader to answer.

The Complete Design

The design is quite straightforward.

Deploy a Lambda Function to the AWS Management Account which contains the list of permissions for each account.
Deploy an OU StackSet which uses a Custom Resource to call the lambda function in the management account, to fetch the list.
The list is persisted in a GitLab assumable IAM Role
GitLab assumes the role at deployment

🔒 AWS Account Permissions Lambda Function

Let's do the easy part first. Of course we want to define the permissions somewhere. Since we are using GitLab, what we actually want to do is define for each AWS account, which GitLab projects (and their branches can be used to access that AWS account). At the top here, we'll define the permissions. And at the bottom, we'll receive the account ID from the caller and use that pull the correct permissions out of the map.

Permissioning Lambda Function

const accountPermissionsMap = {
  000000000000: ['project_path:authress/automation/*:ref_type:*:ref:*']
  111111111111: ['project_path:side-projects/*:ref_type:*:ref:*']
};

const sendResponse = (event, context, status, data, reason) => {
  const body = JSON.stringify({
      Status: status,
      Reason: reason || '',
      PhysicalResourceId: context.logStreamName,
      StackId: event.StackId,
      RequestId: event.RequestId,
      LogicalResourceId: event.LogicalResourceId,
      Data: data
  });

  return await fetch(event.ResponseURL, {
    method: "PUT",
    headers: {
    "Content-Type": "", 'Content-Length': body.length },
    body });
};

exports.handler = async (event, context) => {
  if (event.RequestType === 'Delete') {
    return sendResponse(event, context, 'SUCCESS', {});
  }

  try {
    const accountId = event.ResourceProperties.AccountId;
    const permissions = accountPermissionsMap[accountId] || [];
    return sendResponse(event, context, 'SUCCESS', {
      GitLabProjects: permissions.join(',')
    });
  } catch (err) {
    console.error('Event:', JSON.stringify(event),
      'Error:', err);
    return sendResponse(event, context, 'FAILED', {},
      err.message);
  }
};

🟢 Deploying the Lambda Function

Management Account: CloudFormation Template

// First load the lambda function from the lambda function const handlerCode = await fs.readFile(path.join(__dirname, './fetchPermissionsLambdaFunction.js'), 'utf8');

return {
  AWSTemplateFormatVersion: '2010-09-09',
  Parameters: {
    OrganizationId: {
      Type: 'String',
      Description: 'The organization'
    }
  },
  Resources: {
    GlobalConfigLookupRole: {
      Type: 'AWS::IAM::Role',
      Properties: {
        RoleName: 'OU-StackSet-GlobalConfigLookup',
        AssumeRolePolicyDocument: {
          Version: '2012-10-17',
          Statement: [
            {
              Effect: 'Allow',
              Principal: {
                Service: 'lambda.amazonaws.com' },
              Action: 'sts:AssumeRole'
            }
          ]
        },
        ManagedPolicyArns: 
          ['arn:aws:iam::aws:policy/service-role/
           AWSLambdaBasicExecutionRole']
      }
    },
    GlobalConfigLookupLogGroup: {
      Type: 'AWS::Logs::LogGroup',
      Properties: {
        LogGroupName: '/aws/lambda/OU-StackSet-GlobalConfigLookup',
        RetentionInDays: 30
      }
    },

    GlobalConfigLookupFunction: {
      Type: 'AWS::Lambda::Function',
      Properties: {
        FunctionName: 'OU-StackSet-GlobalConfigLookup',
        Runtime: 'nodejs24.x',
        Handler: 'index.handler',
        Role: { 'Fn::GetAtt': [
          'GlobalConfigLookupRole', 'Arn'] },
        MemorySize: 1769,
        Timeout: 30,
        Code: {
          ZipFile: handlerCode
        },
        LoggingConfig: {
          LogFormat: 'Text',
          LogGroup: { Ref: 'GlobalConfigLookupLogGroup' }
        }
      }
    },
    GlobalConfigLambdaPermission: {
      Type: 'AWS::Lambda::Permission',
      Properties: {
        FunctionName: {
          Ref: 'GlobalConfigLookupFunction' },
        Action: 'lambda:InvokeFunction',
        Principal: '*',
        PrincipalOrgID: { Ref: 'OrganizationId' }
      }
    }
  },
  Outputs: {
    GlobalConfigLookupFunction: {
      Value: {
        'Fn::GetAtt': ['GlobalConfigLookupFunction', 'Arn']
      },
      Export: {
        Name: 'GlobalConfigLookupLambdaArn'
      }
    }
  }
}

▶️ Utilize the Lambda Function

Then we update the member stack to utilize this lambda function, and create the correct IAM Role.

OU StackSet Member Account: CloudFormation Template

{
  // Pull the values in the Lambda Function
  GlobalConfiguration: {
    Type: 'Custom::GlobalConfiguration',
    Properties: {
      ServiceToken: { Ref: 'globalConfigurationLambdaArn' },
      AccountId: { Ref: 'AWS::AccountId' }
    }
  },

  // The IAM Role for GitHub to utilize
  GitLabRunnerRole: {
    Type: 'AWS::IAM::Role',
    Properties: {
      RoleName: { 'Fn::Sub': 'GitLabRunnerRole' },
      MaxSessionDuration: 3600,
      AssumeRolePolicyDocument: {
        Version: '2012-10-17',
        Statement: [{
          Effect: 'Allow',
          Principal: {
            Federated: { 'Fn::Sub':
    'arn:aws:iam::${AWS::AccountId}:oidc-provider/gitlab.com' }
          },
          Action: 'sts:AssumeRoleWithWebIdentity',
          Condition: {
            StringEquals: {
    'gitlab.com:aud': 'https://gitlab.com' },
            StringLike: {
              'gitlab.com:sub': {
                'Fn::Split':
[',', { 'Fn::GetAtt': ['GlobalConfiguration', 'GitLabProjects'] }]
              }
            }
          }
        }]
      }
    }
  },

  // Then register the GitLab OIDC Provider to
  //   allow GitLab to actually assume the role
  GitLabOIDCProvider: {
    Type: 'AWS::IAM::OIDCProvider',
    Properties: {
      ClientIdList: ['https://gitlab.com'],
      Url: 'https://gitlab.com'
    }
  },
  // ...
}

🏁 Run the Deployment

One hidden piece of information that might not be so obvious is how we are going to actually deploy that Member Account CloudFormation Template to all the AWS accounts we have in our AWS Organization. For that, we use an AWS Organization OU Stack Set. The stack set automatically deploys the template for every AWS account in the OU, for every region.

Deploy OU StackSet

import { OrganizationsClient, DescribeOrganizationCommand }
    from '@aws-sdk/client-organizations';
import AwsArchitect from 'aws-architect';

const client = new OrganizationsClient({ region: 'us-east-1' });
const { Organization } = await client.send(
  new DescribeOrganizationCommand({}));
const parameters = { organizationId: Organization.Id };

const awsArchitect = new AwsArchitect(packageMetadata, {});
const deploymentResult = await
  awsArchitect.deployTemplate(globalConfigurationTemplate,
    stackConfiguration,
    parameters);

const GlobalConfigurationLambdaArn =
  deploymentResult.Outputs.find(o =>
    o.ExportName === 'GlobalConfigLookupLambdaArn')
    .OutputValue;
const memberParameters = {  GlobalConfigurationLambdaArn  };
await awsArchitect.configureStackSetForAwsOrganization(
  memberAccountTemplate,
  orgStackConfiguration,
  memberParameters);

And the best part of this is that the lambda function is extensible, so you can include a full configuration in S3 or anything else that you might want to persist in the management account's git repository.

For help understanding this article or how you can implement auth
 and similar security architectures in your services, feel free to 
reach out to us via the community server.

Join the community

[Boost]

Warren Parad — Fri, 07 Nov 2025 20:41:43 +0000

Warren Parad for AWS Community Builders

Nov 7 '25

How when AWS was down, we were not

#aws #reliability #architecture #serverless

Comments 2

37 min read

How when AWS was down, we were not

Warren Parad — Fri, 07 Nov 2025 00:00:00 +0000

🚨 AWS us-east-1 is down!

One of the most massive AWS incidents transpired on October 20th. The long story short is that the DNS for DynamoDB was impacted for us-east-1, which created a health event for the entire region. It's the worst incident we've seen in a decade. Disney+, Lyft, McDonald'ss, New York Times, Reddit, and the list goes on were lining up to claim their share too of the spotlight. And we've been watching because our product is part of our customers critical infrastructure. This one graph of the event says it all:

The AWS post-incident report indicates that at 7:48 PM UTC DynamoDB had "increased error rates". But this article isn't about AWS, and instead I want to share how exactly we were still up when when AWS was down.

Now you might be thinking: why are you running infra in us-east-1?

And it's true, almost no one should be using us-east-1, unless, well, of course, you are us. And that's because we end up running our infrastructure where our customers are. In theory, practice and theory are the same, but in practice they differ. And if our (or your) customers chose us-east-1 in AWS, then realistically, that means you are also choosing us-east-1 😅.

During this time, us-east-1 was offline, and while we only run a limited amount of infrastructure in the region, we have to run it there because we have customers who want it there. And even without a direct dependency on us-east--1, there are critical services in AWS — CloudFront, Certificate Manager, Lambda@Edge, and IAM — that all have their control planes in that region. Attempting to create distributions or roles at that time were also met with significant issues.

Since there are plenty of articles in the wild talking about what actually happened, why it happened, and why it will continue to happen, I don't need to go into it here. Instead, I'm going to share a dive about exactly what we've built to avoid these exact issues, and what you can do for your applications and platforms as well. In this article, I'll review how we maintain a high SLI to match our SLA reliability commitment even when the infrastructure and services we use don't.

📖 What is reliability?

Before I get to the part where I share how we built one of the most reliable auth solutions available. I want to define reliability. And for us, that's an SLA of five nines. I think that's so extraordinary that the question I want you to keep in mind through this article is: is that actually possible? Is it really achievable to have a service with a five nines SLA? When I say five nines, I mean that 99.999% of the time, our service is up and running as expected by our customers. And to put this into perspective, the red, in the sea of blue, represents just how much time we can be down.

And if you can't see it, it's hiding inside this black dot. It amounts to just five minutes and 15 seconds per year. This pretty much means we have to be up all the time, providing responses and functionality exactly as our customers expect.

🤔 But why?

To put it into perspective, it's important to share for a moment, the specific challenges that we face, why we built what we built, and of course why that's relevant. To do that, I need to include some details about what we're building — what Authress actually does. Authress provides login and access control for the software applications that you write — It generates JWTs for your applications. This means:

User authentication and authorization
User identities
Granular role and resource-based authorization (ReBAC, ABAC, TBAC, RBAC, etc...)
API keys for your technical customers to interact with your own APIs
Machine to machine authentication, or services — if you have a microservice architecture.
Audit trails to track the permission changes within your services or expose this to your customers.

And there are of course many more components, that help complete full auth-platform, but they aren't totally relevant to this article, so I'm going to skip over them.

With that, you may already start to be able to see why uptime is so critical for us. We're on the critical path for our customers. It's not inherently true for every single platform, but it is for us. So if our solution is down, then our customer applications are down as well.

If we put the reliability part in the back corner for one second and just think about the features, we can theorize about a potential initial architecture. That is, an architecture that just focuses on the features, how might you build this out as simple as possible? I want to do this, so I can help explain all the issues that we would face with the simple solution.

Maybe you've got a single region, and in that region you have some sort of HTTP router that handles requests and they forward to some compute, serverless, container, or virtual machine, or, and I'm very sorry for the scenario — if you have to use bare metal. Lastly, you're interacting with some database, NoSQL, SQL, or something else, file storage, and maybe there's some async components.

If you take a look at this, it's probably obvious to you (and everyone else) that there is no way it is going to meet our reliability needs. But we have to ask, just exactly how often will there actually be a problem with this architecture? Just building out complexity doesn't directly increase reliability, we need to focus on why this architecture would fail. For us, we use AWS, so I look to the Amazon CTO for guidance, and he's famously quoted as saying, _ Everything fails all the time _.

And AWS's own services are no exception to this. Over the last decade, we've seen numerous incidents:

2014 - Ireland (Partial) - Hardware - Transformer failed - EC2, EBS, and RDS
2016 - Sydney (Partial) - Severe Weather - Power Loss - All Services
2017 - All Regions - Human error - S3 critical servers deleted - S3
2018 - Seoul Region - Human error - DNS resolvers impacted - EC2
2021 - Virginia - Traffic Scaling - Network Control Plane outage - All Services
2021 - California - Traffic Scaling - Network Control Plane outage - All Services
2021 - Frankfurt (Partial) - Fire - Fire Suppression System issues - All Services
2023 - Virginia - Kinesis issues - Scheduling Lambda Invocations impact - Lambda
2023 - Virginia - Networking issues - Operational issue - Lambda, Fargate, API Gateway…
2023 - Oregon (Partial) - Error rates - Dynamodb + 48 services
2024 - Singapore (Partial) - EC2 Autoscaling - EC2
2024 - Virginia (Partial) - Describe API Failures ECS - ECS + 4 services
2024 - Brazil - ISP issues - CloudFront connectivity - CloudFront
2024 - Global - Network connectivity - STS Service
2024 - Virginia - Message size overflow - Kinesis down - Lambda, S3, ECS, CloudWatch, Redshift
2025 - Virginia - Dynamo DB DNS - DynamoDB down - All Services

And any one of these would have caused major problems for us and therefore our customers. And the frequency of incident is actually increasing in time. This shouldn't be a surprise, right? Cloud adoption is increasing over time. The number of services AWS is offering is also increasing. But how impactful are these events? Would single one of them have been a problem for us to actually reach our SLA promise? What would happen if we just trusted AWS and used that to pass through our commitments? Would it be sufficient to achieve 99.999% SLA uptime? Well, let's take a look.

🕰️ AWS SLA Commitments

The AWS Lambda SLA is below 5 nines

The API Gateway SLA is below 5 nines

The AWS SQS SLA is below 5 nines

Okay, so when it comes to trusting AWS SLAs, it isn't sufficient. At. All.

We can't just use the components that are offered by AWS, and go from there. We fundamentally need to do something more than that. So the question becomes, what exactly must a dependency's reliability be in order for us to utilize it? To answer that question, it's time for a math lesson. Or more specifically, everyone's favorite topic, probabilities.

Let's quickly get through this ~~torture~~ exercise. Fundamentally, you have endpoints in your service, and you get in an HTTP request, and it interacts with some third-party component or API, and then you write the result to a database. For us, this could be an integration such as logging in with Google or with Okta for our customers' enterprise customers.

💻 Calculating the allowed failure rate

So if we want to meet a 5-nines reliability promise, how unreliable could this third-party component actually be? What happens if this component out of the box is only 90% reliable? We'll design a strategy for getting around that.

Uptime is a product of all of the individual probabilities:

For the sake of this example, we'll just assume that every other component in our architecture is 100% reliable — That's every line of code, no bugs ever written in our library dependencies, or transitive library dependencies, or the dependencies' dependencies' dependencies, and everything always works exactly as we expect.

So we can actually rewrite our uptime promise as a result of the failure rate of that third-party component.

And the only way that we can actually increase the success rate of the uptime based off of failures is to retry. And so we can multiply out the third-party failure rate and retry multiple times.

Logically that makes a lot of sense. When a component fails, if you retry again, and again, the likelihood it will be down every single time approaches zero. And we can generate a really nasty equation from this to actually determine how many exact times do we need to retry.

How many exactly can it? Rather than guessing whether or not we should retry four times or five times, or put it in a while(true) loop, we can figure it out exactly. So we take this equation and extend it out a little bit. Plugging in our 90% reliable third-party component:

We find that our retry count actually must be greater than or equal to five. We can see that this adds up to our uptime expectation:

Is this the end of the story? Just retry a bunch of times and you're good? Well, not exactly. Remember this equation?

We do really need to consider every single component that we utilize. And specifically when it comes to the third-party component, we had to execute it by utilizing a retry handler. So we need to consider the addition of the retry handler into our equation. Going back to the initial architecture, instead of what we had before, when there's a failure in that third-party component, now we will automatically execute some sort of asynchronous retries or in-process retries. And every time that third-party component fails, we execute the retry handler and retry again.

This means we need to consider the reliability of that retry handler.

Let's assume we have a really reliable retry handler and that it's even more reliable than our service. I think that's reasonable, and actually required. A retry handler that is less reliable than our stated SLA by default is just as faulty as the third-party component.

Let's consider one with five and a half nines — that's half a nine more reliable than our own SLA.

But how reliable does it really need to be? Well, we can pull in our original equation and realize that our total uptime is the unreliability or the reliability of the third-party component multiplied by the reliability of our retry handler.

From here, we add in the retries to figure out what the result should be:

We have a reliable retry handler, but it's not perfect. And with a retry handler that has reliability of five and a half nines, we can retry a maximum two times. Because remember, it has to be reliable every single time we utilize it, as it is a component which can also fail. Which means left with this equation:

I don't think comes as a surprise to anyone that in fact five is greater than two. What is the implication here?

The number of retries required for that unreliable third-party component to be utilized by us exceeds the number of retries actually allowed by our retry handler.

That's a failure, the retry handler can only retry twice before itself violates our SLA, but we need to retry five times in order to raise the third-party component reliably up. We can actually figure out what the minimum reliability of a third-party component is allowed to be, when using our retry handler:

Which in turn validates that it's actually impossible for us to utilize that component. 99.7%. 99.7% is the minimum allowed reliability for any third-party component in order for us to meet our required 5-nines SLA. This third-party component is so unreliable (~90%), that even using a highly reliable retry handler, we still can't make it reliable enough without the retry handler itself compromising our SLA. We fundamentally need to consider this constraint, when we're building out our architecture.

That means we drop this third-party component. Done.

And then, let's assume we get rid of every flaky component, everything that don't have a high enough reliability for us. At this point, it's good to think, is this sufficient to achieve our 5-nines SLA? Well, it isn't just third-party components we have to be concerned about. We also have to be worried about those AWs infrastructure failures.

🌩️ Infrastructure Failures

So let's flashback to our initial architecture again:

We can have issues at the database layer, right? There could be any number of problems here. Maybe it's returning 500s, there are some slow queries, maybe things are timing out. Or there could be a problem with our compute. Maybe it's not scaling up fast enough. We're not getting new infrastructure resources. Sometimes, even AWS is out of bare metal machines when you don't reserve them, request them get them on demand, and the list go on.

Additionally, there could also be some sort of network issue, where requests aren't making it through to us or even throw a DNS resolution error on a request from our users.

In many of these cases, I think the answer is obvious. We just have to declare the whole region as down. And you are probably thinking, well, this is where we failover to somewhere else. No surprise, yeah, this is exactly what we do:

However, this means we have to have all the data and all the infrastructure components duplicated to another region in order to do this. And since Authress has six primary regions around the world, that also means we need multiple backup regions to be able to support the strategy. But this comes with significant wasted resources and wasted compute that we're not even getting to use. Costly! But I'll get to that later.

Knowing a redundant architecture is required is a great first step, but that leaves us having to solve for: how do we actually make the failover happen in practice?

🚧 The Failover Routing Strategy

Simply put — our strategy is to utilize DNS dynamic routing. This means requests come into our DNS and it automatically selects between one of two target regions, the primary region that we're utilizing or the failover region in case there's an issue. The critical component of the infrastructure is to switch regions during an incident:

In our case, when using AWS, this means using the Route 53 health checks and the Route 53 failover routing policy.

We know how we're gonna do it, but the long pole in the tent is actually knowing that there is even a problem in the first place. A partial answer is to say Have a health check , so of course there is health check here. But the full answer is: have a health check that validates both of the regions, checking if the region is up, or is there an incident? And if it is, reports the results to the DNS router.

We could be utilizing the default provided handler from AWS Route 53 or a third-party component which pings our website, but that's not accurate enough from a standpoint of correctly and knowing for certain that our services are in fact down.

It would be devastating for us to fail over when a secondary region is having worse problems than our primary region. Or what if there's an issue with with network traffic. We wouldn't know if that's an issue of communication between AWS's infrastructure services, or an issue with the default Route 53 health check endpoint, or some entangled problem with how those specifically interact with our code that we're actually utilizing. So it became a requirement to built something ourselves, custom, to actually execute exactly what we need to check.

Here is a representation of what we're doing. It's not exactly what we are doing, but it's close enough to be useful. Health check request come in from the Route 53 Health Check. They call into our APIGW or Load Balancer as a router. The requests are passed to our compute which can interact and validate logic, code, access, and data in the database:

The health check executes this code on request that allows us to validate if the region is in fact healthy:

Region HealthCheck validation

import Authorizer from './authorizer.js';
import ModelValidator from './modelValidator.js';
async healthCheck(request) {
  await profiler.start();
  const dynamoDbCheck = accountDatabase.getDefaultAccount();
  const indexerCheck = indexer.authorizationCheck('HealthCheck');
  const sqsValidation = sqsClient.queue('LiveCheck');
  const authorizer = Authorizer.validate();
  const modelValidation = ModelValidator.validate();
  try {
    await Promise.all([dynamoDbCheck, indexerCheck, sqsValidation,
      authorizer, modelValidation]);
  } catch (error) {
    logger.log('HealthCheck Failed', error);
    return { statusCode: 503 };
  }

  await profiler.end();
  return { statusCode: 200 };
}

We start a profiler to know how long our requests are taking.
Then we interact with our databases, as well as validate some secondary components, such as SQS. While issues with secondary components may not always be a reason to failover, they can cause impacts to response time, and those indicators can be used to predict incoming incidents.
From there, we check whether or not the most critical business logic is working correctly. In our case, that's interactions with DynamoDB as well as core authorizer logic. Compared to a simple unit test, this accounts for corruption in a deployment package, as well instances where some subtle differences between regions interact with our code base. We can catch those sorts of problems here, and know that the primary region that we're utilizing, one of the six, is having a problem and automatically update the DNS based on this.
When we're done, we return success or failure so the health check can track changes.

🌿 Improving the Failover Strategy

And we don't stop here with our infrastructure failover however. With the current strategy, it's good, in some cases, even sufficient. But it isn't that great. For starters, we have to completely failover. If there's just one component that's problematic, we can't just swap that one out easily, it's all or nothing with the Route 53 health check. So when possible, we push for an edge-optimized architecture. In AWS, this means utilizing AWS CloudFront with AWS Lambda@Edge for compute. This not only helps reduce latency for our customers and their end users depending where they are around the world, as a secondary benefit, fundamentally, it is an improved failover strategy.

And that looks like this:

Using CloudFront gives us a highly reliable CDN, which routes requests to the locally available compute region. From there, we can interact with the local database. When our database in that region experiences a health incident, we automatically failover, and check the database in a second adjacent region. And when there's a problem there as well, we do it again to a third region. We can do that because when utilizing DynamoDB we have Global Tables configured for authorization configuration. In places where we don't need the data duplicated, we just interact with the table in a different region without replication.

After a third region with an issue, we stop.

And maybe you're asking why three and not four or five or six? Aren't you glad we did the probabilities exercise earlier? Now you can actually figure out why it's three here. But, I'll leave that math as an exercise for you.

As a quick recap, this handles the problems with at the infrastructure level and with third-party components. And if we solve those, is that sufficient for us to achieve our goal the 5-nines SLA?

For us the answer is No , and you might have guessed, if you peaked at the scrollbar or table contents that there are still quite some additional components integrated into our solution. One of them is knowing that at some point, there's going to be a bug in our code, unfortunately.

💻 Application level failures

And that bug will get committed to production, which means we're going to end up with an application failure. It should be obvious that it isn't achievable to write completely bug-free code. Maybe there is someone out there that thinks that, and maybe even that's you, and I believe you that you believe that. However, I know it's not me, and realistically, I don't want to sit around and pray that it's also my fellow team members. The risk is too high, because in the case something does get into production, that means it can impact some of our customers. So instead, let's assume that will happen and design a strategy around it.

So when it does happen, we of course have to trigger our incident response. For us, we send out an email, we post a message on our community and internal communication workspaces, and start an on-call alert. The technology here isn't so relevant, but tools like AWS SES, SQS, SNS, Discord, and emails are involved.

Incidents wake an engineer up, so someone can start to take look at the incident, and most likely the code.

But by the time they even respond to the alert, let alone actually investigate and fix the cause of the incident, we would long violated our SLA. So an alert is not sufficient for us. We need to also implement automation to automatically remediate any of these problems. Now, I'm sure you're thinking, yeah, okay, test automation. You might even be thinking about an LLM agent that can automatically create PRs. (Side note: LLM code generation, doesn't actually work for us, and I'll get to that a little further down) Instead, we have to rely on having sufficient testing in place. And yes, of course we do. We test before deployment. There is no better time to test.

This seems simple and an obvious answer, and I hope that for anyone reading this article it is. Untested code never goes to production. Every line of code is completely tested before it is merged to production, even if it is enabled on some flag. Untested code is never released, it is far too dangerous. Untested code never makes it to production behind some magic flag. Abusing feature flags to make that happen could not be a worse decision for us. And that's because we can need to be as confident as possible before those changes actually get out in front of our customers. The result is — we don't focus on test coverage percentage, but rather test value. That is, which areas provide most value, that are most risky, that we care about being the most reliable for our customers. Those are the ones we focus on testing.

Root Cause Analysis (RCA)

Every incident could have been prevented if we just had one more test. The trick though is actually having that right test, before the incident.

And in reality, that's not actually possible. Having every right test for a service that is constantly changing, while new features are being added, is just unmaintainable. Every additional test we write increases the maintenance burden of our service. Attempting to achieve 100% complete test coverage would require an infinite amount of time. This is known as the Pareto Principle, more commonly the 80-20 rule. If it takes 20% of the time to deliver 80% of the tests, it takes an infinite amount of time to achieve all the tests, and that assumes that the source code isn't changing.

The result is we'll never be able to catch everything. So we can't just optimize for prevention. We also need to optimize for recovery. This conclusion for us means also implementing tests against our deployed production code. One example of this are validation tests.

📋 Validation Tests

A validation test is where you have some data in one format and data in another format and you use those two different formats to ensure referential consistency. (Side note: There are many different kinds of tests, and I do a deep dive in the different types of tests and how they're relevant in building secure and reliable systems). One concrete example could be you have a request that comes in, you end up logging the request data and the response, then you can compare that logged data to what's actually saved in your database.

In our scenario, which focuses on the authorization and permissions enforcement checks, we have multiple databases with similar data. In one case, there's the storage of permissions as well as the storage of the expected checks and the audit trail tracking the creation of those permissions. So we actually have multiple opportunities to compare the data between our databases asynchronously outside of customer critical path usage.

Running the Validation

On a schedule, via an AWS CloudWatch Scheduled Rule, we load the data from our different databases and we compare them against each other to make sure it is consistent. If there is a problem, then if this fires off an incident before any of our customers notice, so that we can actually go in and check what's going on.

This sounds bad on the surface that it could ever happen. But the reality of the situation is that a discrepancy can show up as a result of any number of mechanisms. For instance, the infrastructure from AWS could have corrupted one of the database shards and what is written to the databases is inconsistent. We know that this can happen as there is no 100% guarantee on database durability, even from AWS. AWS does not guarantee Database Durability , are you assuming they do, because we don't! So actually reading the data back and verifying its internal consistency is something that we must do.

While it might not seem that this could reduce the probability of there being an incident. Consider that a requested user permission check whose result doesn't match our customer's expectation is an incident. It might not always be one that anyone identifies or even becomes aware of, but it nonetheless a problem, just like a publicly exposed S3 is technically an issue, even if no one has exfiltrated the data yet, it doesn't mean the bucket isn'is sufficiently secured.

🎯 Incident Impact

There are two parts to the actual risk of an incident. The probability and the impact. Everything in this article I've discuss until now talks about reducing the probability of an incident, that is — the likelihood of it happening. But since we know that we can't avoid ever having an incident, we also have to reduce the impact when it happens.

One way we do that is by utilizing an incremental rollout. Hopefully everyone knows what incremental rollout is, so I'll instead jump straight into how we accomplish it utilizing AWS. And for that we focus again on our solution integrating with CloudFront and our edge architecture.

The solution for us is what I call Customer Deployment Buckets. We bucket individual customers into separate buckets and then deploy to each of the buckets sequentially. If the deployment rolls out without a problem, and it's all green, that is everything works correctly, then we go on to the second bucket and then deploy our code to there, and then the third bucket, and so on and so forth until every single customer has the new version.

If there is an issue, we stop the rollout and we go and investigate what's actually going on. While we can't prevent the issue from happening to the earlier buckets, we are able to stop that issue from propagating to more customers, having an impact on everyone, and thus reduce the impact of the incident.

As I mentioned before the biggest recurring issue isn't executing an operations process during an incident, it's identifying there is a real incident in the first place. So, How do we actually know that there's an issue?

If it was an easy problem to solve, you would have written a unit task or integration test or service level test and thus already discovered it, right? So adding tests can't, by design, help us. Maybe there's an issue with the deployment itself or during infrastructure creation, but likely that's not what's happening.

Now, I know you're thinking, _ When is he going to get to AI? _

Whether or not we'll ever truly have AI is a separate <rant /> that I won't get into here, so this is the only section on it, I promise. What we actually do is better called anomaly detection. Historically anomaly detection, was what AI always meant, true AI, rather than an LLM or agent in any way.

🔎 AI: Anomaly Detection

This is a graph of our detection analysis:

You might notice that it's not tracking 400s or 500s, which are in reality relatively easy to detect. But in fact don't actually tell us meaningfully what's wrong with our service or whether or not there really is a problem. Impact is measured by business value, not technical protocol level analytics, so we need to have a business-focused metric.

And for us, at Authress, the business-focussed metric we use to identify meaningful incidents we call: The Authorization Ratio. That is the ratio of successful logins and authorizations to ones that are blocked, rejected, timeout or are never completed for some reason.

The above CloudWatch metric display contains this exact ratio, and here in this timeframe represents an instance not too long ago where we got really close to firing off our alert.

Here, there was a slight elevation of errors soon after a deployment. The expected ratio was outside of our allowance span for a short period of time. However not long enough to trigger an incident. We still investigated, but it wasn't something that required immediate remediation. And it's a good reminder that identifying problems in any production software isn't so straightforward. To achieve high reliability, we've needed an AI or in this case anomaly detection to actually identify additional problems. And realistically, even with this level of sophistication in place, we still can never know with 100% certainty that there is actually an incident at any moment. And that's because "what is an incident", is actually a philosophical question...

🌹 Does it smell like an incident?

Our anomaly detection said – almost an incident, and we determined the result – no incident. But does that mean there wasn't an incident? What makes an incident, how do I define an incident? And is that exact definition ubiquitous, for every system, every engineer, every customer?

Obviously not, and one look at the AWS Health Status Dashboard is all you need to determine that the identification of incidents is based on subjective perspective, rather than objective criteria. What's actually more important is the synthesis of our perspective on the situation and what our customers believe. To see what I mean, let's do a comparison:

I'm going to use Authress as an example. So I've got the product services perspective on one side and our customer's perspective on the other.

Incident Alignment

In the top left corner we have alignment. If we believe that our system is up and working and our customers do, too, then success, all good. Everything's working as expected.

Inversely in the opposite corner, maybe there is a problem. We believe that one of our services is having an issue, and successfully, we're able to identify it. Most importantly, our customers say–yes, there is an issue for us.

It's not great that there's an incident, but as I've identified incidents will absolutely happen, and the fact we've correctly aligned with our customers on the problem's existence independently allows us to deploy automation to automatically remediate the issue. That's a success! If it's a new problem that we haven't seen before, we can even design new automation to fix this. Correctly identifying incidents is challenging, so doing that step correctly, leads itself very well to automation for remediation.

Perspective Mismatch

One interesting corner is when our customers believe that there's nothing wrong, there have been no incidents reported, but all our alerts are saying – RED ALERT — someone has to go look at this!

In this case, our alerts have identified a problem that no one cares about. This often happens in scenarios where our customers are in one region, Switzerland for example, with local region users, a health care, manufacturing, or e-commerce app, is a good example, rather than global, who are likely asleep at 2:00 AM. And that means an incident at the moment, could be an issue affecting some customers. But if they aren't around to experience it, is it actually happening?

You are probably wincing at that idea. There's a bug, it must be fixed! And sure that's a problem, it's happening and we should take note of what's going on. But we don't need to respond in real time. That's a waste of our resources where we could be investing in other things. Why wake up our engineers based on functionality that no one is using?

I think one of the most interesting categories is in the top right-hand corner where:

our customers say, "hey, your service is down"
But we say, "Wait, really, is it?"_

This is known as a gray failure.

Gray Failures

And it can happen for any number of reasons. Maybe there is something in our knowledge base that tells our customers to do something one way and it's confusing and they've interpreted it in a different way. So there's a different expectation here. That expectation can get codified into customer processes and product services.

Or maybe our customer is running different tests from us, ones that are of course, valuable for their business, but not ones that we consider. Or more likely they are just using a less resilient cloud provider.

Most fundamentally, there could really be an incident, something that we haven't detected yet, but they have. And if we don't respond to that, it could grow, and left unchecked, escalate, and eventually impact all our customers. This means we need to give our customers an easy way to report incidents to us, which we can immediately follow up with.

For us, every single incident, every single customer support ticket that comes into our platform, we immediately and directly send it to our engineering team. Now, I often get pushback on this from other leaders. I'm sure, even you might be thinking something like — I don't want to be on call for customer support incidents. But if you throw additional tiers in your organization between your engineering teams and your customers, that means you're increasing the time to actually start investigating and resolving those problems. If you have two tiers before your engineering team and each tier has its own SLA of 10 minutes to triage the issue, that means you've already gone through 20 minutes before an engineer even knows about it and can go and look at it. That violates our SLA by fourfold before investigation and remediation can even begin.

Instead, in those scenarios, what I actually recommend thinking about is how might you reduce the number of support tickets you receive in aggregate? This is the much more appropriate way to look at the problem. If you are getting support tickets that don't make sense, then you've got to investigate, why did we get this ticket? Do the root cause analysis on the ticket, not just the issue mentioned in it — why the ticket was even created in the first place.

A ticket means: Something is broken. From there, we can figure out, OK, maybe we need to improve our documentation. Or we need to change what we're doing on one of our endpoints. Or we need to change the response error message we're sending. But you can always go deeper.

The customer support advantage

And going deeper, means customer support is critical for us. We consider customer support to be the lifeline of our service level agreement (SLA). If we didn't have that advantage, then we might not have been able to deliver our commitment at all. So much so that we report some of our own CloudWatch custom metrics to our customers so they can have an aggregate view of both what they know internally and what we believe. We do this through our own internal dashboard in our application management UIs.

Helping our users identify incidents benefits us; because we can't catch everything. It's just not possible.

💀 Negligence and Malice

To this point, we've done the math on reliability of third-party components. We've implemented an automatic region failover and added incremental rollout. And we have a core customer support focus. Is that sufficient to achieve 5-nines of reliability?

If you think yes, then you'd expect the meme pictures now. And, I wish I could say it was enough, but it's not. That's because we also have to deal with negligence and malice.

We're in a privileged position to have numerous security researchers out there on the internet constantly trying to find vulnerabilities within our service. For transparency, I have some of those reports I want to share:

“Real” Vulnerability Reports

I am a web security researcher enthusiast. Do you give a monetary reward?

Okay, this isn't starting out that great. What else have we received?

I found some vulnerabilities in your website. Do you offer rewards for ethical hackers?

Well, maybe, but I think you would actually need to answer for us, what the problem actually is. And you also might notice this went to our spam. It didn't even get to our inbox. So a lot of help they might be providing. Actually we ignore any ”security” email sent from a non-custom domain.

This one was really interesting. We had someone attempting to phish our engineering team by creating a support ticket and putting in some configuration trying to get us to provide them our own credentials to one of our third-party dependencies. Interestingly enough, our teams don't even have access to those credentials directly.

And, we know this was malicious because the credentials that they are referencing in the support request are from our honey pot, stuck in our UI to explicitly catch these sorts of things. The only way to get these credentials is if they hacked around our UI application and pulled out of the HTML. They aren't readily available any other way. So it was very easy for us to detect that this “report” was actually a social engineering attack.

And this is one of my favorites, and I can't make this up:

I have found many security loophole. How much will you pay if you want to working with me like project?

That's the exact quote, I don't even know what that means. Unfortunately, LLMs will actually start to make all of these future "vulnerability reports" sound more appealing to read in the future, for better or worse. However, at the end of the day, the truth is that these are harmless. And we actually do have a security disclosure program that anyone can go and submit problems for. I hope the message to white-hat hackers is please use that process, and the legitimate reports usually do go through it. Do not send us emails. Those are going to go into the abyss. Alternatively, you can follow our security.txt public page or go to the disclosure form, but with email, the wrong people are going to get that and we can't triage effectively.

Vulnerabilities in our services can result in production incidents for our customers. That means security is part of our SLA. Don't believe me, I'll show you how:

Multitenant considerations

It's relevant for us, that Authress is a multitenant solution. So some of the resources within our service are in fact shared between customers.

Additionally, customers could have multiple services in a microservice architecture or multiple components. And one of these services could theoretically consume all of the resources that we've allocated for that customer. In that scenario, that would cause an incident for that customer. So we need to protect against resource exhaustion Intra-Tenant. Likewise, we have multiple customers. One of those customers could be consuming more resources than we've allocated to the entire tenant. And that could cause an incident across Inter-Tenant and cause an incident across our platform and impact other customers.

Lastly, we have to be worried about our customers, our customers' customers, and our customers' customers' customers, because any one of those could be malicious and consume their resources and so on and so forth, thus causing a cascading failure. A failure due to lack of resources is an incident. The only solution that makes sense for this is, surprise, rate limiting.

Helpful Rate Limiting

So we need to rate-limit these requests at different levels for different kinds of clients, different kinds of users, and we do that within our architecture, at different fundamental levels within our infrastructure.

Primarily there are protections at our compute level, as well at the region level, and also place protections at a global level. In AWS, this of course means using a web application firewall or WAF. I think our WAF configuration is interesting and in some ways novel.

Fundamentally, one of the things that we love to use is the AWS managed IP reputation list.

The reputation list is list of IP addresses that have been associated with malicious activity outside of our service throughout other customers at AWS and other providers out there in the world where a problem has been detected. That means before those attacks even get to our service or to our customers' instances of Authress, we can already know to block them, and the WAF does that. This is great, and most importantly, has a very low false positive rate.

However, the false positive rate is an important metric for consideration of counter measures against malicious attacks or negligent accidental abuse of resources, and something that prevents us from using any other managed rules from AWS or external providers. There's two problems with managed rules, fundamentally:

Number one is the false positive rate. If that is even a little bit more than, it couldn't be sustainable, and would result in us blocking legitimate requests coming for a customer. This means it is a problem, and it's an incident for them if some of their users can't utilize their software because of something we did. False positives are customer incidents.
The second one is that managed rules are gratuitously expensive. Lots of companies are building these just to charge you lots of money, and the ROI just doesn't seem to be there. We don't see useful blocks from them.

But the truth is, we need to do something more than just the reputation list rule.

Handling Requests at Scale

And the thing that we've decided to do is — add blocking for sufficiently high requests. By default, any Authress account's service client that goes above 2,000 requests per second (RPS), we just immediately terminate. Now, this isn't every customer, as there are some out there for us that do require such a high load or even higher (as 2k isn't that high). But for the majority of them, if you get to this number and they haven't talked to us about their volume, then it is probably malicious in some way. You don't magically go from zero to 2,000 one day, unless it is an import job.

Likewise, we can actually learn about a problem long before it gets to that scale. We have milestones, and we start reporting loads from clients at 100, 200, 500, 1,000, et cetera. If we see clients hitting these load milestones, we can already start to respond and create an incident for us to investigate before they reach a point where they're consuming all of the resources in our services for that customer. And we do this by adding alerts on the COUNT of requests for WAF metrics.

However, we also get attacks at a smaller scale. Just because we aren't being DDoS-ed doesn't mean there isn't attack. And those requests will still get through because they don't meet our blocking limits. They could be malicious in nature, but only identifiable in aggregate. So while single request might seem fine, if you see the same request 10 times a second, 100 times a second, something is probably wrong. Or if you have request urls that end in .php?admin, when no one has run WordPress in decades, you also know that there's a problem. We catch these by logging all of the blocked requests.

We have automation in place to query those results and update our rules, but a picture is worth a thousand words:

Here you can see a query based off of the IP addresses from the client that are being utilized and sorted by frequency. When we get these requests that look non-malicious individually, we execute a query such as this one and we check to see if the results match a pattern. You can use ip address matching or more intelligently, something called the JA3 or JA4 fingerprints of those requests There are actually lots of options available, I'm not going to get into exactly what they are, there are some great articles on the topic. And there are more mechanisms to actually track these used throughout the security industry, and utilizing them let's you instantly identify: Hey, you know what? This request violates one of our patterns, maybe we should block all the requests from that client.

And so, rather than waiting for them to get to the point where an attacker is consuming 2,000 requests per second worth of resources, you can stop there right away. In the cases where we can't make a conclusive decision, this technology gives us another tool that we can utilize to improve our patterns for the future. Maybe it goes without saying, but of course because we've running our technology to many regions around the world, we have to work on deploying this infrastructure in all these places and push it out to the edge where possible.

🎁 The Conclusion

I said a lot of things, so I to quickly want to quickly summarize our architecture that we have in place:

Third-party component reliability reviews. I can't stress this enough. Don't just assume that you can utilize something. And sometimes in order to achieve 5-nines, you actually have to remove components from your infrastructure. Some things are just not able to be utilized no matter what. Now maybe you can put it in some sort of async background, but it can't be on the critical path for your endpoints.
DNS failover and health checks. For places where you have an individual region or availability zone or cluster, having a full backup with a way to conclusively determine what's up and automatically failover is critical.
Edge compute where possible. There's a whole network out there of services that are running on top of the cloud providers, which help guarantee your capability to run as close to as possible to where your users are and reduce latency.
Incremental rollout for when you want to reduce the impact as much as possible.
The Web Application Firewall for handling those malicious requests.
Having a Customer Support Focus to enable escalating issues that outside your area of detection.

And through seven years or so that we've been doing this and building up this architecture, there's a couple of things that we've learned:

Murphy's Law

Everything fails all the time. There absolutely will be failures everywhere. Every line of code, every component you pull in, every library, there's guaranteed to be a problem in each and everyone of those. And you will for sure have to deal with it, at some point. So being prepared to handle that situation, is something you have to be thinking through in your design.

DNS

DNS, yeah, AWS will say it, everyone out there will say, and now we get to say it. The global DNS architecture is pretty good and reliable for a lot of scenarios, but I worry that it's still a single point of failure in a lot of ways.

Infrastructure as Code (IAC)

The last thing is infrastructure as code challenges. We deploy primary regions, but then there's also the backup regions, which are slightly different from the primary regions, and then there are edge compute, which are, again, even more slightly different. And then sometimes, we do this ridiculous thing, where we deploy infrastructure dedicated to one customers. And in doing so, we're running some sort of IaC to deploy those resources.

It is almost exactly the same architecture. Almost! Because it isn't exactly the same there are quite the opportunities for challenges to sneak it. That's problematic with even Open Tofu or CloudFormation, and often these tools make it more difficult, not less. And good luck to you, if you're still using some else that hasn't been modernized. With those, it's even easier to run into problems and not get it exactly correct.

The last thing I want to leave you with is, well, With all of these, is that actually sufficient to achieve five nines?

No. Our commitment is 5-nines, what we do is in defense of that, just because you do all these things doesn't automatically mean your promise of 5-nines in guaranteed. And you know what, you too can promise a 5-nines SLA without doing anything. You'll likely break your promise, but for us our promise is important, and so this is our defense.

For help understanding this article or how you can implement a solution like this one in your services, feel free to reach out to me and join my community:

Join the community

AWS Auth Caching Strategies

Warren Parad — Tue, 17 Jun 2025 13:10:24 +0000

Caching is difficult to get right and often means you need to pull in additional frameworks into your code. Fine tuning the balance between performance and data freshness takes time and experience. In case of User-Agent integrations (for example, an application UI running in your user’s browser), it is even more crucial, as the User-Agent is rarely under your control and yet demands fast response times. This is why often I opt to provide cache recommendations for the service side in many cases. One such example of this, is in the product I work heavily with—Authress.

That doesn’t mean you can’t cache returned values for longer.

I'm going to use Authress as example for caching, so a quick summary might make sense. Authress provides login and access control for the applications you write. This means permissions checks. (And yes, because we are Swiss company focusing on the EU market is critical).

So, in the case that you’re making a lot of the same, low variability permission checks, for example, you may want to build a cache on top of Authress to limit your costs. It is not strictly necessary though. I'm going to walk through how AWS can be utilized to provide different caching opportunities when interacting with third party services.

General caching strategies

In the context of Authorization, frequently the goal is to cache Authorization Requests as much as is useful. The following strategies will review the available possibilities. Let's assume that recommendations for cache times will always be returned in the Cache-Control header in the response from API Authorization User Permission Requests.

A. API Gateway

If you run an API Gateway, there is an automatic caching strategy to support caching data for a short period of time. If data can be cached on a per request basis, then adding into the cache details about the user's permissions and authorization is an option. This is known as "Caching Authorization checks in API Gateway".

Depending on your API Gateway, this can work better for serverless solutions compared to others. The API Gateway caching uses the Access Token as the default cache key, and that means you must add in to the cache key, the Resource URI Path and the Request HTTP Method to ensure a path specific authorization is cached.

The most common and effective cache examples would include A list of all the tenants or customer accounts a user has access to. Since these list would change rarely, storing this information in the AWS API Gateway cache works well.

Getting the list of tenants a user has access to in the API Gateway authorizor:

import { AuthressClient } from '@authress/sdk';
const authressClient = new AuthressClient({
  authressApiUrl: 'https://auth.yourdomain.com' });
const userResources = await  
  authressClient.userPermissions.getUserResources(
    userId, `tenants/*`, 10, null,
    CollectionConfiguration.TOP_LEVEL_ONLY);

return {
  context: {
    // Stringify is because API does not support arrays.
    userResources: userResources.data.resources.join(',')
  }
};

Danger!

I'm going to repeat this: You must ensure that the cache key associated with the API includes the HTTP Method and the full resource URI. If you are not sure what this means please consult with your API Gateway documentation. In API Gateway, update the Identity Source to include both the HTTP Method and the Path, which are both sourced from the context.

See API Gateway configuration vulnerabilities for more information.

B. Content Delivery Networks and Edge-based caching

A CDN can often work to proxy all requests to a target provider. Instead of integrating directly with our API target of choice, you can proxy the requests through another solution that sits in front of your Auth provider. Some CDNs work well for this, others might not.

In the case of AWS, the canonical solution would be using AWS CloudFront. From the experience of my development team, using AWS CloudFront can be a bit finicky when putting CloudFront in front of other services that you don't own. Some of our users say that it has worked, others have run into limitations from CloudFront especially regarding cache times and configuration. Usually in these cases, you might need to use a Lambda@Edge function attached to your CloudFront to interact with the third party.

Due to this, there might be limited value in the benefit from the caching that CloudFront could provide. A common corner case I've found is that sometimes you are thinking about doing this to help reduce costs. Costs incurred by calling that third party API. Costs of course are relevant at scale, however at that same scale, I tend to think about partial volume discounts so that rather than forcing the use and therefore additionally paying for the CDN in above and beyond the third party.

Take for an example Authress, as a company we would much prefer to offer a discount than force you to have to build complexity. You would get the benefit directly from Authress Billing without having to write or maintain anything yourself or pay for a second technology on top (Price or Total Cost of Ownership). If you are investigating a caching solution to handle scale due primarily to costs, please contact your provider. If your provider won't offer alternatives to make your integration seamless, then that might not be a provider that makes sense to continue with. Rather than trying to wrap a bad solution, find a better one!

Once a request is passed to Lambda@Edge, that would grant full capabilities to storing and retrieving data through different data stores, such as DynamoDB. But, the implementation details would be up to you.

Troubleshoot AWS CloudFront

I do want to share a quick callout though. One possible error you might see is related to a CloudFront stacking issue. Since Authress itself is using CloudFront, depending on your setup you might run into a stacking problem. At the current moment, if you are seeing this issue, there isn't a way for CloudFront to be used in your scenario, so we recommend switch to Lambda@Edge with CloudFront and interacting with Authress through there. This is explored further in the next sections.

C. Self-hosted internal proxy

When you are at the point of wanting a proxy to cache authorization requests, a quick microservice service could be separated and created to proxy all the requests to your provider. This could be run as standalone service. The proxy would need to pass along requests Authress after interacting with your cache datastore.

Hopefully the Third Party's SDKs support an a configurable target endpoint. Instead of setting it to be your Custom Domain such as https://auth.yourdomain.com, you would set the target endpoint to be your own microservice's URL.

Proxy service for caching permissions requests:

import { AuthressClient } from '@authress/sdk';

// Switch this to be your cache's URL:
const authressClient = new AuthressClient({
  authressApiUrl: 'https://cache.yourdomain.com' });

const userId = 'User';
const resourceUri = `resources/${resourceId}`;
const permission = 'READ';

try {
  await authressClient.userPermissions.authorizeUser(
  userId, resourceUri, permission);
} catch (error) {
  if (error.code === 'UnauthorizedError') {
      return { statusCode: 403 };
  }

  throw error;
}

For assistance with creating a proxy, I have to recommend reaching out to the provider with questions. Many products have secret fields and configurations in their SDKs, or in the case of our own SDKs we have increased security configuration in there, attempting to side-step the SDK to build a custom caching layer without the SDK will cause you to lose those optimizations.

D. SDK configured caching
Recently I've been investing further resources into improving built-in caching for our own SDKs, but in general each SDKs for each different language for each different provider has varying levels of support for caching.

Caching in the SDK works well for longer lived containers. For sustained requests to your API, even with a serverless solution, your function will have this data cached for the lifetime of the container. This works great for balanced predictable usage. This is less valuable for bursts. For non-serverless solutions when utilizing the caching if it is provided by the SDK, in your language, it can work out of the box.

Some SDKs support caching and caching configuration and others do not. The reason for this is contingent on the tools available in the language as well as libraries supporting memoization.

In-memory caching

Depending on the sort of caching you are looking for or how your requests look, in memory can often provide the best impact. This would give you full control over how caching is done. So there are a bunch of options available, and which levers you want to pull is going to be based on your core needs.

Long term, if the SDK you are using doesn't support the caching configuration you need and you have a solution you have been using effectively, please let us (or your provider) know and hopefully they'll opt for converting your In-memory caching configuration into a first-class option in the SDK for that language. (Note: Company Value of Customer-Obsession may be required for this last part to work)

This example of how a cache could work:

In-memory cache wrapper for javascript:

import { AuthressClient } from '@authress/sdk';
const authressClient = new AuthressClient({
  authressApiUrl: 'https://auth.yourdomain.com' });

// create a cache that stores the results for 10 seconds
const cache = new Cache(10 * 1000);

const userId = 'User';
const resourceUri = `resources/${resourceId}`;
const permission = 'READ';

let hasAccess = await cache.getValue(userId, resourceUri, permission);
// No value is cached
if (hasAccess === null) {
  try {
    await authressClient.userPermissions.authorizeUser(
      userId, resourceUri, permission);
    await cache.storeValue(userId, resourceUri, permission, true);
    hasAccess = true;
  } catch (error) {
    if (error.code === 'UnauthorizedError') {
      await cache.storeValue(userId, resourceUri, permission, false);
      hasAccess = false;
    }
    throw error;
  }
}

if (!hasAccess) {
  return { statusCode: 403 };
}

Shared internal cache

One strategy that works well with multiple services when not using serverless or even sometimes when using serverless, is using a server that optimizes providing fast-lookup caches. That is, if you have multiple services that all need to interact with the same third party in the same way, and access to that third party isn't necessarily well-secured, or all your services use similar credentials for accessing that third party, you might benefit from a shared cache.

Back to the authorization example, after an SDK returns a success for an authorization request, you could store the result in cache-optimized solution. A recommendation for this strategy would be to use Valkey. Most cloud providers either support a Valkey solution or support deploying the open source container to your infrastructure, and AWS is no exception:

Further Caching Support

Have some ideas that aren't listed here, and think I should extend this list? Please let me know so I can extend the recommended caching strategies in this article.

For help understanding this article or how you can implement a solution like this one in your services, feel free to reach out to me and join my community:

Join the community

API Gateway Authorizers: Vulnerable By Design (be careful!)

Warren Parad — Fri, 23 May 2025 08:51:49 +0000

I had the benefit of joining the AWS Community Day in Zürich this week, most went as expected but, then an interesting question came up....Does caching in API Gateway create vulnerabilities for products using Authorizer Caching?

Authorization

When your users call your API, you have an obvious need to verify these requests should actually be allowed. I've talked extensively about this in my academy article on what the @#!? is Auth.

Even if you haven't read that article, if you are well versed in the need for users to authenticate and authorize to your specific service API and endpoints, then you get the gist.

So you have a need to verify the access tokens sent by users on ever request. When using AWS this means using API Gateway, and when using API Gateway that likely means you'll be using an API Gateway Authorizer.

Authorizers in API exist so that you can verify more easily verify the user access tokens. As a reminder an authorization token looks like this:

{
    "identityProviderId": "https://authress.io",
    "userId": "TechInternals|test-user-001",
    "expires": 1761483600,
    "signatureKeyId": "example-key",
    "signature": "SflKxwRJSMeKKF2Qt4fwpMe"
}

And the process to verify the token looks like this:

const authressClient = new AuthressClient({ authressApiUrl });
const userIdentity =
    await authressClient.verifyToken(userToken);

Of course swapping in your favorite open source JWT verifier. More extensive details on this depending on your identity provider are available

Now I know what you are thinking

I'm going to get a lot of requests from the same user to my same API, for different resources. That means they are all going to have the same JWT. Wouldn't it be great to cache those results so that I don't need to verify the same JWT over and over again every time this same user makes a similar request for similar data with same JWT.

And you would be right!

Caching

However, if you wrote the above code and you cache it, you might start to see a problem with it...

Caching by default in API gateway is keyed from the authorization token only and nothing else. This means that the result from one request will interfere with the next one.

Let's take for example the policy result from an AWS API Gateway Authorizer. It might see something like this:

const policy = {
      principalId: userIdentity.sub,
      policyDocument: {
        Version: '2012-10-17',
        Statement: [{
            Effect: 'Allow',
            Action: 'execute-api:Invoke',
            Resource: event.methodArn
        }]
      },
      context: {
        principalId: userIdentity.sub
      }
    };

There is actually a problem with this however. The cache key by default is only the JWT, but the result of this policy says that the user is only allowed to one particular event.methodArn. A method ARN as a reminder is like GET /orders/order_id_123.

That means on a followup request with the same JWT to a different endpoint GET /orders/order_id_456, even if the user should have access to that resource and their JWT is still valid, API Gateway will deny that request.

Why?

Well that is simple, because the result is cached based only on the JWT. The cached result specifies only that one route GET /orders/order_id_123 has been authorized.

Worst case scenario, you have a short cache time, and the only thing that happens is a short but confusing user experience, that quickly results in the correct behavior.

But you are smart, you realize there is a fix, instead of passing the event.methodArn as the result policy you specify ['arn:aws:execute-api:*:*:*'] as the resource result.

Now subsequent requests as long as the JWT is still valid, irrespective of the endpoint, will allow the user through!

🎉🎉🎉

And this works great.

But you are thinking why stop there. Can we go further?

And the answer is also yes.

Authorization of Granular Resources Based Access Control

You might be using solutions such as AWS Verified Permissions hoping to connect it together with Cognito and API Gateway.

Now I know what you are thinking, why is Warren investigating verified permissions when Authress already solves all these problems? Well sometimes even I have to write an article about how the integration of default resources in AWS can cause security misconfigurations.

Your decision is—Not just cache the validity of the JWT, but you also want to cache whether or not the user actually has access to call the endpoint in question. That is, you decide to take the additional step of verifying the user's authorization and you also cache it, then you will have just created a majority security vulnerability in your application.

Do you already see what the problem might be?

In your authorizer you are likely to write:

const hasAccess = await authress.userPermissions.authorizeUser(
   userId,
   'resource',
   'resource:read');

If you are checking the user's access inside the authorizer and it is cached, then subsequent requests to the same API will utilize the cached result.

If the user has:

Access to orders_123
No Access to orders_456

And then calls

GET orders_123
GET orders_456

They will incorrectly be allowed to access that second order.

That's because the authorizer will have access ALLOW for:

const hasAccess = await authress.userPermissions.authorizeUser(
   userId,
   'orders_123',
   'orders:read');

That ALLOW is set as the cache result for the user's JWT:

JWT_001 => ALLOW

The cache doesn't contain the orderId in it. Or said differently the cache is NOT:

[JWT_001, GET, orders_123] => ALLOW

That means when the second request comes in, we got to the cache table, see the cache already exists for JWT_001, return ALLOW, and never actually check the authorization for orders_456.

Removing the security vulnerability

It would be nice of API Gateway to be secure by default and require the identity source cache key to include the resource path and method. But it isn't, so it doesn't. And this risk is similar to ones experienced by engineers all day long with caching in CloudFront. And if we think about the frequency of issues with caching in CloudFront which has no security vulnerability, we can realize that—since AWS created the Verified Permissions service and related functionality, this opened a huge security vulnerability potential configuration in API Gateway.

This isn't an explicit vulnerability in the service though, since the vulnerability only exists based on improper configuration, but here the improper configuration is the default. Show me a company using API Gateway and AWS Verified Permissions, and I bet I can show you a Security Bounty waiting to be collected.

The resolution here is to force the API Gateway Authorizer to cache also on the httpMethod (Context) and path (Context).

Once that is done, now API Gateway will close this security hole because the cache key will match the authorization check performed by your authorization provider.

Recommendations

On your side there is little you can do to remove the pit of failure. Review documentation, invest in deep understanding of the tools you used especially when security is involved. I guess also keep reading my posts as I often try to focus on security related topics.

On the AWS side, there is absolutely a strategy that would have fixed this by design. The authorizer should not have access to the Path and Method properties of the HTTP request unless the identity source cache key includes them. This would require breaking existing configurations, but it would be in the name of security by default.

Going further

There are actually lots of different ways to cache permissions results in AWS when not even using Verified Permissions and for an extensive list of the options and my personal recommendations check out this Auth Academy article on the topic.

Come join my Community and discuss this and other security related topics!

The Blog Post Release Automation

Warren Parad — Mon, 19 May 2025 13:46:12 +0000

The Blog Post Release Automation

I made this mistake this week of believing I wanted to automate, using an LLM of course, some parts of the painful podcast release cycle.

Weekly I record episodes of the podcast Adventures in DevOps with my awesome co-host. Of course all the episodes are available on our podcast website as well other streaming platforms.

But! Since we're a technical podcast, we decided to make our infrastructure open source (On GitHub unfortunately), but to go further it also uses GitHub Actions to publish the episodes to our website. There is of course the nasty bit of actually recording the episodes, editing the episodes, and then downloading and formatting them to make them nice.

After that is all done though, it is time to create the episode page and most importantly the cornerstone of ever podcast, an awesome episode image.

So let's get down to it.

Execution

Interestingly enough, the Nova Lite model failed completely attempting to request it to actually build the command I needed to execute the model itself. Not very self-aware, you might say.

However using other models I was able to coax out the following recommendation:

With the episode saved in the transcript.txt file, and the instructions we want to run in the instructions.txt

#!/usr/bin/env node

import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';
import fs from 'fs/promises';
import path from 'path';
import { fileURLToPath } from 'url';

// Resolve file paths
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const instructionsPath = path.resolve(__dirname, 'instructions.txt');
const transcriptPath = path.resolve(__dirname, 'transcript.txt');

// Set up Bedrock client
const client = new BedrockRuntimeClient({
  region: 'eu-west-1'
});

(async () => {
  try {
    // Read both input files
    const [instructions, transcript] = await Promise.all([
      fs.readFile(instructionsPath, 'utf-8'),
      fs.readFile(transcriptPath, 'utf-8')
    ]);

    // Build prompt
    const content = `Instructions:\n${instructions}\n\nTranscript:\n---\n${transcript}\n---`;

    const payload = {
      messages: [{
        role: 'user',
        content
      }],
      // Max Token Count and other parameters: https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-titan-text.html
      temperature: 0.7,
      top_p: 0.9,
      max_tokens: 4096
    };

    // Invoke the model
    const command = new InvokeModelCommand({
      modelId: 'amazon.nova-lite-v1:0',
      contentType: 'application/json',
      accept: 'application/json',
      body: JSON.stringify(payload)
    });

    const response = await client.send(command);

    // Decode and print response
    const responseBody = await response.body.transformToString();
    console.log('✅ Model response:\n');
    console.log(responseBody.output.message.context);
  } catch (err) {
    console.error('❌ Failed to invoke model:', err);
    process.exit(1);
  }
})();

And that's it, we can take the output create a pull request and then release the episode.

Troubleshooting

Of course nothing works the first time, and for us the first issue is

Failed to invoke model: ValidationException: Invocation of model ID amazon.nova-lite-v1:0 with on-demand throughput isn’t supported. Retry your request with the ID or ARN of an inference profile that contains this mode.

Okay turns out there is some magic that it takes to run the Nova model in other regions, so instead of trying to get that to work, we'll switch to the region us-east-1.

Malformed input request: #: required key [messages] not found, please reformat your input and try again.

Hmmm, weird, it turns out there have been some serious changes to the API, in which the documentation is not really up to date. So figuring out the correct parameters is actually a bit of a problem.

But setting the payload as just:

const payload = {
      messages: [{
        role: 'user',
        content: [{
          // type: "text",
          text: inputText
        }]
      }]
      // Max Token Count and other parameters: https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-titan-text.html
      // temperature: 0.7
      // top_p: 0.9,
      // max_tokens: 4096
    };

Solves most of this problem.

Although one problem we keep running into is that the Nova Modals "Content" filter keeps blocking itself. Even sending the very innocuous "hey" to the model to generate three images, fails after the first one.

Success!?

The podcast image

The next step is run the generator a second time, but this time use the output from the first step as the input to generate an image relevant to the podcast.

There are a couple of changes that have to be made.

We don't need the transcript anymore since we already have a summary.
We need to pass an input image, we don't want some random picture we want something that is brand aware.
The output will be an image as well.

So instead we'll use the Nova Canvas model: amazon.nova-canvas-v1:0 with the parameters:

const referenceImage1 = await fs.readFile(referenceImagePath1);
const referenceImage2 = await fs.readFile(referenceImagePath2);
const payload = {
      messages: [
        {
          role: 'user',
          content: [
            { text: inputText },
            {
              image: {
                format: 'jpeg',
                source: {
                  bytes: referenceImage1.toString('base64')
                }
              }
            },
            {
              image: {
                format: 'png',
                source: {
                  bytes: referenceImage2.toString('base64')
                }
              }
            }
          ]
        }
      ]
    };

And we can write out the results using the response:

const imageData = JSON.parse(responseBody).output.message.content[0].image.source.bytes;
await fs.writeFile(path.join(__dirname, `image.png`), Buffer.from(imageData, 'base64'));

The result

We'll I think pictures are way more expressive than words, so check out the latest episode here on Adventures in DevOps to see how exactly well we did!

Our Verdict

Nova is not ready for prime time. For now, we are going to try out some of the other models offered through Bedrock and focus on getting more high quality content. Quality and reliability are crucial here as we aim to cut down on time to create the episode releases.

[Boost]

Warren Parad — Fri, 24 Jan 2025 17:59:44 +0000

The Risks of User Impersonation

Warren Parad for Authress Engineering Blog ・ Jan 24

#authentication #authorization #identity #security

The Risks of User Impersonation

Warren Parad — Fri, 24 Jan 2025 17:58:49 +0000

What is user impersonation?

User impersonation is anything that allows your systems to believe the current logged in user is someone else. With regards to JWTs and access tokens, this means that one user obtains a JWT that contains another user's User ID. User impersonation or logging in as a customer can be used as a tool to help identify many issues from user authentication and onboarding to corrupted data in complex multi-service business logic flows.

However, at first glance it should is obvious that there are major security implications with such an approach. Even if it isn't, this article will extensively review user impersonation and the security implications as well as offer alternative suggestions to achieve a similar outcome in a software system without compromising security.

The impersonation use cases

No solution is relevant in a vacuum, so let's consider the concrete issues that you might actually have, and the reason you've arrived at this Authress Academy article. If we were to jump straight into a solution, then we'll definitely end up sacrificing security or worse, our user's sensitive data in favor of suboptimal solutions.

Possible use case user stories:

One of your users reports that they are experiencing an issue with a screen in your application portal not showing the correct information. As a support engineer, you want to review the exact display in the application UI that your user sees, so that you can verify the UI is indeed broken and something is actually going wrong.
Similar to above, can you know whether or not the display having an issue is a result of a problem with the UI itself or with the data that application UI is fetching, hence a service API issue.
Sometimes it is a problem with a complex API server flow. A click in your application portal was expected to perform a data change, transformation, or API request to your backend services, but is may not have been sent with the appropriate data. As an product engineer, you would like to know that the correct request data is being sent in the request to my service API.
As an system admin, multiple third party systems are interacting with each other and something™ isn't working, and because you are a great collaborator, even though it isn't your responsibility, you want to help out your customers.

Now, this list isn't exhaustive, but already you can start see that while focusing on the concrete problems, user impersonation might be useful, but these don't actually require it to debug. The root causes often fall into at least one of these categories:

This is a UI component display issue.
An unexpected request is being sent or isn't sent to your service API from your application portal.
The wrong data is being sent in the request from your application UI to your API.
It is a READ permissions data issue for the user.
It is a WRITE permissions data issue for the user.
In is multi-system problem and not an access issue, and having a duplicated environment that exactly matches the current production was your goal to continue debugging.

Note: out of these solutions, none of them even get close to needing user impersonation, they each have straightforward alternatives that are both secure and frequently simpler to implement.

Supported libraries

Fundamentally, user impersonation is insecure by design, we'll see why in a moment. There are much better ways to provide insight into your specific scenario that actually take security into account. But let's assume that we do implement user personation. Is there help available for us by utilizing support from our favorite overengineered solution?

Ruby - Rails pretender
Python - Django hijack
Nodejs - Express/Passport impersonate
Insert your favorite monolithic HTTP Framework here ➤ Deprecated Solution

What's interesting is that in doing the research to actually find existing implementations, 86% of the repos and links I found:

No longer exist, and haven't existed for quite some time
Were archived over 5 years ago
Have less than 10 stars on GitHub

Even if people are trying to make this happen, the tools don't even exist to ensure that we are doing it correctly and safely. The results of this search, tell us something. Even more surprisingly is that most of the Auth SaaS solutions don't offer this either. As it turns out, either no one really cares that much or it is next to impossible to get it right such that no solution can exist. Well that can't be right.

Dangers of user impersonation

Let's assume for a moment that the collective wisdom is correct, and no solutions exist because it is dangerous. What exactly are those dangers? To help convey these issues, say that we managed to get one of these legacy packages above actually working with our system, the first problem that we'll run into is:

Who actually has access to perform this User Impersonation in the first place? Who are our admins?

1. Defining the admins

Of course allowing everyone to impersonate one another basically means our authentication provides no value. We might as well let users enter whatever username they like on every post they make. Realistically, we want to restrict this list to those that it actually makes sense to have the ultimate su privilege.

Figuring out who the admins should be and maintaining access to that closely guarded endpoint that grants user impersonation is a common problem that even eludes the most sophisticated companies. The most notorious example of getting this wrong were the Twitter 2020 admin tools hack and the Microsoft Storm-0558 breaches. Attackers were able to compromise admin-level account tools, and use them to steal and impersonate actual users. Historically, one of these companies had paid significant attention to their own internal security, were, if not the first, one of the first to introduce the notion of public social logins, and were no stranger to the issues at hand, and the other was Microsoft.

Challenge 1: Maintaining both the admin list, and correctly securing the endpoint to allow impersonation in the first place.

2. The implementation

The next issue regarding impersonation becomes transparent when we start to question how it can even work in practice. In theory, practice is the same as theory, in practice it is not.

Once admin is authorized to impersonate a user, what exactly is happening in our platform? Let's flash back to Authentication. In order to secure your system, to ensure the right users have access to the right data at the right time, your users must use a session cookie or session token sent on every request for which your API can verify that user is logged in. This could be a completely opaque GUID that represents some data in your database (a reference token) or a more secure JWT that is stateless. In any case, your system identifies users via your Authentication Strategy, and at the end of the day identification comes down to a single property in a single object somewhere. An example could be the JWT subject claim property:

User user_001 JWT access token:

{
        "iss": "https://login.authress.io",

        // highlight
        "sub": "user_001",
        // highlight

        "iat": 1685021390,
        "exp": 1685107790,
        "scope": "openid profile email"
}

In OAuth/OpenID, the sub claim in a JWT represent the User ID. Thus this particular token represents a verified user with the identity user_001. Anyone that holds this token is now has access to impersonate this user. Hopefully, you have some logging in place to identify when a user is being impersonated and who actually started the impersonation process. But how do we actually impersonate this user?

Well of course, I need to convert a token that represents my admin user into a token that represents the user I want to impersonate. This would be an example of the token that I have right now.

My admin user token:

{
        "iss": "https://login.authress.io",

        // highlight
        "sub": "me_admin",
        // highlight

        "iat": 1685021390,
        "exp": 1685107790,
        "scope": "openid profile email"
}

Since our system, in this scenario uses the sub property to determine which user is accessing the system, I of course need a token that replaces the current value of the sub which is me_admin for me, to one that contains the sub of user_001. So when I impersonate the user, the result must be a token that looks exactly like the user token:

User token generated by the admin:

{
        "iss": "https://login.authress.io",

        // highlight
        "sub": "user_001",
        // highlight

        "iat": 1685021390,
        "exp": 1685107790,
        "scope": "openid profile email"
}

Some of http/auth frameworks have thought a whole two seconds longer than the rest and might have decided to add an additional property to indicate that the token was created through the process of impersonation by an admin instead of directly by the user:

User token generated by the admin with magic:

{
        "iss": "https://login.authress.io",
        "sub": "user_001",

        // highlight
        "generated_by": "me_admin",
        // highlight

        "iat": 1685021390,
        "exp": 1685107790,
        "scope": "openid profile email"
}

And this might even seem like a good idea, however, in practice it creates a Pit of Failure. Enabling admin to create new tokens that contain the original user causes two distinct problems.

The first issue is that one admin user can impersonate another admin user. And that second admin user might be one that potentially has more access and is authorized for more sensitive information. This means that it isn't so straightforward to just add in impersonation and assume that everything will just work out. Our List of Admin, no longer can just be a list of admin, it now must also contain some hierarchal order of who can impersonate whom. If you've been following along this looks a lot like what Authress Authorization provides. Of course you don't absolutely have to have that, but if you don't then you've sacrificed some security.
The second issue is that not every application you have might be interested in allowing users to be impersonated. In any mature system, and even most early software ventures, have some data that you are even less interested in exposing than rest. Sensitive by nature or Regulated data fits this picture. This could be Personal Identifiable Information (PII), Credit Cards (PCI-DSS), or really anything that has been regulated in your locality as a result of governing bodies. You might breach this through user impersonation if for instance your support engineer is in different Data Residency than the user is in. For example, when attempting to debug issues in a UI, almost never is the Date Of Birth (DOB) of the user absolutely necessary to be shown on the screen. Sure it is relevant in some user use cases, but in most debugging scenarios it is not.

If your authentication depends on the property sub in the JWT, then an application cannot opt out of user impersonation. Since you are changing the sub to be the impersonated user, every application will see the new sub value, even if they do not want to support user impersonation. Strike 1.
All applications are forced opted-in. If an application wants to opt-out then the second claim generated_by or it's respective implementation is required. But then still, all applications are opted-in. That means when you design a new application you have to know that you might want to opt out admin from accessing user data in this application, "data is insecure by default, unless explicitly designed otherwise". This is the pit of failure, a pit of success would be opt-in, Data is secured by default, unless otherwise excluded. Strike 2.

A quick call-out is worthwhile on how to secure data like a user's DOB. UIs don't need to know this information in most cases. The screens and activities where DOB is valuable, actually care that the user isBornInJanuary or isOlderThan18, and not the actual date of birth of the user. Unless of course this is the users DOB selection, in which case this component rarely needs to be validated by a support engineer, and if you believe that user impersonation is necessary to help validate the user DOB entry screen, this article isn't going to be of any help for you.

3. Secondary system data leakage

Not only do we need to worry about vulnerabilities in our primary user applications, as well as leaking the data associated with them. Now we also need to worry about protecting these secondary systems used to impersonate users AND leaking the data associated with them as well. Internal systems, by their very design usually end having worse security measures in place because fewer people use them. Fewer users and lower volume means more hacks and less attention given to such an app. In practice, these applications are rarely changed, but frequently break, and most importantly have low priority when it comes to innovation and implementing necessary improvements. They don't end up in your OKR Objectives for this quarter and no one is getting promoted over them.

We are so concerned that someone is abusing these tools that we ourselves leak user access tokens and data to logging systems. We log so zealously to ensure we have captured the usage of these tools, that we end up logging that which we should not. When we log that means we've probably also exported these logs to some third party reporting tools. It is a Catch-22, we know we need to log and report on actions taken as an admin when impersonating a user that log data that we would not normally be logging. The goal to prevent security issues creates a new attack surface.

The result is that these systems will likely end up logging usage of user tokens. That's an introduction of a new attack surface, and due to the issues in priority with fixing, these systems are actually twice as likely to leak user data compared to our primary user applications.

4. Corrupted audit trails

Frequently we can a priori conclude that user impersonation is actually wrong. In the debugging scenarios, the last thing you want to do is gain access to modify the users' data. If you actually needed to modify a user's private data, or one of your customer's account information, you definitely want a dedicated system to handle that. This means, you actually don't want to the be the user, you don't want to impersonate the user, you just want to be the user with the explicit caveat of read only permissions. You only want to see what they see, not actually be able to modify their data. Accidentally modifying user data is guaranteed to happen accidentally if the only way to to verify a user facing UX problem is to completely impersonate a user and get full write access to their account.

Without thinking about, the following issues are associated with impersonating the user in this context:

Audit trails incorrect say the user changed data when they did not. ➤ An admin impersonating the user did it.
The user's sessions may start to include the one generated by the admin. ➤ As a user, it would be an understatement to say they would be concerned if they saw a session in a sensitive account modifying data from a location they are not in.
Logging data in the applications is incorrectly recorded, or may not be recorded at all. ➤ You may be tempted to hide these admin interactions.
And lastly, in every case, now we need to alter our systems to be not only aware of how to process the data due to impersonation, but how to log it. ➤ Impersonation is a virus that starts to infect all of our systems.

The practical-ish solutions

If generating a new token that contains the impersonated User ID is so bad, there must be better solutions out there.

Solution A: Additional token claim property

What if we don't change the subject sub claim, but instead add a new claim. That way, only those services that understand this claim, and actually want to use it would choose to use it. Services that don't know about it, keep using the unmodified sub claim. Admins would still look like admins. Only services that care about a new adminIsImpersonatingUserId claim property would know to use it and how to handle it. This would give you security by default, and only expose services to the danger that have already explicitly designed support for it. You would have to opt in, success finally!

Theoretically this is great, and while it is a bit more secure than altering the subject, in practice, we start to write code that looks like this:

Resolve User Identity:

async function resolveUserIdentity() {
        const userId = jwtToken.adminIsImpersonatingUserId
          ?? jwtToken.sub
        return userId;
}

Then that code ends up in a shared library which all our services implement. So while our intentions were good, the reinforcing system loops cause this to be no better than the alternatives. The reason is, we often find the need to optimize our usage across even a small number of services, some believe preventing code duplication is a bad thing. So the resolveUserIdentity method leads us to the following pattern:

We change our Auth solution to add the new claim to the JWT during impersonation.
Only those services that need to care about this add support for it.

At this point we are still 100% secure. But then:

We update some shared libraries that support JWT verification and add the method resolveUserIdentity to it.
The resolveUserIdentity replaces all the checks to consume the new claim.
All existing services get updated to use this shared library, and are exposed to the dangers of impersonation.

A new claim won't help us. This means that now we are back to the same problem, and arguably the situation is worse. Instead of all the services in the platform trusting the standardize sub, we now maintain a bespoke solution just for our system. This is especially important, the sub claim is an OAuth and OpenID industry standard RFC 9068, everyone in the industry is familiar with it. However, just for your system, there is now a new claim which just ends up being treated as the sub canonical sub, but it is not standard, not self documenting, unexpected and unique. Complexity reduces security. Strike 3.

For more about the systemic issues with a JWT or session token based permission system, permission attenuation is discussed in depth in the token scoping academy topic.

Solution B: DOM Recording

See earlier impersonation use cases.

If we flash back to the original user stories that drove us to implement user impersonation in the first place, we might start to see a pattern emerge. Most of the time the issue is that — something is wrong with the User Experience. The user is stuck in some way, the data isn't being displayed correctly, some component is broken.

All of these are user facing issues, and issues facing the user purely in the UI. The source of the data, and the security therein has near-zero value to us in validating the user experience. Attempting to use expensive full user impersonation instead of simple UI component tests, is the exact same problem we see incorrectly implementing tests at the wrong level.

Let's use the Testing Pyramid as an analogy. The canonical testing pyramid is this:

At the bottom is our unit tests, those tests are cheap and easy to write, find the most issues, and ensure our system is working without much effort.
Then comes the service level tests. Or in the case of UIs these are our screen tests. Multiple pieces of functionality and components are combined together in these tests. We don't want many of them, perhaps 10% max of all our tests test full screens or services. Most of the functionality of the service or screen is already validated in the unit tests — ie we know that our core functions, as well as buttons, slides, pickers, etc — all work correctly.
Now comes the 1% integration or end-to-end tests. You almost never want these, only the most critical flows of your application should be validated. When they report a failure, you have no idea what might have caused that particular failure, you just know there is a problem. In the case of an application like social media platform, The integration test you want is — making a new post. (Obviously there is no reason to test the login flow, since your auth provider has you already covered there!)
At the top of the pyramid is manual exploratory testing. That which cannot be automated, and most importantly needs the intelligence and creativity of a human to identify potential problems in your software application. This is the most expensive and you rarely have an interest in squandering this effort.

The only difference between this and a support case is the context — the why. The services, applications, business logic, and tools that we have at our disposal are all the same. We need to trust that our tests exist to validate the problems we could have. It is always a mistake to invest effort in the top of the pyramid when we lack the assets at the bottom. Likewise, our support pyramid is this:

At the bottom is application logs. There is no sense in attempting to tackle any of the higher layers until you have sufficient application logs that exactly report incoming requests, outgoing responses, unexpected data scenarios, edge cases that aren't completely implemented, and systemic issues.
Just above that is documentation. This includes expected common flows, uncommon flows, and demos of the more complex to use aspects of our application. The biggest benefit of this documentation is that we can help out users. I want to repeat that it is more for us, than it is for our users. The pyramid exists to inform us what we should do, not how our users should operate.
The next rung up are User recordings. For users that are having issues, we have concrete recorded data for their flow. The flows would include anything relevant to the application, how they used it, what actions they took. All so we can actually see what happened in context for when there is a problem. No one wants to spend any time looking at recordings if they don't have to. It is also very difficult to identify the root cause of problems by reviewing a recording, but having them is indispensable to your support engineers when they need them, when a user has reported a issue. Solutions include PostHog, FullStory, Sentry. If you don't have these recordings, then the next best alternative (which is very far away) is getting a live screencast from the user. These are less useful, and more expensive to obtain. Worst of all, they can and have been used to breach sensitive systems.
At the very top, is of course the thing you never want to have to do, and the topic of this article: Full user impersonation. If everything else fails then at least we have user impersonation left in our toolkit. But this must only be used after we have significantly invested in all the other strategies.

Assuming we have tackled the bottom two rungs of the table, the missing next component is the User recordings. If you have those, which offer the ability to sanitize the data coming from users, then you've got the solution to 99% of all support cases. Having people jump in and impersonate users is just not necessary. And most importantly, if we look at who often needs to impersonate users, it isn't even the people who should have access to do so.

Revisiting user impersonation

Do you want to see the data or do you want to see what the user sees? In almost every case it is the former, seeing the data can be through an admin app. In the rare case that it is the later, we would need the exact permissions the user has, or some safer strict subset of them. So what's the right way to handle user impersonation in the case that we just can't live without it?

The most important principle here is Secure by Default. So far a blanket implementation is wrong, and there are too many pits of failure with the JWT, auth session, or reference token based approach.

Looking at the support engineer use case, our needs would be satisfied if we were to explicitly hand out to the support staff just the permissions read:logs to handle that specific support case. But it is quite something else to generate whole valid tokens that contain the subject different from the user requesting them and give those out to specific people. So as long as we have a system that allows us to provide our team members with explicit permissions to only the exact resources they need, then we have the capability to ensure we have a secure system that also solves all our use cases.

How Authress supports user impersonation

I want to end this article with a discussion about how Authress solves the top of the pyramid user impersonation story. The caveat here being, that it is sometimes a trade-off some companies really want. They absolutely want to sacrifice security, increase vulnerabilities as well as their attack surface by introducing full user impersonation functionality. However from experience, very few of our customers have anything implemented in this space at all, and those that do have hooked their process into easy to grant permissions through Authress, rather than full user identity impersonation.

The real solution is to actually consider your support team persona when designing features. And this is what Authress optimizes for.

The flow that we consider the most secure is explicitly and Temporarily grant your support user persona exactly one small additional set of permissions relevant for the support case. When we do this we don't change how we determine identity, we only change the way we determine access. Authress supports this by allowing quick cloning of User Based Access Records which represent the permissions a user has. Since cloning is dynamic, a temporary access record can be created that only contains the READ equivalent roles that the user has. And most cases, you can just directly assign your support engineers to a Authress Permission Group with READ ✶ access, and never need to touch permissions again.

Here is an example cloned access record, where the support engineer received just the Viewer Role to all organizations so that documents and users could be Read not Updated:

The firehouse recommendations

In case you want to ignore the advise of this academy article, and instead of using Authress permissions to drive access control as recommended, I do want to include recommendations that will help reduce the impact of security and compliance issues related to user impersonation:

Do not hide user impersonation, it will be tempting to obscure the usage of it from your customers. Instead make sure it is visible and clear for everyone especially your customers. I know you don't want them to know, but they should know, they may even need to know, especially if something goes wrong.
Make sure all actions are recorded in an audit trail both by your admin who impersonated the user and the application user. Especially the admin. There will definitely be questions related to the "last person that touched this" and of course "it was working before your team looked at it". You will need a way to be confident in your response to your customers when it wasn't an admin that touch it last.
If you're operating in any high-security environment, FedRAMP, ITAR, or the like, always require customer user action before the support engineer has access to the account data. Some prominent cloud providers believe having an email with the user agreeing, is sufficient for this. I'm here to say — is not sufficient. Because often people who can create support cases do not and should not have admin access to the customer account to view all the data. Someone without the customer admin role should be able to grant your support engineering staff access to sensitive data in the account. You need an admin to click a button. This is usually done through a Step-Up Authorization Request.
Impersonation can be valuable in some environments however often completely useless in others. Especially in spaces with regulatory requirements, it's much better to diagnose issues from outside the impacted account, either through data replication or a permissions based approach.
Ensure your impersonation logic is completely tested. There should be no better tested piece of functionality in your software system.
Audit trails should always keep a "This was run-by User X" annotation on audit records, not just the user ID, but any additional information from the admin. Our recommendation is both the Admin User ID and the Support Ticket ID, on every log statement.
Start with your customer expectations. What sort of transparency do they explicitly expect? Do not guess. Err on the side of overcommunicating, rather than under.
Please revisit doing this in the first place if you don't have the capacity to have a dedicated team accountable for this functionality. Often this will involve your legal team when it doesn't go right.
When (not if) credentials leak, who leaked those credentials? Was it your customer or was it through your admin application or by one of your support engineers. Always be able to tell where those credentials came from, so that you can respond to the compromise as effectively as possible.
If you want to start anywhere, go back and invest in your admin/support tools so that they can expose the data that you need, rather than focusing on user impersonation. If those tools are insufficient check back at the Support Engineer Pyramid again.

For help understanding this article or how you can implement a solution like this one in your services, feel free to reach out to the Authress development team or follow along in the Authress documentation and join our community:

Join the community

Migrating CloudFormation to TF

Warren Parad — Tue, 21 Jan 2025 13:56:46 +0000

One day you might find yourself in the unfortunate position of wanting to migrate away from CloudFormation (CFN). While some may say that CFN is bad and should never be used. I can confirm that it is still better than:

CloudFormation CDK
AWS SAM
Serverless - Not "serverless", but the company that is abusing this name.
SST
And many others

The truth is: CloudFormation isn't bad, however like most things, it is bad when you find out your current solution doesn't support the thing that you want it to support.

So back to the problem...You want to migrate from CloudFormation to OpenTofu (since no one uses Terraform anymore after their legal scandal), and part of that problem involves a migration.

The Migration

Migrations are technically easy. Monolith to microservices, event buses to REST, MSSQL to NoSQL DynamoDB. The hard part is always the non-technical part. The part where you figure out what you want, now that's the problem. Unless of course you have a monolith, because you should just give up now. No one successfully converts from a Monolith to microservices. They write some code, complain a lot, then apply for a new job at a new company telling their would-be manager "Look how I helped this company migrate to microservices. I'm Great!"

But this isn't a story about how monoliths are bad, it is about how to migrate your Infrastructure as Code (IaC) solution.

Realistically, you have to painstakingly generate the new IaC HCL files for OpenTofu. You have existing CloudFormation as well as the real live version of your infrastructure currently supporting an massive business. And if you are like us at Authress, you might also have a 99.999% uptime SLA you need to account for.

If you have 100+ CFN stacks, you probably don't want to import these resources in OpenTofu by hand. Instead, you'll want some sort of tool to do this, and there are a bunch:

Former2 - Export from AWS to HCL.
Firefly.ai - AI in the company name, yuck
CF2TF - Open source converter
Doing it by hand to verify you have everything you need.

And there are still more... You could even try one of the LLMs out there.

Generating the configuration

Terraform and OpenTofu actually support configuration generation out of the gate as well, so we will use their strategy here, and if you want to use one of the less great ones from above, you do you!

1. Add the import block:

import {
  to = aws_instance.example
  id = "foo"
}

2. Run the configuration generation command:

tf plan -generate-config-out=generated.tf

Output:

resource "aws_instance" "example" {
  arn = "arn:aws:ec2:eu-west-1:1234567890:instance/i-00deadc0de"
  ami = "ami-000a4d9c6067d5d0d"
  instance_type = "t3.micro"
  ...   
}

3. Commit then new configuration:
Add the configuration to your files, and git commit to your IaC repository.

Running the migration

Once we have all of those generated we just need to run tf plan, tf apply, and then delete the import statements.

And you are done!

Cleanup

The one thing that no one tells you at this point is that you aren't done. Importing the resources and having the committed IaC HCL does not mean you are done. If you are like me, then you care that you still have 100s of CFN stacks deployed in your AWS accounts. Maybe these stacks all have CFN Drift and don't even represent the current state of the world anymore.

However, even if they do represent the current state, you probably don't want someone going into your account and accidentally updating or deleting those. Or your desire to have a pristine account compels you to delete these stacks. You probably wouldn't be someone working on this problem in the first place if you didn't care that these old stacks are still here.

The problem is that there is no way to delete a stack without also deleting the resources in that stack. And of course, you want to keep the resources in those stacks, so that's a conundrum. Thankfully, I've figured out a hack to get around this.

The involves utilizing three features:

the delete_failed status
FORCE_DELETION action flag
CloudFormation execution Role ARN

The delete_failed status occurs whenever CFN tries to delete a resource that it believes is no longer necessary, but the resource is either in use OR CFN doesn't have access to delete the resource. Take note of this second one.

Second, when a stack is in the delete_failed status, you are allowed to force delete the stack and retain explicit resources that you might still be using.

So all we need to do is get the stack into the delete_failed state, and then ask CFN to retain all the resources.

CloudFormation allows, for "security reasons", you to specify a role ARN to execute CFN with. When you do that the CFN stack changes will only be executed with that role. So we'll define a new role that does not have access to anything. We'll abuse the Role ARN property to force CFN to fail to delete any resources and thus fail to delete stack.

Cleanup Execution

Create the Role:

CfnDeleteStackRole:
     Type: AWS::IAM::Role
     Properties:
       RoleName: cfn-delete-stack-role
       AssumeRolePolicyDocument:
         Version: '2012-10-17'
         Statement:
           - Effect: Allow
             Principal:
               Service: cloudformation.amazonaws.com
             Action: sts:AssumeRole

With that role, we'll call the Delete Stack:

aws cloudformation delete-stack \
    --stack-name my-stack
    --role-arn arn:aws:iam::account:role/cfn-delete-stack-role

This execution call will fail, but we knew that was going to happen. Now, it will put the stack in the status delete_failed.

Finally, we can execute the delete again, utilizing the force deletion parameters:

aws cloudformation delete-stack \
    --stack-name my-stack
    --role-arn arn:aws:iam::account:role/cfn-delete-stack-role
    --deletion_mode FORCE_DELETE_STACK

Depending on the resources you have in your stack you or if you want extra security to prevent deleting your precious resources, you can add the flag --retain-resources to the CLI command:

aws cloudformation delete-stack \
    --stack-name my-stack
    --role-arn arn:aws:iam::account:role/cfn-delete-stack-role
    --deletion_mode FORCE_DELETE_STACK
    --retain-resources $LOGICAL_RESOURCES_LIST

With $LOGICAL_RESOURCES_LIST value set as the string list of CFN resources.

const cfnTemplateFile = await fs.readFile('./cfn-template.json');
const cfnTemplate = JSON.parse(cfnTemplateFile);
const resourceKeys = Object.keys(cfnTemplate.Resources).join(',')
return resourceKeys;

// Use resourceKeys as $LOGICAL_RESOURCES_LIST

Repeat this for every CFN stack in every region in every AWS account in your org, and everything will be cleaned up, just the way you wanted it to.

Curious about this and worth discussing more, join my community and chat with me:

Join the community

Are millions of accounts vulnerable due to Google's OAuth Flaw?

Warren Parad — Wed, 15 Jan 2025 17:01:30 +0000

This article is a rebuttal to Truffle Security's post on Millions of Accounts Vulnerable due to Google's OAuth Flaw. (Alt link) Even more ridiculous might be that their post got picked up by no small number of news outlets that all should be ashamed of themselves, far too many to actually link in this post.

Are millions of accounts vulnerable due to Google's OAuth Flaw?

In a true Betteridge's law of headlines fashion, the answer is a resounding No. Which explains why Google ignored this vulnerability in the first place:

The TL;DR of the source article claims that due to the nature of how Google OAuth works, "Millions of Americans' data and accounts remain vulnerable". It relies on the nature of Domain Ownership.

The Claim

Google’s OAuth login doesn’t protect against someone purchasing a failed startup’s domain and using it to re-create email accounts for former employees.

Domains are the root of trust* for many businesses. At Authress we rely on authress.io to establish trust with our customers, just as at your business you rely on your domains for your customers. This is "Root of Trust" with an asterisk because in reality the root of trust lies with the domain authority, the domain registrar, and the issuer of your TLS certificates for HTTPS encryption. But that is outside of the scope of this article.

The claim in the original article is that it is OAuth and specifically Google's OAuth that is at fault and nothing else. And that somehow domain ownership is linked to the exposure of customer data.

Background

Gaining access to your trusted domain is one way in which attackers attempt to circumvent your security strategy and compromise your users. If malicious attackers can utilize your domain to trick your users, then they can impersonate your business and steal their personal information, bank accounts, and credit card numbers. This is the basis for why phishing is popular today. As a matter of fact phishing is so popular because compromising a domain is incredibly hard, and is usually executed through a DNS Poising attack. The strategy behind phishing is to purchase alternative domains that look and feel like the valid domain as the next best thing (Typosquatting). These facsimiles exist for exactly that reason.

Besides using separate domains attackers will often also attempt Subdomain takeovers which is a mesh between domain compromise and using an alternative domain.

However, in this case, attackers cleverly will attempt to use your existing corporate domain after you believe you are done with it. The expected flow involving Google Workspace's OAuth looks something like this:

You buy a domain for your company, let's call it yourcompany.com.
Sign up for an Employee Identity Solution (IdP) that provides OAuth, there are actually many solutions here, Google Workspace, Okta, Microsoft Entra ID, Ping Identity
Then your employees use that identity solution to sign into to a third party product such as Stripe, AWS, PostHog, etc...
Lastly you give critical data to that product, business sensitive information, like your pets' birthdays.
That third party applications saves that data because they like data very much.

Identity

When you log into your favorite third party application, there needs to be an identifier sent from the Employee Identity Solution to that third party. The Third Party trusts your chosen identity solution as well as that identifier. Here is an example token generated by Google Workspace:

{
      "iss": "https://accounts.google.com",
      "sub": "210169484474386",
      "iat": "1736946817",
      "exp": "1736996817",

      "email": "warren@yourcompany.com",
      "hd": "yourcompany.com",
      "name" : "Warren Parad",
      "given_name": "Warren",
      "family_name": "Parad",
      "locale": "en"
}

The identifier in the token is the sub claim with the value 210169484474386. This is my User ID (Note: this is not actually my user ID, feel free to do with it as you wish, but I made it up for the purposes of this post.)

Your third party application uses this sub property to uniquely identify you, and then authorize you to your company's sensitive cat photos.

The Vulnerability

Now, imagine that you close your Google Workspace account, because your company goes bankrupt (This frequently happens because as much as we want to believe companies are successful through hard work, the truth is that it is actually luck). Along with your Google Workspace Account will likely be your expired domain yourcompany.com, unless you have some secret prayers that one day you will be able to sell it instead of expiring worthless. Let's assume that yourcompany.com domain is now available for anyone to purchase. By purchasing that domain, an attacker can create a new Google Workspace account, in hopes to gain access to those exact same third parties you had used for your business.

This actually isn't even the first time something like this has been attempted, and frequently it works due to hard-coded solutions in many applications. In a cruel twist of fate, here is a great example of being able to compromise the attackers themselves because they had a used a application which relied on expired trusted malicious domains.

This actually doesn't happen with Google OAuth. When you close the google workspace account, the User ID with the value 210169484474386, ceases to exist. This is what Google is confirming by closing the original bug report. An attacker recreating the Google Workspace account is unable to generate the same sub again. So that even if an attacker attempted to create a new Google Workspace from the expired and unclaimed domain yourcompany.com, the sub would be different and your third party application would reject access.

What's the problem?

The issue is some third party applications decided not to use the sub claim. The author of the Truffle Security post suggests that this is due to some bug in the Google OAuth implementation, but the reality is OAuth has nothing to do with this problem. The failure to use the sub claim stems from this shiny property in the identity token called email. In the original token above you can see the users email there warren@yourcompany.com.

A third party that utilizes this email address to uniquely identify users means that they are allowing malicious attackers who compromise employee identity providers through expired domains to take over your account. There are lots of reasons they do this, but primarily it is because they like the way the @ looks in their database.

That means this is actually a vulnerability on the third party application side. Any third party application that allows users to log in with just an email are inherently creating a vulnerability in their own platform and setting themselves up to expose their (ex-)users data.

Vulnerability review

So, actually this has nothing to do with Google Workspace at all. And an attacker can actually use any email provider to perpetrate this attack:

Buy an expired domain and register your domain in a new email provider
...
Profit

Although in this case the ... is simply: Attempt a password reset or magic-link authentication for that third party application. In a similar attack a vulnerability was utilized by attackers through an email support system.

1. The real vulnerability

This shows us that OAuth and Google Workspace aren't actually the source of the issue here, it's the third party application. I've frequently condemned Magic-Link based Authentication, and while there are some areas where it unfortunately still provides value, it isn't worth it if you care about security. The fact that the email is provided by Google is just unfortunate. Emails are helpful for identify where to send messages to users who want emails, but it should never be used anywhere related to security.

2. Dismantling the solution

The original article suggests that adding yet two more additional claims/properties to the User Identity Token, will solve the problem. One claim isn't good enough, let's have three!

Given that the problem is that third party applications are ignoring the already existing sub claim. I find it to be quite the naïve suggestion. No amount of additional claims will prevent third parties for incorrectly substituting in their beliefs where actual security is necessary. This is just an unfortunate truth. We see this every day and it is one of the reasons we built Authress in the first place. The defaults that exist in SDKs, frameworks, protocols, and standards, are just not enough for people to do the right thing, explicit investment had to be made in prevention of doing the wrong thing.

3. Third Party Application responsibility

The last part of the problem is that the author in the original article claims

What can Downstream Providers do to mitigate this? At the time of writing, there is no fix.

Which just isn't true. Third party applications that allow email based authentication, must delete user data after account deactivation. Once you stop paying for a third party application, that data must be deleted and never exposed again unless you resume access and the third party verifies identity. I prefer taking guidance from the NIST 800-63A.

As a user you too, can do something to. If you have sensitive data, you could decide not to use any third party applications, unless of course you actually pay for it and ensure that you delete your account before your company stops using that application. If you give someone your data, they have it, assume the worst. We can and should put more responsibility onto these third party application services who are utilizing unsafe email addresses and often SMS numbers of authentication. As long as you treat email auth as a valid solution, everyone will forever be just as culpable as third parties who rely on it. Use OAuth and SAML for your business authentication and make sure to provide sufficient secure options to the users of the products and services you build.

Consumer exposure

The original article also seems to conflate risks to consumers directly. There is nothing about this vulnerability that directly affects consumers. Sure there are impacts to consumers regarding data privacy, but the vulnerability discussed in this article doesn't include them.

That's because as a consumer when you use an application, that application stores data in their primary databases. When the company that manages that application fails, both their databases and their bank accounts are empty. You don't have to worry about that data. But you do have to worry about who they gave your data to. You have to worry about that irrespective of the company, or its state. Many companies out there have started to be investigating because of just that. This is the whole premise of the Facebook's Cambridge Analytica scandal. Facebook gave user personal data to Cambridge Analytica when they should not have access to it. Facebook didn't even need to be bankrupt for there to be a problem.

The core of the issue isn't the data you have given to the company, the problem is data the they have shared to others. But no amount of praying or technological solutions is going to fix that. The problems proposed in this article regarding the domain vulnerability in question are related to the data given to the third party applications secured with by the company's corporate domain. The data that is most vulnerable in these circumstances is the business-to-business relationships. Billing information, strategic partnerships, invoices, business strategies, these are at risk.

For example, at Authress we use Stripe, sometimes. In stripe we have customer account information, including customer emails for sending invoices. If you are using Stripe or another payment provider, then chances are you too are storing some sort of customer data in Stripe. If your company goes bankrupt, and attacker uses the domain vulnerability to do a password reset on your Stripe account, they will now have access to your old company's customer invoice and email data. You probably don't care, but you should.

Conclusion

So I think we can say definitely, no there aren't millions of people at risk with this vulnerability. Sure your data is at risk, it always had been at risk, it always will be at risk, but Google's OAuth implementation, while problematic, honestly doesn't change anything at all. You can continue to file your data deletion requests with your third party application providers when you don't think they are doing too well. But if they aren't doing that well, I sincerely doubt they are deleting your data, let alone deleting your data from their third party providers. I don't know what will become of the original published articles or Google's response, but I had felt strongly to first educate regarding the problem rather than lambast Google Workspace over their responses. The claim by the original author that millions of accounts vulnerable due to Google's OAuth Flaw is just irresponsible.

Curious about this and worth discussing more:

Join the community

DEV Community: Warren Parad

Making rate limiting in AWS less terrible

Why rate limiting matters

AWS API Gateway Usage Plans

What usage plans are and how they work

Hard limits on usage plan keys

The bootstrap problem

Option 1: control plane API

Option 2: Pregeneration

Option 3: Account Creation

Usage Plans in practice

Quotas

Endpoint cardinality

The usage plan verdict

Rolling your own rate limiter

Problem 1: Every request pays the tax

Problem 2: Cost

Problem 3: Edge cases

Problem 4: The devil

The custom gateway alternative

Making progress: AWS WAF

WAF Rate-based rules

Examples

Alternative 1: Is rate limiting required?

Async: the architecture alternative

Batching

The hidden cost

The business solution

Alternative 2: WAF + User IDs

Starting from scratch

Category 1: Unauthenticated Users

Category 2: Authenticated Users

Protecting users from attacks

Performance and reliability

Introducing CloudFront + WAF

WAF after

WAF before

This one weird trick

Wrapping up

Actually Fixing AWS S3

Broken: S3 Bucket Names are Global​

Name squatting​

Security Through Obscurity​

Security misconfiguration​

Historical Hacks​

What AWS Just Shipped​

How S3 Is Actually Used​

The Real Root Cause​

Intelligent Design​

What S3 gets right​

Secure by default​

My Prospal: Private by Default, Public by Promotion​

The Cornerstone Example

Public Buckets: How promotion works​

A New CloudFront Opportunity​

The S3 team's outstanding task​

Phase 1 — New accounts, new defaults​

Phase 2 — AWS internal service updates​

Phase 3 — Configuration Split​

The Objections​

What About SPA Websites?​

Bucket Origin Responses​

You're asking AWS to blow up a working control plane​

The Better Announcement​

Securing CI/CD Access to AWS

❌ The Wrong Way​

The Complete Design​

🔒 AWS Account Permissions Lambda Function​

🟢 Deploying the Lambda Function​

▶️ Utilize the Lambda Function​

🏁 Run the Deployment​

[Boost]

How when AWS was down, we were not

How when AWS was down, we were not

🚨 AWS us-east-1 is down! ​

📖 What is reliability?​

🤔 But why?​

🕰️ AWS SLA Commitments​

The AWS Lambda SLA is below 5 nines​

The API Gateway SLA is below 5 nines​

Broken: S3 Bucket Names are Global

Name squatting

Security Through Obscurity

Security misconfiguration

Historical Hacks

What AWS Just Shipped

How S3 Is Actually Used

The Real Root Cause

Intelligent Design

What S3 gets right

Secure by default

My Prospal: Private by Default, Public by Promotion

Public Buckets: How promotion works

A New CloudFront Opportunity

The S3 team's outstanding task

Phase 1 — New accounts, new defaults

Phase 2 — AWS internal service updates

Phase 3 — Configuration Split

The Objections

What About SPA Websites?

Bucket Origin Responses

You're asking AWS to blow up a working control plane

The Better Announcement

❌ The Wrong Way

The Complete Design

🔒 AWS Account Permissions Lambda Function

🟢 Deploying the Lambda Function

▶️ Utilize the Lambda Function

🏁 Run the Deployment

🚨 AWS us-east-1 is down!

📖 What is reliability?

🤔 But why?

🕰️ AWS SLA Commitments

The AWS Lambda SLA is below 5 nines

The API Gateway SLA is below 5 nines

The AWS SQS SLA is below 5 nines

💻 Calculating the allowed failure rate

🌩️ Infrastructure Failures

🚧 The Failover Routing Strategy

🌿 Improving the Failover Strategy

💻 Application level failures

Root Cause Analysis (RCA)

📋 Validation Tests

Running the Validation

🎯 Incident Impact

🔎 AI: Anomaly Detection

🌹 Does it smell like an incident?

Incident Alignment

Perspective Mismatch

Gray Failures

The customer support advantage

💀 Negligence and Malice

“Real” Vulnerability Reports

Multitenant considerations

Helpful Rate Limiting

Handling Requests at Scale

🎁 The Conclusion

Murphy's Law

DNS

Infrastructure as Code (IAC)