Skip to content

variable in initial_metadata do not get evaluated in rate_limit_service when used with apply_on_stream_done #40892

@brightbyte

Description

@brightbyte

Title: variable in initial_metadata do not get evaluated in rate_limit_service when used with apply_on_stream_done.

Description:
I am setting headers for the rate-limiter side-request using the following code:

                      rate_limit_service:
                        grpc_service:
                          envoy_grpc:
                            cluster_name: ratelimit
                          initial_metadata:
                            key: "x-cluster-hash-key"
                            value: "%DYNAMIC_METADATA(envoy.filters.http.ratelimit.backend_req:cluster_hash_key)%"

Where cluster_hash_key is set by a Lua filter. This works as expected with a regular RateLimit filter, but it does not work if the filter has apply_on_stream_done set. In that context, the variables (aka command operators) stay unevaluated. I also tried using a header with %REQ, same problem.

This is a show stopper for my use case - I need cost-based limiting, so I have to send a second request to the ratelimiting service during the response flow, to record the actual cost of the request. That second request needs to have the same header value for x-cluster-hash-key, to ensure the request gets to the rate-limiter that maintains the counter for that user. If I can't do this, I have to go back to using a single shared store for all counters, which I would very much like to avoid.

@wbpcode explained that this difference in bahvior is due to life cycle issues. But curiously, I can access the relevant header fine in the part of the RateLimiter config where I construct the limit descriptor to be sent to the rate limiter. But if I try to put it into a header (medta-data), it fails.

Repro steps:
Happens on every request that has rate limiting applied.

Config:
Envoy config: https://phabricator.wikimedia.org/P82004
Helm chart: https://gitlab.wikimedia.org/daniel/rlstools/-/tree/f44dd3c5500734345ced1017eaac7804388d050d/environments/envoy-helm

Logs:
Logs of requests going into the rate limit service:

[pod/rls-3/rate-limit-service] 2025/08/29 11:46:33 Received: domain:"apis" descriptors:{entries:{key:"user" value:"2345555"} entries:{key:"group" value:"cookie-user"} hits_addend:{value:5}}; Meta: metadata.MD{":authority":[]string{"ratelimit"}, "content-type":[]string{"application/grpc"}, "x-cluster-hash-key":[]string{"2345555"}, "x-envoy-expected-rq-timeout-ms":[]string{"20"}, "x-envoy-internal":[]string{"true"}, "x-forwarded-for":[]string{"10.244.1.186"}}
[pod/rls-1/rate-limit-service] 2025/08/29 11:46:33 Received: domain:"apis" descriptors:{entries:{key:"user" value:"2345555"} entries:{key:"group" value:"cookie-user"} hits_addend:{value:3}}; Meta: metadata.MD{":authority":[]string{"ratelimit"}, "content-type":[]string{"application/grpc"}, "x-cluster-hash-key":[]string{"%DYNAMIC_METADATA(envoy.filters.http.ratelimit.backend_req:cluster_hash_key)%"}, "x-envoy-expected-rq-timeout-ms":[]string{"20"}, "x-envoy-internal":[]string{"true"}, "x-forwarded-for":[]string{"10.244.1.186"}}

This illustrates the problem (the first request gets "x-cluster-hash-key":[]string{"2345555"} while the second one gets "x-cluster-hash-key":[]string{"%DYNAMIC_METADATA(envoy.filters.http.ratelimit.backend_req:cluster_hash_key)%"}) and also the consequence: the first request goes to pod/rls-3 and the second one to pod/rls-1, because the RingHash load balancer uses the value of the x-cluster-hash-key header to compute the hash and pick a rate limiter instance.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions