feat(server): Org rate limit per metric bucket#2758
Conversation
Dav1dde
left a comment
There was a problem hiding this comment.
Just some quick nits, I'll have to look into the outcome logging more carefully later.
I would like if we didn't have to duplicate this and somehow can share as much code as possible with the ManagedEnvelope logic (when we drop the envelope transactions and profile outcomes are created). But I'll need more time to think about this.
| let _bucket_qty = buckets.len(); | ||
| let bucket_partitions = partition_buckets(scoping.project_key, buckets, partitions); | ||
|
|
||
| #[cfg(feature = "processing")] |
There was a problem hiding this comment.
I think it makes sense to check the cached rate limits even on non-processing relays
There was a problem hiding this comment.
well atm, they just get updated from redis, which isnt available on non-processing relays, so im not sure if that would make sense
There was a problem hiding this comment.
I am not sure how propagation of quotas/limits to pops is exactly working, but I don't see a reason why we cant check it now and once we actually have a working propagation everything just works
iker-barriocanal
left a comment
There was a problem hiding this comment.
Does this PR rate-limit buckets on a binary basis? Do we want to consider scenarios in which some buckets are rate-limited but others aren't?
relay-server/src/actors/processor.rs
Outdated
|
|
||
| self.drop_buckets_with_outcomes( | ||
| reason_code, | ||
| total_batches, |
There was a problem hiding this comment.
What's the difference between total_batches and bucket_qty (function param), the maximum size of the bucket? Can we use just one of the two? It seems odd to me that sometimes we use one and sometimes the other (see when emitting outcomes and on log messages).
There was a problem hiding this comment.
bucket_qty is the amount of total buckets in the flush, total_batches is the amount of batches we are writing upstream (aka. amount of requests to upstream or amount of messages in the kafka topic).
relay-server/src/actors/processor.rs
Outdated
| event_id: None, | ||
| remote_addr: None, | ||
| category: DataCategory::Metrics, | ||
| quantity: total_batches as u32, |
There was a problem hiding this comment.
Should we exclude buckets with the usage metric from rate-limited outcomes? These metrics are data we (sentry) generate but the user doesn't care about, and I assume these outcomes will be presented to them.
This is on the roadmap but not planned for this PR, the idea is to rate limit based on namespace (there is some info in the epic: #2716) |
jan-auer
left a comment
There was a problem hiding this comment.
This looks good to me. Could you add an integration test?
| /** | ||
| * Metric bucket. | ||
| */ | ||
| RELAY_DATA_CATEGORY_METRIC_BUCKET = 15, |
There was a problem hiding this comment.
Note that this will require a release of the python library and bump in Sentry and Snuba.
CHANGELOG.md
Outdated
| - Add Mixed JS/Android Profiles events processing. ([#2706](https://github.com/getsentry/relay/pull/2706)) | ||
| - Allow to ingest measurements on a span. ([#2792](https://github.com/getsentry/relay/pull/2792)) | ||
| - Add size limits on metric related envelope items. ([#2800](https://github.com/getsentry/relay/pull/2800)) | ||
| - Include the size offending item in the size limit error message. ([#2801](https://github.com/getsentry/relay/pull/2801)) |
relay-server/src/actors/processor.rs
Outdated
| source_quantities += source_quantities_from_buckets(&BucketsView::new(buckets), mode); | ||
| } | ||
|
|
||
| let timestamp = UnixTimestamp::now().as_datetime().unwrap_or_else(Utc::now); |
There was a problem hiding this comment.
Is this not just Utc::now() why the roundtrip through UnixTimestamp?
| ] | ||
|
|
||
| def generate_ticks(): | ||
| # Generate a new timestamp for every bucket, so they do not get merged by the aggregator |
There was a problem hiding this comment.
We can also use tags for this, should be easier than having timestamps tied to bucket_interval
Reverts #2758 This was deployed before sentry updated librelay, so sentry crashed with unknown data category.
we had an incident when deploying #2758 because we unexpextedly sent outcomes of type ItemBucket without having updated sentry. This PR only adds the new datacategory so we can properly synchronize on both ends before adding the business logic. #2821 (comment)
part of: #2716 in order to protect our kafka metric consumers, we want to have a way of rate limiting based on the amount of buckets, as that's what decides the load placed on our kafka topics. We are starting out with just the org throughput limits but will be expanded upon further as outlined in the linked epic. This is a re-revert of #2758 (reverted in #2821). --------- Co-authored-by: Joris Bayer <joris.bayer@sentry.io>
part of: #2716
in order to protect our kafka metric consumers, we want to have a way of rate limiting based on the amount of buckets, as that's what decides the load placed on our kafka topics.
We are starting out with just the org throughput limits but will be expanded upon further as outlined in the linked epic.