-
Notifications
You must be signed in to change notification settings - Fork 111
Description
The performance of metrics queries depends on the number of time series they have to aggregate. To put an upper bound on the number of time series - and thus the number of metric buckets with a unique combination of tag values - the metrics ingestion pipeline has a cardinality limiter.
It resides in the metrics indexer consumer, currently causing more than half of its CPU usage and introducing additional latency due to a high number of Redis calls. There are two reasons to move the cardinality limiter out of the indexer consumers:
- The limiter currently does not apply to Relay.
- The indexer consumers are a bottleneck and rely on low-latency high throughput.
Stage 1
### Stage 1 / Re-Implementation
- [ ] https://github.com/getsentry/relay/issues/2719
- [x] Organize a Memstore/Redis Instance (We decided to use the existing Relay Redis for now)
- [x] Pass configuration via global config down from Sentry
- [ ] https://github.com/getsentry/sentry/pull/61028
- [ ] https://github.com/getsentry/sentry/pull/61024
- [ ] https://github.com/getsentry/getsentry/pull/12241
- [ ] https://github.com/getsentry/relay/pull/2745
- [x] ~~`SMISMEMBER` instead of loading the entire set~~ (Needs Redis 6.2)
- [ ] https://github.com/getsentry/relay/pull/2849
- [x] Hookup Cardinality Limiter into existing Processing Pipeline
- [x] Investiage alternative HashSet hasher, e.g. AHash for Redis implementation
- [ ] https://github.com/getsentry/snuba/pull/5349/
- [ ] https://github.com/getsentry/sentry-options-automator/pull/574
- [ ] https://github.com/getsentry/relay/pull/2947
- [ ] https://github.com/getsentry/sentry/pull/63562
- [ ] https://github.com/getsentry/relay/pull/2972
- [ ] https://github.com/getsentry/relay/pull/2990
- [ ] https://github.com/getsentry/relay/pull/2978
- [ ] https://github.com/getsentry/relay/pull/3008
- [ ] https://github.com/getsentry/relay/pull/3020
- [ ] https://github.com/getsentry/relay/pull/3018
- [ ] https://github.com/getsentry/relay/issues/3010
- [ ] https://github.com/getsentry/relay/pull/3029
- [ ] https://github.com/getsentry/relay/pull/3053
- [ ] https://github.com/getsentry/relay/pull/3071
- [ ] https://github.com/getsentry/relay/pull/3076
- [ ] https://github.com/getsentry/ops/pull/9381
- [ ] https://github.com/getsentry/relay/pull/3199
- [ ] https://github.com/getsentry/sentry/pull/66299
- [x] ~~Investigate using the hash as a hashmap hash instead of hashing the hash, e.g. https://crates.io/crates/hash_hasher~~ Performs very badly with certain hashes, minimal gain
Validation
- Metrics
- Same metrics as in the indexer cardinality limiter
- Observability if org limits are hit through Sentry Errors and metrics
Rollout
- New memstore
- Load Test (cardinality), adapt existing indexer load tests
- Rollout in S4S
- Use a really low sample rate, slowly ramp up the rate
- Verify indexer limiter is now longer hit
- Turn off limiter in Indexer
- Rollout in Production
- Same as above, just more informed decisions
Stage 2
Replace Redis Set with in memory HLL and Bloomfilter, which is repeatedly merged back into Redis to synchronize all nodes.
While this is still an option to use either probabilistic data structure in either Redis or Relay, it seems a Bloomfilter can reduce the amount of bits we store by 50% 32 to ~15, while at the same time decreasing lookup and write performance. Our current bottleneck is more likely to be CPU than Storage, so this makes this option less interesting and valuable.