Skip to content

[EPIC] Move the metrics cardinality limiter to Relay #2717

@jan-auer

Description

@jan-auer

The performance of metrics queries depends on the number of time series they have to aggregate. To put an upper bound on the number of time series - and thus the number of metric buckets with a unique combination of tag values - the metrics ingestion pipeline has a cardinality limiter.

It resides in the metrics indexer consumer, currently causing more than half of its CPU usage and introducing additional latency due to a high number of Redis calls. There are two reasons to move the cardinality limiter out of the indexer consumers:

  1. The limiter currently does not apply to Relay.
  2. The indexer consumers are a bottleneck and rely on low-latency high throughput.

Stage 1

### Stage 1 / Re-Implementation
- [ ] https://github.com/getsentry/relay/issues/2719
- [x] Organize a Memstore/Redis Instance (We decided to use the existing Relay Redis for now)
- [x] Pass configuration via global config down from Sentry
- [ ] https://github.com/getsentry/sentry/pull/61028
- [ ] https://github.com/getsentry/sentry/pull/61024
- [ ] https://github.com/getsentry/getsentry/pull/12241
- [ ] https://github.com/getsentry/relay/pull/2745
- [x] ~~`SMISMEMBER` instead of loading the entire set~~ (Needs Redis 6.2)
- [ ] https://github.com/getsentry/relay/pull/2849
- [x] Hookup Cardinality Limiter into existing Processing Pipeline
- [x] Investiage alternative HashSet hasher, e.g. AHash for Redis implementation
- [ ] https://github.com/getsentry/snuba/pull/5349/
- [ ] https://github.com/getsentry/sentry-options-automator/pull/574
- [ ] https://github.com/getsentry/relay/pull/2947
- [ ] https://github.com/getsentry/sentry/pull/63562
- [ ] https://github.com/getsentry/relay/pull/2972
- [ ] https://github.com/getsentry/relay/pull/2990
- [ ] https://github.com/getsentry/relay/pull/2978
- [ ] https://github.com/getsentry/relay/pull/3008
- [ ] https://github.com/getsentry/relay/pull/3020
- [ ] https://github.com/getsentry/relay/pull/3018
- [ ] https://github.com/getsentry/relay/issues/3010
- [ ] https://github.com/getsentry/relay/pull/3029
- [ ] https://github.com/getsentry/relay/pull/3053
- [ ] https://github.com/getsentry/relay/pull/3071
- [ ] https://github.com/getsentry/relay/pull/3076
- [ ] https://github.com/getsentry/ops/pull/9381
- [ ] https://github.com/getsentry/relay/pull/3199
- [ ] https://github.com/getsentry/sentry/pull/66299
- [x] ~~Investigate using the hash as a hashmap hash instead of hashing the hash, e.g. https://crates.io/crates/hash_hasher~~  Performs very badly with certain hashes, minimal gain

Validation

  • Metrics
    • Same metrics as in the indexer cardinality limiter
    • Observability if org limits are hit through Sentry Errors and metrics

Rollout

  • New memstore
  • Load Test (cardinality), adapt existing indexer load tests
  • Rollout in S4S
    • Use a really low sample rate, slowly ramp up the rate
    • Verify indexer limiter is now longer hit
    • Turn off limiter in Indexer
  • Rollout in Production
    • Same as above, just more informed decisions

Stage 2

Replace Redis Set with in memory HLL and Bloomfilter, which is repeatedly merged back into Redis to synchronize all nodes.

While this is still an option to use either probabilistic data structure in either Redis or Relay, it seems a Bloomfilter can reduce the amount of bits we store by 50% 32 to ~15, while at the same time decreasing lookup and write performance. Our current bottleneck is more likely to be CPU than Storage, so this makes this option less interesting and valuable.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions