feat(metrics): Unicode escapes for tag values in statsd#3358
feat(metrics): Unicode escapes for tag values in statsd#3358
Conversation
* master: feat(metric-stats): Report cardinality to metric stats (#3360) release: 0.8.56 fix(perfscore): Adds span op tag to perf score totals (#3326) ref(profiles): Return retention_days as part of the Kafka message (#3362) ref(filter): Add GTmetrix to the list of web crawlers (#3363) fix: Fix kafka topic default (#3350) ref(normalization): Remove duplicated normalization (#3355) feat(feedback): Emit outcomes for user feedback events (#3026) feat(cardinality): Implement cardinality reporting (#3342)
| /// because data structures or their serialization have overheads. | ||
| pub fn tags_cost(tags: &BTreeMap<String, String>) -> usize { | ||
| tags.iter().map(|(k, v)| k.capacity() + v.capacity()).sum() | ||
| tags.iter().map(|(k, v)| k.len() + v.len()).sum() |
There was a problem hiding this comment.
This is an unrelated bugfix that showed up in one of the tests: By using capacity, we were overestimating the size of the buckets. Due to the introduction of unescaper::unescape, the allocated tag strings were larger than their contents, which changed test behavior.
There was a problem hiding this comment.
Mh I wonder if this wasn't deliberate, because capacity is the cost of the metric in memory (just doesn't work for serialization).
There was a problem hiding this comment.
It was deliberate, I originally used capacity() since we used this to control memory consumption. More than that, we now rely on this for serialization now and Joris has since changed the corresponding check on metric names to use .len(). So it should be fine to use len.
If we want robust memory measurements, I'm afraid we'll have to explore different approaches like arena allocators.
There was a problem hiding this comment.
Joris has since changed the corresponding check on metric names to use .len()
@jan-auer are you referring to this line? The reason why it uses .len() is because metric names are now represented by Arc<str>, not String (see #3279).
relay/relay-metrics/src/aggregator.rs
Line 84 in e614e91
No heavy objections against this change, but the cleanest solution would be to have two different estimation functions for memory footprint and serialization cost.
| /// Namespaces and units must consist of ASCII characters and match the regular expression | ||
| /// `/\w+/`. The name component of MRIs consist of unicode characters and must match the | ||
| /// regular expression `/\w[\w\d_-.]+/`. Note that the name must begin with a letter. | ||
| /// regular expression `/\w[\w\-.]*/`. Note that the name must begin with a letter. |
There was a problem hiding this comment.
The validation/normalization implementation in Relay differs from the rules for legal metric names:
- We do not yet enforce the first letter. This will come in as a separate refactor.
- Dashes are intentionally replaced with underscores. Right now, the product wants just underscores, but we still permit SDKs to send dashes for future proofing.
There was a problem hiding this comment.
Dashes are intentionally replaced with underscores
Should we add this information to the doc comment here?
There was a problem hiding this comment.
Actually, I just found that the first point is wrong. We do validate that the first character is a letter.
I've described this more clearly in the docs of try_normalize_metric_name now. To readers of the public Rust docs (i.e. client implementors) this information ideally shouldn't matter.
| /// because data structures or their serialization have overheads. | ||
| pub fn tags_cost(tags: &BTreeMap<String, String>) -> usize { | ||
| tags.iter().map(|(k, v)| k.capacity() + v.capacity()).sum() | ||
| tags.iter().map(|(k, v)| k.len() + v.len()).sum() |
There was a problem hiding this comment.
Joris has since changed the corresponding check on metric names to use .len()
@jan-auer are you referring to this line? The reason why it uses .len() is because metric names are now represented by Arc<str>, not String (see #3279).
relay/relay-metrics/src/aggregator.rs
Line 84 in e614e91
No heavy objections against this change, but the cleanest solution would be to have two different estimation functions for memory footprint and serialization cost.
| /// Namespaces and units must consist of ASCII characters and match the regular expression | ||
| /// `/\w+/`. The name component of MRIs consist of unicode characters and must match the | ||
| /// regular expression `/\w[\w\d_-.]+/`. Note that the name must begin with a letter. | ||
| /// regular expression `/\w[\w\-.]*/`. Note that the name must begin with a letter. |
There was a problem hiding this comment.
Dashes are intentionally replaced with underscores
Should we add this information to the doc comment here?
| } | ||
|
|
||
| let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_/.]+").unwrap()); | ||
| let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_.]+").unwrap()); |
There was a problem hiding this comment.
Why is / not replaced anymore?
Also, should the . be escaped? Or is it not interpreted as a wildcard when it occurs within [...]?
There was a problem hiding this comment.
Why is / not replaced anymore?
👀
Or is it not interpreted as a wildcard when it occurs within [...]?
It's a literal . in a match group.
There was a problem hiding this comment.
/ was exempt from replacing and is now replaced. The original spec allowed slashes, but for a long while SDK guidelines have excluded it. We're now aligning with the latest character set described in the epic and on develop docs.
| relay-log = { workspace = true } | ||
| relay-statsd = { workspace = true } | ||
| tokio = { workspace = true, features = ["rt", "signal"] } | ||
| tokio = { workspace = true, features = ["rt", "signal", "macros"] } |
There was a problem hiding this comment.
It fixes a compilation issue when the relay-system crate is compiled alone (e.g. to run tests). The crate actually depends on macros, we just didn't notice before as another workspace member enables this feature usually.
| } | ||
|
|
||
| let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_/.]+").unwrap()); | ||
| let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_.]+").unwrap()); |
There was a problem hiding this comment.
Why is / not replaced anymore?
👀
Or is it not interpreted as a wildcard when it occurs within [...]?
It's a literal . in a match group.
|
Thanks for the reviews. I've addressed most comments. Going to merge once we reach consensus on #3358 (comment) |
Disclaimer: See the epic for more context.
Adds support for escape sequences based on best practices recommended in RFC 5137.
Tag values allow all printable unicode characters. Control characters have to be stripped from output. There is a list of restricted characters that have to be escaped according to the following mapping:
\t.\r.\n.\\.\u{2c}and\u{7c}, respectively.Note: Documentation previously allowed all
\sin tag values, whichincluded newlines. These are not legal and would lead to invalid statsd
payloads. Newlines and all whitespace characters other than a plain space now
require an escape sequence and will be removed as control characters.
Epic: https://github.com/getsentry/team-ingest/issues/304