feat(metrics): Unicode escapes for tag values in statsd by jan-auer · Pull Request #3358 · getsentry/relay

jan-auer · 2024-03-28T14:35:02Z

Disclaimer: See the epic for more context.

Adds support for escape sequences based on best practices recommended in RFC 5137.

Tag values allow all printable unicode characters. Control characters have to be stripped from output. There is a list of restricted characters that have to be escaped according to the following mapping:

Tab is escaped as \t.
Carriage return is escaped as \r.
Line feed is escaped as \n.
Backslash is escaped as \\.
Commas and pipes are given unicode escapes in the form \u{2c} and \u{7c}, respectively.

Note: Documentation previously allowed all \s in tag values, which
included newlines. These are not legal and would lead to invalid statsd
payloads. Newlines and all whitespace characters other than a plain space now
require an escape sequence and will be removed as control characters.

Epic: https://github.com/getsentry/team-ingest/issues/304

* master: feat(metric-stats): Report cardinality to metric stats (#3360) release: 0.8.56 fix(perfscore): Adds span op tag to perf score totals (#3326) ref(profiles): Return retention_days as part of the Kafka message (#3362) ref(filter): Add GTmetrix to the list of web crawlers (#3363) fix: Fix kafka topic default (#3350) ref(normalization): Remove duplicated normalization (#3355) feat(feedback): Emit outcomes for user feedback events (#3026) feat(cardinality): Implement cardinality reporting (#3342)

jan-auer · 2024-04-04T14:11:29Z

relay-metrics/src/aggregator.rs

 /// because data structures or their serialization have overheads.
 pub fn tags_cost(tags: &BTreeMap<String, String>) -> usize {
-    tags.iter().map(|(k, v)| k.capacity() + v.capacity()).sum()
+    tags.iter().map(|(k, v)| k.len() + v.len()).sum()


This is an unrelated bugfix that showed up in one of the tests: By using capacity, we were overestimating the size of the buckets. Due to the introduction of unescaper::unescape, the allocated tag strings were larger than their contents, which changed test behavior.

Mh I wonder if this wasn't deliberate, because capacity is the cost of the metric in memory (just doesn't work for serialization).

It was deliberate, I originally used capacity() since we used this to control memory consumption. More than that, we now rely on this for serialization now and Joris has since changed the corresponding check on metric names to use .len(). So it should be fine to use len.

If we want robust memory measurements, I'm afraid we'll have to explore different approaches like arena allocators.

Joris has since changed the corresponding check on metric names to use .len()

@jan-auer are you referring to this line? The reason why it uses .len() is because metric names are now represented by Arc<str>, not String (see #3279).

relay/relay-metrics/src/aggregator.rs

Line 84 in e614e91

mem::size_of::<Self>() + self.metric_name.len() + tags_cost(&self.tags)

No heavy objections against this change, but the cleanest solution would be to have two different estimation functions for memory footprint and serialization cost.

jan-auer · 2024-04-04T14:24:38Z

relay-metrics/src/bucket.rs

    /// Namespaces and units must consist of ASCII characters and match the regular expression
    /// `/\w+/`. The name component of MRIs consist of unicode characters and must match the
-    /// regular expression `/\w[\w\d_-.]+/`. Note that the name must begin with a letter.
+    /// regular expression `/\w[\w\-.]*/`. Note that the name must begin with a letter.


The validation/normalization implementation in Relay differs from the rules for legal metric names:

We do not yet enforce the first letter. This will come in as a separate refactor.

Dashes are intentionally replaced with underscores. Right now, the product wants just underscores, but we still permit SDKs to send dashes for future proofing.

Dashes are intentionally replaced with underscores

Should we add this information to the doc comment here?

Actually, I just found that the first point is wrong. We do validate that the first character is a letter.

I've described this more clearly in the docs of try_normalize_metric_name now. To readers of the public Rust docs (i.e. client implementors) this information ideally shouldn't matter.

jjbayer · 2024-04-05T06:37:00Z

relay-metrics/src/aggregator.rs

 /// because data structures or their serialization have overheads.
 pub fn tags_cost(tags: &BTreeMap<String, String>) -> usize {
-    tags.iter().map(|(k, v)| k.capacity() + v.capacity()).sum()
+    tags.iter().map(|(k, v)| k.len() + v.len()).sum()


Joris has since changed the corresponding check on metric names to use .len()

@jan-auer are you referring to this line? The reason why it uses .len() is because metric names are now represented by Arc<str>, not String (see #3279).

relay/relay-metrics/src/aggregator.rs

Line 84 in e614e91

mem::size_of::<Self>() + self.metric_name.len() + tags_cost(&self.tags)

No heavy objections against this change, but the cleanest solution would be to have two different estimation functions for memory footprint and serialization cost.

jjbayer · 2024-04-05T06:40:31Z

relay-metrics/src/bucket.rs

    /// Namespaces and units must consist of ASCII characters and match the regular expression
    /// `/\w+/`. The name component of MRIs consist of unicode characters and must match the
-    /// regular expression `/\w[\w\d_-.]+/`. Note that the name must begin with a letter.
+    /// regular expression `/\w[\w\-.]*/`. Note that the name must begin with a letter.


Dashes are intentionally replaced with underscores

Should we add this information to the doc comment here?

relay-metrics/src/bucket.rs

jjbayer · 2024-04-05T06:54:26Z

relay-base-schema/src/metrics/mod.rs

    }

-    let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_/.]+").unwrap());
+    let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_.]+").unwrap());


Why is / not replaced anymore?

Also, should the . be escaped? Or is it not interpreted as a wildcard when it occurs within [...]?

Why is / not replaced anymore?

👀

Or is it not interpreted as a wildcard when it occurs within [...]?

It's a literal . in a match group.

/ was exempt from replacing and is now replaced. The original spec allowed slashes, but for a long while SDK guidelines have excluded it. We're now aligning with the latest character set described in the epic and on develop docs.

jjbayer · 2024-04-05T06:55:12Z

relay-system/Cargo.toml

 relay-log = { workspace = true }
 relay-statsd = { workspace = true }
-tokio = { workspace = true, features = ["rt", "signal"] }
+tokio = { workspace = true, features = ["rt", "signal", "macros"] }


What is this for?

It fixes a compilation issue when the relay-system crate is compiled alone (e.g. to run tests). The crate actually depends on macros, we just didn't notice before as another workspace member enables this feature usually.

relay-metrics/src/bucket.rs

Dav1dde · 2024-04-05T07:43:51Z

relay-base-schema/src/metrics/mod.rs

    }

-    let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_/.]+").unwrap());
+    let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_.]+").unwrap());


Why is / not replaced anymore?

👀

Or is it not interpreted as a wildcard when it occurs within [...]?

It's a literal . in a match group.

relay-metrics/src/protocol.rs

relay-metrics/src/bucket.rs

jan-auer · 2024-04-05T10:37:11Z

Thanks for the reviews. I've addressed most comments. Going to merge once we reach consensus on #3358 (comment)

feat(metrics): Unicode escapes for tag values in statsd

546ae89

jan-auer requested a review from a team as a code owner March 28, 2024 14:35

fix: Lint

41349a4

jan-auer marked this pull request as draft March 28, 2024 15:04

jan-auer added 5 commits April 3, 2024 15:17

ref: Update to escape restricted

42b8469

fix: Update rules

ad36fb0

fix: Missing feature flag in relay-system

b8cb7b6

fix: Invalid use of string capacity

a07c78d

jan-auer commented Apr 4, 2024

View reviewed changes

test(metrics): Add test cases for escaping

2cdaa08

jan-auer self-assigned this Apr 4, 2024

Merge branch 'master' into feat/metrics-escape-tag-values

d052aae

jan-auer marked this pull request as ready for review April 4, 2024 14:21

jan-auer commented Apr 4, 2024

View reviewed changes

meta: Changelog

fcca3db

cleptric mentioned this pull request Apr 4, 2024

Project: Metrics Normalization getsentry/team-sdks#80

Closed

jjbayer approved these changes Apr 5, 2024

View reviewed changes

Dav1dde reviewed Apr 5, 2024

View reviewed changes

jan-auer added 2 commits April 5, 2024 12:18

doc: Describe divergence from the public docs

289b878

fix: Unicode control characters and tests

629efaa

jan-auer merged commit 9c756ac into master Apr 5, 2024

jan-auer deleted the feat/metrics-escape-tag-values branch April 5, 2024 11:53

Conversation

jan-auer commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jan-auer commented Apr 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jan-auer commented Mar 28, 2024 •

edited

Loading