Skip to content

feat(metrics): Unicode escapes for tag values in statsd#3358

Merged
jan-auer merged 12 commits intomasterfrom
feat/metrics-escape-tag-values
Apr 5, 2024
Merged

feat(metrics): Unicode escapes for tag values in statsd#3358
jan-auer merged 12 commits intomasterfrom
feat/metrics-escape-tag-values

Conversation

@jan-auer
Copy link
Copy Markdown
Member

@jan-auer jan-auer commented Mar 28, 2024

Disclaimer: See the epic for more context.

Adds support for escape sequences based on best practices recommended in RFC 5137.

Tag values allow all printable unicode characters. Control characters have to be stripped from output. There is a list of restricted characters that have to be escaped according to the following mapping:

  • Tab is escaped as \t.
  • Carriage return is escaped as \r.
  • Line feed is escaped as \n.
  • Backslash is escaped as \\.
  • Commas and pipes are given unicode escapes in the form \u{2c} and \u{7c}, respectively.

Note: Documentation previously allowed all \s in tag values, which
included newlines. These are not legal and would lead to invalid statsd
payloads. Newlines and all whitespace characters other than a plain space now
require an escape sequence and will be removed as control characters.

Epic: https://github.com/getsentry/team-ingest/issues/304

@jan-auer jan-auer requested a review from a team as a code owner March 28, 2024 14:35
@jan-auer jan-auer marked this pull request as draft March 28, 2024 15:04
jan-auer added 5 commits April 3, 2024 15:17
* master:
  feat(metric-stats): Report cardinality to metric stats (#3360)
  release: 0.8.56
  fix(perfscore): Adds span op tag to perf score totals (#3326)
  ref(profiles): Return retention_days as part of the Kafka message (#3362)
  ref(filter): Add GTmetrix to the list of web crawlers (#3363)
  fix: Fix kafka topic default (#3350)
  ref(normalization): Remove duplicated normalization (#3355)
  feat(feedback): Emit outcomes for user feedback events (#3026)
  feat(cardinality): Implement cardinality reporting (#3342)
/// because data structures or their serialization have overheads.
pub fn tags_cost(tags: &BTreeMap<String, String>) -> usize {
tags.iter().map(|(k, v)| k.capacity() + v.capacity()).sum()
tags.iter().map(|(k, v)| k.len() + v.len()).sum()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated bugfix that showed up in one of the tests: By using capacity, we were overestimating the size of the buckets. Due to the introduction of unescaper::unescape, the allocated tag strings were larger than their contents, which changed test behavior.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh I wonder if this wasn't deliberate, because capacity is the cost of the metric in memory (just doesn't work for serialization).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was deliberate, I originally used capacity() since we used this to control memory consumption. More than that, we now rely on this for serialization now and Joris has since changed the corresponding check on metric names to use .len(). So it should be fine to use len.

If we want robust memory measurements, I'm afraid we'll have to explore different approaches like arena allocators.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Joris has since changed the corresponding check on metric names to use .len()

@jan-auer are you referring to this line? The reason why it uses .len() is because metric names are now represented by Arc<str>, not String (see #3279).

mem::size_of::<Self>() + self.metric_name.len() + tags_cost(&self.tags)

No heavy objections against this change, but the cleanest solution would be to have two different estimation functions for memory footprint and serialization cost.

@jan-auer jan-auer self-assigned this Apr 4, 2024
@jan-auer jan-auer marked this pull request as ready for review April 4, 2024 14:21
/// Namespaces and units must consist of ASCII characters and match the regular expression
/// `/\w+/`. The name component of MRIs consist of unicode characters and must match the
/// regular expression `/\w[\w\d_-.]+/`. Note that the name must begin with a letter.
/// regular expression `/\w[\w\-.]*/`. Note that the name must begin with a letter.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation/normalization implementation in Relay differs from the rules for legal metric names:

  • We do not yet enforce the first letter. This will come in as a separate refactor.
  • Dashes are intentionally replaced with underscores. Right now, the product wants just underscores, but we still permit SDKs to send dashes for future proofing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dashes are intentionally replaced with underscores

Should we add this information to the doc comment here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just found that the first point is wrong. We do validate that the first character is a letter.

I've described this more clearly in the docs of try_normalize_metric_name now. To readers of the public Rust docs (i.e. client implementors) this information ideally shouldn't matter.

/// because data structures or their serialization have overheads.
pub fn tags_cost(tags: &BTreeMap<String, String>) -> usize {
tags.iter().map(|(k, v)| k.capacity() + v.capacity()).sum()
tags.iter().map(|(k, v)| k.len() + v.len()).sum()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Joris has since changed the corresponding check on metric names to use .len()

@jan-auer are you referring to this line? The reason why it uses .len() is because metric names are now represented by Arc<str>, not String (see #3279).

mem::size_of::<Self>() + self.metric_name.len() + tags_cost(&self.tags)

No heavy objections against this change, but the cleanest solution would be to have two different estimation functions for memory footprint and serialization cost.

/// Namespaces and units must consist of ASCII characters and match the regular expression
/// `/\w+/`. The name component of MRIs consist of unicode characters and must match the
/// regular expression `/\w[\w\d_-.]+/`. Note that the name must begin with a letter.
/// regular expression `/\w[\w\-.]*/`. Note that the name must begin with a letter.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dashes are intentionally replaced with underscores

Should we add this information to the doc comment here?

}

let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_/.]+").unwrap());
let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_.]+").unwrap());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is / not replaced anymore?

Also, should the . be escaped? Or is it not interpreted as a wildcard when it occurs within [...]?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is / not replaced anymore?

👀

Or is it not interpreted as a wildcard when it occurs within [...]?

It's a literal . in a match group.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ was exempt from replacing and is now replaced. The original spec allowed slashes, but for a long while SDK guidelines have excluded it. We're now aligning with the latest character set described in the epic and on develop docs.

relay-log = { workspace = true }
relay-statsd = { workspace = true }
tokio = { workspace = true, features = ["rt", "signal"] }
tokio = { workspace = true, features = ["rt", "signal", "macros"] }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fixes a compilation issue when the relay-system crate is compiled alone (e.g. to run tests). The crate actually depends on macros, we just didn't notice before as another workspace member enables this feature usually.

}

let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_/.]+").unwrap());
let normalize_re = NORMALIZE_RE.get_or_init(|| Regex::new("[^a-zA-Z0-9_.]+").unwrap());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is / not replaced anymore?

👀

Or is it not interpreted as a wildcard when it occurs within [...]?

It's a literal . in a match group.

@jan-auer
Copy link
Copy Markdown
Member Author

jan-auer commented Apr 5, 2024

Thanks for the reviews. I've addressed most comments. Going to merge once we reach consensus on #3358 (comment)

@jan-auer jan-auer merged commit 9c756ac into master Apr 5, 2024
@jan-auer jan-auer deleted the feat/metrics-escape-tag-values branch April 5, 2024 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants