model/normalizer: Enforce valid utf-8 during normalization. #431

AlexJF · 2018-06-05T12:34:33Z

An attempt at enforcing valid utf8 strings for span contents after normalization.

It works but has a non-negligible impact which I couldn't avoid by writing the same code with several variations (my own loop, etc...). Most of the time and allocations are actually inside the Rune handling code of the stdlib it seems.

Impact on my MBP:

Impact with no invalid utf-8 strings:

$ benchstat old.txt new-valid.txt
name             old time/op    new time/op    delta
Normalization-4     549ns ± 8%     671ns ± 0%  +22.13%  (p=0.016 n=5+4)

name             old alloc/op   new alloc/op   delta
Normalization-4      792B ± 0%      792B ± 0%     ~     (all equal)

name             old allocs/op  new allocs/op  delta
Normalization-4      9.00 ± 0%      9.00 ± 0%     ~     (all equal)

Impact with invalid utf-8 strings in Resource, Type and Meta:

$ benchstat old.txt new-invalid.txt
name             old time/op    new time/op    delta
Normalization-4     549ns ± 8%    1344ns ± 0%  +144.68%  (p=0.008 n=5+5)

name             old alloc/op   new alloc/op   delta
Normalization-4      792B ± 0%     1296B ± 0%   +63.64%  (p=0.008 n=5+5)

name             old allocs/op  new allocs/op  delta
Normalization-4      9.00 ± 0%     17.00 ± 0%   +88.89%  (p=0.008 n=5+5)

gbbr

Thanks. Overall it looks good to me.

Based on our Slack discussion about the size of the dependency, I'll wait for you to go back to the standard library unicode/utf8 package before a second round of review.

Also, it would be super helpful to reason about this PR if you were to post the results of the benchmarks using golang.org/x/perf/cmd/benchstat in the PR description.

gbbr · 2018-06-05T12:59:37Z

model/normalizer.go

+	return s.normalize(true)
+}
+
+func (s *Span) normalize(utf8Enforce bool) error {


Since there is no call to this method using false other than in tests, let's remove the argument and always do this. Are there any reasons not obvious to me why we would want to do this?

gbbr · 2018-06-05T13:05:53Z

model/normalizer.go

 	return string(res), true
 }
+
+func forceUtf8(s string) (string, error) {


I'd say it might be (insignificantly) more efficient to have only one decoder per function call. Maybe it would be best to inline this, but I don't feel strongly about it.

If you decide to keep it here, can we just call it toUTF8 instead? I feel it would be much clearer. Force seems like a word that we would use with an operation that has the potential to fail but we are ignoring the error. I don't think this is the case.

gbbr · 2018-06-05T13:06:36Z

model/normalizer_test.go

 }
+
+func TestNormalizeInvalidUTF8(t *testing.T) {
+	invalidUtf8 := "test\x99\x8f"


I like how you called it UTF8 in the test function title, can we do the same everywhere else. s/Utf8/UTF8?

AlexJF · 2018-06-05T13:35:51Z

Done, didn't know about this benchstat thing, useful :)

gbbr

Looking good, just some nits.

gbbr · 2018-06-05T13:49:53Z

model/normalizer.go

+
+	for {
+		r, _, err := in.ReadRune()
+


Would you terribly mind removing all the empty lines? As far as I see, in general the practice is to have a statement, and then immediately after its error handling, so I'd love it if it would be possible to change this as:

r, _, err := in.ReadRune() if err == io.EOF { return res.String(), nil } if err != nil { return "", err }

Also, I've recently discovered strings.Builder which I think should be more efficient.

Oh great tip regarding the strings.Builder (also replaced the input with a strings.Reader). The new benchstat (compare it with the one in the PR description, using bytes.Buffer):

$ benchstat old.txt new.txt name old time/op new time/op delta Normalization-4 549ns ± 8% 1097ns ± 3% +99.67% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Normalization-4 792B ± 0% 848B ± 0% +7.07% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Normalization-4 9.00 ± 0% 15.00 ± 0% +66.67% (p=0.008 n=5+5)

gbbr · 2018-06-05T13:50:32Z

model/normalizer.go

+	}
+	s.Type = utf8Type
+
+	// Meta


Can we please drop this comment?

AlexJF · 2018-06-06T08:18:20Z

Oh strings.Builder apparently requires go 1.10. Do we need compatibility with 1.9 or can we update the CI?

gbbr · 2018-06-06T08:31:51Z

It's definitely not a problem to update CI. AFAIK we are aiming to have the same version in CI as we do in Agent 6.

@derekwbrown or @masci can you guys please confirm that it's ok for us to have a requirement for go1.10? (tldr; to use strings.Builder)

gbbr

I am approving this PR as a style and consistency review, not technical.

For the technical part, I'd love to hear more views. I wonder if this indicates a problem somewhere else in the pipeline. These strings are and should be UTF-8. How does this get broken? Is it msgpack? Is it the tracers?

masci · 2018-06-06T08:59:43Z

Ping @olivielpeau for review

AlexJF · 2018-06-06T09:07:54Z

My 2 cents:

We can't really control what gets sent to us and given that traces are encoded via protobuf and this was merged (golang/protobuf#499), then any non-utf8 traces would be immediately dropped at the agent during serialization time when/if we upgrade to latest versions.

This might be acceptable behaviour or not, that's more of a product decision. I'll just comment that, from a user's standpoint, it seems much more friendly to let the traces go through but clearly mark strings with invalid utf8 with �. But as this PR shows, this comes with a certain cost.

Another thing we could do is have this utf8 normalization happen later on in the pipeline, after sampling. That could reduce the performance impact but would mean we'll have traces flying through our pipeline with invalid utf8. This might not be a problem currently but if we introduce more parts later on it's something to keep in mind.

PS: I've seen some invalid UTF-8 get included in some Redis traces. I can try to track it down to language but it's something that's very easy to happen if one strings a byte array.

LotharSee · 2018-06-06T09:32:20Z

model/normalizer.go

 	}
+	utf8Resource, err := toUTF8(s.Resource)
+	if err != nil {
+		return fmt.Errorf("span.normalize: error converting resoutce to utf-8: %s", err.Error())


typo at resoutce.

LotharSee · 2018-06-06T09:36:52Z

model/normalizer.go

+		}
+
+		s.Meta[k] = utf8V
+	}


Afair when Span.Normalize returns an error, we drop the full span / trace (no stats). It means it can break significantly the product just for a small unicode issue on an meta attribute. And that might be more frequent that we'd hope (let's say one put an uncontroller variable in a meta which might end up with binary content).

Could we instead replace non-unicode strings with a placeholder (like we do for failing SQL quantization)? It would apply to resource, type and meta too.

This is what this PR does (check the tests). The error handling here is just to be extra sure but looking at the code for ReadRune() and WriteRune() no errors are returned except forEOF which is handled properly.

Added a comment to clarify toUTF8's behaviour.

Hmm, sneaky. And how could we be 100% sure that in.ReadRune() won't return a non io.EOF error?
As if it happens, we are back to the situation where Normalize return an error -> trace dropped.

When reading it isn't much obvious. If we think we can have a "safe" toUTF8 then we should provide an implementation not returning error. And if errors can still happen, Normalize should handle them in a safe manner (non-dropping replacement).

What do you think?

olivielpeau · 2018-06-06T09:48:01Z

@gbbr Please don't rely on go 1.10 features yet, we're working on using go 1.10 for the 6.3 release but if we run into instabilities during QA we may have to revert to go 1.9

LotharSee · 2018-06-06T09:52:32Z

model/normalizer.go

+	if err != nil {
+		return fmt.Errorf("span.normalize: error converting resource to utf-8: %s", err.Error())
+	}
+	s.Resource = utf8Resource


We might want to test s.Resource == "" after the utf8 replacement? (I don't know if they actually are cases were it'd be false before the replacement / true after ; but I guess it is more logic to test for emptiness at the end anyway).

bmermet · 2018-06-06T09:58:46Z

model/normalizer.go

 	if s.Resource == "" {
 		return errors.New("span.normalize: empty `Resource`")
 	}
+	utf8Resource, err := toUTF8(s.Resource)


With this implementation we are effectively allocating twice each string processed by the agent. With the assumption that these UTF-8 errors should be very rare, wouldn't it make sense to first detect if a string is invalid UTF-8 before doing the substitution?

Yeah good point. I was worried that checking UTF-8 validation would itself be almost as expensive but just ran some checks and it seems to only bump 22.36% of time:

$ benchstat old.txt new-valid.txt name old time/op new time/op delta Normalization-4 549ns ± 8% 672ns ± 1% +22.36% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Normalization-4 792B ± 0% 792B ± 0% ~ (all equal) name old allocs/op new allocs/op delta Normalization-4 9.00 ± 0% 9.00 ± 0% ~ (all equal)

So I might just end up doing that 👍

gbbr

All good, just two nits.

gbbr · 2018-06-12T13:38:51Z

model/normalizer.go

+		return s
+	}
+
+	in := bytes.NewBufferString(s)


We don't really need a buffer here, strings.NewReader should be good enough.

gbbr · 2018-06-12T13:39:38Z

model/normalizer.go

+	}
+
+	in := bytes.NewBufferString(s)
+	out := bytes.Buffer{}


Would you mind changing this to:

var out bytes.Buffer

I honestly find your version more uniform given the immediate declaration above, but I'd like to stay consistent.

gbbr · 2018-06-12T14:05:55Z

All good. Please make sure to "Squash & Merge" when ready.

AlexJF requested review from gbbr and palazzem June 5, 2018 12:34

AlexJF force-pushed the alexjf/utf8-enforcement branch from 874f465 to b3ccc1b Compare June 5, 2018 13:00

gbbr reviewed Jun 5, 2018

View reviewed changes

AlexJF force-pushed the alexjf/utf8-enforcement branch 4 times, most recently from e22cff3 to c9ec90c Compare June 5, 2018 13:34

AlexJF changed the title ~~Do UTF8 validation on string span fields that have no validation.~~ model/normalizer: Enforce valid utf-8 during normalization. Jun 5, 2018

gbbr reviewed Jun 5, 2018

View reviewed changes

gbbr previously approved these changes Jun 6, 2018

View reviewed changes

LotharSee reviewed Jun 6, 2018

View reviewed changes

AlexJF dismissed gbbr’s stale review via c805817 June 6, 2018 09:41

LotharSee reviewed Jun 6, 2018

View reviewed changes

bmermet reviewed Jun 6, 2018

View reviewed changes

gbbr reviewed Jun 12, 2018

View reviewed changes

gbbr approved these changes Jun 12, 2018

View reviewed changes

LotharSee approved these changes Jun 13, 2018

View reviewed changes

model/normalizer: Enforce valid utf-8 during normalization.

d7fa6e3

AlexJF force-pushed the alexjf/utf8-enforcement branch from 85713f0 to d7fa6e3 Compare June 13, 2018 09:32

AlexJF merged commit 9fff78c into master Jun 13, 2018

AlexJF deleted the alexjf/utf8-enforcement branch June 13, 2018 13:56

gbbr modified the milestones: 6.3.1, next Jun 26, 2018

gbbr removed this from the next milestone Sep 11, 2018

model/normalizer: Enforce valid utf-8 during normalization. #431

model/normalizer: Enforce valid utf-8 during normalization. #431

Uh oh!

Conversation

AlexJF commented Jun 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gbbr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexJF commented Jun 5, 2018

Uh oh!

gbbr left a comment

Choose a reason for hiding this comment

Uh oh!

gbbr Jun 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexJF commented Jun 6, 2018

Uh oh!

gbbr commented Jun 6, 2018

Uh oh!

gbbr left a comment

Choose a reason for hiding this comment

Uh oh!

masci commented Jun 6, 2018

Uh oh!

AlexJF commented Jun 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olivielpeau commented Jun 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bmermet Jun 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gbbr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gbbr commented Jun 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

AlexJF commented Jun 5, 2018 •

edited

Loading

gbbr Jun 5, 2018 •

edited

Loading

AlexJF commented Jun 6, 2018 •

edited

Loading

bmermet Jun 6, 2018 •

edited

Loading