Skip to content
This repository was archived by the owner on Aug 30, 2019. It is now read-only.

Conversation

@AlexJF
Copy link

@AlexJF AlexJF commented Jun 5, 2018

An attempt at enforcing valid utf8 strings for span contents after normalization.

It works but has a non-negligible impact which I couldn't avoid by writing the same code with several variations (my own loop, etc...). Most of the time and allocations are actually inside the Rune handling code of the stdlib it seems.

Impact on my MBP:

Impact with no invalid utf-8 strings:

$ benchstat old.txt new-valid.txt
name             old time/op    new time/op    delta
Normalization-4     549ns ± 8%     671ns ± 0%  +22.13%  (p=0.016 n=5+4)

name             old alloc/op   new alloc/op   delta
Normalization-4      792B ± 0%      792B ± 0%     ~     (all equal)

name             old allocs/op  new allocs/op  delta
Normalization-4      9.00 ± 0%      9.00 ± 0%     ~     (all equal)

Impact with invalid utf-8 strings in Resource, Type and Meta:

$ benchstat old.txt new-invalid.txt
name             old time/op    new time/op    delta
Normalization-4     549ns ± 8%    1344ns ± 0%  +144.68%  (p=0.008 n=5+5)

name             old alloc/op   new alloc/op   delta
Normalization-4      792B ± 0%     1296B ± 0%   +63.64%  (p=0.008 n=5+5)

name             old allocs/op  new allocs/op  delta
Normalization-4      9.00 ± 0%     17.00 ± 0%   +88.89%  (p=0.008 n=5+5)

@AlexJF AlexJF requested review from gbbr and palazzem June 5, 2018 12:34
@AlexJF AlexJF force-pushed the alexjf/utf8-enforcement branch from 874f465 to b3ccc1b Compare June 5, 2018 13:00
Copy link
Contributor

@gbbr gbbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Overall it looks good to me.

Based on our Slack discussion about the size of the dependency, I'll wait for you to go back to the standard library unicode/utf8 package before a second round of review.

Also, it would be super helpful to reason about this PR if you were to post the results of the benchmarks using golang.org/x/perf/cmd/benchstat in the PR description.

return s.normalize(true)
}

func (s *Span) normalize(utf8Enforce bool) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no call to this method using false other than in tests, let's remove the argument and always do this. Are there any reasons not obvious to me why we would want to do this?

return string(res), true
}

func forceUtf8(s string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it might be (insignificantly) more efficient to have only one decoder per function call. Maybe it would be best to inline this, but I don't feel strongly about it.

If you decide to keep it here, can we just call it toUTF8 instead? I feel it would be much clearer. Force seems like a word that we would use with an operation that has the potential to fail but we are ignoring the error. I don't think this is the case.

}

func TestNormalizeInvalidUTF8(t *testing.T) {
invalidUtf8 := "test\x99\x8f"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you called it UTF8 in the test function title, can we do the same everywhere else. s/Utf8/UTF8?

@AlexJF AlexJF force-pushed the alexjf/utf8-enforcement branch 4 times, most recently from e22cff3 to c9ec90c Compare June 5, 2018 13:34
@AlexJF
Copy link
Author

AlexJF commented Jun 5, 2018

Done, didn't know about this benchstat thing, useful :)

@AlexJF AlexJF changed the title Do UTF8 validation on string span fields that have no validation. model/normalizer: Enforce valid utf-8 during normalization. Jun 5, 2018
Copy link
Contributor

@gbbr gbbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just some nits.


for {
r, _, err := in.ReadRune()

Copy link
Contributor

@gbbr gbbr Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you terribly mind removing all the empty lines? As far as I see, in general the practice is to have a statement, and then immediately after its error handling, so I'd love it if it would be possible to change this as:

r, _, err := in.ReadRune()
if err == io.EOF {
	return res.String(), nil
}
if err != nil {
	return "", err
}

Also, I've recently discovered strings.Builder which I think should be more efficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh great tip regarding the strings.Builder (also replaced the input with a strings.Reader). The new benchstat (compare it with the one in the PR description, using bytes.Buffer):

$ benchstat old.txt new.txt
name             old time/op    new time/op    delta
Normalization-4     549ns ± 8%    1097ns ± 3%  +99.67%  (p=0.008 n=5+5)

name             old alloc/op   new alloc/op   delta
Normalization-4      792B ± 0%      848B ± 0%   +7.07%  (p=0.008 n=5+5)

name             old allocs/op  new allocs/op  delta
Normalization-4      9.00 ± 0%     15.00 ± 0%  +66.67%  (p=0.008 n=5+5)

}
s.Type = utf8Type

// Meta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please drop this comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AlexJF
Copy link
Author

AlexJF commented Jun 6, 2018

Oh strings.Builder apparently requires go 1.10. Do we need compatibility with 1.9 or can we update the CI?

@gbbr
Copy link
Contributor

gbbr commented Jun 6, 2018

It's definitely not a problem to update CI. AFAIK we are aiming to have the same version in CI as we do in Agent 6.

@derekwbrown or @masci can you guys please confirm that it's ok for us to have a requirement for go1.10? (tldr; to use strings.Builder)

gbbr
gbbr previously approved these changes Jun 6, 2018
Copy link
Contributor

@gbbr gbbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving this PR as a style and consistency review, not technical.

For the technical part, I'd love to hear more views. I wonder if this indicates a problem somewhere else in the pipeline. These strings are and should be UTF-8. How does this get broken? Is it msgpack? Is it the tracers?

@masci
Copy link

masci commented Jun 6, 2018

Ping @olivielpeau for review

@AlexJF
Copy link
Author

AlexJF commented Jun 6, 2018

My 2 cents:

We can't really control what gets sent to us and given that traces are encoded via protobuf and this was merged (golang/protobuf#499), then any non-utf8 traces would be immediately dropped at the agent during serialization time when/if we upgrade to latest versions.

This might be acceptable behaviour or not, that's more of a product decision. I'll just comment that, from a user's standpoint, it seems much more friendly to let the traces go through but clearly mark strings with invalid utf8 with . But as this PR shows, this comes with a certain cost.

Another thing we could do is have this utf8 normalization happen later on in the pipeline, after sampling. That could reduce the performance impact but would mean we'll have traces flying through our pipeline with invalid utf8. This might not be a problem currently but if we introduce more parts later on it's something to keep in mind.

PS: I've seen some invalid UTF-8 get included in some Redis traces. I can try to track it down to language but it's something that's very easy to happen if one strings a byte array.

}
utf8Resource, err := toUTF8(s.Resource)
if err != nil {
return fmt.Errorf("span.normalize: error converting resoutce to utf-8: %s", err.Error())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo at resoutce.

}

s.Meta[k] = utf8V
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afair when Span.Normalize returns an error, we drop the full span / trace (no stats). It means it can break significantly the product just for a small unicode issue on an meta attribute. And that might be more frequent that we'd hope (let's say one put an uncontroller variable in a meta which might end up with binary content).

Could we instead replace non-unicode strings with a placeholder (like we do for failing SQL quantization)? It would apply to resource, type and meta too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what this PR does (check the tests). The error handling here is just to be extra sure but looking at the code for ReadRune() and WriteRune() no errors are returned except forEOF which is handled properly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to clarify toUTF8's behaviour.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, sneaky. And how could we be 100% sure that in.ReadRune() won't return a non io.EOF error?
As if it happens, we are back to the situation where Normalize return an error -> trace dropped.

When reading it isn't much obvious. If we think we can have a "safe" toUTF8 then we should provide an implementation not returning error. And if errors can still happen, Normalize should handle them in a safe manner (non-dropping replacement).

What do you think?

@olivielpeau
Copy link
Member

@gbbr Please don't rely on go 1.10 features yet, we're working on using go 1.10 for the 6.3 release but if we run into instabilities during QA we may have to revert to go 1.9

if err != nil {
return fmt.Errorf("span.normalize: error converting resource to utf-8: %s", err.Error())
}
s.Resource = utf8Resource

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to test s.Resource == "" after the utf8 replacement? (I don't know if they actually are cases were it'd be false before the replacement / true after ; but I guess it is more logic to test for emptiness at the end anyway).

if s.Resource == "" {
return errors.New("span.normalize: empty `Resource`")
}
utf8Resource, err := toUTF8(s.Resource)
Copy link

@bmermet bmermet Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this implementation we are effectively allocating twice each string processed by the agent. With the assumption that these UTF-8 errors should be very rare, wouldn't it make sense to first detect if a string is invalid UTF-8 before doing the substitution?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. I was worried that checking UTF-8 validation would itself be almost as expensive but just ran some checks and it seems to only bump 22.36% of time:

$ benchstat old.txt new-valid.txt
name             old time/op    new time/op    delta
Normalization-4     549ns ± 8%     672ns ± 1%  +22.36%  (p=0.008 n=5+5)

name             old alloc/op   new alloc/op   delta
Normalization-4      792B ± 0%      792B ± 0%     ~     (all equal)

name             old allocs/op  new allocs/op  delta
Normalization-4      9.00 ± 0%      9.00 ± 0%     ~     (all equal)

So I might just end up doing that 👍

Copy link
Contributor

@gbbr gbbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, just two nits.

return s
}

in := bytes.NewBufferString(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need a buffer here, strings.NewReader should be good enough.

}

in := bytes.NewBufferString(s)
out := bytes.Buffer{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind changing this to:

var out bytes.Buffer

I honestly find your version more uniform given the immediate declaration above, but I'd like to stay consistent.

@gbbr
Copy link
Contributor

gbbr commented Jun 12, 2018

All good. Please make sure to "Squash & Merge" when ready.

@AlexJF AlexJF force-pushed the alexjf/utf8-enforcement branch from 85713f0 to d7fa6e3 Compare June 13, 2018 09:32
@AlexJF AlexJF merged commit 9fff78c into master Jun 13, 2018
@AlexJF AlexJF deleted the alexjf/utf8-enforcement branch June 13, 2018 13:56
@gbbr gbbr modified the milestones: 6.3.1, next Jun 26, 2018
@gbbr gbbr removed this from the next milestone Sep 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants