-
Notifications
You must be signed in to change notification settings - Fork 28
model/normalizer: Enforce valid utf-8 during normalization. #431
Conversation
874f465 to
b3ccc1b
Compare
gbbr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Overall it looks good to me.
Based on our Slack discussion about the size of the dependency, I'll wait for you to go back to the standard library unicode/utf8 package before a second round of review.
Also, it would be super helpful to reason about this PR if you were to post the results of the benchmarks using golang.org/x/perf/cmd/benchstat in the PR description.
model/normalizer.go
Outdated
| return s.normalize(true) | ||
| } | ||
|
|
||
| func (s *Span) normalize(utf8Enforce bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no call to this method using false other than in tests, let's remove the argument and always do this. Are there any reasons not obvious to me why we would want to do this?
model/normalizer.go
Outdated
| return string(res), true | ||
| } | ||
|
|
||
| func forceUtf8(s string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it might be (insignificantly) more efficient to have only one decoder per function call. Maybe it would be best to inline this, but I don't feel strongly about it.
If you decide to keep it here, can we just call it toUTF8 instead? I feel it would be much clearer. Force seems like a word that we would use with an operation that has the potential to fail but we are ignoring the error. I don't think this is the case.
model/normalizer_test.go
Outdated
| } | ||
|
|
||
| func TestNormalizeInvalidUTF8(t *testing.T) { | ||
| invalidUtf8 := "test\x99\x8f" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you called it UTF8 in the test function title, can we do the same everywhere else. s/Utf8/UTF8?
e22cff3 to
c9ec90c
Compare
|
Done, didn't know about this benchstat thing, useful :) |
gbbr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just some nits.
model/normalizer.go
Outdated
|
|
||
| for { | ||
| r, _, err := in.ReadRune() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you terribly mind removing all the empty lines? As far as I see, in general the practice is to have a statement, and then immediately after its error handling, so I'd love it if it would be possible to change this as:
r, _, err := in.ReadRune()
if err == io.EOF {
return res.String(), nil
}
if err != nil {
return "", err
}Also, I've recently discovered strings.Builder which I think should be more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh great tip regarding the strings.Builder (also replaced the input with a strings.Reader). The new benchstat (compare it with the one in the PR description, using bytes.Buffer):
$ benchstat old.txt new.txt
name old time/op new time/op delta
Normalization-4 549ns ± 8% 1097ns ± 3% +99.67% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
Normalization-4 792B ± 0% 848B ± 0% +7.07% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
Normalization-4 9.00 ± 0% 15.00 ± 0% +66.67% (p=0.008 n=5+5)
model/normalizer.go
Outdated
| } | ||
| s.Type = utf8Type | ||
|
|
||
| // Meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please drop this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
Oh |
|
It's definitely not a problem to update CI. AFAIK we are aiming to have the same version in CI as we do in Agent 6. @derekwbrown or @masci can you guys please confirm that it's ok for us to have a requirement for go1.10? (tldr; to use |
gbbr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am approving this PR as a style and consistency review, not technical.
For the technical part, I'd love to hear more views. I wonder if this indicates a problem somewhere else in the pipeline. These strings are and should be UTF-8. How does this get broken? Is it msgpack? Is it the tracers?
|
Ping @olivielpeau for review |
|
My 2 cents: We can't really control what gets sent to us and given that traces are encoded via protobuf and this was merged (golang/protobuf#499), then any non-utf8 traces would be immediately dropped at the agent during serialization time when/if we upgrade to latest versions. This might be acceptable behaviour or not, that's more of a product decision. I'll just comment that, from a user's standpoint, it seems much more friendly to let the traces go through but clearly mark strings with invalid utf8 with Another thing we could do is have this utf8 normalization happen later on in the pipeline, after sampling. That could reduce the performance impact but would mean we'll have traces flying through our pipeline with invalid utf8. This might not be a problem currently but if we introduce more parts later on it's something to keep in mind. PS: I've seen some invalid UTF-8 get included in some Redis traces. I can try to track it down to language but it's something that's very easy to happen if one strings a byte array. |
model/normalizer.go
Outdated
| } | ||
| utf8Resource, err := toUTF8(s.Resource) | ||
| if err != nil { | ||
| return fmt.Errorf("span.normalize: error converting resoutce to utf-8: %s", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo at resoutce.
| } | ||
|
|
||
| s.Meta[k] = utf8V | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afair when Span.Normalize returns an error, we drop the full span / trace (no stats). It means it can break significantly the product just for a small unicode issue on an meta attribute. And that might be more frequent that we'd hope (let's say one put an uncontroller variable in a meta which might end up with binary content).
Could we instead replace non-unicode strings with a placeholder (like we do for failing SQL quantization)? It would apply to resource, type and meta too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what this PR does (check the tests). The error handling here is just to be extra sure but looking at the code for ReadRune() and WriteRune() no errors are returned except forEOF which is handled properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to clarify toUTF8's behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, sneaky. And how could we be 100% sure that in.ReadRune() won't return a non io.EOF error?
As if it happens, we are back to the situation where Normalize return an error -> trace dropped.
When reading it isn't much obvious. If we think we can have a "safe" toUTF8 then we should provide an implementation not returning error. And if errors can still happen, Normalize should handle them in a safe manner (non-dropping replacement).
What do you think?
|
@gbbr Please don't rely on go 1.10 features yet, we're working on using go 1.10 for the 6.3 release but if we run into instabilities during QA we may have to revert to go 1.9 |
model/normalizer.go
Outdated
| if err != nil { | ||
| return fmt.Errorf("span.normalize: error converting resource to utf-8: %s", err.Error()) | ||
| } | ||
| s.Resource = utf8Resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to test s.Resource == "" after the utf8 replacement? (I don't know if they actually are cases were it'd be false before the replacement / true after ; but I guess it is more logic to test for emptiness at the end anyway).
model/normalizer.go
Outdated
| if s.Resource == "" { | ||
| return errors.New("span.normalize: empty `Resource`") | ||
| } | ||
| utf8Resource, err := toUTF8(s.Resource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this implementation we are effectively allocating twice each string processed by the agent. With the assumption that these UTF-8 errors should be very rare, wouldn't it make sense to first detect if a string is invalid UTF-8 before doing the substitution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point. I was worried that checking UTF-8 validation would itself be almost as expensive but just ran some checks and it seems to only bump 22.36% of time:
$ benchstat old.txt new-valid.txt
name old time/op new time/op delta
Normalization-4 549ns ± 8% 672ns ± 1% +22.36% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
Normalization-4 792B ± 0% 792B ± 0% ~ (all equal)
name old allocs/op new allocs/op delta
Normalization-4 9.00 ± 0% 9.00 ± 0% ~ (all equal)
So I might just end up doing that 👍
gbbr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, just two nits.
model/normalizer.go
Outdated
| return s | ||
| } | ||
|
|
||
| in := bytes.NewBufferString(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really need a buffer here, strings.NewReader should be good enough.
model/normalizer.go
Outdated
| } | ||
|
|
||
| in := bytes.NewBufferString(s) | ||
| out := bytes.Buffer{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind changing this to:
var out bytes.BufferI honestly find your version more uniform given the immediate declaration above, but I'd like to stay consistent.
|
All good. Please make sure to "Squash & Merge" when ready. |
85713f0 to
d7fa6e3
Compare
An attempt at enforcing valid utf8 strings for span contents after normalization.
It works but has a non-negligible impact which I couldn't avoid by writing the same code with several variations (my own loop, etc...). Most of the time and allocations are actually inside the Rune handling code of the stdlib it seems.
Impact on my MBP:
Impact with no invalid utf-8 strings:
Impact with invalid utf-8 strings in Resource, Type and Meta: