ref(sourcebundle): Check UTF-8 validity memory efficiently#890
ref(sourcebundle): Check UTF-8 validity memory efficiently#890szokeasaurusrex merged 6 commits intomasterfrom
Conversation
4942f9e to
d7e4852
Compare
Swatinem
left a comment
There was a problem hiding this comment.
there must be some kind of pre-existing crate that does this. I’m not sure we should implement this ourselves TBH
|
@Swatinem I tried to find a crate that does this before implementing, but I could not find one. I agree, using a crate would be the ideal solution if one exists |
loewenheim
left a comment
There was a problem hiding this comment.
Impressive work, I just have some nits. I agree that it's a shame to have to implement this ourselves.
Have you verified that if the UTF8Reader errors, nothing is written to the writer by io::copy? Also, this type would be ripe for fuzzing/property-based testing (generate random strings and see if they get read correctly).
|
The code doesn't look like it even has to guarantee UTF-8 Validity, isn't it fine if the file contents are just copied as bytes? A reader must always ensure correctness anyways. |
@Dav1dde, @loewenheim added this in #816. The context is not entirely clear to me from that PR, but if I remember correctly, I think the reason it needed to be added was to have client-side verification that users upload valid source bundles to Sentry. Valid source bundles can only contain UTF-8 encoded data. |
d7e4852 to
7dd2f8f
Compare
|
I think I have addressed all feedback. Please let me know if I missed something |
loewenheim
left a comment
There was a problem hiding this comment.
One small typo, otherwise LGTM.
7dd2f8f to
6115047
Compare
|
I have verified with the memory profiler that when using a version of |
|
Hey @loewenheim, just saw this:
Regarding the first question: no, I have not tested this, but I would expect that we would write everything to As for the fuzzing/property based testing, I am not sure how to do this, but it sounds like a good idea! |
|
My worry is this: if we write into the writer up to the first read error, how does that interact with As for proptesting, I've used it a fair bit and would be happy to give you an introduction |
6115047 to
84304d0
Compare
|
@loewenheim, I have addressed your feedback |
84304d0 to
094cd96
Compare
The current check to ensure a sourcebundle is valid UTF-8 reads the entire sourcebundle file into memory. This is inefficient for large files. This PR introduces a UTF8Reader which wraps any reader. The UTF8Reader ensures that the stream is valid UTF8 as it is being read, while only requiring a small amount of memory (currently 8 KiB) to be allocated as a buffer.
Co-authored-by: Sebastian Zivota <loewenheim@users.noreply.github.com>
89a44ca to
c81365a
Compare
Symbolic version `12.13.3` includes [a change](getsentry/symbolic#890), which will reduce, in some cases significantly, the memory usage of sourcemap uploads. ref #2344
The current check to ensure a sourcebundle is valid UTF-8 reads the entire sourcebundle file into memory. This is inefficient for large files.
This PR introduces a UTF8Reader which wraps any reader. The UTF8Reader ensures that the stream is valid UTF8 as it is being read, while only requiring a small amount of memory (currently 8 KiB) to be allocated as a buffer.