Fix the max token size threshold to correctly compute to 125MB for Base64 bytes.#40792
Conversation
scalablecory
left a comment
There was a problem hiding this comment.
looks good other than a flaky test.
| } | ||
| catch (OutOfMemoryException) | ||
| { | ||
| return; |
There was a problem hiding this comment.
We have this pattern elsewhere in the code base, but you bring up a good point. We can exclude this test on linux. It should be deterministic for other OSes.
| @@ -63,7 +63,7 @@ internal static class JsonConstants | |||
|
|
|||
| public const int MaxEscapedTokenSize = 1_000_000_000; // Max size for already escaped value. | |||
There was a problem hiding this comment.
Where does this value come from?
There was a problem hiding this comment.
We compute for the maximum required space to write a property name and value string, and then ask for that size'd buffer from the IBufferWriter<byte> (the call to GetSpan(sizeHint)).
What's the largest key/value pairs that can fit after making a single call to GetSpan(...)?
Say, you want to write: "really large property name": " really large value". You do this as follows:
// This needs to call IBW.GetSpan with the required size to write this data into it
jsonWriter.WriteString("really large property name", " really large value");The maximum feasible is roughly: int.MaxValue - 56 (for array-backed buffer)
Then we have up to 1000 depth, so account for indentation (and quotes, spaces, colon, etc), so let's round the maximum feasible down to 2 billion.
We have 1 billion for the property name string, and 1 billion for the value string. That's where this limit comes from.
There was a problem hiding this comment.
Sure, but why not just ask the IBufferWriter for what's needed, and let it fail if it can't provide the space?
There was a problem hiding this comment.
Does our read implementation have some sanity limits that we're trying to keep our write implementation under?
There was a problem hiding this comment.
No, the reader doesn't have such limitations. The limitation in the writer was mainly to avoid making multiple calls to the interface while writing for perf (for the common path).
| public const int MaxEscapedTokenSize = 1_000_000_000; // Max size for already escaped value. | ||
| public const int MaxUnescapedTokenSize = MaxEscapedTokenSize / MaxExpansionFactorWhileEscaping; // 166_666_666 bytes | ||
| public const int MaxBase46ValueTokenSize = (MaxEscapedTokenSize >> 2 * 3) / MaxExpansionFactorWhileEscaping; // 125_000_000 bytes | ||
| public const int MaxBase64ValueTokenSize = (MaxEscapedTokenSize >> 2) * 3 / MaxExpansionFactorWhileEscaping; // 125_000_000 bytes |
There was a problem hiding this comment.
What forces us to have a hardcoded maximum? Why not just let the implementation OOM if the required size would be too big?
There was a problem hiding this comment.
It may not OOM, it may cause integer overflow depending on the size when we calculate the maximum required size for the given input. And then, we may pass in an unexpected value to IBW.GetSpan(...). This could be a user-defined IBW which could have custom logic for how the memory is allocated.
corefx/src/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.cs
Lines 1047 to 1055 in a6e7ffd
For example. if this calculation overflows (which currently is guaranteed not to), then we skip the call to Grow and then overwrite existing data, or only write partial data, or throw ArgumentException when trying to copy to a buffer that's too small:
There was a problem hiding this comment.
it may cause integer overflow
And you can't use checked?
There was a problem hiding this comment.
Hmm, let me see if that is feasible, and follow up on this as a separate PR.
This PR is intended to fix the bug in the current code that was blocking user scenario (considering for 3.0), so I'd like to keep the change isolated to just that.
There was a problem hiding this comment.
And you can't use
checked?
Aren't their performance implications of doing so? Or are you suggesting those would be negligible here?
There was a problem hiding this comment.
Aren't their performance implications of doing so? Or are you suggesting those would be negligible here?
It would add a few instructions, but you'd also get to remove calls like JsonWriterHelper.ValidateBytes(bytes);, simplify the code, and avoid these strange hard-coded, semi-accurate limits that appear to come out of nowhere.
There was a problem hiding this comment.
Good point. I am trying to see the side effects of removing these checks (and constants). Currently, the error is detected up-front with a deterministic error message. Removing the checks and letting integer overflow/OOM/etc. capture the failure condition based on the input and method being called may not be worth the trade-off. It is easy to reason about these limits today.
For example, let's say we remove the escaping/transcoding based limits. Today, we have the following up-front check on the public API:
corefx/src/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.cs
Lines 117 to 121 in 06007cd
After that, all subsequent method calls can rely on this invariant.
Now, we remove that, and then the "error" state (i.e. data too large) propagates to all the places where the previous assertions break:
- GetMaxEscapedLength would need to be checked.
- Calculating the required max size would need to be checked for both indented and minimized.
And now the user gets IntegerOverflowException rather than an ArgumentException.
That said, I think it's worth doing the exercise and see what code paths would need to be fixed and if reasoning about the kinds of errors that might get surfaced to the caller is easy enough to justify the change.
The main question is: Do we want to surface OOMs or IntegerOverflowException to the caller if they pass data that's too large (depending on what code path got executed), or do we want to provide a more meaningful, and deterministic error message (with ArgumentException) that conveys what went wrong.
Edit: Filed an issue - https://github.com/dotnet/corefx/issues/40795
There was a problem hiding this comment.
checked would require catching the overflow exception. It's a forbidden exception type that should not bubble out to callers.
There was a problem hiding this comment.
He was referring to the overflow exception, not the out of memory exception. (And it's fine for an overflow exception to escape to callers in some APIs, you just wouldn't expect it here.)
There was a problem hiding this comment.
This PR is intended to fix the bug in the current code that was blocking user scenario (considering for 3.0), so I'd like to keep the change isolated to just that.
Yes we need to keep this as simple as possible for 3.0 to avoid any potential regression. For 5.0 then we improve this area by removing the constants and use checked (if that turns out to be the best design).
platform specific new line.
|
Any other feedback for this PR (specific to fixing the issue blocking writing large byte[] as Base64) or is it good to merge? I'll re-evaluate the feedback around removing the thresholds all together. |
|
Unrelated test failure for the Windows Build UWP_CoreCLR_x64_Debug leg: https://github.com/dotnet/corefx/issues/31608 |
…se64 bytes. (dotnet#40792) * Fix the max token size threshold to correctly compute to 125MB for Base64 bytes. * Rename constant to fix transpose error: Base46 -> Base64 * Enable the outerloop tests for windows and osx only and update to use platform specific new line.
…se64 bytes. (dotnet/corefx#40792) * Fix the max token size threshold to correctly compute to 125MB for Base64 bytes. * Rename constant to fix transpose error: Base46 -> Base64 * Enable the outerloop tests for windows and osx only and update to use platform specific new line. Commit migrated from dotnet/corefx@1511f72
Fixes https://github.com/dotnet/corefx/issues/40755 in master
Also:
Rename constant to fix transpose error: Base46 -> Base64
cc @lauxjpn, @steveharter, @scalablecory