Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Fix the max token size threshold to correctly compute to 125MB for Base64 bytes.#40792

Merged
ahsonkhan merged 3 commits intodotnet:masterfrom
ahsonkhan:FixWritingBase64Threshold
Sep 4, 2019
Merged

Fix the max token size threshold to correctly compute to 125MB for Base64 bytes.#40792
ahsonkhan merged 3 commits intodotnet:masterfrom
ahsonkhan:FixWritingBase64Threshold

Conversation

@ahsonkhan
Copy link

Fixes https://github.com/dotnet/corefx/issues/40755 in master

Also:
Rename constant to fix transpose error: Base46 -> Base64

cc @lauxjpn, @steveharter, @scalablecory

@ahsonkhan ahsonkhan added this to the 5.0 milestone Sep 4, 2019
Copy link

@scalablecory scalablecory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good other than a flaky test.

}
catch (OutOfMemoryException)
{
return;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is flaky, no?

Copy link
Author

@ahsonkhan ahsonkhan Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this pattern elsewhere in the code base, but you bring up a good point. We can exclude this test on linux. It should be deterministic for other OSes.

@@ -63,7 +63,7 @@ internal static class JsonConstants

public const int MaxEscapedTokenSize = 1_000_000_000; // Max size for already escaped value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this value come from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We compute for the maximum required space to write a property name and value string, and then ask for that size'd buffer from the IBufferWriter<byte> (the call to GetSpan(sizeHint)).

What's the largest key/value pairs that can fit after making a single call to GetSpan(...)?
Say, you want to write: "really large property name": " really large value". You do this as follows:

// This needs to call IBW.GetSpan with the required size to write this data into it
jsonWriter.WriteString("really large property name", " really large value");

The maximum feasible is roughly: int.MaxValue - 56 (for array-backed buffer)
Then we have up to 1000 depth, so account for indentation (and quotes, spaces, colon, etc), so let's round the maximum feasible down to 2 billion.
We have 1 billion for the property name string, and 1 billion for the value string. That's where this limit comes from.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but why not just ask the IBufferWriter for what's needed, and let it fail if it can't provide the space?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does our read implementation have some sanity limits that we're trying to keep our write implementation under?

Copy link
Author

@ahsonkhan ahsonkhan Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the reader doesn't have such limitations. The limitation in the writer was mainly to avoid making multiple calls to the interface while writing for perf (for the common path).

public const int MaxEscapedTokenSize = 1_000_000_000; // Max size for already escaped value.
public const int MaxUnescapedTokenSize = MaxEscapedTokenSize / MaxExpansionFactorWhileEscaping; // 166_666_666 bytes
public const int MaxBase46ValueTokenSize = (MaxEscapedTokenSize >> 2 * 3) / MaxExpansionFactorWhileEscaping; // 125_000_000 bytes
public const int MaxBase64ValueTokenSize = (MaxEscapedTokenSize >> 2) * 3 / MaxExpansionFactorWhileEscaping; // 125_000_000 bytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What forces us to have a hardcoded maximum? Why not just let the implementation OOM if the required size would be too big?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not OOM, it may cause integer overflow depending on the size when we calculate the maximum required size for the given input. And then, we may pass in an unexpected value to IBW.GetSpan(...). This could be a user-defined IBW which could have custom logic for how the memory is allocated.

private void WriteBase64Minimized(ReadOnlySpan<byte> bytes)
{
int encodingLength = Base64.GetMaxEncodedToUtf8Length(bytes.Length);
Debug.Assert(encodingLength < int.MaxValue - 3);
// 2 quotes to surround the base-64 encoded string value.
// Optionally, 1 list separator
int maxRequired = encodingLength + 3;
if (_memory.Length - BytesPending < maxRequired)
{
Grow(maxRequired);
}

else
{
Debug.Assert(_output != null);
_output.Advance(BytesPending);
BytesCommitted += BytesPending;
BytesPending = 0;
_memory = _output.GetMemory(sizeHint);

For example. if this calculation overflows (which currently is guaranteed not to), then we skip the call to Grow and then overwrite existing data, or only write partial data, or throw ArgumentException when trying to copy to a buffer that's too small:

int maxRequired = indent + encodingLength + 3 + s_newLineLength;
if (_memory.Length - BytesPending < maxRequired)
{
Grow(maxRequired);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may cause integer overflow

And you can't use checked?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, let me see if that is feasible, and follow up on this as a separate PR.

This PR is intended to fix the bug in the current code that was blocking user scenario (considering for 3.0), so I'd like to keep the change isolated to just that.

Copy link
Author

@ahsonkhan ahsonkhan Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you can't use checked?

Aren't their performance implications of doing so? Or are you suggesting those would be negligible here?

sharplab.io

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't their performance implications of doing so? Or are you suggesting those would be negligible here?

It would add a few instructions, but you'd also get to remove calls like JsonWriterHelper.ValidateBytes(bytes);, simplify the code, and avoid these strange hard-coded, semi-accurate limits that appear to come out of nowhere.

Copy link
Author

@ahsonkhan ahsonkhan Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I am trying to see the side effects of removing these checks (and constants). Currently, the error is detected up-front with a deterministic error message. Removing the checks and letting integer overflow/OOM/etc. capture the failure condition based on the input and method being called may not be worth the trade-off. It is easy to reason about these limits today.

For example, let's say we remove the escaping/transcoding based limits. Today, we have the following up-front check on the public API:

public void WriteString(ReadOnlySpan<char> propertyName, ReadOnlySpan<char> value)
{
JsonWriterHelper.ValidatePropertyAndValue(propertyName, value);

public static void ValidatePropertyAndValue(ReadOnlySpan<char> propertyName, ReadOnlySpan<char> value)
{
if (propertyName.Length > JsonConstants.MaxCharacterTokenSize || value.Length > JsonConstants.MaxCharacterTokenSize)
ThrowHelper.ThrowArgumentException(propertyName, value);
}

After that, all subsequent method calls can rely on this invariant.

Now, we remove that, and then the "error" state (i.e. data too large) propagates to all the places where the previous assertions break:

  • GetMaxEscapedLength would need to be checked.
    int length = JsonWriterHelper.GetMaxEscapedLength(value.Length, firstEscapeIndexVal);
  • Calculating the required max size would need to be checked for both indented and minimized.
    Debug.Assert(escapedValue.Length <= JsonConstants.MaxEscapedTokenSize);
    Debug.Assert(escapedPropertyName.Length < ((int.MaxValue - 7 - indent - s_newLineLength) / JsonConstants.MaxExpansionFactorWhileTranscoding) - escapedValue.Length);
    // All ASCII, 2 quotes for property name, 2 quotes for value, 1 colon, and 1 space => escapedPropertyName.Length + escapedValue.Length + 6
    // Optionally, 1 list separator, 1-2 bytes for new line, and up to 3x growth when transcoding
    int maxRequired = indent + ((escapedPropertyName.Length + escapedValue.Length) * JsonConstants.MaxExpansionFactorWhileTranscoding) + 7 + s_newLineLength;

    private void WriteStringMinimized(ReadOnlySpan<char> escapedPropertyName, ReadOnlySpan<char> escapedValue)
    {
    Debug.Assert(escapedValue.Length <= JsonConstants.MaxUnescapedTokenSize);
    Debug.Assert(escapedPropertyName.Length < ((int.MaxValue - 6) / JsonConstants.MaxExpansionFactorWhileTranscoding) - escapedValue.Length);
    // All ASCII, 2 quotes for property name, 2 quotes for value, and 1 colon => escapedPropertyName.Length + escapedValue.Length + 5
    // Optionally, 1 list separator, and up to 3x growth when transcoding
    int maxRequired = ((escapedPropertyName.Length + escapedValue.Length) * JsonConstants.MaxExpansionFactorWhileTranscoding) + 6;

And now the user gets IntegerOverflowException rather than an ArgumentException.

That said, I think it's worth doing the exercise and see what code paths would need to be fixed and if reasoning about the kinds of errors that might get surfaced to the caller is easy enough to justify the change.

The main question is: Do we want to surface OOMs or IntegerOverflowException to the caller if they pass data that's too large (depending on what code path got executed), or do we want to provide a more meaningful, and deterministic error message (with ArgumentException) that conveys what went wrong.

Edit: Filed an issue - https://github.com/dotnet/corefx/issues/40795

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked would require catching the overflow exception. It's a forbidden exception type that should not bubble out to callers.

Copy link
Member

@stephentoub stephentoub Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He was referring to the overflow exception, not the out of memory exception. (And it's fine for an overflow exception to escape to callers in some APIs, you just wouldn't expect it here.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is intended to fix the bug in the current code that was blocking user scenario (considering for 3.0), so I'd like to keep the change isolated to just that.

Yes we need to keep this as simple as possible for 3.0 to avoid any potential regression. For 5.0 then we improve this area by removing the constants and use checked (if that turns out to be the best design).

@ahsonkhan
Copy link
Author

Any other feedback for this PR (specific to fixing the issue blocking writing large byte[] as Base64) or is it good to merge?

I'll re-evaluate the feedback around removing the thresholds all together.

@ahsonkhan
Copy link
Author

ahsonkhan commented Sep 4, 2019

Unrelated test failure for the Windows Build UWP_CoreCLR_x64_Debug leg: https://github.com/dotnet/corefx/issues/31608

@ahsonkhan ahsonkhan merged commit 1511f72 into dotnet:master Sep 4, 2019
@ahsonkhan ahsonkhan deleted the FixWritingBase64Threshold branch September 4, 2019 05:24
ahsonkhan added a commit to ahsonkhan/corefx that referenced this pull request Sep 4, 2019
…se64 bytes. (dotnet#40792)

* Fix the max token size threshold to correctly compute to 125MB for
Base64 bytes.

* Rename constant to fix transpose error: Base46 -> Base64

* Enable the outerloop tests for windows and osx only and update to use
platform specific new line.
ahsonkhan added a commit that referenced this pull request Sep 5, 2019
…se64 bytes. (#40792) (#40796)

* Fix the max token size threshold to correctly compute to 125MB for
Base64 bytes.

* Rename constant to fix transpose error: Base46 -> Base64

* Enable the outerloop tests for windows and osx only and update to use
platform specific new line.
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…se64 bytes. (dotnet/corefx#40792)

* Fix the max token size threshold to correctly compute to 125MB for
Base64 bytes.

* Rename constant to fix transpose error: Base46 -> Base64

* Enable the outerloop tests for windows and osx only and update to use
platform specific new line.


Commit migrated from dotnet/corefx@1511f72
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The JSON value of length n is too large and not supported.

5 participants