Add JsonReadFeature.ALLOW_HEXADECIMAL_NUMBERS for JSON5-style hex literals (#707)#1613
Conversation
92f36fa to
0e27cba
Compare
…erXML#707) Address review feedback on FasterXML#1613: parse the buffered hex digits as a BigInteger via a new NumberInput.parseBigIntegerWithRadix(char[], ...) overload that hands the slice straight to FastDoubleParser, eliminating the intermediate String allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erXML#707) Address review feedback on FasterXML#1613: parse the buffered hex digits as a BigInteger via a new NumberInput.parseBigIntegerWithRadix(char[], ...) overload that hands the slice straight to FastDoubleParser, eliminating the intermediate String allocation.
ba2c3ea to
45e02b3
Compare
|
@pjfanning Addressed your review feedback in 45e02b3 (moved BigInteger parsing into |
|
Hmmmh. This is HUGE. Not a big fan based on size alone... :-( Will give some more thought. |
|
Quick note on the "fail when feature is not enabled" request — that scenario was actually already covered from the initial commit by Also pushed 4e8e060 to add invalid-hex digit coverage (e.g. |
|
@cowtowncoder Totally hear you on the size — it's something I've been chewing on too. Let me break down what's in the +940 to make it easier to evaluate, and offer a couple of options. Where the lines actually go:
Why parser plumbing is 51%: the feature has to land in each concrete parser independently because the input sources (Reader, byte stream, DataInput, async buffer) don't share a unified read primitive. I tried to keep as much as possible in The one outlier is What I can trim (good faith, ~15–20 lines): Suggested review path if it helps to chunk it:
Alternative if size is still the blocker: I could split into a prelim PR that lands just the utils ( Whichever direction you'd like to take, let me know and I'll make it happen. |
…asterXML#707) Addresses two review comments on PR FasterXML#1613: - ParserBase.java:183 (cowtowncoder): hex literal handling is JSON-only, not shared by binary/text format backends that also extend ParserBase. - NonBlockingUtf8JsonParserBase.java:1810 (cowtowncoder): the per-parser _isHexDigit helper should live in a shared base for reuse. Promote JSON-only hex state to JsonParserBase --------------------------------------------- - _numberIsHex field and resetIntHex(...) move from ParserBase to JsonParserBase. ParserBase no longer references hex at all. - ParserBase.resetInt loses `final` so JsonParserBase can override it to clear _numberIsHex. resetFloat/resetAsNaN no longer touch the flag: _numberIsHex is only read in the VALUE_NUMBER_INT branch of _parseNumericValue/_parseIntValue, both reached only via resetInt or resetIntHex (both of which now correctly initialize the flag), so leaving it stale after a float/NaN reset is benign. Hoist duplicated per-parser helpers to JsonParserBase ----------------------------------------------------- - _isHexDigit(int) was duplicated four times across ReaderBasedJsonParser, UTF8StreamJsonParser, UTF8DataInputJsonParser, and NonBlockingUtf8JsonParserBase. Replaced with a single protected static on JsonParserBase. Char-stream parser's existing `_isHexDigit(char)` call site keeps working via implicit char->int widening. - "Hexadecimal number prefix '0X' must be followed by at least one hex digit (...)" error message was duplicated at the same four call sites. Extracted as protected static _hexPrefixNotFollowedMessage(char) on JsonParserBase. Net effect: -76 / +79 lines, identical externally observable behavior. Verified by full test suite (1702 pass, 2 pre-existing skips), with particular attention to NonStandardHexNumbers707Test and AsyncHexNumbers707Test.
) Addresses cowtowncoder's follow-up suggestion on PR FasterXML#1613: UTF8DataInputJsonParser.java:1218 (cowtowncoder): "Or better yet, check out what CharTypes.java already has; maybe charToHex() would be useful (gives -1 for non-hex digit)" CharTypes.charToHex(int) is a single masked array lookup that returns the digit value (0..15) for valid hex characters and -1 otherwise. That single primitive subsumes both helpers introduced for this feature: - _isHexDigit(int) -> CharTypes.charToHex(c) >= 0 - _hexDigit(char) -> CharTypes.charToHex(c) Both are removed; the four concrete parser scan loops and JsonParserBase._parseHexInt now go through CharTypes.charToHex directly, matching the pattern already used elsewhere in the JSON parsers for \\uXXXX escape decoding. Verified by full test suite (1702 pass, 2 pre-existing skips).
|
@cowtowncoder Pushed
Net diff is now +923 / -4 across 16 files (was +940 / -3). Modest line reduction but materially cleaner structure: zero hex references in cross-format Full test suite: 1702 pass / 2 pre-existing skips, including |
| // valid in hex regardless of ALLOW_LEADING_ZEROS_FOR_NUMBERS. | ||
| if (_inputPtr < _inputEnd || _loadMore()) { | ||
| char peek = _inputBuffer[_inputPtr]; | ||
| if ((peek == 'x' || peek == 'X') |
There was a problem hiding this comment.
Idea: add a helper method ("_checkHexNumbersAllowed()" etc) in JsonParserBase called after "0x" or "0X" seen, passing next character -- if feature NOT enabled, can give specific error message including feature to enable to allow hexadecimal numbers: either returns (indicating feature enabled) or throws exception (is not).
Reduces duplication slightly but more importantly gives better exception in case feature disabled.
There was a problem hiding this comment.
Done in 4e5945f. Added JsonParserBase._checkHexNumbersAllowed(int) that throws StreamReadException naming the feature when JsonReadFeature.ALLOW_HEXADECIMAL_NUMBERS is disabled. All eight call sites across the four JSON parsers (Reader, UTF8 stream, UTF8 DataInput, non-blocking UTF8) now go through it, eliminating duplicated isEnabled(...) checks. Existing feature-disabled tests tightened to assert the error message includes the feature name.
|
@seonwooj0810 wrt: yes, please, that'd be worth fixing separately, can merge quickly. |
|
@seonwooj0810 Ok, aside from review (with which are making good progress), one process thing: unless we have asked for it before (and got it), we'd need CLA. It's from: https://github.com/FasterXML/jackson/blob/main/CLA-jackson-2026.pdf and is usually done by printing, filling & signing, scan/photo, email to Only needs to be sent once and is good for all future contributions. |
Two small review-feedback nits from FasterXML#1613: - ParserBase.resetInt was made non-final earlier in this branch so that JsonParserBase can override it to clear the hex flag. Add a comment noting the prior `final` modifier and why it was relaxed. - BigIntegerParser.parseWithFastParser(char[],int,int,int) is new in 3.2; add the `@since 3.2` Javadoc tag to match the existing convention.
…error Per @cowtowncoder review on FasterXML#1613: when a number literal starts with `0x` / `0X` but `JsonReadFeature.ALLOW_HEXADECIMAL_NUMBERS` is not enabled, the user previously got a generic "unexpected character" error that did not mention the feature to enable. Centralize the feature check in a single `JsonParserBase._checkHexNumbersAllowed(int)` helper that throws a `StreamReadException` naming the feature, so the diagnostic is actionable: Unexpected character ('x') (code 120): hexadecimal number literals require enabling `JsonReadFeature.ALLOW_HEXADECIMAL_NUMBERS` All eight call sites across the four JSON parsers (Reader, UTF8 stream, UTF8 DataInput, non-blocking UTF8) now go through the helper, eliminating the duplicated `isEnabled(ALLOW_HEXADECIMAL_NUMBERS)` check at every parser-specific entry point. Existing feature-disabled tests are tightened to assert that the new error message includes the feature name in addition to the original "Unexpected character ('x'" prefix.
Per @cowtowncoder review on FasterXML#1613: previously the maximum-integer-length constraint was applied only when `resetIntHex` was called -- i.e. after the full hex literal had been buffered. A pathological input with millions of digits would therefore allocate millions of chars before being rejected. Move the validation into the hex digit accumulation loop so it fires at every `_textBuffer` segment boundary (default first segment is 500 chars). With the default `maxNumberLength=1000`, a 10-million-digit hex literal is now caught at ~4 KiB rather than after fully buffering. Add `hexNumberLengthConstraint` to NonStandardHexNumbers707Test, exercising an 8000-digit literal with `maxNumberLength=100` against all sync modes; each backend must throw `StreamConstraintsException` with the standard "exceeds the maximum" message. The async path is exercised by the same validation primitive in `_finishHexDigits`.
|
@cowtowncoder Pushed commits
One item I want to flag rather than implement here: your suggestion to move hex decoding into Net effect after this round: +1003 / -22 across 18 files (was +923 / -4 across 16 files). The bump is roughly evenly split between the new helper + boundary test (+~80) and the same churn spread across the four parsers — but the cross-parser duplication is now down (one feature check, one boundary check primitive each). Full test suite: 1703 pass / 2 pre-existing skips (was 1702 — added Process note: CLA will follow by email shortly; same signature covers both this PR and #1614. |
…erals (FasterXML#707) Implements JSON5 hexadecimal integer literals (0x/0X prefix, optional sign, [0-9a-fA-F]+) across all four JSON parsers: * ReaderBasedJsonParser - hex branch in _parseNumber2 (before _verifyNoLeadingZeroes, so ALLOW_LEADING_ZEROS_FOR_NUMBERS is not consulted on the hex path). * UTF8StreamJsonParser - same pattern, byte-oriented. * UTF8DataInputJsonParser - intercept in the existing 'x'/'X' branch inside _parseUnsignedNumber and _parseSignedNumber. * NonBlockingUtf8JsonParserBase - new minor states MINOR_NUMBER_HEX_PREFIX and MINOR_NUMBER_HEX_DIGITS so suspension works at every byte boundary; helpers _startHexNumber(WithSign) and _finishHexDigits drive the state machine. Shared decoding lives in JsonParserBase._parseHexInt: int (<=7 hex digits), long (<=16 with top-nibble check), or BigInteger (eagerly decoded; the lazy _numberString path expects base-10 and would mis-read hex). A new boolean _numberIsHex field plus resetIntHex helper on ParserBase records that the buffered literal is hex; the textual representation preserved in _textBuffer is the original literal (sign + 0x/0X + digits), matching the precedent set by FasterXML#784 for the '+' sign. Per pjfanning's review on FasterXML#707: * JSON5 grammar is followed verbatim; leading zeros within hex digits are always allowed and do not consult ALLOW_LEADING_ZEROS_FOR_NUMBERS (see ALLOW_HEXADECIMAL_NUMBERS javadoc and the unsignedHexWithLeadingZeros test). * The '+' variant additionally requires ALLOW_LEADING_PLUS_SIGN_FOR_NUMBERS, matching how '+' is handled for decimal numbers; this is exercised by plusSignHexRequiresLeadingPlusFeature. Drive-by fix: while testing the async path I uncovered a pre-existing inconsistency where the explicit '+' sign was lost on the +0... resumption path because MINOR_NUMBER_ZERO was reused for both "plain 0" and "+0" suspension. Added MINOR_NUMBER_PLUSZERO so the sign survives byte-by-byte feeds; the sibling test AsyncNonStandardNumberParsingTest.leadingPlusSignInDecimalEnabled2 was asserting the buggy behavior (its peer leadingPlusSignInDecimalEnabled already asserted the correct, sign-preserving result) and has been updated to match the blocking parsers. Happy to split this into a separate PR if preferred. Tests: * NonStandardHexNumbers707Test - 12 cases covering unsigned/uppercase/ negative/+sign/long range/BigInteger range/in-array/feature-off rejection/empty-prefix rejection across ALL_MODES (Reader, UTF8 stream, UTF8 throttled, DataInput). * AsyncHexNumbers707Test - 10 cases driving the non-blocking parser with 1-byte and 3-byte feeds to exercise suspension/resumption at every position. Full mvn verify: 1697 tests, 0 failures, 0 regressions.
…erXML#707) Address review feedback on FasterXML#1613: parse the buffered hex digits as a BigInteger via a new NumberInput.parseBigIntegerWithRadix(char[], ...) overload that hands the slice straight to FastDoubleParser, eliminating the intermediate String allocation.
Cover offset=0 with both full-array and truncated-length cases, plus negative-sign inputs for the new parseBigIntegerWithRadix(char[], ...) overload.
Verify both useFastParser branches throw NumberFormatException on non-hex characters in the new parseBigIntegerWithRadix(char[], ...) overload.
…asterXML#707) Addresses two review comments on PR FasterXML#1613: - ParserBase.java:183 (cowtowncoder): hex literal handling is JSON-only, not shared by binary/text format backends that also extend ParserBase. - NonBlockingUtf8JsonParserBase.java:1810 (cowtowncoder): the per-parser _isHexDigit helper should live in a shared base for reuse. Promote JSON-only hex state to JsonParserBase --------------------------------------------- - _numberIsHex field and resetIntHex(...) move from ParserBase to JsonParserBase. ParserBase no longer references hex at all. - ParserBase.resetInt loses `final` so JsonParserBase can override it to clear _numberIsHex. resetFloat/resetAsNaN no longer touch the flag: _numberIsHex is only read in the VALUE_NUMBER_INT branch of _parseNumericValue/_parseIntValue, both reached only via resetInt or resetIntHex (both of which now correctly initialize the flag), so leaving it stale after a float/NaN reset is benign. Hoist duplicated per-parser helpers to JsonParserBase ----------------------------------------------------- - _isHexDigit(int) was duplicated four times across ReaderBasedJsonParser, UTF8StreamJsonParser, UTF8DataInputJsonParser, and NonBlockingUtf8JsonParserBase. Replaced with a single protected static on JsonParserBase. Char-stream parser's existing `_isHexDigit(char)` call site keeps working via implicit char->int widening. - "Hexadecimal number prefix '0X' must be followed by at least one hex digit (...)" error message was duplicated at the same four call sites. Extracted as protected static _hexPrefixNotFollowedMessage(char) on JsonParserBase. Net effect: -76 / +79 lines, identical externally observable behavior. Verified by full test suite (1702 pass, 2 pre-existing skips), with particular attention to NonStandardHexNumbers707Test and AsyncHexNumbers707Test.
) Addresses cowtowncoder's follow-up suggestion on PR FasterXML#1613: UTF8DataInputJsonParser.java:1218 (cowtowncoder): "Or better yet, check out what CharTypes.java already has; maybe charToHex() would be useful (gives -1 for non-hex digit)" CharTypes.charToHex(int) is a single masked array lookup that returns the digit value (0..15) for valid hex characters and -1 otherwise. That single primitive subsumes both helpers introduced for this feature: - _isHexDigit(int) -> CharTypes.charToHex(c) >= 0 - _hexDigit(char) -> CharTypes.charToHex(c) Both are removed; the four concrete parser scan loops and JsonParserBase._parseHexInt now go through CharTypes.charToHex directly, matching the pattern already used elsewhere in the JSON parsers for \\uXXXX escape decoding. Verified by full test suite (1702 pass, 2 pre-existing skips).
Two small review-feedback nits from FasterXML#1613: - ParserBase.resetInt was made non-final earlier in this branch so that JsonParserBase can override it to clear the hex flag. Add a comment noting the prior `final` modifier and why it was relaxed. - BigIntegerParser.parseWithFastParser(char[],int,int,int) is new in 3.2; add the `@since 3.2` Javadoc tag to match the existing convention.
…error Per @cowtowncoder review on FasterXML#1613: when a number literal starts with `0x` / `0X` but `JsonReadFeature.ALLOW_HEXADECIMAL_NUMBERS` is not enabled, the user previously got a generic "unexpected character" error that did not mention the feature to enable. Centralize the feature check in a single `JsonParserBase._checkHexNumbersAllowed(int)` helper that throws a `StreamReadException` naming the feature, so the diagnostic is actionable: Unexpected character ('x') (code 120): hexadecimal number literals require enabling `JsonReadFeature.ALLOW_HEXADECIMAL_NUMBERS` All eight call sites across the four JSON parsers (Reader, UTF8 stream, UTF8 DataInput, non-blocking UTF8) now go through the helper, eliminating the duplicated `isEnabled(ALLOW_HEXADECIMAL_NUMBERS)` check at every parser-specific entry point. Existing feature-disabled tests are tightened to assert that the new error message includes the feature name in addition to the original "Unexpected character ('x'" prefix.
Per @cowtowncoder review on FasterXML#1613: previously the maximum-integer-length constraint was applied only when `resetIntHex` was called -- i.e. after the full hex literal had been buffered. A pathological input with millions of digits would therefore allocate millions of chars before being rejected. Move the validation into the hex digit accumulation loop so it fires at every `_textBuffer` segment boundary (default first segment is 500 chars). With the default `maxNumberLength=1000`, a 10-million-digit hex literal is now caught at ~4 KiB rather than after fully buffering. Add `hexNumberLengthConstraint` to NonStandardHexNumbers707Test, exercising an 8000-digit literal with `maxNumberLength=100` against all sync modes; each backend must throw `StreamConstraintsException` with the standard "exceeds the maximum" message. The async path is exercised by the same validation primitive in `_finishHexDigits`.
f306baa to
d1bf5c2
Compare
|
Quick update batch on this PR:
Ready for another look when you have time. |
|
CLA received. |
cowtowncoder
left a comment
There was a problem hiding this comment.
LGTM: planning to merge later tonight.
|
Thanks @cowtowncoder and @pjfanning for the thorough review. Moving the hex bookkeeping into |
|
@seonwooj0810 Great collaboration, will get this merged tonight. |
Summary
Implements JSON5 hexadecimal integer literals (
0x/0Xprefix, optional sign,[0-9a-fA-F]+) across all four JSON parsers, gated behind a new opt-inJsonReadFeature.ALLOW_HEXADECIMAL_NUMBERS. Closes #707 (hex part; whitespace remains).Per @pjfanning's review:
0x|0X+[0-9a-fA-F]+, optional leading sign, no fraction / no exponent / no_separators.ALLOW_LEADING_ZEROS_FOR_NUMBERS— leading zeros within hex digits (e.g.0x007F) are always permitted when this feature is enabled, regardless of the leading-zero feature. The hex check fires before_verifyNoLeadingZeroes()in every parser, and the leading-zero feature is never consulted on the hex path. Covered byunsignedHexWithLeadingZeros.The
+variant additionally requiresALLOW_LEADING_PLUS_SIGN_FOR_NUMBERS(matching how+is treated for decimals, per #774). Covered byplusSignHexRequiresLeadingPlusFeatureandplusSignHexWithPlusFeature.Implementation per parser:
ReaderBasedJsonParser_parseNumber2, before_verifyNoLeadingZeroes.UTF8StreamJsonParser_verifyNoLeadingZeroesin both_parseUnsignedNumberand_parseSignedNumber.UTF8DataInputJsonParser'x'/'X'arm of_handleLeadingZeroes(no peek needed — DataInput already consumed it).NonBlockingUtf8JsonParserBaseMINOR_NUMBER_HEX_PREFIXandMINOR_NUMBER_HEX_DIGITSso suspension works at every byte boundary; helpers_startHexNumber/_startHexNumberWithSign/_finishHexDigitsdrive the state machine.Shared decoding lives in
JsonParserBase._parseHexInt:intfor ≤ 7 hex digits (always fits in signed 32-bit),longfor ≤ 16 hex digits with a top-nibble check at 16,BigIntegerotherwise — eagerly decoded because the lazy_numberStringpath expects base-10.getString()returns the original literal (sign +0x/0X+ digits), matching the precedent set for+by #784.getIntValue()/getLongValue()/getBigIntegerValue()return the decoded value.The bookkeeping additions on
JsonParserBaseare limited:_numberIsHexfield (cleared byresetInt/resetFloat/resetAsNaN),resetIntHex(neg, hexDigitLen)helper alongsideresetInt.I put these on
JsonParserBaseonly (since only JSON consumes it today).Test plan
mvn clean verify— 1697 tests, 0 failures, 0 regressions.NonStandardHexNumbers707Test(12 cases) — unsigned, uppercaseX, leading zeros, negative,+with/without plus-feature, long range (0x80000000), BigInteger range (0x1ffffffffffffffff), in-array, feature-off rejection, empty-prefix rejection. Runs acrossALL_MODES(Reader, UTF8 stream, throttled, DataInput).AsyncHexNumbers707Test(10 cases) — drives the non-blocking parser with 1-byte and 3-byte feeds to exercise suspension/resumption at every position; covers the same shapes plus aplainZeroStillWorkssmoke test to verify the bare-0path didn't regress.