Keep None as a real null in Json() columns instead of the string "null"#8231
Merged
lhoestq merged 1 commit intoJun 5, 2026
Merged
Conversation
A missing value (None) in a Json() column was encoded as the JSON string "null" rather than a real Arrow null. As a result the column had a null_count of 0 even when values were missing, and a genuine missing value became indistinguishable from a column that holds the literal JSON value null. The leaf branch of json_encode_field (used by from_dict and the on_mixed_types="use_json" path) ran ujson_loads(None), which raised, then fell back to ujson_dumps(None) and produced "null". Json.encode_example had the same issue, Json.decode_example crashed on None, and cast_storage re-serialized None entries to "null". Now all of these keep None as a real Arrow null, so null_count, is_null, filtering and Parquet roundtrips are correct. Decoded access is unchanged. Added regression tests covering mixed and all-None Json() columns.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this fixes
A missing value (
None) in aJson()column is currently stored as the JSON string"null"instead of a real Arrow null. This has two visible consequences:null_count == 0even when values are missing, andis_null()returns allFalse. Anything that relies on Arrow null semantics (filtering on nulls, counting nulls, null-aware Parquet writes) sees no nulls at all.null. Both end up as the same"null"string in storage.Reproducer on current
main:Root cause
The
from_dictpath and theon_mixed_types="use_json"path encode each value throughjson_encode_fieldinsrc/datasets/utils/json.py. Its leaf branch ranujson_loads(example)on the value. ForNonethat raises, so it fell into theexceptand returnedujson_dumps(None), which is the string"null". The same gap existed inJson.encode_example. On the read sideJson.decode_example(None)crashed because it passedNonestraight toujson_loads, andJson.cast_storagere-serializedNoneentries to"null"when it had to rebuild the array.The fix
Noneis now preserved as a real Arrow null across all of these entry points:json_encode_fieldandjson_decode_fieldreturnNoneunchanged at the leaf.Json.encode_example(None)returnsNone.Json.decode_example(None)returnsNoneinstead of crashing.Json.cast_storagekeepsNoneentries as nulls when it rebuilds the storage array, and the validation sampling skipsNoneso a sampled null does not force a needless re-encode.Decoded access through
ds[:],to_listandto_dictis unchanged, so this does not change observed values for existing users. It only fixes the underlying storage so nulls behave like nulls.Verification
Added two regression tests in
tests/test_arrow_dataset.py(test_json_feature_keeps_none_as_nullandtest_json_feature_all_none). Both fail before the change and pass after.Ran on CPU:
Also confirmed by hand that the mixed-None column now reports
null_count == 1, that an all-None column is all nulls, that theon_mixed_types="use_json"path keeps nulls, and that a Parquet roundtrip preserves the null.ruff checkandruff format --checkare clean on the changed files.