Skip to content

Keep None as a real null in Json() columns instead of the string "null"#8231

Merged
lhoestq merged 1 commit into
huggingface:mainfrom
adityasingh2400:json-feature-null-handling
Jun 5, 2026
Merged

Keep None as a real null in Json() columns instead of the string "null"#8231
lhoestq merged 1 commit into
huggingface:mainfrom
adityasingh2400:json-feature-null-handling

Conversation

@adityasingh2400

Copy link
Copy Markdown
Contributor

What this fixes

A missing value (None) in a Json() column is currently stored as the JSON string "null" instead of a real Arrow null. This has two visible consequences:

  1. The column reports null_count == 0 even when values are missing, and is_null() returns all False. Anything that relies on Arrow null semantics (filtering on nulls, counting nulls, null-aware Parquet writes) sees no nulls at all.
  2. A genuine missing value becomes indistinguishable from a column that holds the literal JSON value null. Both end up as the same "null" string in storage.

Reproducer on current main:

from datasets import Dataset, Features, Json

ds = Dataset.from_dict({"col": [{"a": 1}, None, {"b": 2}]}, features=Features({"col": Json()}))
arr = ds.data["col"].combine_chunks()
print(arr.null_count)        # 0, expected 1
print(arr.to_pylist())       # ['{"a":1}', 'null', '{"b":2}'], expected the middle to be a real null

Root cause

The from_dict path and the on_mixed_types="use_json" path encode each value through json_encode_field in src/datasets/utils/json.py. Its leaf branch ran ujson_loads(example) on the value. For None that raises, so it fell into the except and returned ujson_dumps(None), which is the string "null". The same gap existed in Json.encode_example. On the read side Json.decode_example(None) crashed because it passed None straight to ujson_loads, and Json.cast_storage re-serialized None entries to "null" when it had to rebuild the array.

The fix

None is now preserved as a real Arrow null across all of these entry points:

  • json_encode_field and json_decode_field return None unchanged at the leaf.
  • Json.encode_example(None) returns None.
  • Json.decode_example(None) returns None instead of crashing.
  • Json.cast_storage keeps None entries as nulls when it rebuilds the storage array, and the validation sampling skips None so a sampled null does not force a needless re-encode.

Decoded access through ds[:], to_list and to_dict is unchanged, so this does not change observed values for existing users. It only fixes the underlying storage so nulls behave like nulls.

Verification

Added two regression tests in tests/test_arrow_dataset.py (test_json_feature_keeps_none_as_null and test_json_feature_all_none). Both fail before the change and pass after.

Ran on CPU:

python -m pytest tests/test_arrow_dataset.py -k "json or Json or mixed" -q   # 25 passed
python -m pytest tests/features/test_features.py -k "json or round_trip or yaml" -q   # 58 passed
python -m pytest tests/io/test_json.py tests/packaged_modules/test_json.py -q   # 79 passed
python -m pytest tests/io/test_parquet.py -q   # passed
python -m pytest tests/test_arrow_writer.py -q   # 86 passed

Also confirmed by hand that the mixed-None column now reports null_count == 1, that an all-None column is all nulls, that the on_mixed_types="use_json" path keeps nulls, and that a Parquet roundtrip preserves the null. ruff check and ruff format --check are clean on the changed files.

A missing value (None) in a Json() column was encoded as the JSON string
"null" rather than a real Arrow null. As a result the column had a
null_count of 0 even when values were missing, and a genuine missing
value became indistinguishable from a column that holds the literal JSON
value null.

The leaf branch of json_encode_field (used by from_dict and the
on_mixed_types="use_json" path) ran ujson_loads(None), which raised, then
fell back to ujson_dumps(None) and produced "null". Json.encode_example
had the same issue, Json.decode_example crashed on None, and cast_storage
re-serialized None entries to "null".

Now all of these keep None as a real Arrow null, so null_count, is_null,
filtering and Parquet roundtrips are correct. Decoded access is unchanged.

Added regression tests covering mixed and all-None Json() columns.

@lhoestq lhoestq left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq merged commit fd67320 into huggingface:main Jun 5, 2026
12 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants