Implement hex decoding of JSON strings to binary arrays#8737
Merged
alamb merged 9 commits intoapache:mainfrom Nov 6, 2025
Merged
Implement hex decoding of JSON strings to binary arrays#8737alamb merged 9 commits intoapache:mainfrom
alamb merged 9 commits intoapache:mainfrom
Conversation
The `writer::encoder::BinaryEncoder` encodes binary arrays as hex-encoded JSON strings. This commit adds support for decoding these strings again.
9 tasks
alamb
approved these changes
Nov 1, 2025
| fn decode_hex_string(hex_string: &str) -> Result<Vec<u8>, ArrowError> { | ||
| let mut decoded = Vec::with_capacity(hex_string.len() / 2); | ||
| for substr in hex_string.as_bytes().chunks(2) { | ||
| let str = std::str::from_utf8(substr).map_err(|e| { |
Contributor
There was a problem hiding this comment.
I am pretty sure this code could be made much faster with a custom lookup table rather than using u8::from_str_radix etc
That being said, that would be a nice thing to improve in a future PR
Contributor
Author
There was a problem hiding this comment.
Probably, also because we can make stronger assumptions than the requirements of from_str_radix.
I don't have time to look into this right now though, so maybe we can leave this for a future PR.
Contributor
|
I merged up from main and fixed a small clippy lint to get this PR to pass CI checks |
Contributor
|
Thanks again @phil-opp |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
arrow-jsonsupports encoding binary arrays, but not decoding #8736Rationale for this change
See linked issue.
What changes are included in this PR?
Add JSON decoders for binary array variants that act as counterparts to #5622. This way, it becomes possible to do a full round-trip encoding/decoding of binary array.
Are these changes tested?
I added a roundtrip test based on the
test_writer_binary. It verifies that encoding and then decoding leads to the original input again. It coversBinary,LargeBinary,FixedSizeBinary, andBinaryViewarrays, all with and without explicit nulls.Are there any user-facing changes?
Yes, encoding and decoding binary arrays to/from JSON is now fully supported, given the right schema.
One limitation is that schema inference is not able to detect binary arrays as they look like normal JSON strings after encoding. However, this is already true when encoding other Arrow types, for example it's not possible to differentiate integer bit widths.
I updated the docs accordingly.