feat: encode FixedSizeBinary in JSON as hex string#5622
Conversation
Adds encoding support to the JSON writer for the FixedSizeBinary DataType A test was added as well
arrow-json/Cargo.toml
Outdated
| bytes = "1.4" | ||
| criterion = { version = "0.5", default-features = false } | ||
| rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] } | ||
| hex = "0.4.3" |
There was a problem hiding this comment.
No, it is not required. The reason I added it was so that in the test I added, it was apparent that the bytes in the column resulted in the correct HEX string. I will see if I can write the test in a way that gets by without it.
There was a problem hiding this comment.
I updated the test so that it no longer relies on the hex crate, and removed the dep. I think it would be nice to have a test that fully round-trips data, first by decoding from HEX to the FixedSizeBinary, then encoding back. But, I am not sure if we want to support this behaviour on the JSON reader, and this issue is only about the write side.
|
I realize now, that the test I wrote is complete bogus (not valid JSON). So I am reworking that. |
The fixed size binary values were not being encoded with surrounding double quotes. This fixes that, and updates the added test to actually parse the written JSON as JSON, using serde_json, and make assertions against that.
arrow-json/src/writer/encoder.rs
Outdated
|
|
||
| impl Encoder for FixedSizeBinaryEncoder { | ||
| fn encode(&mut self, idx: usize, out: &mut Vec<u8>) { | ||
| let v = self.0.value(idx); |
There was a problem hiding this comment.
I think this needs to check is_valid before serializing (I think you'll see the need when testing for null)
Have the FixedSizeBinaryEncoder for the JSON writer handle explicit null values, based on the Writer's EncoderOptions.
|
I think this is good to go -- any concerns @tustvold ? |
Changed the FixedSizeBinaryEncoder for the JSON writer to use a borrow of the FixedSizeBinaryArray being encoded, to follow other Encoder implementations, and to remove the use of clone.
|
@alamb - I changed the |
BooleanEncoder and StringEncoder were changed to use borrows of their respective Array types, to avoid cloning.
|
I also noticed that the |
arrow-json/src/writer/encoder.rs
Outdated
|
|
||
| impl<'a> Encoder for FixedSizeBinaryEncoder<'a> { | ||
| fn encode(&mut self, idx: usize, out: &mut Vec<u8>) { | ||
| if self.array.is_valid(idx) { |
There was a problem hiding this comment.
IIRC nullability is handled by the parent and so this shouldn't be necessary for a non nested type
There was a problem hiding this comment.
I double checked by making the following change locally and the tests all still pass. Thus I agree the valid check is not needed
diff --git a/arrow-json/src/writer/encoder.rs b/arrow-json/src/writer/encoder.rs
index 27fa5747288..470d2fadd06 100644
--- a/arrow-json/src/writer/encoder.rs
+++ b/arrow-json/src/writer/encoder.rs
@@ -101,7 +101,7 @@ fn make_encoder_impl<'a>(
DataType::FixedSizeBinary(_) => {
let array = array.as_any().downcast_ref::<FixedSizeBinaryArray>().unwrap();
- (Box::new(FixedSizeBinaryEncoder::new(array, options)) as _, array.nulls().cloned())
+ (Box::new(FixedSizeBinaryEncoder::new(array)) as _, array.nulls().cloned())
}
DataType::Struct(fields) => {
@@ -451,14 +451,12 @@ impl<'a> Encoder for MapEncoder<'a> {
struct FixedSizeBinaryEncoder<'a> {
array: &'a FixedSizeBinaryArray,
- explicit_nulls: bool,
}
impl<'a> FixedSizeBinaryEncoder<'a> {
- fn new(array: &'a FixedSizeBinaryArray, options: &EncoderOptions) -> Self {
+ fn new(array: &'a FixedSizeBinaryArray) -> Self {
Self {
array,
- explicit_nulls: options.explicit_nulls,
}
}
}
@@ -473,8 +471,6 @@ impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
write!(out, "{byte:02x}").unwrap();
}
out.push(b'"');
- } else if self.explicit_nulls {
- out.extend_from_slice(b"null");
}
}
}There was a problem hiding this comment.
Indeed, as per #5622 (comment), when I first added the test, it was passing without adding the explicit_nulls option to the FixedSizeBinaryEncoder, but I added it anyways. I am totally fine with taking it out.
A couple of reference counts are irrelevant compared to encoding overheads, but I suppose it can't hurt to avoid them |
alamb
left a comment
There was a problem hiding this comment.
Thanks @hiltontj -- I think this PR could be merged as is.
@tustvold has a good suggestion for simplificiation https://github.com/apache/arrow-rs/pull/5622/files#r1562191540 but I also think we could do that as a follow on PR if you prefer
Just let me know what you would prefer
arrow-json/src/writer/encoder.rs
Outdated
|
|
||
| impl<'a> Encoder for FixedSizeBinaryEncoder<'a> { | ||
| fn encode(&mut self, idx: usize, out: &mut Vec<u8>) { | ||
| if self.array.is_valid(idx) { |
There was a problem hiding this comment.
I double checked by making the following change locally and the tests all still pass. Thus I agree the valid check is not needed
diff --git a/arrow-json/src/writer/encoder.rs b/arrow-json/src/writer/encoder.rs
index 27fa5747288..470d2fadd06 100644
--- a/arrow-json/src/writer/encoder.rs
+++ b/arrow-json/src/writer/encoder.rs
@@ -101,7 +101,7 @@ fn make_encoder_impl<'a>(
DataType::FixedSizeBinary(_) => {
let array = array.as_any().downcast_ref::<FixedSizeBinaryArray>().unwrap();
- (Box::new(FixedSizeBinaryEncoder::new(array, options)) as _, array.nulls().cloned())
+ (Box::new(FixedSizeBinaryEncoder::new(array)) as _, array.nulls().cloned())
}
DataType::Struct(fields) => {
@@ -451,14 +451,12 @@ impl<'a> Encoder for MapEncoder<'a> {
struct FixedSizeBinaryEncoder<'a> {
array: &'a FixedSizeBinaryArray,
- explicit_nulls: bool,
}
impl<'a> FixedSizeBinaryEncoder<'a> {
- fn new(array: &'a FixedSizeBinaryArray, options: &EncoderOptions) -> Self {
+ fn new(array: &'a FixedSizeBinaryArray) -> Self {
Self {
array,
- explicit_nulls: options.explicit_nulls,
}
}
}
@@ -473,8 +471,6 @@ impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
write!(out, "{byte:02x}").unwrap();
}
out.push(b'"');
- } else if self.explicit_nulls {
- out.extend_from_slice(b"null");
}
}
}|
@alamb - I will update as suggested on this PR and re-request review, no problem! |
The FixedSizeBinaryEncoder does not need to handle nulls, as that will be handled by a parent encoder, i.e., list/map.
# Which issue does this PR close? - Closes #8736 # Rationale for this change See linked issue. # What changes are included in this PR? Add JSON decoders for binary array variants that act as counterparts to #5622. This way, it becomes possible to do a full round-trip encoding/decoding of binary array. # Are these changes tested? I added a roundtrip test based on the `test_writer_binary`. It verifies that encoding and then decoding leads to the original input again. It covers `Binary`, `LargeBinary`, `FixedSizeBinary`, and `BinaryView` arrays, all with and without explicit nulls. # Are there any user-facing changes? Yes, encoding and decoding binary arrays to/from JSON is now fully supported, given the right schema. One limitation is that schema inference is not able to detect binary arrays as they look like normal JSON strings after encoding. However, this is already true when encoding other Arrow types, for example it's not possible to differentiate integer bit widths. I updated the docs accordingly. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #5620.
Rationale for this change
See #5620.
What changes are included in this PR?
Adds encoding support to the JSON writer for the
FixedSizeBinarydata type.A simple test was added as well. I can add more extensive testing if needed.
Are there any user-facing changes?
Potentially some documentation to state that
FixedSizeBinaryis encoded as hex strings.