Skip to content

ARROW-10987: [Rust] Interpret BinaryArray as JSON#8971

Closed
kflansburg wants to merge 3 commits intoapache:masterfrom
kflansburg:binary-to-json
Closed

ARROW-10987: [Rust] Interpret BinaryArray as JSON#8971
kflansburg wants to merge 3 commits intoapache:masterfrom
kflansburg:binary-to-json

Conversation

@kflansburg
Copy link
Copy Markdown

Create lightweight wrapper, JSONArray to interpret BinaryArray values as serialized JSON. Leverage recent work for inferring JSON schema to support conversion to StructArray.

Example:

let json_array: JSONArray = binary_array.into();
let struct_array: StructArray = json_array.try_into().unwrap();

cc @nevi-me

@github-actions
Copy link
Copy Markdown

@codecov-io
Copy link
Copy Markdown

codecov-io commented Dec 19, 2020

Codecov Report

Merging #8971 (93aa609) into master (5eb6ce1) will increase coverage by 0.02%.
The diff coverage is 97.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8971      +/-   ##
==========================================
+ Coverage   83.19%   83.21%   +0.02%     
==========================================
  Files         199      200       +1     
  Lines       48661    48739      +78     
==========================================
+ Hits        40482    40558      +76     
- Misses       8179     8181       +2     
Impacted Files Coverage Δ
rust/arrow/src/array/array_json.rs 97.43% <97.43%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5eb6ce1...93aa609. Read the comment docs.

@kflansburg kflansburg marked this pull request as ready for review December 20, 2020 02:18
@jorgecarleitao
Copy link
Copy Markdown
Member

Hi @kflansburg .

Every array type in the arrow crate has an associated DataType that is part of the arrow specification. In particular, they all implement the trait Array, which has a method data_type, which allow interoperability with other implementations via the c data interface, flight protocol, parquet format, etc.

I can't find a type "JSON" in the arrow specification. Could you clarify how this new struct would fit?

@kflansburg
Copy link
Copy Markdown
Author

Hey @jorgecarleitao ,

Thanks for the info. It sounds, then, like this would be better suited as a method on BinaryArray. Would you agree?

FWIW, this is a light wrapper around BinaryArray, and implements Deref. The underlying memory would be DataType::Binary.

@jorgecarleitao
Copy link
Copy Markdown
Member

Maybe @nevi-me can provide a better insight here, but IMO the conversion [Large]BinaryArray to StructArray is a kernel with a signature

to_struct<T>(array: GenericBinaryArray<T>, schema: &Schema, error_on_invalid: bool) -> Result<StructArray>

or something similar.

@kflansburg
Copy link
Copy Markdown
Author

I can see how this would be a kernel, however it should probably be named more specifically since there will an identical type signature for different formats (JSON, proto, etc.).

@nevi-me
Copy link
Copy Markdown
Contributor

nevi-me commented Dec 22, 2020

Hi @kflansburg @jorgecarleitao , I offered the suggestion, so I should have been more explicit on how we could do this in a spec-compliant fashion. I apologise for that.

The spec doesn't have a JSONArray, but it has an ExtensionArray, which would be one of the logical types that Arrow supports, but with extra metadata that defines that it's a specific type. JSON, UUID are good examples, where the datatype could be StringArray and FixedSizeBinaryArray respectively.

In order to take advantage of the JSON conversion, we could:

  • Implement https://issues.apache.org/jira/browse/ARROW-10258, which is the PR for ExtensionArray (I still intend on getitng to this by this release, but help is welcome).
  • Add a function that converts a StringArray to ExtensionArray with the metadata indicating that it's a JSON array.

We'd have to look at the other language implementations, and also plan out the API so it can cover its intended use-cases adequately.

I personally see the need to be able to convert a StringArray that has JSON data, into a StructArray. One concrete example is that Postgres and other databases have a JSON type, which is externally represented as a string with valid JSON.
Given that adding this functionality wouldn't introduce any dependencies, I think it can/should live in the arrow crate.

@kflansburg
Copy link
Copy Markdown
Author

@nevi-me, Looking at PyArrow's implementation of extension types, they appear to be creating a new type (UuidType) which wraps the underlying array, much like I'm doing here. Based on this, I'm thinking the following would match the spirit of the Extension specification:

  1. Add a default-empty metadata map to Field.
  2. Possibly define ExtensionType: Array trait which at the very least outputs the required metadata.
  3. Move JSONArray to a new module called extension and rename to JSONType.
  4. Ensure that Fields for JSONArrays have ARROW:extension:name=json and ARROW:extension:metadata= (empty).
  5. Update various Arrow reading code to capture Field metadata and produce JSONType when appropriate.

The remaining concern I have here is that there appears to be no other libraries implementing this (JSON), so interoperability seems unlikely.

Finally, you mention using StringArray as the underlying type, however there are a number of use-cases where a BinaryArray will be the input, and since serde_json supports parsing &[u8], it would be nice to be able to skip the extra utf-8 validation step. Thoughts?

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Feb 13, 2021

What is the status of this PR?

As part of trying to clean up the backlog of Rust PRs in this repo, I am going through seemingly stale PRs and pinging the authors to see if there are any plans to continue the work or conversation.

@kflansburg
Copy link
Copy Markdown
Author

No actionable feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants