Adding Beam Schemas capability to parse json-schemas. This is the de-…#24271
Adding Beam Schemas capability to parse json-schemas. This is the de-…#24271pabloem merged 14 commits intoapache:masterfrom
Conversation
|
Assigning reviewers. If you would like to opt out of this review, comment R: @apilloud for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Codecov Report
@@ Coverage Diff @@
## master #24271 +/- ##
==========================================
- Coverage 73.43% 73.37% -0.07%
==========================================
Files 716 717 +1
Lines 96571 96939 +368
==========================================
+ Hits 70920 71128 +208
- Misses 24328 24488 +160
Partials 1323 1323
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
lukecwik
left a comment
There was a problem hiding this comment.
What do we do with properties that don't translate to beam schemas like array uniqueness of items?
We should describe to users what we reject and be explicit about what we support and what we ignore for each JSON type from https://json-schema.org/understanding-json-schema/reference/index.html
We'll also want to be careful when supporting nulls and test for the null case explicitly.
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/test/resources/schemas/json/nested_arrays_objects_json_schema.json
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy
Outdated
Show resolved
Hide resolved
sdks/java/core/build.gradle
Outdated
There was a problem hiding this comment.
Should we make this feature require the user provide the library?
Need to analyze how large the dependency tree is and how stable it is.
There was a problem hiding this comment.
I've marked is as a provided dependency.
Before moving forward - another possibility is to create a new extension module to hold these dependencies and this functionality. The module would be used by some of our schema transform implementations (kafka and pubsub), so it would still come back as a depenency for other Beam modules, but not for core - wdyt?
sdks/java/core/src/test/resources/schemas/json/nested_arrays_objects_json_schema.json
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
|
Should we add a SchemaProvider as well? |
There was a problem hiding this comment.
Is it possible to add the following support methods? I advocate that removing the embedded checks in this method allows these to be independently tested. First, the following shows how beamSchemaFromJsonSchema might look with the proposed overloaded support methods. Then, some of the support methods follow. To avoid the instanceof checks, I tried to find in the libraries Javadoc, whether the base Schema had a getType() getter but it seems not. Perhaps we could have a Map<Class<T extends org.everit.json.schema.Schema>, Function<T extends org.everit.json.schema.Schema, Schema.Field>> that maps a particular json Schema class to the Java function that converts to the Schema.Field.
Schema.Builder beamSchemaBuilder = Schema.builder()
for (String propertyName : jsonSchema.getPropertySchemas().keySet()) {
org.everit.json.schema.Schema propertySchema = jsonSchema.getPropertySchemas().get(propertyName);
if (propertySchema == null) {
throw new IllegalArgumentException("Unable to parse schema " + jsonSchema.toString());
}
Schema.Field field = beamFieldFromJsonField(propertySchema);
builder = builder.addField(field);
}
return builder.build();
static Schema.Field beamFieldFromJsonField(org.everit.json.schema.Schema propertySchema) {
if (propertySchema instanceof org.everit.json.schema.ObjectSchema) {
return beamFieldFromJsonField((org.everit.json.schema.ObjectSchema) propertySchema);
}
if (propertySchema instanceof org.everit.json.schema.ArraySchema) {
return beamFieldFromJsonField((org.everit.json.schema.ObjectSchema) propertySchema);
}
// etc etc to all the various types.
}
static Schema.Field beamFieldFromJsonField(org.everit.json.schema.ObjectSchema objectSchema) {
// do some magic
}
static Schema.Field beamFieldFromJsonField(org.everit.json.schema.ArraySchema arraySchema) {
// do some more magic
}
static Schema.Field beamFieldFromJsonField(org.everit.json.schema.BooleanField booleanSchema) {
// do even some more magical magic
}There was a problem hiding this comment.
The conversions for type to schema are short I think it would be simpler to not add this indirection
…facto standard way to define JSON schemas
35131d3 to
a14c8c4
Compare
953e918 to
342edca
Compare
Can you elaborate @reuvenlax ? Any pointers? : ) Happy to implement whatever... |
|
Run Java PreCommit |
3 similar comments
|
Run Java PreCommit |
|
Run Java PreCommit |
|
Run Java PreCommit |
I've tried to document in some detail what we do/don't support, as well as the required provided dependency. LMK what you think @lukecwik I've also added support for nullable/non-nullable fields (based on the |
|
Run Spotless PreCommit |
|
Run Java_GCP_IO_Direct PreCommit |
1 similar comment
|
Run Java_GCP_IO_Direct PreCommit |
|
Run Java_Debezium_IO_Direct PreCommit |
|
Run Java_PVR_Flink_Docker PreCommit |
|
@lukecwik please let me know what you think of the javadoc for it, and other changes. |
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
| * <dependency> | ||
| * <groupId>com.github.erosb < /groupId> | ||
| * <artifactId>everit-json-schema < /artifactId> | ||
| * <version>1.14.1 < /version> |
There was a problem hiding this comment.
We should point the person to use the same version that Beam was tested with and not hard-code the version in the documentation.
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
This only applies if you have a type that you want to convert to a schema and doesn't apply to schemas stored in an arbitrary location (e.g. string, file, URL, ...) |
Nice improvement over the original and good tests. |
…/JsonUtils.java Co-authored-by: Lukasz Cwik <lcwik@google.com>
…/JsonUtils.java Co-authored-by: Lukasz Cwik <lcwik@google.com>
…/JsonUtils.java Co-authored-by: Lukasz Cwik <lcwik@google.com>
…/JsonUtils.java Co-authored-by: Lukasz Cwik <lcwik@google.com>
…/JsonUtils.java Co-authored-by: Lukasz Cwik <lcwik@google.com>
45cfdd1 to
aff5ea4
Compare
|
Run Java_Examples_Dataflow PreCommit |
3 similar comments
|
Run Java_Examples_Dataflow PreCommit |
|
Run Java_Examples_Dataflow PreCommit |
|
Run Java_Examples_Dataflow PreCommit |
|
Run Java_GCP_IO_Direct PreCommit |
|
thanks all! |
apache#24271) * Adding Beam Schemas capability to parse json-schemas. This is the de-facto standard way to define JSON schemas * json sample schema files for tests * addressing comments * fixup * fixup * documenting and fixing nullable cases * fixup * Update sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java Co-authored-by: Lukasz Cwik <lcwik@google.com> * Update sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java Co-authored-by: Lukasz Cwik <lcwik@google.com> * Update sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java Co-authored-by: Lukasz Cwik <lcwik@google.com> * Update sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java Co-authored-by: Lukasz Cwik <lcwik@google.com> * Update sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java Co-authored-by: Lukasz Cwik <lcwik@google.com> * improve docs * fixup Co-authored-by: Lukasz Cwik <lcwik@google.com>
…facto standard way to define JSON schemas
r: @damondouglas
fyi: @reuvenlax @lukecwik - I am adding a new dependency to Java Core SDK. It is not exposed in the API. It is used to parse JSON schemas.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.