Add support for tuple ClickHouse#29715
Conversation
|
Assigning reviewers. If you would like to opt out of this review, comment R: @robertwb for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
Cc @Abacn since you've reviewed the last 2 PRs adding ClickHouse data types |
Abacn
left a comment
There was a problem hiding this comment.
Thanks, did a quick pass and let some comments. However I am going to travel for the next weeks. Would be great if could find another reviewer or until late December
CHANGES.md
Outdated
| * Adding support for LowCardinality DataType in ClickHouse (Java) ([#29533](https://github.com/apache/beam/pull/29533)). | ||
| * Added support for handling bad records to KafkaIO (Java) ([#29546](https://github.com/apache/beam/pull/29546)) | ||
| * Add support for generating text embeddings in MLTransform for Vertex AI and Hugging Face Hub models.([#29564](https://github.com/apache/beam/pull/29564)) | ||
| * Adding support for Tuples DataType in ClickHouse (Java) ([Tuple Support](https://github.com/apache/beam/pull/29715)). |
There was a problem hiding this comment.
We can combine this with LowCardinality DataType above, e.g.
Adding support for LowCardinality and Tuples DataType in ClickHouse (Java) ([#29533](https://github.com/apache/beam/pull/29533), [#29715](https://github.com/apache/beam/pull/29715)).
There was a problem hiding this comment.
Release cut is today so this may not get into 2.53.0. Could leave it as is and fix when the PR is finalized
| * | ||
| * Nullable row columns are supported through Nullable type in ClickHouse. Low cardinality hint is | ||
| * supported through LowCardinality DataType in ClickHouse. | ||
| * supported through LowCardinality DataType in ClickHouse supported through Tuple DataType in |
There was a problem hiding this comment.
Consider put the documentation Tuple to the table above. Broken sentence here currently.
| .map(s -> s.trim().replaceAll(" +", "':;")) | ||
| .collect(Collectors.toList()); | ||
| String content = | ||
| String.join(",", l).trim().replaceAll("Tuple\\(", "Tuple('").replaceAll(",", ",'"); |
There was a problem hiding this comment.
The caller has a condition if (type.toLowerCase().trim().startsWith("tuple(")) { below, so here payload always (case insensitive) starts with "tuple(", so here the replace of "Tuple\(" won't work
There was a problem hiding this comment.
The condition is on with lower case, but the actual parsing is on the original string
There was a problem hiding this comment.
I mean below when " if (type.toLowerCase().trim().startsWith("tuple(")) {" the this preprocessing will be executed.
So Tuple( (with backslash) is considered here but not the condition in the caller there. Coild this cause issue?
There was a problem hiding this comment.
You're right. Would you mind adding some comments about the proprocessing rule and example, for example, sth like // Tuple(a String, b Integer) -> ...
| | < BOOL : "BOOL" > | ||
| | < LOWCARDINALITY : "LOWCARDINALITY" > | ||
| | < TUPLE : "TUPLE" > | ||
| | < COLON : ":" > |
There was a problem hiding this comment.
Is COLON and SEMI_COLON for tuple or another feature?
There was a problem hiding this comment.
Using that since i have some ambiguity in parsing (looking for other options), but for now decided to leave it that way
There was a problem hiding this comment.
If it's get added to SDK then there is expectation of backward compatibility. So I would prefer to keep consistent with ClickHouse's SQL syntax
There was a problem hiding this comment.
It's a temporary solution since there is some ambiguity when using javacc having difficulty to detecting between tokens and strings
There was a problem hiding this comment.
Removed COLON and SEMI_ COLON
| Map<String, ColumnType> m2 = new HashMap<>(); | ||
| m2.put("a1", ColumnType.STRING); | ||
| m2.put("b", ColumnType.BOOL); | ||
| ColumnType columnType02 = ColumnType.parse("Tuple('a1':;String,'b':;Bool)"); |
There was a problem hiding this comment.
Is :; a standard syntax for ClickHouse? https://clickhouse.com/docs/en/sql-reference/data-types/tuple seems not mentioning this pattern.
|
Reminder, please take a look at this pr: @robertwb @chamikaramj |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @kennknowles for label java. Available commands:
|
Abacn
left a comment
There was a problem hiding this comment.
Thanks, the syntax now looks good and consistent with clickhouse documentation. Replied to previous comment
CHANGES.md
Outdated
| * Added support for handling bad records to KafkaIO (Java) ([#29546](https://github.com/apache/beam/pull/29546)) | ||
| * Add support for generating text embeddings in MLTransform for Vertex AI and Hugging Face Hub models.([#29564](https://github.com/apache/beam/pull/29564)) | ||
| * NATS IO connector added (Go) ([#29000](https://github.com/apache/beam/issues/29000)). | ||
| * Adding support for LowCardinality and Tuples DataType in ClickHouse (Java) ([#29533](https://github.com/apache/beam/pull/29533), [#29715](https://github.com/apache/beam/pull/29715)). |
There was a problem hiding this comment.
LowCardinality will go to 2.53.0 but Tuples does not, as 2.53.0 is already cut
| * | ||
| * Nullable row columns are supported through Nullable type in ClickHouse. Low cardinality hint is | ||
| * supported through LowCardinality DataType in ClickHouse. | ||
| * supported through LowCardinality DataType in ClickHouse supported through Tuple DataType in |
| .map(s -> s.trim().replaceAll(" +", "':;")) | ||
| .collect(Collectors.toList()); | ||
| String content = | ||
| String.join(",", l).trim().replaceAll("Tuple\\(", "Tuple('").replaceAll(",", ",'"); |
There was a problem hiding this comment.
I mean below when " if (type.toLowerCase().trim().startsWith("tuple(")) {" the this preprocessing will be executed.
So Tuple( (with backslash) is considered here but not the condition in the caller there. Coild this cause issue?
|
waiting on author |
|
@Abacn |
You're right, sorry for confusion |
| case TUPLE: | ||
| List<Schema.Field> fields = | ||
| columnType.tupleTypes().entrySet().stream() | ||
| .map(x -> Schema.Field.of(x.getKey(), Schema.FieldType.DATETIME)) |
There was a problem hiding this comment.
Hi @mzitnik,
Shouldn't each key inside a tuple be mapped to their respective data type ? Any reason for being DATETIME?
There was a problem hiding this comment.
Hi,
This quick adaptation seems to work:
case TUPLE:
List<TableSchema.Column> columns = columnType.tupleTypes().entrySet().stream().map(
x -> TableSchema.Column.of(x.getKey(), x.getValue())
).collect(Collectors.toList());
TableSchema tupleSchema = TableSchema.of(columns.toArray(new TableSchema.Column[0]));
Schema tupleAsRowSchema = getEquivalentSchema(tupleSchema);
return Schema.FieldType.row(tupleAsRowSchema);
There was a problem hiding this comment.
Hi,
Also, it seems that Array(Tuple(*)) is not supported when calling ClickHouseIO.getTableSchema.
This seems to fix the issue :
if (type.toLowerCase().trim().startsWith("tuple(")) { String content = tuplePreprocessing(type); columnType = TableSchema.ColumnType.parse(content); } else if (type.toLowerCase().trim().startsWith("array(tuple(")) { String content = tuplePreprocessing(type); columnType = TableSchema.ColumnType.parse(content); }
Add support for tuple ClickHouse
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.