-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Support Binary/LargeBinary --> Utf8/LargeUtf8 in ilike and string functions
#7840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Binary/LargeBinary --> Utf8/LargeUtf8 coercionBinary/LargeBinary --> Utf8/LargeUtf8 in ilike
dd8f41e to
b932e1e
Compare
Binary/LargeBinary --> Utf8/LargeUtf8 in ilike Binary/LargeBinary --> Utf8/LargeUtf8 in ilike and string functions
…ring functions fixup
b932e1e to
56514be
Compare
| DataType::LargeUtf8 => $largeUtf8Type, | ||
| // LargeBinary inputs are automatically coerced to Utf8 | ||
| DataType::LargeBinary => $largeUtf8Type, | ||
| DataType::Utf8 => $utf8Type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the fix for string functions taking binary arguments
The new cases for Binary/LargeBinary are all that is new here -- the rest is just comments
| /// Coercion rules for binary (Binary/LargeBinary) to string (Utf8/LargeUtf8): | ||
| /// If one argument is binary and the other is a string then coerce to string | ||
| /// (e.g. for `like`) | ||
| fn binary_to_string_coercion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the new rule, only used for ilke
I originally tried modifying string_coercion to also include Binary --> String coercion, however, that has the (bad) side effect of making a binary = string comparison into cast(binary)= string which is bad because:
- The comparison is valid even if the binary column contains non UTF8 data (but the cast will fail in this case)
- The cast from binary --> utf8 is not free (it has to check each value for utf8 correctness)
| /// This is a union of string coercion rules and dictionary coercion rules | ||
| pub fn like_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { | ||
| string_coercion(lhs_type, rhs_type) | ||
| .or_else(|| binary_to_string_coercion(lhs_type, rhs_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the fix for like
| let (lhs, rhs) = get_input_types(&$A_TYPE, &$OP, &$B_TYPE)?; | ||
| assert_eq!(lhs, $C_TYPE); | ||
| assert_eq!(rhs, $C_TYPE); | ||
| ($LHS_TYPE:expr, $RHS_TYPE:expr, $OP:expr, $RESULT_TYPE:expr) => {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same logic, but I renamed the arguments for clarity
|
|
||
| query error DataFusion error: type_coercion | ||
| SELECT largebinary FROM t where largebinary LIKE '%F'; | ||
| query ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They work 🎉
|
@jonahgao I wonder if you have some time to review this PR cc @Weijun-H and @JayjeetAtGithub and @parkma99 |
jackwener
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I review this PR carefully, a great job!
Thanks @alamb .
|
@alamb LGTM 👍. I also tested some other string functions on this branch, and they worked as expected. |
Weijun-H
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @alamb 👍
|
Thanks everyone for the reviews! |
| // `utf8_to_str_type`: returns either a Utf8 or LargeUtf8 based on the input type size. | ||
| make_utf8_to_return_type!(utf8_to_str_type, DataType::LargeUtf8, DataType::Utf8); | ||
|
|
||
| // `utf8_to_str_type`: returns either a Int32 or Int64 based on the input type size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utf8_to_str_type -> utf8_to_int_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find -- filed #7861
|
|
||
| /// Coercion rules for Strings: the type that both lhs and rhs can be | ||
| /// casted to for the purpose of a string computation | ||
| /// Coercion rules for string types (Utf8/LargeUtf8): If at least on argument is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least on argument -> at least one argument
| } | ||
| } | ||
|
|
||
| /// Coercion rules for binary types (Binary/LargeBinary): If at least on argument is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least on argument -> at least one argument
| /// | ||
| /// For example, if a function `func` accepts arguments of `(int64, int64)`, | ||
| /// but was called with `(int32, int64)`, this function could match the | ||
| /// valid_types by by coercing the first argument to `int64`, and would return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid_types by by coercing -> valid_types by coercing
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a few typos.
|
Thank you @viirya -- I will fix the typos |
PR to fix: #7873 |
Which issue does this PR close?
Closes #7342
Closes #7345
Closes #7344
It also closes several draft / PRs:
Closes #7349
Closes #7350
Closes #7365
Rationale for this change
Some of the clickbench queries have this pattern
What changes are included in this PR?
BinaryandLargeBinaryto the appropriateUtf8(string) variants forLIKEandILIKEAre these changes tested?
Yes, new tests
Are there any user-facing changes?
Yes now some queries will not error