-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[draft] Add LogicalType, try to support user-defined types
#8143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Current PR has some unresolved issues requiring collaboration for discussion. |
|
I've organized the logic for the mutual conversion between DFSchema to SchemaNo need to changeDefaultPhysicalPlanner
To be changed
Schema to DFSchema (To be changed)
|
|
Thanks @yukkit -- I plan to give this a look, but probably will not have time until tomorrow |
| } | ||
| } | ||
|
|
||
| async fn create_type(&self, cmd: CreateType) -> Result<DataFrame> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core processing logic for creating SQL UDT.
| self.register_data_type( | ||
| type_signature.to_owned_type_signature(), | ||
| extension_type, | ||
| )?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Register the extension types created through SQL UDT.
| } | ||
| } | ||
|
|
||
| impl TypeManager for SessionState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SessionState implements TymeManager for registering and retrieving extended data types.
| pub trait ExprSchemable { | ||
| /// given a schema, return the type of the expr | ||
| fn get_type<S: ExprSchema>(&self, schema: &S) -> Result<DataType>; | ||
| fn get_type<S: ExprSchema>(&self, schema: &S) -> Result<LogicalType>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace all arrow DataType used in the logical planning stage with LogicalType(literal ScalarValue TBD).
| // TODO not convert to DataType | ||
| let data_types = data_types | ||
| .into_iter() | ||
| .map(|e| e.physical_type()) | ||
| .collect::<Vec<_>>(); | ||
|
|
||
| Ok((fun.return_type)(&data_types)?.as_ref().clone().into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Functions signatures for udf/udaf use TypeSignature instead of the arrow DataType
| .ok_or_else(|| { | ||
| plan_datafusion_err!( | ||
| // FIXME support logical data types, not convert to DataType | ||
| let data_type = comparison_coercion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify the function comparison_coercion to accept LogicalType as parameters.
| /// data types here come from function 'equal_rows', if more data types are supported | ||
| /// in equal_rows(hash join), add those data types here to generate join logical plan. | ||
| pub fn can_hash(data_type: &DataType) -> bool { | ||
| pub fn can_hash(data_type: &LogicalType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can_hash might be abstracted as a property of logical data types.
| // FIXME like_coercion use LogicalType | ||
| let left_type = expr.get_type(&self.schema)?.physical_type(); | ||
| let right_type = pattern.get_type(&self.schema)?.physical_type(); | ||
| let coerced_type = like_coercion(&left_type, &right_type).ok_or_else(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Modify the function like_coercion to accept LogicalType as parameters.
| } | ||
|
|
||
| pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<DataType> { | ||
| pub(crate) fn convert_data_type( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert data types in the AST to logical data types.
| SQLDataType::Custom(name, params) => { | ||
| let name = object_name_to_string(name); | ||
| let params = params.iter().map(Into::into).collect(); | ||
| let type_signature = TypeSignature::new_with_params(name, params); | ||
| if let Some(data_type) = self.context_provider.get_data_type(&type_signature) { | ||
| return Ok(data_type); | ||
| } | ||
|
|
||
| plan_err!("User-Defined SQL type {sql_type:?} not found") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert custom data types.
LogicalType, try to support user-defined types
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an epic PR @yukkit -- I looked through it and I really liked the structure and the overall code. Really nice 👏
I think the next steps for this PR are to circulate it more widely -- perhaps via a note to the arrow mailing list, and discord / slack channels (I can help with this) and potentially have people gauge how much disruption this will cause with downstream crates (I can test with IOx perhaps, and see what that experience is like)
Again, thanks for pushing this forward
| _ => dt1 == dt2, | ||
| } | ||
| fn datatype_is_logically_equal(dt1: &LogicalType, dt2: &LogicalType) -> bool { | ||
| dt1 == dt2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is certainly nicer
| data_type: LogicalType, | ||
| nullable: bool, | ||
| /// A map of key-value pairs containing additional custom meta data. | ||
| metadata: HashMap<String, String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could store in a BTreeMap to avoid sorting them each time 🤔
| use arrow_schema::{DataType, Field, IntervalUnit, TimeUnit}; | ||
|
|
||
| #[derive(Clone, Debug)] | ||
| pub enum LogicalType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we should add doc comments here, but it also makes sense to avoid over doing it on RFC / draft.
| .collect::<Vec<_>>(); | ||
| DataType::Struct(fields.into()) | ||
| } | ||
| other => panic!("not implemented {other:?}"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the idea that DataType::Dictionary and DataType::RunEndEncoded would also be included here? If so it makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up. This should include DataType::Dictionary and DataType::RunEndEncoded here.
| pub format: Arc<dyn FileFormat>, | ||
| /// The expected partition column names in the folder structure. | ||
| /// See [Self::with_table_partition_cols] for details | ||
| /// TODO this maybe LogicalType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one usecase is to use dictionary encoding for the partition columns to minimize the overhead of creating such columns. As long as they can be represented / specified with LogicalType I think it is a good idea to try changing this too.
|
What's the status of this pr? This should be a very useful feature. |
|
I think this PR is stalled and I don't have any update |
|
Please accept my apologies for the delay. Due to personal circumstances, I have been unable to attend to any work. I will now proceed to resume work on this PR. |
|
No worries at all -- I hope all is well and we look forward to this work |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
|
Hello, sorry if this is a redundant question. What is the status of this PR? |
I think it is stale and on track to be closed from what I can see |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
|
FYI #11160 tracks a new proposal for this feature. It seems to be gaining traction |
Which issue does this PR close?
Closes #7923 .
Current Pull Request is an Experimental Demo for Validating the Feasibility of Logical Types
Rationale for this change
What changes are included in this PR?
Features
UDTas a function signature inudf/udaf.register_data_typefunction in theSessionContext.New Additions
LogicalTypestruct.ExtensionTypetrait. Abstraction for extension types.TypeSignaturestruct. Uniquely identifies a data type.Major Changes
get_data_type(&self, _name: &TypeSignature) -> Option<LogicalType>function to theContextProvidertrait.DFSchema,DFFieldnow usesLogicalType, removing arrowFieldand retaining onlydata_type,nullable,metadatasincedict_id,dict_is_orderedare not necessary at the logical stage.ExprSchemableandExprSchemanow useLogicalType.astto logical plan conversion now usesLogicalType.To Be Implemented
TypeCoercionRewriterin the analyze stage uses logical types. For example, functions likecomparison_coercion,get_input_types,get_valid_types, etc.udf/udafuseTypeSignatureinstead of the existingDataTypefor ease of use inudf/udaf.To Be Determined
ScalarValueuseLogicalTypeor arrowDataType?LogicalType.DataTypeTableSourcereturnDFSchemaor arrowSchema?Schema.DFSchemaDFSchematoSchema; logical plans useDFSchema, physical plans useSchema).SchemaandDFSchemaSchematoDFSchema?TableScannode, obtain arrowSchemathroughTableSource/TableProviderand then convert it toDFSchema.TableSource/TableProviderreturnsDFSchemainstead ofSchema.DFSchematoSchema?SchemafromTableSourcein physical planner, no need for conversion.DFSchemareturned byTableSourcetoSchemain the physical planner stage.Some Thoughts
dyn ArraytoLineStringArrayorMultiPointArraywas raised. In my perspective, assuming there is a function specifically designed for handlingLineStringdata, the function signature can be defined asLineString, ensuring that the input data must be of a type acceptable byLineStringArray.Are these changes tested?
Are there any user-facing changes?