-
Notifications
You must be signed in to change notification settings - Fork 19
Enable creating external tables on cloud object stores #433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/config/schema.rs
Outdated
| .ok_or(DataFusionError::Execution( | ||
| "'access_key_id' not found in provided options".to_string(), | ||
| ))? | ||
| .clone(), | ||
| secret_access_key: map | ||
| .get("secret_access_key") | ||
| .ok_or(DataFusionError::Execution( | ||
| "'secret_access_key' not found in provided options".to_string(), | ||
| ))? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFusionError feels a bit foreign inside the config module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could define something ourselves or reuse a serde parsing error, given this is part of the config deserialization logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right; tbh I'm still torn torn between doing this here (or somewhere else in our code), or going with the datafusion-cli helper functions (which would make this redundant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. It seems like datafusion-cli's dependencies are similar to ours anyway (e.g. clap, tokio, datafusion itself; mimalloc isn't though). Would it even be a meaningful size increase if we wanted to pull in datafusion-cli?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's 61MB regardless we include datafusion-cli or not. I was also wary of using the df-cli functions as they initially source values from the env, and delta-rs used to dump options values to env vars when creating a new external table, but it seems this is no longer the case.
Given this, I will try this approach next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out delta-rs still dumps the env vars when creating an external delta table on S3, which would interfere with the creation of a new object store in get_s3_object_store_builder.
So I'll leave the implementation as is for now (only changing the error above and adding some tests).
src/context.rs
Outdated
| let s3 = if cmd.options.is_empty() && let schema::ObjectStore::S3(s3) = self.internal_object_store.config.clone() { | ||
| S3{ bucket, ..s3 } | ||
| } else { | ||
| S3::from_bucket_and_options(bucket, &cmd.options)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential alternative here is to use get_s3_object_store_builder (and get_s3_object_store_builder below) though that would mean adding a new dependency (datafusion-cli).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to do some different things according to the source and doesn't handle GCS though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I meant for the other link to point to get_gcs_object_store_builder.
In the end both end up doing effectively what we want—create an object store dynamically and register it with the object store registry.
| .clone(), | ||
| endpoint: map.get("endpoint").cloned(), | ||
| bucket, | ||
| cache_properties: Some(ObjectCacheProperties::default()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should be setting this to None in this case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between None and Some(ObjectCacheProperties::default()) (if someone doesn't want their object store contents to be cached, can they represent that in the config)? If not, we could replace the type with ObjectCacheProperties instead of the optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if someone doesn't want their object store contents to be cached, can they represent that in the config
this is currently represented by not specifying the [object_store.cache_properties] section in the config, and I think it's wise to leave it like that (i.e. keep it optional).
I'm just wondering whether it's worth it to setup the temp file caching machinery for a external table, and I'm leaning towards no for starters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be useful. The reason is that we originally set up caching for the HTTP object store because DataFusion would call our object store requesting very small byte ranges very frequently, so we'd get slowed down by the latency (this could now have been fixed, but could still be a problem for S3/GCS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; I'll leave it to use the default caching props then (as is now).
src/config/schema.rs
Outdated
| .ok_or(DataFusionError::Execution( | ||
| "'access_key_id' not found in provided options".to_string(), | ||
| ))? | ||
| .clone(), | ||
| secret_access_key: map | ||
| .get("secret_access_key") | ||
| .ok_or(DataFusionError::Execution( | ||
| "'secret_access_key' not found in provided options".to_string(), | ||
| ))? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could define something ourselves or reuse a serde parsing error, given this is part of the config deserialization logic?
| .clone(), | ||
| endpoint: map.get("endpoint").cloned(), | ||
| bucket, | ||
| cache_properties: Some(ObjectCacheProperties::default()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between None and Some(ObjectCacheProperties::default()) (if someone doesn't want their object store contents to be cached, can they represent that in the config)? If not, we could replace the type with ObjectCacheProperties instead of the optional.
This would allow one to query CSV/Parquet files stored in S3 and GCS.
For example:
query a different bucket with the same credentials as provided by in the config:
query with new credential options (or just in another region):
GCS example (tested on a GCP VM, so the creds where automatically provided)
Closes #256