-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
The DuckDB blog shows off a really cool new feature (access remote datasets from hugging face):
https://duckdb.org/2024/05/29/access-150k-plus-datasets-from-hugging-face-with-duckdb
I think doing this with DataFusion would be quite cool and quite simple to implement. Being able to add such support quickly would be a good example of how datafusion's extensibility allows rapid feature development as well as being a cool project.
Describe the solution you'd like
I would like to support this type of query from datafusion-cli:
SELECT *
FROM 'hf://datasets/datasets-examples/doc-formats-csv-1/data.csv';Describe alternatives you've considered
I think we can just follow the same model as the existing object store integration in datafusion-cli
datafusion/datafusion-cli/src/object_storage.rs
Lines 419 to 496 in 088ad01
| pub(crate) fn register_options(ctx: &SessionContext, scheme: &str) { | |
| // Match the provided scheme against supported cloud storage schemes: | |
| match scheme { | |
| // For Amazon S3 or Alibaba Cloud OSS | |
| "s3" | "oss" | "cos" => { | |
| // Register AWS specific table options in the session context: | |
| ctx.register_table_options_extension(AwsOptions::default()) | |
| } | |
| // For Google Cloud Storage | |
| "gs" | "gcs" => { | |
| // Register GCP specific table options in the session context: | |
| ctx.register_table_options_extension(GcpOptions::default()) | |
| } | |
| // For unsupported schemes, do nothing: | |
| _ => {} | |
| } | |
| } | |
| pub(crate) async fn get_object_store( | |
| state: &SessionState, | |
| scheme: &str, | |
| url: &Url, | |
| table_options: &TableOptions, | |
| ) -> Result<Arc<dyn ObjectStore>, DataFusionError> { | |
| let store: Arc<dyn ObjectStore> = match scheme { | |
| "s3" => { | |
| let Some(options) = table_options.extensions.get::<AwsOptions>() else { | |
| return exec_err!( | |
| "Given table options incompatible with the 's3' scheme" | |
| ); | |
| }; | |
| let builder = get_s3_object_store_builder(url, options).await?; | |
| Arc::new(builder.build()?) | |
| } | |
| "oss" => { | |
| let Some(options) = table_options.extensions.get::<AwsOptions>() else { | |
| return exec_err!( | |
| "Given table options incompatible with the 'oss' scheme" | |
| ); | |
| }; | |
| let builder = get_oss_object_store_builder(url, options)?; | |
| Arc::new(builder.build()?) | |
| } | |
| "cos" => { | |
| let Some(options) = table_options.extensions.get::<AwsOptions>() else { | |
| return exec_err!( | |
| "Given table options incompatible with the 'cos' scheme" | |
| ); | |
| }; | |
| let builder = get_cos_object_store_builder(url, options)?; | |
| Arc::new(builder.build()?) | |
| } | |
| "gs" | "gcs" => { | |
| let Some(options) = table_options.extensions.get::<GcpOptions>() else { | |
| return exec_err!( | |
| "Given table options incompatible with the 'gs'/'gcs' scheme" | |
| ); | |
| }; | |
| let builder = get_gcs_object_store_builder(url, options)?; | |
| Arc::new(builder.build()?) | |
| } | |
| "http" | "https" => Arc::new( | |
| HttpBuilder::new() | |
| .with_url(url.origin().ascii_serialization()) | |
| .build()?, | |
| ), | |
| _ => { | |
| // For other types, try to get from `object_store_registry`: | |
| state | |
| .runtime_env() | |
| .object_store_registry | |
| .get_store(url) | |
| .map_err(|_| { | |
| exec_datafusion_err!("Unsupported object store scheme: {}", scheme) | |
| })? | |
| } | |
| }; | |
| Ok(store) |
And register the hf url with a specially created Http object store instance
Additional context
No response