Skip to content

Implement hf:// / "hugging face" integration in datafusion-cli #10720

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

The DuckDB blog shows off a really cool new feature (access remote datasets from hugging face):

https://duckdb.org/2024/05/29/access-150k-plus-datasets-from-hugging-face-with-duckdb

I think doing this with DataFusion would be quite cool and quite simple to implement. Being able to add such support quickly would be a good example of how datafusion's extensibility allows rapid feature development as well as being a cool project.

Describe the solution you'd like

I would like to support this type of query from datafusion-cli:

SELECT *
FROM 'hf://datasets/datasets-examples/doc-formats-csv-1/data.csv';

Describe alternatives you've considered

I think we can just follow the same model as the existing object store integration in datafusion-cli

pub(crate) fn register_options(ctx: &SessionContext, scheme: &str) {
// Match the provided scheme against supported cloud storage schemes:
match scheme {
// For Amazon S3 or Alibaba Cloud OSS
"s3" | "oss" | "cos" => {
// Register AWS specific table options in the session context:
ctx.register_table_options_extension(AwsOptions::default())
}
// For Google Cloud Storage
"gs" | "gcs" => {
// Register GCP specific table options in the session context:
ctx.register_table_options_extension(GcpOptions::default())
}
// For unsupported schemes, do nothing:
_ => {}
}
}
pub(crate) async fn get_object_store(
state: &SessionState,
scheme: &str,
url: &Url,
table_options: &TableOptions,
) -> Result<Arc<dyn ObjectStore>, DataFusionError> {
let store: Arc<dyn ObjectStore> = match scheme {
"s3" => {
let Some(options) = table_options.extensions.get::<AwsOptions>() else {
return exec_err!(
"Given table options incompatible with the 's3' scheme"
);
};
let builder = get_s3_object_store_builder(url, options).await?;
Arc::new(builder.build()?)
}
"oss" => {
let Some(options) = table_options.extensions.get::<AwsOptions>() else {
return exec_err!(
"Given table options incompatible with the 'oss' scheme"
);
};
let builder = get_oss_object_store_builder(url, options)?;
Arc::new(builder.build()?)
}
"cos" => {
let Some(options) = table_options.extensions.get::<AwsOptions>() else {
return exec_err!(
"Given table options incompatible with the 'cos' scheme"
);
};
let builder = get_cos_object_store_builder(url, options)?;
Arc::new(builder.build()?)
}
"gs" | "gcs" => {
let Some(options) = table_options.extensions.get::<GcpOptions>() else {
return exec_err!(
"Given table options incompatible with the 'gs'/'gcs' scheme"
);
};
let builder = get_gcs_object_store_builder(url, options)?;
Arc::new(builder.build()?)
}
"http" | "https" => Arc::new(
HttpBuilder::new()
.with_url(url.origin().ascii_serialization())
.build()?,
),
_ => {
// For other types, try to get from `object_store_registry`:
state
.runtime_env()
.object_store_registry
.get_store(url)
.map_err(|_| {
exec_datafusion_err!("Unsupported object store scheme: {}", scheme)
})?
}
};
Ok(store)

And register the hf url with a specially created Http object store instance

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions