Skip to content

Conversation

@gruuya
Copy link
Contributor

@gruuya gruuya commented Jun 6, 2023

This would allow one to query CSV/Parquet files stored in S3 and GCS.

For example:

  • query a different bucket with the same credentials as provided by in the config:

    $ curl -H "Content-Type: application/json" http://127.0.0.1:8080/q -d@- <<EOF
    {"query": "create external table test_aws stored as parquet location 's3://seafowl-public/tutorial/trase-supply-chains.parquet'; select * from staging.test_aws limit 1"}
    EOF
    {"commodity":"CORN","country_of_import":"CANADA","country_of_import_trase_id":"CA","country_of_production":"ARGENTINA","economic_bloc":"CANADA","exporter":"RONALB S R L","exporter_group":"RONALB S R L","exporter_group_id":63685,"exporter_id":39201,"exporter_trase_id":"AR-TRADER-3064104720","flow_id":401692173,"fob":24200.0,"importer":"RONALB S R L","is_domestic":"0","port":"ROSARIO","product_type":"CORN GRAINS","region_production_1_type":"COUNTRY","region_production_2":"SANTA FE","region_production_2_level":2,"region_production_2_trase_id":"AR-82","region_production_2_type":"PROVINCE","row_number":1,"scale":"SUBNATIONAL","version":"0.2.2","volume":50.0,"year":2018.0}
  • query with new credential options (or just in another region):

    $ curl -H "Content-Type: application/json" http://127.0.0.1:8080/q -d@- <<EOF
    {"query": "create external table test_aws_options stored as parquet options ('access_key_id' '*******', 'secret_access_key' '*************', 'region' 'eu-west-3') location 's3://splitgraph-athena-test/supply-chains/supply-chains.parquet'; select * from staging.test_aws_options limit 1"}
    EOF
    {"commodity":"CORN","country_of_import":"CANADA","country_of_import_trase_id":"CA","country_of_production":"ARGENTINA","economic_bloc":"CANADA","exporter":"RONALB S R L","exporter_group":"RONALB S R L","exporter_group_id":63685,"exporter_id":39201,"exporter_trase_id":"AR-TRADER-3064104720","flow_id":401692173,"fob":24200.0,"importer":"RONALB S R L","is_domestic":"0","port":"ROSARIO","product_type":"CORN GRAINS","region_production_1_type":"COUNTRY","region_production_2":"SANTA FE","region_production_2_level":2,"region_production_2_trase_id":"AR-82","region_production_2_type":"PROVINCE","row_number":1,"scale":"SUBNATIONAL","version":"0.2.2","volume":50.0,"year":2018.0}
  • GCS example (tested on a GCP VM, so the creds where automatically provided)

    $ curl -v -H "Content-Type: application/json" http://127.0.0.1:8080/q -d@- <<EOF
    {"query": "create external table test_gcs stored as parquet location 'gs://splitgraph-staging/tweets.parquet'; select * from staging.test_gcs limit 1"}
    EOF
    {"id":877940643162578944,"link":"https://www.twitter.com/AdamKinzinger/statuses/877940643162578944","screen_name":"AdamKinzinger","source":"Twitter for iPad","text":"Before we blame anyone else for the tone of today's politics, we must look to ourselves. I hope you'll join me.  https://t.co/O6WWDYloKl","time":"2017-06-22T13:25:35","user_id":18004222}

Closes #256

Comment on lines 145 to 153
.ok_or(DataFusionError::Execution(
"'access_key_id' not found in provided options".to_string(),
))?
.clone(),
secret_access_key: map
.get("secret_access_key")
.ok_or(DataFusionError::Execution(
"'secret_access_key' not found in provided options".to_string(),
))?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFusionError feels a bit foreign inside the config module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could define something ourselves or reuse a serde parsing error, given this is part of the config deserialization logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; tbh I'm still torn torn between doing this here (or somewhere else in our code), or going with the datafusion-cli helper functions (which would make this redundant).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. It seems like datafusion-cli's dependencies are similar to ours anyway (e.g. clap, tokio, datafusion itself; mimalloc isn't though). Would it even be a meaningful size increase if we wanted to pull in datafusion-cli?

Copy link
Contributor Author

@gruuya gruuya Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's 61MB regardless we include datafusion-cli or not. I was also wary of using the df-cli functions as they initially source values from the env, and delta-rs used to dump options values to env vars when creating a new external table, but it seems this is no longer the case.

Given this, I will try this approach next.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out delta-rs still dumps the env vars when creating an external delta table on S3, which would interfere with the creation of a new object store in get_s3_object_store_builder.

So I'll leave the implementation as is for now (only changing the error above and adding some tests).

src/context.rs Outdated
let s3 = if cmd.options.is_empty() && let schema::ObjectStore::S3(s3) = self.internal_object_store.config.clone() {
S3{ bucket, ..s3 }
} else {
S3::from_bucket_and_options(bucket, &cmd.options)?
Copy link
Contributor Author

@gruuya gruuya Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential alternative here is to use get_s3_object_store_builder (and get_s3_object_store_builder below) though that would mean adding a new dependency (datafusion-cli).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to do some different things according to the source and doesn't handle GCS though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant for the other link to point to get_gcs_object_store_builder.

In the end both end up doing effectively what we want—create an object store dynamically and register it with the object store registry.

@gruuya gruuya requested a review from mildbyte June 6, 2023 11:34
.clone(),
endpoint: map.get("endpoint").cloned(),
bucket,
cache_properties: Some(ObjectCacheProperties::default()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be setting this to None in this case)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between None and Some(ObjectCacheProperties::default()) (if someone doesn't want their object store contents to be cached, can they represent that in the config)? If not, we could replace the type with ObjectCacheProperties instead of the optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone doesn't want their object store contents to be cached, can they represent that in the config

this is currently represented by not specifying the [object_store.cache_properties] section in the config, and I think it's wise to leave it like that (i.e. keep it optional).

I'm just wondering whether it's worth it to setup the temp file caching machinery for a external table, and I'm leaning towards no for starters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful. The reason is that we originally set up caching for the HTTP object store because DataFusion would call our object store requesting very small byte ranges very frequently, so we'd get slowed down by the latency (this could now have been fixed, but could still be a problem for S3/GCS).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; I'll leave it to use the default caching props then (as is now).

Comment on lines 145 to 153
.ok_or(DataFusionError::Execution(
"'access_key_id' not found in provided options".to_string(),
))?
.clone(),
secret_access_key: map
.get("secret_access_key")
.ok_or(DataFusionError::Execution(
"'secret_access_key' not found in provided options".to_string(),
))?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could define something ourselves or reuse a serde parsing error, given this is part of the config deserialization logic?

.clone(),
endpoint: map.get("endpoint").cloned(),
bucket,
cache_properties: Some(ObjectCacheProperties::default()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between None and Some(ObjectCacheProperties::default()) (if someone doesn't want their object store contents to be cached, can they represent that in the config)? If not, we could replace the type with ObjectCacheProperties instead of the optional.

@gruuya gruuya merged commit 8f60358 into main Jun 7, 2023
@gruuya gruuya deleted the s3-gcs-external-tables branch June 7, 2023 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature: external table for private S3 bucket and ObjectStoreProvider

3 participants