-
Notifications
You must be signed in to change notification settings - Fork 1.9k
datafusion-cli: Use correct S3 region if it is not specified #16502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datafusion-cli: Use correct S3 region if it is not specified #16502
Conversation
fa796a8 to
1800fc4
Compare
1800fc4 to
126b9e1
Compare
|
Hi @alamb, this PR is ready for review! One concern here is whether it is acceptable to test against a bucket outside of datafusion, but I couldn’t find a good way to test region retry against local minio. |
|
Thanks @liamzwbao -- I'll put this one on my review queue for tomorrow |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @liamzwbao -- I tried this out and it works great! Also very nice tests.
DataFusion CLI v48.0.0
> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet';
0 row(s) fetched.
Elapsed 1.737 seconds.
...I think the code to execute statements is getting a bit complicated however. I wonder if there is any way we can encapsulate it more (I left a comment)
Maybe we can do the refactoring as a folllow on PR
datafusion-cli/src/exec.rs
Outdated
| AdjustedPrintOptions::new(print_options.clone()).with_statement(&statement); | ||
|
|
||
| let plan = create_plan(ctx, statement).await?; | ||
| // Only clone the statement if it's a CreateExternalTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is pretty specific to S3 and create_table -- inlining it here obscures the rest of the control flow that is happening.
Is there any way we can encapsulate these special cases somewhere?
Perhaps something like
struct StatementExecutor {
statement: DFStatement
...
// should we automatically look up the region of a failed s3 request?
retry_region: bool,
}
impl StatementExecutor {
async fn execute(&mut self) {}
}Then this loop could look something like
StatementExecutor::new(ctx, print_options).await?;
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think then we would have a place to put retry-region
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I also feel like putting the logic here hurts the readability.
I searched in the code and looks like it's probably ok to always clone the statement here, so I simplify the logic in the exec_and_print to make it a bit clearer.
I can revert to the previous version if you think that one is better.
Also created an issue #16559 to encapsulate the logic in a followup PR
| use url::Url; | ||
|
|
||
| #[cfg(not(test))] | ||
| use object_store::aws::resolve_bucket_region; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
34b37f4 to
4c29697
Compare
|
Thank you @liamzwbao |
…16502) * datafusion-cli: Use correct S3 region if it is not specified * Simplify the retry logic
Which issue does this PR close?
datafusion-cli: Use correct S3 region if it is not specified #16306.Rationale for this change
Improve UX by resolving s3 region automatically.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Yes, the CLI can automatically resolve the region for bucket now