Skip to content

Conversation

@liamzwbao
Copy link
Contributor

@liamzwbao liamzwbao commented Jun 22, 2025

Which issue does this PR close?

Rationale for this change

Improve UX by resolving s3 region automatically.

What changes are included in this PR?

  • Resolve bucket region if not specified
  • Retry on wrong region setting

Are these changes tested?

  • unit tests for resolving region
  • integration tests against a public bucket

Are there any user-facing changes?

Yes, the CLI can automatically resolve the region for bucket now

@liamzwbao liamzwbao force-pushed the issue-13456-resolve-region-on-error branch 4 times, most recently from fa796a8 to 1800fc4 Compare June 24, 2025 23:34
@liamzwbao liamzwbao force-pushed the issue-13456-resolve-region-on-error branch from 1800fc4 to 126b9e1 Compare June 25, 2025 00:31
@liamzwbao liamzwbao marked this pull request as ready for review June 25, 2025 00:44
@liamzwbao
Copy link
Contributor Author

Hi @alamb, this PR is ready for review!

One concern here is whether it is acceptable to test against a bucket outside of datafusion, but I couldn’t find a good way to test region retry against local minio.

@alamb
Copy link
Contributor

alamb commented Jun 25, 2025

Thanks @liamzwbao -- I'll put this one on my review queue for tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @liamzwbao -- I tried this out and it works great! Also very nice tests.

DataFusion CLI v48.0.0
> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet';
0 row(s) fetched.
Elapsed 1.737 seconds.
...

I think the code to execute statements is getting a bit complicated however. I wonder if there is any way we can encapsulate it more (I left a comment)

Maybe we can do the refactoring as a folllow on PR

AdjustedPrintOptions::new(print_options.clone()).with_statement(&statement);

let plan = create_plan(ctx, statement).await?;
// Only clone the statement if it's a CreateExternalTable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is pretty specific to S3 and create_table -- inlining it here obscures the rest of the control flow that is happening.

Is there any way we can encapsulate these special cases somewhere?

Perhaps something like

struct StatementExecutor {
  statement: DFStatement 
  ...
  // should we automatically look up the region of a failed s3 request?
   retry_region: bool,
}
impl StatementExecutor {
  async fn execute(&mut self) {}
}

Then this loop could look something like

  StatementExecutor::new(ctx, print_options).await?;
}

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think then we would have a place to put retry-region

Copy link
Contributor Author

@liamzwbao liamzwbao Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I also feel like putting the logic here hurts the readability.

I searched in the code and looks like it's probably ok to always clone the statement here, so I simplify the logic in the exec_and_print to make it a bit clearer.

I can revert to the previous version if you think that one is better.

Also created an issue #16559 to encapsulate the logic in a followup PR

use url::Url;

#[cfg(not(test))]
use object_store::aws::resolve_bucket_region;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@liamzwbao liamzwbao force-pushed the issue-13456-resolve-region-on-error branch from 34b37f4 to 4c29697 Compare June 26, 2025 01:45
@alamb alamb merged commit 3649dc8 into apache:main Jun 26, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 26, 2025

Thank you @liamzwbao

@liamzwbao liamzwbao deleted the issue-13456-resolve-region-on-error branch June 26, 2025 19:03
adriangb pushed a commit to pydantic/datafusion that referenced this pull request Jun 27, 2025
…16502)

* datafusion-cli: Use correct S3 region if it is not specified

* Simplify the retry logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

datafusion-cli: Use correct S3 region if it is not specified

2 participants