Skip to content

enable import from GCS emulator without PublicHost#248

Merged
goccy merged 1 commit into
goccy:mainfrom
totem3:feature/import-from-gcs-emulator-without-public-host
Apr 7, 2024
Merged

enable import from GCS emulator without PublicHost#248
goccy merged 1 commit into
goccy:mainfrom
totem3:feature/import-from-gcs-emulator-without-public-host

Conversation

@totem3

@totem3 totem3 commented Nov 23, 2023

Copy link
Copy Markdown
Collaborator

fixes #209

Summary of problem:

There is an issue with the job that imports files from GCS, specifically when using the GCS Emulator. As detailed in issue #209, attempts to import data from the GCS Emulator sometimes does not work.

This happens when publicHost is not set in GCS Emulator, or access not using publicHost .

We have spent quite some time investigating this issue, and considering there's already an issue created with comments on it, we believe there is value in making it work without needing to set a publicHost.

cause

The problem arises due to two different URL formats used for accessing objects in the GCS Emulator:

  • /storage/v1/b/{bucketName}/o/{objectName}
  • /{bucketName}/{objectName}

The second URL pattern is only valid for accesses to publicHost in the GCS Emulator. The Go GCS SDK, when downloading files from GCS (using client.Bucket(...).Object(...).NewReader()) , accesses the latter URL format, which requires a valid publicHost and results in errors if it's not set.

The issue can be pinpointed in the code here:
When building the URL for data reading, the method at google-cloud-go#L788-L793 is used. This method does not take the API prefix (storage/v1) into account, considering only the host, bucket name, and object path. It is internally used in the NewReader method at bigquery-emulator#L1087.

However, in the JSON API this problem does not occur, because even when data reading, it uses the former URL format. (google-api-go-client#L12441).

This issue seems to be specific to the Emulator and not a problem with standard GCS usage, likely due to the ability to access objects directly through URLs without an API Prefix on storage.googleapis.com.

Changes made in this PR:

I have enabled the option to use the JSON API, ensuring that imports work even when a publicHost is not set for the emulator. Since JSON download API introduced in v1.30.0, I have upgraded cloud.google.com/go/storage version.

This might be more a problem with the Go GCS SDK than with the BigQuery Emulator. So, if this fix isn't right, please let me know. If that's the case, I'm thinking of making another PR to add guidelines in the README about setting a publicHost for the GCS Emulator.

Thank you for maintaining such a great product.

@goccy goccy left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution !! LGTM 👍
Please resolve conflict 🙏

@goccy goccy added the reviewed label Apr 6, 2024
@totem3 totem3 force-pushed the feature/import-from-gcs-emulator-without-public-host branch from b4ea961 to 8bf5a73 Compare April 7, 2024 06:03
@totem3

totem3 commented Apr 7, 2024

Copy link
Copy Markdown
Collaborator Author

@goccy
Thank you for the review! I've rebased onto main.
I dropped the commit that was causing conflicts because the dependency modules in the latest main were already updated enough, making that commit unnecessary. Other than that, I haven't made any changes.

@goccy

goccy commented Apr 7, 2024

Copy link
Copy Markdown
Owner

Thank you for your quickly response !!

@goccy goccy merged commit 5ad569f into goccy:main Apr 7, 2024
iolalla pushed a commit to iolalla/bigquery-emulator that referenced this pull request Aug 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Loading a CSV from emulated GCS fails

2 participants