-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[Bug]: BigQuery size estimation does not allow impersonation #26622
Copy link
Copy link
Closed
Description
What happened?
Environment
SDK: Python
Runner: Dataflow
Connector: BigQuery
Apache Beam version: 2.45.0, but most likely also present in previous versions
Python version: 3.8.16, but does not seem to matter in this case
Preconditions
- Google Cloud Platform project P,
- BigQuery table T in any project,
- Google Cloud Platform account A which can read from table T and can create Dataflow and BigQuery jobs in project P,
- Google Cloud Platform account B which can impersonate account A, but does not have access to table T nor can create Dataflow and BigQuery jobs in project P.
Reproduction steps
- Implement a
Pipelinethat:- Reads from table T (
apache_beam.io.ReadFromBigQuery), - Is executed in project P (
project), - Runs as user B (
service_account), - Impersonates user A (
impersonate_service_account).
- Reads from table T (
- Run the pipeline as user B,
- An error occurs, because the originally configured
PipelineOptionsdo not take part in size estimation:This step does not fail the pipeline, as size estimation is best effort and can just returnFailed to insert job <JobReference jobId: 'beam_bq_job_QUERY_XXX' projectId: 'P'>: HttpError accessing <https://bigquery.googleapis.com/bigquery/v2/projects/P/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Mon, 08 May 2023 07:14:00 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '403', 'content-length': '486', '-content-encoding': 'gzip'}>, content <{ "error": { "code": 403, "message": "Access Denied: Project P: User does not have bigquery.jobs.create permission in project P.", "errors": [ { "message": "Access Denied: Project P: User does not have bigquery.jobs.create permission in project P.", "domain": "global", "reason": "accessDenied" } ], "status": "PERMISSION_DENIED" } }None. - Another error occurs, because of Google Cloud Platform token caching. Even though
PipelineOptionsare configured properly, the first request for Google Cloud Platform token was made with empty options during BigQuery table size estimation in step 3. Subsequent calls utilize the cache and do not even try to get a token for the impersonated user.apitools.base.py.exceptions.HttpForbiddenError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/P/locations/europe-west1/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Mon, 08 May 2023 07:14:21 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '403', 'content-length': '314', '-content-encoding': 'gzip'}>, content <{ "error": { "code": 403, "message": "(4598d41a53139eed): Could not create workflow; user does not have write access to project: P Causes: (4598d41a53139582): Permission 'dataflow.jobs.create' denied on project: 'P'", "status": "PERMISSION_DENIED" } }
Source code
apache_beam/io/gcp/bigquery.py/_CustomBigQuerySource.estimate_size,apache_beam/io/gcp/bigquery_tools.py/BigQueryWrapper.__init__,apache_beam/internal/gcp/auth.py/_Credentials.get_service_credentials.
Expected behaviour
The pipeline estimates BigQuery table size and submits the pipeline to Dataflow as user A.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner
Reactions are currently unavailable