Skip to content

[Bug]: BigQuery size estimation does not allow impersonation #26622

@dopieralad

Description

@dopieralad

What happened?

Environment

SDK: Python
Runner: Dataflow
Connector: BigQuery
Apache Beam version: 2.45.0, but most likely also present in previous versions
Python version: 3.8.16, but does not seem to matter in this case

Preconditions

  • Google Cloud Platform project P,
  • BigQuery table T in any project,
  • Google Cloud Platform account A which can read from table T and can create Dataflow and BigQuery jobs in project P,
  • Google Cloud Platform account B which can impersonate account A, but does not have access to table T nor can create Dataflow and BigQuery jobs in project P.

Reproduction steps

  1. Implement a Pipeline that:
    • Reads from table T (apache_beam.io.ReadFromBigQuery),
    • Is executed in project P (project),
    • Runs as user B (service_account),
    • Impersonates user A (impersonate_service_account).
  2. Run the pipeline as user B,
  3. An error occurs, because the originally configured PipelineOptions do not take part in size estimation:
    Failed to insert job <JobReference    
    jobId: 'beam_bq_job_QUERY_XXX'    
    projectId: 'P'>: HttpError accessing <https://bigquery.googleapis.com/bigquery/v2/projects/P/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Mon, 08 May 2023 07:14:00 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '403', 'content-length': '486', '-content-encoding': 'gzip'}>, content <{    
     "error": {    
       "code": 403,    
       "message": "Access Denied: Project P: User does not have bigquery.jobs.create permission in project P.",    
         "errors": [    
           {    
             "message": "Access Denied: Project P: User does not have bigquery.jobs.create permission in project P.",    
             "domain": "global",    
             "reason": "accessDenied"    
           }    
         ],    
         "status": "PERMISSION_DENIED"    
       }    
     }    
    
    This step does not fail the pipeline, as size estimation is best effort and can just return None.
  4. Another error occurs, because of Google Cloud Platform token caching. Even though PipelineOptions are configured properly, the first request for Google Cloud Platform token was made with empty options during BigQuery table size estimation in step 3. Subsequent calls utilize the cache and do not even try to get a token for the impersonated user.
    apitools.base.py.exceptions.HttpForbiddenError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/P/locations/europe-west1/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Mon, 08 May 2023 07:14:21 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '403', 'content-length': '314', '-content-encoding': 'gzip'}>, content <{
      "error": {
        "code": 403,
        "message": "(4598d41a53139eed): Could not create workflow; user does not have write access to project: P Causes: (4598d41a53139582): Permission 'dataflow.jobs.create' denied on project: 'P'",
        "status": "PERMISSION_DENIED"
      }
    }
    

Source code

  1. apache_beam/io/gcp/bigquery.py/_CustomBigQuerySource.estimate_size,
  2. apache_beam/io/gcp/bigquery_tools.py/BigQueryWrapper.__init__,
  3. apache_beam/internal/gcp/auth.py/_Credentials.get_service_credentials.

Expected behaviour

The pipeline estimates BigQuery table size and submits the pipeline to Dataflow as user A.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions