Skip to content

GCSToBQ operator does not respect project_id in deferrable mode with impersonation chain. #32093

@nathadfield

Description

@nathadfield

Apache Airflow version

2.6.2

What happened

When using the GCSToBigQueryOperator in deferrable mode with an impersonation_chain service account which has a default project_id that is different from the project_id specified in the operator arguments, a failure occurs.

[2023-06-23, 11:38:37 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py", line 447, in execute_complete
    raise AirflowException(event["message"])
airflow.exceptions.AirflowException: 404, message='Not Found: {\n  "error": {\n    "code": 404,\n    "message": "Not found: Job king-cdmr-etl-sandbox:airflow_apptweak_king_itunes_connect_channels_load_active_devices_to_bq_2023_06_22T07_00_00_00_00_4842808969d21632ecbb76ffca48aabd",\n    "errors": [\n      {\n        "message": "Not found: Job king-cdmr-etl-sandbox:airflow_apptweak_king_itunes_connect_channels_load_active_devices_to_bq_2023_06_22T07_00_00_00_00_4842808969d21632ecbb76ffca48aabd",\n        "domain": "global",\n        "reason": "notFound"\n      }\n    ],\n    "status": "NOT_FOUND"\n  }\n}\n', url=URL('https://www.googleapis.com/bigquery/v2/projects/king-cdmr-etl-sandbox/jobs/airflow_apptweak_king_itunes_connect_channels_load_active_devices_to_bq_2023_06_22T07_00_00_00_00_4842808969d21632ecbb76ffca48aabd')

I believe this happens because, although the BigQuery job to insert data, is raised against self.project_id in _submit_job, when in deferrable mode it tries to find the job within the project in self.hook.project_id.

It is possible that that the default project_id assigned to the impersonation chain service account is different to the project_id specified to the operator.

In the above error, you can see that the error says that it cannot find the job_id airflow_apptweak_king_itunes_connect_channels_load_active_devices_to_bq_2023_06_22T07_00_00_00_00_4842808969d21632ecbb76ffca48aabd in the project king-cdmt-etl-sandbox.

In fact this job_id was created successfully in the project king-coredatasets-sandbox

Screenshot 2023-06-23 at 12 40 39

What you think should happen instead

I think that we should modify the call to self.defer to receive self.project_id rather than self.hook.project_id

How to reproduce

I haven't quite got the exact steps to reproduce but I will submit a PR for review soon.

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow-providers-google==10.0.0

Deployment

Astronomer

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions