Skip to content

Conversation

@vchiapaikeo
Copy link
Contributor

@vchiapaikeo vchiapaikeo commented Dec 18, 2022

GCSToBigQueryOperator allows multiple ways to specify schema of the BigQuery table:

  1. Setting autodetect == True
  2. Setting schema_fields directly with autodetect == False
  3. Setting a schema_object and optionally a schema_object_bucket with autodetect == False

This third method seems to be broken in the latest provider version (8.6.0) and will always result in this error:

[2022-12-16, 21:06:18 UTC] {taskinstance.py:1772} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py", line 395, in execute
    self.configuration = self._check_schema_fields(self.configuration)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py", line 524, in _check_schema_fields
    raise RuntimeError(
RuntimeError: Table schema was not found. Set autodetect=True to automatically set schema fields from source objects or pass schema_fields explicitly

The reason for this is because this block where if self.schema_object and self.source_format != "DATASTORE_BACKUP": fails to set self.schema_fields. It only sets the local variable, schema_fields. When self._check_schema_fields is subsequently called here, we enter the first block because autodetect is false and self.schema_fields is not set.

This PR sets the instance variable, self.schema_fields when the user passes in a schema_obj. Additionally, it uses self.schema_object_bucket instead of the erroneous self.bucket.

cc: @eladkal
Fixes: #28441

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Dec 18, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Dec 18, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@vchiapaikeo vchiapaikeo force-pushed the vchiapaikeo/fix-gcs-to-bigquery-v1 branch from 735546e to d324b97 Compare December 19, 2022 13:25
@vchiapaikeo
Copy link
Contributor Author

Whoops @eladkal - fixed the isort failure

@vchiapaikeo
Copy link
Contributor Author

cc @turbaszek, I believe you are codeowner. Can you approve this as well?

@eladkal eladkal merged commit 9eacf60 into apache:main Dec 20, 2022
@eladkal
Copy link
Contributor

eladkal commented Dec 20, 2022

Thanks @vchiapaikeo !
If you have time I'd appreciate also helping to resolve #12329

@boring-cyborg
Copy link

boring-cyborg bot commented Dec 20, 2022

Awesome work, congrats on your first merged pull request!

@vchiapaikeo vchiapaikeo deleted the vchiapaikeo/fix-gcs-to-bigquery-v1 branch December 20, 2022 18:32
@vchiapaikeo
Copy link
Contributor Author

My pleasure @eladkal! Sure I can try to take a look at it this weekend.

@axelborja
Copy link

Any idea in which release this one will be shipped ?

@potiuk
Copy link
Member

potiuk commented Jan 17, 2023

Screenshot 2023-01-17 at 12 35 50

Seen the above ? ^^ looks like this is already shipped in the previous provider and you can update the provider when you want.

@grimwm
Copy link

grimwm commented Jan 26, 2023

Hi, when will this be released? It is not in apache-airflow[gcp]==2.5.1.

@potiuk
Copy link
Member

potiuk commented Jan 26, 2023

Hi, when will this be released? It is not in apache-airflow[gcp]==2.5.1.

Why do you think so ?

@grimwm
Copy link

grimwm commented Feb 4, 2023

Hi, when will this be released? It is not in apache-airflow[gcp]==2.5.1.

Why do you think so ?

Hi, that was my fault. I didn't realize our production Airflow was on 2.5.0. I finally put it together and got it sorted and forgot to update here. Sorry, and thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GCSToBigQueryOperator fails when schema_object is specified without schema_fields

5 participants