-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
Description
We would like to be able to load data to existing BigQuery tables without having to specify schema_fields/schema_object in GCSToBigQueryOperator since table already exists.
Use case / motivation
BigQuery load job usage details and problem explanation
We create BigQuery tables/datasets through CI process (terraform managed), with the help of Airflow we updating those tables with data.
To update tables with data we can use:
Airflow 2.0 operator: GCSToBigQueryOperator
Airflow 1.* operator (deprecated) GoogleCloudStorageToBigQueryOperator
However those operator require to specify one of 3 things:
- schema_fields - fields that define table schema
- schema_object - a GCS object path pointing to a .json file that contains the schema for the table
- or autodetect=True
In other cases it will:
raise ValueError('At least one of `schema_fields`, `schema_object`, '
'or `autodetect**=True**` must be passed.')
Note: it does not actually says that autodetect must be True in exception - but according to code it must be specified as True, or schema should be used otherwise.
But we already have created table, and we can update it using
bq load command. (which Airflow operators mentioned above are using internally)
When using bq load - you also have an option to specify schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#bq
When you specify --autodetect as True - BigQuery will try to give random names to your columns, e.g.: 'string_field_0', 'int_field_1' - and if you are trying to load into existing table - bq load will fail with error:
'Cannot add fields (field: string_field_0)'}.'
Same way Airflow operators like 'GCSToBigQueryOperator' will fail.
However there is also an option NOT to specify --autodetect or specify --autodetect=False and in this case bq load will load from CloudStorage to existing BQ table without problems.
Proposal/TL;DR:
Add an option not to specify --autodetect or specify --autodetect=False when write_disposition='WRITE_APPEND' is used in GCSToBigQueryOperator. This will allow an operator to update existing BigQuery table without having to specify schema within the operator itself (it will just be updating existing table with data).