Skip to content

GCSToBigQueryOperator - allow upload to existing table without specifying schema_fields/schema_object #12329

@aKumpan

Description

@aKumpan

Description

We would like to be able to load data to existing BigQuery tables without having to specify schema_fields/schema_object in GCSToBigQueryOperator since table already exists.

Use case / motivation

BigQuery load job usage details and problem explanation

We create BigQuery tables/datasets through CI process (terraform managed), with the help of Airflow we updating those tables with data.
To update tables with data we can use:
Airflow 2.0 operator: GCSToBigQueryOperator
Airflow 1.* operator (deprecated) GoogleCloudStorageToBigQueryOperator
However those operator require to specify one of 3 things:

  1. schema_fields - fields that define table schema
  2. schema_object - a GCS object path pointing to a .json file that contains the schema for the table
  3. or autodetect=True
    In other cases it will:
 raise ValueError('At least one of `schema_fields`, `schema_object`, '
                                 'or `autodetect**=True**` must be passed.')  

Note: it does not actually says that autodetect must be True in exception - but according to code it must be specified as True, or schema should be used otherwise.

But we already have created table, and we can update it using
bq load command. (which Airflow operators mentioned above are using internally)

When using bq load - you also have an option to specify schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#bq

When you specify --autodetect as True - BigQuery will try to give random names to your columns, e.g.: 'string_field_0', 'int_field_1' - and if you are trying to load into existing table - bq load will fail with error:
'Cannot add fields (field: string_field_0)'}.'
Same way Airflow operators like 'GCSToBigQueryOperator' will fail.

However there is also an option NOT to specify --autodetect or specify --autodetect=False and in this case bq load will load from CloudStorage to existing BQ table without problems.

Proposal/TL;DR:
Add an option not to specify --autodetect or specify --autodetect=False when write_disposition='WRITE_APPEND' is used in GCSToBigQueryOperator. This will allow an operator to update existing BigQuery table without having to specify schema within the operator itself (it will just be updating existing table with data).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions