Skip to content

BigQuery API -- List Row with 'start Index' and Big 'Max Results' induce wrong output. #4

@adussarps

Description

@adussarps
Environment details

Using the google-cloud-bigquery client with version 1.23.1
Python 3.7 (on linux and macos)

Steps to reproduce

  1. Using client.list_row with max_result and start_index induce wrong data to be pulled when
    the client needs to use more than one page.
    He then issued a second call with 'nextPageToken' and 'startIndex' wich seems to be incompatible.

Code example

def table_to_df_iterator(project_id, dataset_id, table_id) -> iter:
    table_full_id = project_id + "." + dataset_id + "." + table_id
    client = get_client()
    index = 0
    while True:
        offset = BATCH_SIZE_ROWS * index
        df = client.list_rows(table_full_id, max_results=BATCH_SIZE_ROWS, 
                                          start_index=offset).to_dataframe()
        if df.empty:
            break
        logging.info(f"Offset is at {offset} got a dataframe of size {len(DataFrame.index)}")
        yield df
        index += 1

Trace

DEBUG:google.cloud.bigquery.table:Started reading table 'samsung-global-dashboard.1_Raw.Facebook_SEUK_VIDEO_20190101' with tabledata.list.
DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/samsung-global-dashboard/datasets/1_Raw/tables/Facebook_SEUK_VIDEO_20190101/data?maxResults=100000&startIndex=100000 HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/samsung-global-dashboard/datasets/1_Raw/tables/Facebook_SEUK_VIDEO_20190101/data?pageToken=BEP6ZNORN4AQAAASAUIIBAEAAUNAQCEG6ADBBIENAYQP777777777777P4VAA%3D%3D%3D&maxResults=87354&startIndex=100000 HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/samsung-global-dashboard/datasets/1_Raw/tables/Facebook_SEUK_VIDEO_20190101/data?pageToken=BEP6ZNORN4AQAAASAUIIBAEAAUNAQCEG6ADBBOVKAUQP777777777777P4VAA%3D%3D%3D&maxResults=74708&startIndex=100000 HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/samsung-global-dashboard/datasets/1_Raw/tables/Facebook_SEUK_VIDEO_20190101/data?pageToken=BEP6ZNORN4AQAAASAUIIBAEAAUNAQCEG6ADBBVGHAQQP777777777777P4VAA%3D%3D%3D&maxResults=62062&startIndex=100000 HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/samsung-global-dashboard/datasets/1_Raw/tables/Facebook_SEUK_VIDEO_20190101/data?pageToken=BEP6ZNORN4AQAAASAUIIBAEAAUNAQCEG6ADBB3XEAMQP777777777777P4VAA%3D%3D%3D&maxResults=49416&startIndex=100000 HTTP/1.1" 200 None

Idea to fix

Make the second call use an updated startIndex instead of 'nextPageToken'

Thanks!

Metadata

Metadata

Labels

api: bigqueryIssues related to the googleapis/python-bigquery API.priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions