BigQuery: Fixed pandas DataFrames being returned with incorrect index. by eriknil · Pull Request #7953 · googleapis/google-cloud-python

eriknil · 2019-05-13T00:15:18Z

When loading large datasets from BIgQuery as pandas DataFrames, sometimes the index contains duplicates. This happens when the results are collected as multiple DataFrames and then concatenated without resetting the index.

As an example, we get 0, 1 repeated in the index:

In [1]: import pandas as pd
In [2]: x = pd.DataFrame({"a": [1, 2]})
   ...: pd.concat([x, x])
Out[2]:
   a
0  1
1  2
0  1
1  2

which has some unintended consequences when we try to subset:

In [3]: pd.concat([x, x]).loc[0]
Out[3]:
   a
0  1
0  1

Instead, by setting ignore_index=True we get:

In [4]: pd.concat([x, x], ignore_index=True)
Out[4]:
   a
0  1
1  2
2  1
3  2

From the pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html:

ignore_index : boolean, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

In our case the indexes really don't have any meaningful information.

Reproducible example

We can see this in action by running the following query on the crypto-dash dataset:

from google.cloud import bigquery
bq_client = bigquery.Client()

query = """
SELECT
  block_timestamp_month
FROM
  `bigquery-public-data.crypto_dash.transactions`
LIMIT
  1000000
"""
data = bq_client.query(query).result().to_dataframe()

and then checking if the indexes are unique:

print(data.index.is_unique)
> False

I have replicated the problem with two different unit tests, and fixed it in this PR.

googlebot · 2019-05-13T00:15:21Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

eriknil · 2019-05-13T00:20:01Z

I signed it!

googlebot · 2019-05-13T00:20:04Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

tswast

Thanks for the contribution, explanation, and the unit tests!

eriknil added 2 commits May 12, 2019 19:34

added tests confirming the bug

5c7779e

reset index when concating data frames

54d7c43

eriknil requested a review from a team May 13, 2019 00:15

googlebot added the cla: no This human has *not* signed the Contributor License Agreement. label May 13, 2019

googlebot added cla: yes This human has signed the Contributor License Agreement. and removed cla: no This human has *not* signed the Contributor License Agreement. labels May 13, 2019

tswast approved these changes May 13, 2019

View reviewed changes

tswast merged commit 53e492c into googleapis:master May 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: Fixed pandas DataFrames being returned with incorrect index.#7953

BigQuery: Fixed pandas DataFrames being returned with incorrect index.#7953
tswast merged 2 commits intogoogleapis:masterfrom
eriknil:fix-bq-pandas-incorrect-index

eriknil commented May 13, 2019 •

edited

Loading

Uh oh!

googlebot commented May 13, 2019

Uh oh!

eriknil commented May 13, 2019

Uh oh!

googlebot commented May 13, 2019

Uh oh!

tswast left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eriknil commented May 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproducible example

Uh oh!

googlebot commented May 13, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

Uh oh!

eriknil commented May 13, 2019

Uh oh!

googlebot commented May 13, 2019

Uh oh!

tswast left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eriknil commented May 13, 2019 •

edited

Loading