Skip to content

Feat: Onboard The General Index Dataset#342

Merged
nlarge-google merged 22 commits into
GoogleCloudPlatform:mainfrom
gkodukula:general_index
Jun 8, 2022
Merged

Feat: Onboard The General Index Dataset#342
nlarge-google merged 22 commits into
GoogleCloudPlatform:mainfrom
gkodukula:general_index

Conversation

@gkodukula

@gkodukula gkodukula commented Apr 13, 2022

Copy link
Copy Markdown
Contributor

Checklist

Note: If an item applies to you, all of its sub-items must be fulfilled

  • (Required) This pull request is appropriately labeled
  • Please merge this pull request after it's approved
  • I'm adding or editing a dataset
    • The Google Cloud Datasets team is aware of the proposed dataset
    • I put all my code inside datasets/the_general_index> and nothing outside of that directory

".csv", "-" + str(chunk_number) + ".csv"
)
df = pd.DataFrame()
df = pd.concat([df, chunk])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are concatenating an empty dataframe to chunk each time here. I think the result of lines 209 and 210 is the same as just df = chunk

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@happyhuman Gowtham and myself are working on this together. It turns out that data in some of the source files exceeds pandas csv row string length maximum and pandas does not provide a method to be able to extend that maximum, so for this implementation we need to rewrite the file reading process to use "import csv" module instead. As a result we are not yet ready to push to production.

) -> pd.DataFrame:
for column in null_string_list:
logging.info(f"Removing null strings from column {column}")
df[column] = df[column].str.replace("\\N", "", regex=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if it should be "\n"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@happyhuman The \N is actual text in the data "\N" is put in place of null when the source data was dumped; so we are not actually looking for newlines. Therefore, the "\N" is correct here.



def convert_dt_format(dt_str: str) -> str:
if not dt_str or str(dt_str).lower() == "nan" or str(dt_str).lower() == "nat":

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dt_str can be None, or nan or nat, a good way to check them all is to use pd.notnull(dt_str).

Also, if dt_str is expected to be of type str, then using str(dt_str) seems redundant.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@happyhuman The nan/nat issue has been identified before in previous PR's. Use of notnull has not been successful before. However, we will try both issues again to see if we can get your suggestion/s to work. Thanks for the reminder.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@happyhuman pd.notnull will not work for NaN or NaT. Please see https://pandas.pydata.org/docs/reference/api/pandas.notnull.html

the only way to access lower is through the "str" object so the only other alternative would be str.lower(dt_str) which is not much different so I think this is moot. Let me know if you still want me to change to str.lower(dt_str) or not.

@nlarge-google nlarge-google changed the title Feat: Onboard The General index Dataset Feat: Onboard The General Index Dataset Apr 15, 2022
@nlarge-google

Copy link
Copy Markdown
Contributor

@gkodukula this is failing in dev. please review and fix. thanks!

@nlarge-google nlarge-google merged commit 67d7216 into GoogleCloudPlatform:main Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants