Skip to content
This repository was archived by the owner on Nov 1, 2024. It is now read-only.

Fix(DataFrameCpu): support append value from list#421

Closed
ice-tong wants to merge 1 commit intopytorch:mainfrom
ice-tong:yancong-_append_value-from-list
Closed

Fix(DataFrameCpu): support append value from list#421
ice-tong wants to merge 1 commit intopytorch:mainfrom
ice-tong:yancong-_append_value-from-list

Conversation

@ice-tong
Copy link

@ice-tong ice-tong commented Jul 4, 2022

DataFrameCpu can not append values from list, is this behavior a bug?

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 4, 2022
return new_data.append(it)

elif isinstance(value, tuple):
elif isinstance(value, (tuple, list)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ice-tong for looking into this!

I am wondering what's the use case of this? -- in general TorchArrow prefers to use tuple/named tuple to represent DataFrame/struct column, while use list to represent List column.

Is this for Pandas compatibility? Thanks! ^_^

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wenleix , I'm try to use DataPipe + TorchArrow in a simple case: download iris dataset from http and parse it into TorchArrow DataFrame.

I found that the CSVParser DataPipe parse data into list, but DataFrameMaker DataPipe does not accept list and raise an unfriendly error.

I think whether it can accept list or provide a more friendly error prompt. Thans! ^_^

Copy link
Author

@ice-tong ice-tong Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my code, I got "AttributeError: 'NoneType' object has no attribute '_data'"

from torchdata.datapipes.iter import IterableWrapper, HttpReader
import torcharrow.dtypes as dt


FEATURE_NAMES = ["sepal length", "sepal width", "petal length", "petal width"]


def _filter_fn(x):
    return len(x) != 0

def preprocess(df):
    for feature_name in FEATURE_NAMES:
        df[feature_name] = (df[feature_name] - df[feature_name].mean()) / df[feature_name].std()
    return df


iris_data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

url_dp = IterableWrapper([iris_data_url])

http_reader_dp = HttpReader(url_dp)

csv_dp = http_reader_dp.parse_csv().filter(_filter_fn)

DTYPE = dt.Struct([
    dt.Field("sepal length", dt.float32), 
    dt.Field("sepal width", dt.float32),
    dt.Field("petal length", dt.float32), 
    dt.Field("petal width", dt.float32),
    dt.Field("label", dt.int32)
])

df_dp = csv_dp.dataframe(dtype=DTYPE, dataframe_size=20).map(preprocess)


print(next(iter(df_dp)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ice-tong Thanks for the feedback! Yeah I do think we should improve the error message and make it easier to use!

I guess the following should be able to unblock you :

csv_dp = http_reader_dp.parse_csv().map(lambda row: tuple(row)).filter(_filter_fn)

@ejguan , @NivekT : I am wondering if we should add a flag in parse_csv that allows the data into tuple format? something like parse_csv(as_tuple=True)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I added a tuple map to solve this. I will close this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ice-tong . Curious: do you intend to do the normalization (i.e. df[feature_name] = (df[feature_name] - df[feature_name].mean()) / df[feature_name].std()) for each batch, or you mean to do the normalization over the whole dataset? ^_^

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see. I just want to show the dataframe_size use case, but not a serious code in practice. ^_^

If you're interested, here's an article I wrote about DataPipe + TorchArrow in Chinese: https://zhuanlan.zhihu.com/p/537868554 (受限于经验与水平,如有错误还请赐教)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we we use list by default simply because list is mutable but tuple is not. With list, we can run in-place operations over list using DataPipe.
We should add as_tuple to parse_csv. @ice-tong Feel free to open an PR in TorchData.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ejguan , thanks for the reply. I'm willing to open a PR to add this feature. ^_^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls ping me when you have the PR. Thx

@ice-tong ice-tong closed this Jul 7, 2022
facebook-github-bot pushed a commit to meta-pytorch/data that referenced this pull request Jul 15, 2022
Summary:
### Motivation
see pytorch/torcharrow#421

### Changes

- Add `as_tuple` argument for CSVParserIterDataPipe
- Add a functional test for `as_tuple` in tests/test_local_io.py

Pull Request resolved: #646

Reviewed By: wenleix, NivekT

Differential Revision: D37787684

Pulled By: ejguan

fbshipit-source-id: de674e507f717d9008b9eed2cf97c81c69ab563b
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants