Fix(DataFrameCpu): support append value from list by ice-tong · Pull Request #421 · pytorch/torcharrow

ice-tong · 2022-07-04T14:27:16Z

DataFrameCpu can not append values from list, is this behavior a bug?

wenleix · 2022-07-05T15:01:23Z

torcharrow/velox_rt/dataframe_cpu.py

                return new_data.append(it)

-            elif isinstance(value, tuple):
+            elif isinstance(value, (tuple, list)):


Thanks @ice-tong for looking into this!

I am wondering what's the use case of this? -- in general TorchArrow prefers to use tuple/named tuple to represent DataFrame/struct column, while use list to represent List column.

Is this for Pandas compatibility? Thanks! ^_^

Hi @wenleix , I'm try to use DataPipe + TorchArrow in a simple case: download iris dataset from http and parse it into TorchArrow DataFrame.

I found that the CSVParser DataPipe parse data into list, but DataFrameMaker DataPipe does not accept list and raise an unfriendly error.

I think whether it can accept list or provide a more friendly error prompt. Thans! ^_^

Here are my code, I got "AttributeError: 'NoneType' object has no attribute '_data'"

from torchdata.datapipes.iter import IterableWrapper, HttpReader import torcharrow.dtypes as dt FEATURE_NAMES = ["sepal length", "sepal width", "petal length", "petal width"] def _filter_fn(x): return len(x) != 0 def preprocess(df): for feature_name in FEATURE_NAMES: df[feature_name] = (df[feature_name] - df[feature_name].mean()) / df[feature_name].std() return df iris_data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data' url_dp = IterableWrapper([iris_data_url]) http_reader_dp = HttpReader(url_dp) csv_dp = http_reader_dp.parse_csv().filter(_filter_fn) DTYPE = dt.Struct([ dt.Field("sepal length", dt.float32), dt.Field("sepal width", dt.float32), dt.Field("petal length", dt.float32), dt.Field("petal width", dt.float32), dt.Field("label", dt.int32) ]) df_dp = csv_dp.dataframe(dtype=DTYPE, dataframe_size=20).map(preprocess) print(next(iter(df_dp)))

@ice-tong Thanks for the feedback! Yeah I do think we should improve the error message and make it easier to use!

I guess the following should be able to unblock you :

csv_dp = http_reader_dp.parse_csv().map(lambda row: tuple(row)).filter(_filter_fn)

@ejguan , @NivekT : I am wondering if we should add a flag in parse_csv that allows the data into tuple format? something like parse_csv(as_tuple=True)

Yeah, I added a tuple map to solve this. I will close this PR.

Thanks @ice-tong . Curious: do you intend to do the normalization (i.e. df[feature_name] = (df[feature_name] - df[feature_name].mean()) / df[feature_name].std()) for each batch, or you mean to do the normalization over the whole dataset? ^_^

Yeah, I see. I just want to show the dataframe_size use case, but not a serious code in practice. ^_^

If you're interested, here's an article I wrote about DataPipe + TorchArrow in Chinese： https://zhuanlan.zhihu.com/p/537868554 (受限于经验与水平，如有错误还请赐教)

I think we we use list by default simply because list is mutable but tuple is not. With list, we can run in-place operations over list using DataPipe.
We should add as_tuple to parse_csv. @ice-tong Feel free to open an PR in TorchData.

Hi @ejguan , thanks for the reply. I'm willing to open a PR to add this feature. ^_^

Pls ping me when you have the PR. Thx

Summary: ### Motivation see pytorch/torcharrow#421 ### Changes - Add `as_tuple` argument for CSVParserIterDataPipe - Add a functional test for `as_tuple` in tests/test_local_io.py Pull Request resolved: #646 Reviewed By: wenleix, NivekT Differential Revision: D37787684 Pulled By: ejguan fbshipit-source-id: de674e507f717d9008b9eed2cf97c81c69ab563b

fix(DataFrameCpu): support append value from list

82b4735

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 4, 2022

wenleix reviewed Jul 5, 2022

View reviewed changes

ice-tong closed this Jul 7, 2022

ice-tong mentioned this pull request Jul 12, 2022

[DataPipe] Add as_tuple argument for CSVParserIterDataPipe meta-pytorch/data#646

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix(DataFrameCpu): support append value from list#421

Fix(DataFrameCpu): support append value from list#421
ice-tong wants to merge 1 commit intopytorch:mainfrom
ice-tong:yancong-_append_value-from-list

ice-tong commented Jul 4, 2022

Uh oh!

wenleix Jul 5, 2022

Uh oh!

ice-tong Jul 6, 2022

Uh oh!

ice-tong Jul 6, 2022 •

edited

Loading

Uh oh!

wenleix Jul 7, 2022

Uh oh!

ice-tong Jul 7, 2022

Uh oh!

wenleix Jul 7, 2022

Uh oh!

ice-tong Jul 7, 2022

Uh oh!

ejguan Jul 8, 2022

Uh oh!

ice-tong Jul 9, 2022

Uh oh!

ejguan Jul 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ice-tong commented Jul 4, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ice-tong Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ice-tong Jul 6, 2022 •

edited

Loading