Fix(DataFrameCpu): support append value from list#421
Fix(DataFrameCpu): support append value from list#421ice-tong wants to merge 1 commit intopytorch:mainfrom
Conversation
| return new_data.append(it) | ||
|
|
||
| elif isinstance(value, tuple): | ||
| elif isinstance(value, (tuple, list)): |
There was a problem hiding this comment.
Thanks @ice-tong for looking into this!
I am wondering what's the use case of this? -- in general TorchArrow prefers to use tuple/named tuple to represent DataFrame/struct column, while use list to represent List column.
Is this for Pandas compatibility? Thanks! ^_^
There was a problem hiding this comment.
Hi @wenleix , I'm try to use DataPipe + TorchArrow in a simple case: download iris dataset from http and parse it into TorchArrow DataFrame.
I found that the CSVParser DataPipe parse data into list, but DataFrameMaker DataPipe does not accept list and raise an unfriendly error.
I think whether it can accept list or provide a more friendly error prompt. Thans! ^_^
There was a problem hiding this comment.
Here are my code, I got "AttributeError: 'NoneType' object has no attribute '_data'"
from torchdata.datapipes.iter import IterableWrapper, HttpReader
import torcharrow.dtypes as dt
FEATURE_NAMES = ["sepal length", "sepal width", "petal length", "petal width"]
def _filter_fn(x):
return len(x) != 0
def preprocess(df):
for feature_name in FEATURE_NAMES:
df[feature_name] = (df[feature_name] - df[feature_name].mean()) / df[feature_name].std()
return df
iris_data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
url_dp = IterableWrapper([iris_data_url])
http_reader_dp = HttpReader(url_dp)
csv_dp = http_reader_dp.parse_csv().filter(_filter_fn)
DTYPE = dt.Struct([
dt.Field("sepal length", dt.float32),
dt.Field("sepal width", dt.float32),
dt.Field("petal length", dt.float32),
dt.Field("petal width", dt.float32),
dt.Field("label", dt.int32)
])
df_dp = csv_dp.dataframe(dtype=DTYPE, dataframe_size=20).map(preprocess)
print(next(iter(df_dp)))There was a problem hiding this comment.
@ice-tong Thanks for the feedback! Yeah I do think we should improve the error message and make it easier to use!
I guess the following should be able to unblock you :
csv_dp = http_reader_dp.parse_csv().map(lambda row: tuple(row)).filter(_filter_fn)@ejguan , @NivekT : I am wondering if we should add a flag in parse_csv that allows the data into tuple format? something like parse_csv(as_tuple=True)
There was a problem hiding this comment.
Yeah, I added a tuple map to solve this. I will close this PR.
There was a problem hiding this comment.
Thanks @ice-tong . Curious: do you intend to do the normalization (i.e. df[feature_name] = (df[feature_name] - df[feature_name].mean()) / df[feature_name].std()) for each batch, or you mean to do the normalization over the whole dataset? ^_^
There was a problem hiding this comment.
Yeah, I see. I just want to show the dataframe_size use case, but not a serious code in practice. ^_^
If you're interested, here's an article I wrote about DataPipe + TorchArrow in Chinese: https://zhuanlan.zhihu.com/p/537868554 (受限于经验与水平,如有错误还请赐教)
There was a problem hiding this comment.
I think we we use list by default simply because list is mutable but tuple is not. With list, we can run in-place operations over list using DataPipe.
We should add as_tuple to parse_csv. @ice-tong Feel free to open an PR in TorchData.
There was a problem hiding this comment.
Hi @ejguan , thanks for the reply. I'm willing to open a PR to add this feature. ^_^
There was a problem hiding this comment.
Pls ping me when you have the PR. Thx
Summary: ### Motivation see pytorch/torcharrow#421 ### Changes - Add `as_tuple` argument for CSVParserIterDataPipe - Add a functional test for `as_tuple` in tests/test_local_io.py Pull Request resolved: #646 Reviewed By: wenleix, NivekT Differential Revision: D37787684 Pulled By: ejguan fbshipit-source-id: de674e507f717d9008b9eed2cf97c81c69ab563b
DataFrameCpu can not append values from list, is this behavior a bug?