Conversation
ee71bfc to
13a0798
Compare
|
Failed tests should be fixed the the PR in PyTorch Core landed. |
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Failed tests should be fixed by #905 |
wenleix
left a comment
There was a problem hiding this comment.
LGTM % some noob/minor comments
|
|
||
|
|
||
| @functional_datapipe("round_robin_demux") | ||
| class RoundRobinDemultiplexerIterDataPipe(IterDataPipe): |
There was a problem hiding this comment.
curious: why demux is PyTorch core (https://github.com/pytorch/pytorch/blob/7a2930b357a4e62bb0bab53bb0d23c607b6ede38/torch/utils/data/datapipes/iter/combining.py#L305-L306) but round_robin_demux is in TorchData?
There was a problem hiding this comment.
Historical reason. We put lots of DataPipe in core but we plan to move them to torchdata eventually. To reduce the amount of work, I think it's better to put round_robin_demux to torchdata directly.
|
|
||
| datapipe = datapipe.enumerate() | ||
| container = _RoundRobinDemultiplexerIterDataPipe(datapipe, num_instances, buffer_size=buffer_size) | ||
| return [_ChildDataPipe(container, i).map(_drop_index) for i in range(num_instances)] |
There was a problem hiding this comment.
As a Python noob, TIL I learnt X(...) in Python may not return a instance of X
https://www.geeksforgeeks.org/__new__-in-python/
... because other class constructors can be called by new method or it can simply return other objects as an instance of this class.
Will _ChildDataPipe.__init__ be called for each instance in the list? ~
| return datapipe | ||
|
|
||
| datapipe = datapipe.enumerate() | ||
| container = _RoundRobinDemultiplexerIterDataPipe(datapipe, num_instances, buffer_size=buffer_size) |
There was a problem hiding this comment.
curious: can we just do
_DemultiplexerIterDataPipe(
datapipe,
num_instances,
partial(num_instances, _round_robin_fn),
False,
buffer_size
)where _round_robin_fn is:
def _round_robin_fn(num_instances: int, idx_data) -> int:
idx, _ = idx_data
return idx % num_instancesIs that because you want to override _DemultiplexerIterDataPipe.get_length_by_instance ?
There was a problem hiding this comment.
Yeah. demux doesn't provide a valid length but this DataPipe should have a deterministic result. Let me know if you have better way to do so.
There was a problem hiding this comment.
noob question : what's "deterministic result" here mean?
Especially I was somehow under the impression that in general try to avoid query len(datapipe) for large datapipe , since it probably needs to read all data and it's expensive?
There was a problem hiding this comment.
If the prior DataPipe provides a length, we should be able to get the length ahead of time.
For demux which uses a classify_fn, there is no way for us to know the length and the length can vary across epoches.
in general try to avoid query
len(datapipe)for large datapipe , since it probably needs to read all data and it's expensive?
It's true. But, we do provide DataPipe that can manually assign length to the DataPipe https://github.com/pytorch/data/blob/67afe0c36f5f93d34fde7026e7e95835c1160bb3/torchdata/datapipes/iter/util/header.py#L62
|
Unrelated question: Why demux and unzip data pipes are in |
Aha. We categorized DataPipe into a few sectors. See: https://pytorch.org/data/beta/index.html We put them together as they are the counter party of When #293 is done, we probably need to make them into |
1b169ed to
9c8fa69
Compare
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
9c8fa69 to
2ea1de0
Compare
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
The only reason I added this DataPipe without directly using
enumereate().demux().drop_index()is thisRoundRobinDemuxshould provide a valid length. So, this PR needs pytorch/pytorch#89216 landedAnd, this DataPipe will be used for
Proto2RS.Side change: Move
Unziptocombining.pyas well.