[DataPipe] Add .set_length method to IterDataPipe#83750
[DataPipe] Add .set_length method to IterDataPipe#83750NivekT wants to merge 4 commits intogh/nivekt/54/basefrom
.set_length method to IterDataPipe#83750Conversation
[ghstack-poisoned]
🔗 Helpful links
❌ 7 New Failures, 1 Base FailuresAs of commit d4a7d39 (more details on the Dr. CI page): Expand to see more
🕵️ 7 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
|
|
||
| namespace['reset'] = conditional_reset | ||
|
|
||
| if '__len__' in namespace: |
There was a problem hiding this comment.
We need to add this hook no matter __len__ is in the namespace or not. We can raise NotImplementedError if _length is None.
There was a problem hiding this comment.
Done. Note that isinstance(dp, collections.abc.Sized) will always be True now, because __len__ always exists as a method through the hook. Is that fine?
If that is not fine, we can consider somehow intercepting the len(dp) call but it may be complicated (if feasible at all).
There was a problem hiding this comment.
Note that this causes so test failures, so we will need to adjust some other logic, such as the following in "combining.py": if all(isinstance(dp, Sized) for dp in self.datapipes)
There was a problem hiding this comment.
If that is not fine, we can consider somehow intercepting the
len(dp)call but it may be complicated (if feasible at all).
Based on my understanding, not really feasible as __len__ is implemented on the class rather than instsances.
Note that this causes so test failures, so we will need to adjust some other logic, such as the following in "combining.py":
if all(isinstance(dp, Sized) for dp in self.datapipes)
Emmm, this is a good question. I personally don't know if there is any drawback of making all IterDataPipe Sized (raising TypeError if needed). Let's discuss with Vitaly about the decision? This seems like a BC-breaking change.
There was a problem hiding this comment.
Yea, let's have @VitalyFedyunin add his thoughts.
The benefit on implementing .set_length as a DataPipe instead is that it doesn't require changes to __len__ or Sized. In the absence of that, I do think a hook makes more sense.
This allows users to manually set length of an `IterDataPipe` with no other side effect. This is useful for DataPipes whose final length cannot be known in advance (e.g. `filter`). If you know the final length with certainty, you can manually set it for usages by DataLoader or other DataPipes. Previously, users theoretically could use `.header(length)` to manually set length. However, the UX is suboptimal in that 1) the name is not obvious, and 2) warnings are raised unless the `HeaderDataPipe` has been fully traversed once to confirm the length. [ghstack-poisoned]
This allows users to manually set length of an `IterDataPipe` with no other side effect. This is useful for DataPipes whose final length cannot be known in advance (e.g. `filter`). If you know the final length with certainty, you can manually set it for usages by DataLoader or other DataPipes. Previously, users theoretically could use `.header(length)` to manually set length. However, the UX is suboptimal in that 1) the name is not obvious, and 2) warnings are raised unless the `HeaderDataPipe` has been fully traversed once to confirm the length. [ghstack-poisoned]
This allows users to manually set length of an `IterDataPipe` with no other side effect. This is useful for DataPipes whose final length cannot be known in advance (e.g. `filter`). If you know the final length with certainty, you can manually set it for usages by DataLoader or other DataPipes. Previously, users theoretically could use `.header(length)` to manually set length. However, the UX is suboptimal in that 1) the name is not obvious, and 2) warnings are raised unless the `HeaderDataPipe` has been fully traversed once to confirm the length. [ghstack-poisoned]
|
|
||
| namespace['reset'] = conditional_reset | ||
|
|
||
| if '__len__' in namespace: |
There was a problem hiding this comment.
Done. Note that isinstance(dp, collections.abc.Sized) will always be True now, because __len__ always exists as a method through the hook. Is that fine?
If that is not fine, we can consider somehow intercepting the len(dp) call but it may be complicated (if feasible at all).
| if len_func is not None: | ||
| return len_func(*args, **kwargs) | ||
| else: | ||
| raise TypeError(f"object of type '{datapipe}' has no len()") |
There was a problem hiding this comment.
The original exception is a TypeError but I can switch it to NotImplementedError if that is preferred.
There was a problem hiding this comment.
TypeError is needed for running list(dp).
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
.set_length method to IterDataPipe
|
We will be landing meta-pytorch/data#747 instead of this PR. |
Stack from ghstack:
.set_lengthmethod to IterDataPipe #83750Fixes meta-pytorch/data#549
This allows users to manually set length of an
IterDataPipewith no other side effect. This is useful for DataPipes whose final length cannot be known in advance (e.g.filter). If you know the final length with certainty, you can manually set it for usages by DataLoader or other DataPipes.Previously, users theoretically could use
.header(length)to manually set length. However, the UX is suboptimal in that 1) the name is not obvious, and 2) warnings are raised unless theHeaderDataPipehas been fully traversed once to confirm the length.Differential Revision: D38879334