Skip to content

[DataPipe] Add .set_length method to IterDataPipe#83750

Closed
NivekT wants to merge 4 commits intogh/nivekt/54/basefrom
gh/nivekt/54/head
Closed

[DataPipe] Add .set_length method to IterDataPipe#83750
NivekT wants to merge 4 commits intogh/nivekt/54/basefrom
gh/nivekt/54/head

Conversation

@NivekT
Copy link
Contributor

@NivekT NivekT commented Aug 19, 2022

Stack from ghstack:

Fixes meta-pytorch/data#549

This allows users to manually set length of an IterDataPipe with no other side effect. This is useful for DataPipes whose final length cannot be known in advance (e.g. filter). If you know the final length with certainty, you can manually set it for usages by DataLoader or other DataPipes.

Previously, users theoretically could use .header(length) to manually set length. However, the UX is suboptimal in that 1) the name is not obvious, and 2) warnings are raised unless the HeaderDataPipe has been fully traversed once to confirm the length.

Differential Revision: D38879334

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 19, 2022

🔗 Helpful links

❌ 7 New Failures, 1 Base Failures

As of commit d4a7d39 (more details on the Dr. CI page):

Expand to see more
  • 7/8 failures introduced in this PR
  • 1/8 broken upstream at merge base 3074219 on Aug 19 from 9:59am to 7:35pm

🕵️ 7 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-focal-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge) (1/7)

Step: "Test" (full log | diagnosis details)

2022-08-19T22:04:15.3431192Z FAIL [0.002s]: tes...iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:04:15.3428940Z 
2022-08-19T22:04:15.3429025Z ======================================================================
2022-08-19T22:04:15.3429298Z FAIL [0.002s]: test_sampler_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:04:15.3429670Z ----------------------------------------------------------------------
2022-08-19T22:04:15.3429928Z Traceback (most recent call last):
2022-08-19T22:04:15.3430165Z   File "test_datapipe.py", line 1560, in test_sampler_iterdatapipe
2022-08-19T22:04:15.3430432Z     sampled_dp = dp.iter.Sampler(input_dp_nolen)
2022-08-19T22:04:15.3430677Z AssertionError: AssertionError not raised
2022-08-19T22:04:15.3430818Z 
2022-08-19T22:04:15.3430898Z ======================================================================
2022-08-19T22:04:15.3431192Z FAIL [0.002s]: test_zip_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:04:15.3431598Z ----------------------------------------------------------------------
2022-08-19T22:04:15.3432203Z TypeError: object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`.
2022-08-19T22:04:15.3432538Z 
2022-08-19T22:04:15.3432675Z During handling of the above exception, another exception occurred:
2022-08-19T22:04:15.3432840Z 
2022-08-19T22:04:15.3432932Z Traceback (most recent call last):
2022-08-19T22:04:15.3433177Z   File "test_datapipe.py", line 1653, in test_zip_iterdatapipe
2022-08-19T22:04:15.3433383Z     len(zipped_dp)
2022-08-19T22:04:15.3434005Z AssertionError: "instance doesn't have valid length$" does not match "object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`."
2022-08-19T22:04:15.3434359Z 

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge) (2/7)

Step: "Test" (full log | diagnosis details)

2022-08-19T22:02:17.4574387Z FAIL [0.002s]: tes...iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:02:17.4572039Z 
2022-08-19T22:02:17.4572132Z ======================================================================
2022-08-19T22:02:17.4572484Z FAIL [0.002s]: test_sampler_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:02:17.4572850Z ----------------------------------------------------------------------
2022-08-19T22:02:17.4573105Z Traceback (most recent call last):
2022-08-19T22:02:17.4573397Z   File "test_datapipe.py", line 1560, in test_sampler_iterdatapipe
2022-08-19T22:02:17.4573652Z     sampled_dp = dp.iter.Sampler(input_dp_nolen)
2022-08-19T22:02:17.4573898Z AssertionError: AssertionError not raised
2022-08-19T22:02:17.4574036Z 
2022-08-19T22:02:17.4574131Z ======================================================================
2022-08-19T22:02:17.4574387Z FAIL [0.002s]: test_zip_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:02:17.4574753Z ----------------------------------------------------------------------
2022-08-19T22:02:17.4575342Z TypeError: object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`.
2022-08-19T22:02:17.4575647Z 
2022-08-19T22:02:17.4575787Z During handling of the above exception, another exception occurred:
2022-08-19T22:02:17.4575952Z 
2022-08-19T22:02:17.4576031Z Traceback (most recent call last):
2022-08-19T22:02:17.4576283Z   File "test_datapipe.py", line 1653, in test_zip_iterdatapipe
2022-08-19T22:02:17.4576504Z     len(zipped_dp)
2022-08-19T22:02:17.4577126Z AssertionError: "instance doesn't have valid length$" does not match "object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`."
2022-08-19T22:02:17.4577488Z 

See GitHub Actions build pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (3/7)

Step: "Test" (full log | diagnosis details)

2022-08-19T23:29:18.4689037Z FAIL [0.000s]: tes...iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T23:29:18.4686437Z 
2022-08-19T23:29:18.4686542Z ======================================================================
2022-08-19T23:29:18.4686835Z FAIL [0.000s]: test_sampler_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T23:29:18.4687176Z ----------------------------------------------------------------------
2022-08-19T23:29:18.4687430Z Traceback (most recent call last):
2022-08-19T23:29:18.4687819Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_datapipe.py", line 1560, in test_sampler_iterdatapipe
2022-08-19T23:29:18.4688181Z     sampled_dp = dp.iter.Sampler(input_dp_nolen)
2022-08-19T23:29:18.4688428Z AssertionError: AssertionError not raised
2022-08-19T23:29:18.4688653Z 
2022-08-19T23:29:18.4688752Z ======================================================================
2022-08-19T23:29:18.4689037Z FAIL [0.000s]: test_zip_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T23:29:18.4689372Z ----------------------------------------------------------------------
2022-08-19T23:29:18.4689905Z TypeError: object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`.
2022-08-19T23:29:18.4690239Z 
2022-08-19T23:29:18.4690391Z During handling of the above exception, another exception occurred:
2022-08-19T23:29:18.4690569Z 
2022-08-19T23:29:18.4690669Z Traceback (most recent call last):
2022-08-19T23:29:18.4691050Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_datapipe.py", line 1653, in test_zip_iterdatapipe
2022-08-19T23:29:18.4691342Z     len(zipped_dp)
2022-08-19T23:29:18.4691838Z AssertionError: "instance doesn't have valid length$" does not match "object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`."
2022-08-19T23:29:18.4692226Z 

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge) (4/7)

Step: "Test" (full log | diagnosis details)

2022-08-19T22:02:15.8823040Z FAIL [0.002s]: tes...iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:02:15.8820761Z 
2022-08-19T22:02:15.8820855Z ======================================================================
2022-08-19T22:02:15.8821139Z FAIL [0.002s]: test_sampler_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:02:15.8821500Z ----------------------------------------------------------------------
2022-08-19T22:02:15.8821759Z Traceback (most recent call last):
2022-08-19T22:02:15.8822012Z   File "test_datapipe.py", line 1560, in test_sampler_iterdatapipe
2022-08-19T22:02:15.8822266Z     sampled_dp = dp.iter.Sampler(input_dp_nolen)
2022-08-19T22:02:15.8822514Z AssertionError: AssertionError not raised
2022-08-19T22:02:15.8822658Z 
2022-08-19T22:02:15.8822752Z ======================================================================
2022-08-19T22:02:15.8823040Z FAIL [0.002s]: test_zip_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:02:15.8823396Z ----------------------------------------------------------------------
2022-08-19T22:02:15.8823997Z TypeError: object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`.
2022-08-19T22:02:15.8824303Z 
2022-08-19T22:02:15.8824439Z During handling of the above exception, another exception occurred:
2022-08-19T22:02:15.8824637Z 
2022-08-19T22:02:15.8824730Z Traceback (most recent call last):
2022-08-19T22:02:15.8824970Z   File "test_datapipe.py", line 1643, in test_zip_iterdatapipe
2022-08-19T22:02:15.8825215Z     def test_zip_iterdatapipe(self):
2022-08-19T22:02:15.8825468Z   File "test_datapipe.py", line 1653, in test_zip_iterdatapipe
2022-08-19T22:02:15.8825675Z     len(zipped_dp)

See GitHub Actions build pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu) (5/7)

Step: "Test" (full log | diagnosis details)

2022-08-19T22:27:11.9341751Z FAIL [0.002s]: tes...iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:27:11.9338533Z 
2022-08-19T22:27:11.9338653Z ======================================================================
2022-08-19T22:27:11.9339030Z FAIL [0.002s]: test_sampler_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:27:11.9339547Z ----------------------------------------------------------------------
2022-08-19T22:27:11.9339900Z Traceback (most recent call last):
2022-08-19T22:27:11.9340281Z   File "/var/lib/jenkins/workspace/test/test_datapipe.py", line 1559, in test_sampler_iterdatapipe
2022-08-19T22:27:11.9340685Z     with self.assertRaises(AssertionError):
2022-08-19T22:27:11.9341069Z AssertionError: AssertionError not raised
2022-08-19T22:27:11.9341270Z 
2022-08-19T22:27:11.9341388Z ======================================================================
2022-08-19T22:27:11.9341751Z FAIL [0.002s]: test_zip_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:27:11.9342259Z ----------------------------------------------------------------------
2022-08-19T22:27:11.9343133Z TypeError: object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`.
2022-08-19T22:27:11.9343558Z 
2022-08-19T22:27:11.9343741Z During handling of the above exception, another exception occurred:
2022-08-19T22:27:11.9343978Z 
2022-08-19T22:27:11.9344116Z Traceback (most recent call last):
2022-08-19T22:27:11.9344504Z   File "/var/lib/jenkins/workspace/test/test_datapipe.py", line 1652, in test_zip_iterdatapipe
2022-08-19T22:27:11.9345054Z     with self.assertRaisesRegex(TypeError, r"instance doesn't have valid length$"):
2022-08-19T22:27:11.9345958Z AssertionError: "instance doesn't have valid length$" does not match "object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`."
2022-08-19T22:27:11.9346458Z 

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge) (6/7)

Step: "Test" (full log | diagnosis details)

2022-08-19T22:02:04.1602330Z FAIL [0.002s]: tes...iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:02:04.1600103Z 
2022-08-19T22:02:04.1600202Z ======================================================================
2022-08-19T22:02:04.1600465Z FAIL [0.002s]: test_sampler_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:02:04.1600839Z ----------------------------------------------------------------------
2022-08-19T22:02:04.1601093Z Traceback (most recent call last):
2022-08-19T22:02:04.1601333Z   File "test_datapipe.py", line 1560, in test_sampler_iterdatapipe
2022-08-19T22:02:04.1601597Z     sampled_dp = dp.iter.Sampler(input_dp_nolen)
2022-08-19T22:02:04.1601842Z AssertionError: AssertionError not raised
2022-08-19T22:02:04.1601981Z 
2022-08-19T22:02:04.1602061Z ======================================================================
2022-08-19T22:02:04.1602330Z FAIL [0.002s]: test_zip_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:02:04.1602694Z ----------------------------------------------------------------------
2022-08-19T22:02:04.1603299Z TypeError: object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`.
2022-08-19T22:02:04.1603602Z 
2022-08-19T22:02:04.1603727Z During handling of the above exception, another exception occurred:
2022-08-19T22:02:04.1603893Z 
2022-08-19T22:02:04.1603984Z Traceback (most recent call last):
2022-08-19T22:02:04.1604230Z   File "test_datapipe.py", line 1653, in test_zip_iterdatapipe
2022-08-19T22:02:04.1604449Z     len(zipped_dp)
2022-08-19T22:02:04.1605056Z AssertionError: "instance doesn't have valid length$" does not match "object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`."
2022-08-19T22:02:04.1605413Z 

See GitHub Actions build pull / linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge) (7/7)

Step: "Test" (full log | diagnosis details)

2022-08-19T22:01:04.3388589Z FAIL [0.002s]: tes...iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:01:04.3386213Z 
2022-08-19T22:01:04.3386309Z ======================================================================
2022-08-19T22:01:04.3386585Z FAIL [0.002s]: test_sampler_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:01:04.3387036Z ----------------------------------------------------------------------
2022-08-19T22:01:04.3387280Z Traceback (most recent call last):
2022-08-19T22:01:04.3387569Z   File "test_datapipe.py", line 1560, in test_sampler_iterdatapipe
2022-08-19T22:01:04.3387841Z     sampled_dp = dp.iter.Sampler(input_dp_nolen)
2022-08-19T22:01:04.3388076Z AssertionError: AssertionError not raised
2022-08-19T22:01:04.3388221Z 
2022-08-19T22:01:04.3388315Z ======================================================================
2022-08-19T22:01:04.3388589Z FAIL [0.002s]: test_zip_iterdatapipe (__main__.TestFunctionalIterDataPipe)
2022-08-19T22:01:04.3388953Z ----------------------------------------------------------------------
2022-08-19T22:01:04.3389556Z TypeError: object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`.
2022-08-19T22:01:04.3389865Z 
2022-08-19T22:01:04.3390004Z During handling of the above exception, another exception occurred:
2022-08-19T22:01:04.3390171Z 
2022-08-19T22:01:04.3390267Z Traceback (most recent call last):
2022-08-19T22:01:04.3390503Z   File "test_datapipe.py", line 1653, in test_zip_iterdatapipe
2022-08-19T22:01:04.3390725Z     len(zipped_dp)
2022-08-19T22:01:04.3391349Z AssertionError: "instance doesn't have valid length$" does not match "object of type 'IDP_NoLen' has no len(). This is likely becausethe length of the DataPipe (or its source DataPipes) cannot be known in advance.Consider manually setting the length with `dp.set_length(length)`."
2022-08-19T22:01:04.3391707Z 

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

NivekT added a commit that referenced this pull request Aug 19, 2022
ghstack-source-id: 8c5d8e7
Pull Request resolved: #83750
@NivekT NivekT added module: data torch.utils.data release notes: dataloader release notes category topic: improvements topic category labels Aug 19, 2022

namespace['reset'] = conditional_reset

if '__len__' in namespace:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add this hook no matter __len__ is in the namespace or not. We can raise NotImplementedError if _length is None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Note that isinstance(dp, collections.abc.Sized) will always be True now, because __len__ always exists as a method through the hook. Is that fine?

If that is not fine, we can consider somehow intercepting the len(dp) call but it may be complicated (if feasible at all).

Copy link
Contributor Author

@NivekT NivekT Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this causes so test failures, so we will need to adjust some other logic, such as the following in "combining.py": if all(isinstance(dp, Sized) for dp in self.datapipes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is not fine, we can consider somehow intercepting the len(dp) call but it may be complicated (if feasible at all).

Based on my understanding, not really feasible as __len__ is implemented on the class rather than instsances.

Note that this causes so test failures, so we will need to adjust some other logic, such as the following in "combining.py": if all(isinstance(dp, Sized) for dp in self.datapipes)

Emmm, this is a good question. I personally don't know if there is any drawback of making all IterDataPipe Sized (raising TypeError if needed). Let's discuss with Vitaly about the decision? This seems like a BC-breaking change.

Copy link
Contributor Author

@NivekT NivekT Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, let's have @VitalyFedyunin add his thoughts.

The benefit on implementing .set_length as a DataPipe instead is that it doesn't require changes to __len__ or Sized. In the absence of that, I do think a hook makes more sense.

This allows users to manually set length of an `IterDataPipe` with no other side effect. This is useful for DataPipes whose final length cannot be known in advance (e.g. `filter`). If you know the final length with certainty, you can manually set it for usages by DataLoader or other DataPipes.

Previously, users theoretically could use `.header(length)` to manually set length. However, the UX is suboptimal in that 1) the name is not obvious, and 2) warnings are raised unless the `HeaderDataPipe` has been fully traversed once to confirm the length.


[ghstack-poisoned]
This allows users to manually set length of an `IterDataPipe` with no other side effect. This is useful for DataPipes whose final length cannot be known in advance (e.g. `filter`). If you know the final length with certainty, you can manually set it for usages by DataLoader or other DataPipes.

Previously, users theoretically could use `.header(length)` to manually set length. However, the UX is suboptimal in that 1) the name is not obvious, and 2) warnings are raised unless the `HeaderDataPipe` has been fully traversed once to confirm the length.


[ghstack-poisoned]
This allows users to manually set length of an `IterDataPipe` with no other side effect. This is useful for DataPipes whose final length cannot be known in advance (e.g. `filter`). If you know the final length with certainty, you can manually set it for usages by DataLoader or other DataPipes.

Previously, users theoretically could use `.header(length)` to manually set length. However, the UX is suboptimal in that 1) the name is not obvious, and 2) warnings are raised unless the `HeaderDataPipe` has been fully traversed once to confirm the length.


[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 19, 2022
ghstack-source-id: 8500f2b
Pull Request resolved: #83750

namespace['reset'] = conditional_reset

if '__len__' in namespace:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Note that isinstance(dp, collections.abc.Sized) will always be True now, because __len__ always exists as a method through the hook. Is that fine?

If that is not fine, we can consider somehow intercepting the len(dp) call but it may be complicated (if feasible at all).

if len_func is not None:
return len_func(*args, **kwargs)
else:
raise TypeError(f"object of type '{datapipe}' has no len()")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original exception is a TypeError but I can switch it to NotImplementedError if that is preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeError is needed for running list(dp).

@NivekT
Copy link
Contributor Author

NivekT commented Aug 19, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@NivekT NivekT changed the title [DataPipe] Add .set_length method to IterDataPipe [DataPipe] Add .set_length method to IterDataPipe Aug 19, 2022
@NivekT NivekT requested a review from ejguan August 19, 2022 22:21
@NivekT
Copy link
Contributor Author

NivekT commented Aug 24, 2022

We will be landing meta-pytorch/data#747 instead of this PR.

@NivekT NivekT closed this Aug 24, 2022
@facebook-github-bot facebook-github-bot deleted the gh/nivekt/54/head branch September 24, 2022 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: data torch.utils.data release notes: dataloader release notes category topic: improvements topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants