add lazily filled dict for prototype datasets#5219
add lazily filled dict for prototype datasets#5219pmeier wants to merge 3 commits intopytorch:mainfrom
Conversation
💊 CI failures summary and remediationsAs of commit ff79a88 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
| image_files_map = dict( | ||
| (image_id, rel_posix_path.rsplit("/", maxsplit=1)[1]) for image_id, rel_posix_path in image_files_dp | ||
| ) | ||
| image_files_dp = Mapper(image_files_dp, self._2011_image_key, input_col=1) |
There was a problem hiding this comment.
What we can do here is a in_memory_cache over image_files_dp. Then, we could add an API to convert a IterDataPipe to a lazy loaded MapDataPipe to represent the LazyDict.
If we use LazyDict here, I have concern that image_files_map would be missing from the DataLoader graph.
cc: @VitalyFedyunin
There was a problem hiding this comment.
Besides, I think the DataLoader would complain this datapipe graph in the second epoch because image_files_dp is never used after the first epoch then Demux would also be non-serializable same as the comment I made in the other PR.
So, a fix from Demux is not avoidable.
There was a problem hiding this comment.
This seems like a good thing to test in general. What should a test look like. Is something like
for _ in dataset.cycle(2):
passenough? If yes, my proposal passes this test.
There was a problem hiding this comment.
I mean if we put the dataset (datapipes) into DataLoader, the second epoch of DataLoader would break.
There was a problem hiding this comment.
So something like
data_loader = DataLoader2(dataset)
for epoch in range(2):
for sample in data_loader:
pass?
There was a problem hiding this comment.
Yeah. I have asked Kevin to fix such issue in demux.
There was a problem hiding this comment.
My proposal still works. I've pushed the test I'm running against. There are multiple failures for other datasets, but cub200 is not one of them.
| @@ -173,9 +177,8 @@ def _make_datapipe( | |||
| ) | |||
|
|
|||
There was a problem hiding this comment.
A second thought. Could we simply filter image_files_dp from archive_dp here and create the image_files_map dictionary?
Then, we can do demux over archive_dp again and drop data in image_files_dp.
There was a problem hiding this comment.
So basically splitting of image_files_dp from the graph?
There was a problem hiding this comment.
Yeah. Then, we can materialize the data from it like a meta-datapipe.
|
The agreed upon idiom is that we can iterate before the full pipeline is built, but we have to exhaust it completely. See #6065 for fixes to |
This is my proposed solution to #5187 (comment). Since we only need the mapping during iteration, we can also delay its instantiation until then. Thoughts?
cc @pmeier @bjuncek