Add support for PCAM dataset#5203
Conversation
💊 CI failures summary and remediationsAs of commit 8a3dd39 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet.
This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
|
||
| def __len__(self) -> int: | ||
| images_file = self._FILES[self._split]["images"][0] | ||
| with self.h5py.File(self._base_folder / images_file) as images_data: |
There was a problem hiding this comment.
Note for here and below: opening a File does not load its data into memory, so the operation is very cheap and fast.
Similarly below accessing a single row in the file will not load the entire file, just a specific section of it.
I guess we could open the files and keep the handles in __init__, but I'm not sure it would be any faster, and we might not be able to ever close the handles properly.
| _FILES = { | ||
| "train": { | ||
| "images": ( | ||
| "camelyonpatch_level_2_split_train_x.h5", # Data file name | ||
| "1Ka0XfEMiwgCYPdTI-vv6eUElOBnKFKQ2", # Google Drive ID | ||
| "1571f514728f59376b705fc836ff4b63", # md5 hash | ||
| ), |
There was a problem hiding this comment.
I'm not ecstatic about this big dict, but I needed everything in the same place to support a per-split download logic (i.e. only download the test data if we don't need train nor val).
There was a problem hiding this comment.
Thanks @NicolasHug! I have a few minor nits inline. Otherwise LGTM when CI is green.
| download (bool, optional): If True, downloads the dataset from the internet and puts it into | ||
| ``root/oxford-iiit-pet``. If dataset is already downloaded, it is not downloaded again. |
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Summary: * Add support for PCAM dataset * mypy * Apply suggestions from code review * Remove classes and class_to_idx attributes * Use _decompress Reviewed By: datumbox, NicolasHug Differential Revision: D33655258 fbshipit-source-id: a38e55340ab3c364969160f3c186d1a130bdc371 Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Towards #5108
This PR adds support for the PCAM dataset.
cc @pmeier