🐛 Describe the bug
S3 test is broken due to the the datasets have been updated in the public bucket, which we don't have any control.
Test case:
https://github.com/pytorch/data/blob/807db8f8c7282b2f48b48b1e07439c119a2ba12f/test/test_remote_io.py#L256-L291
And, we previously just fix the test by updating the number of files per bucket whenever the dataset update happened. It's not a long-term solution to maintain CI. To fix it, we might choose from the following solutions:
- Only validate some files existing in the output, not the total file count from each bucket
- Use mock to simulate the result
- Add our own stable bucket for testing
I prefer the first solution for two reasons:
- We want to test the functionality provided by
_torchdata.so. Even though mocking the result of this extension would guarantee test green, it doesn't really cover the test over the extension.
- The third option might work but it also means our own bucket will be exposed on Github, which is not ideal IMHO.
Versions
main