[release-test] Disable drop_last flag to fix division by zero in torch dataloader baselines#56395
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the ZeroDivisionError by setting drop_last=False for the PyTorch DataLoader. This prevents the validation dataloader from becoming empty when the dataset size isn't a multiple of the batch size distributed across workers. The addition of an assert in runner.py is also a good defensive measure to ensure the validation set is not empty. My main feedback is regarding a subtle issue that this change exposes: the calculation of num_rows for validation loss now becomes inaccurate for the last batch. I've left a specific comment on how to address this. Overall, this is a good fix for the reported bug.
| self._metrics["validation/rows_processed"].add( | ||
| self.benchmark_config.dataloader_config.validation_batch_size | ||
| ) | ||
| assert num_rows > 0, "Validation dataset yielded no batches." |
There was a problem hiding this comment.
This assertion is a good safeguard against the ZeroDivisionError. However, the change to drop_last=False in torch_dataloader_factory.py surfaces a subtle bug in the calculation of num_rows. On line 200, num_rows is incremented by the fixed validation_batch_size. With drop_last=False, the last batch may be partial, causing num_rows to be overestimated and the calculated validation/loss to be incorrect. To fix this, you should use the actual size of the batch. For example, you could change line 200 to num_rows += batch[0].size(0) (assuming batch is a tuple where the first element is the input tensor). A similar issue, already noted with a TODO, exists in _train_epoch.
There was a problem hiding this comment.
yeah but I have it as a todo already in the code. It's a bit hard to get the real size of the batch (it's usually batch_size except for the last one) because the collated format is specific to the task right now (image vs. torchrec training).
| prefetch_factor=prefetch_factor, | ||
| timeout=timeout, | ||
| drop_last=True, | ||
| drop_last=False, |
There was a problem hiding this comment.
Do we need to change this elsewhere in train_tests/benchmark?
There was a problem hiding this comment.
should just be here as a torch dataloader specific change
| self._metrics["validation/rows_processed"].add( | ||
| self.benchmark_config.dataloader_config.validation_batch_size | ||
| ) | ||
| assert num_rows > 0, "Validation dataset yielded no batches." |
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Summary
#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation.
This happened because
drop_last=Truebecame the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by settingdrop_last=False.