Fixed datatype related issues in DataCollatorForLanguageModeling#36457
Conversation
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the |
tests/trainer/test_data_collator.py::TFDataCollatorIntegrationTest::test_all_mask_replacement and DataCollatorForLanguageModelingDataCollatorForLanguageModeling
e51086e to
00366e0
Compare
|
Hey @Rocketknight1! I saw the force push on the branch, but not really sure what to make of it. Hope the PR works as expected, and that there aren't any issues |
|
Hi @capemox, the force push is caused by me rebasing with the "Update branch" tool in the Github interface, don't worry about it! I often do this with PRs before merging to fix any CI issues. You can |
…CollatorIntegrationTest::test_all_mask_replacement`: 1. I got the error `RuntimeError: "bernoulli_tensor_cpu_p_" not implemented for 'Long'`. This is because the `mask_replacement_prob=1` and `torch.bernoulli` doesn't accept this type (which would be a `torch.long` dtype instead. I fixed this by manually casting the probability arguments in the `__post_init__` function of `DataCollatorForLanguageModeling`. 2. I also got the error `tensorflow.python.framework.errors_impl.InvalidArgumentError: cannot compute Equal as input huggingface#1(zero-based) was expected to be a int64 tensor but is a int32 tensor [Op:Equal]` due to the line `tf.reduce_all((batch["input_ids"] == inputs) | (batch["input_ids"] == tokenizer.mask_token_id))` in `test_data_collator.py`. This occurs because the type of the `inputs` variable is `tf.int32`. Solved this by manually casting it to `tf.int64` in the test, as the expected return type of `batch["input_ids"]` is `tf.int64`.
00366e0 to
6d15651
Compare
Rocketknight1
left a comment
There was a problem hiding this comment.
Sorry again for the confusion with the rebase, but now that the CI is passing this looks good, thank you!
|
Thanks a lot for the clarification @Rocketknight1! Feels great being able to contribute! Not wanting to be too selfish, but could you also take a look at this PR: |
What does this PR do?
Fixes two issues regarding the test
tests/trainer/test_data_collator.py::TFDataCollatorIntegrationTest::test_all_mask_replacement:RuntimeError: "bernoulli_tensor_cpu_p_" not implemented for 'Long'. This is because themask_replacement_prob=1andtorch.bernoullidoesn't accept this type (which would be atorch.longdtype instead). I fixed this by manually casting the probability arguments in the__post_init__function ofDataCollatorForLanguageModeling.tensorflow.python.framework.errors_impl.InvalidArgumentError: cannot compute Equal as input #1(zero-based) was expected to be a int64 tensor but is a int32 tensor [Op:Equal]due to the line
tf.reduce_all((batch["input_ids"] == inputs) | (batch["input_ids"] == tokenizer.mask_token_id))in
test_data_collator.py. This occurs because the type of theinputsvariable istf.int32. Solved this by manually casting it totf.int64in the test, as the expected return type ofbatch["input_ids"]istf.int64.These changes were done on Python
3.12.8. The dependencies installed were aspip install -e ".[dev]"along with:Motivation: I wanted to make some contributions to
DataCollatorForLanguageModeling, unfortunately though the tests were failing on the existing code itself. I thought I'll fix these bugs before moving forward with that.Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@Rocketknight1 should be able to review this!