Conversation
8d0a541 to
f1e37bd
Compare
|
Can you update the libtpu version too? Line 62 in 29a8bf2 |
|
Yep, we'll update that. |
1e10aac to
68a61d1
Compare
9391930 to
717d9a3
Compare
|
@wonjoolee95 Can you rebase this pr? |
717d9a3 to
e7a1142
Compare
d5aa01e to
e3965e2
Compare
This reverts commit 68a61d1.
e3965e2 to
f399816
Compare
….cc after rebasing from master
| PJRT_DEVICE=GPU python test/test_train_mp_imagenet_fsdp.py --fake_data --auto_wrap_policy type_based --use_small_fake_sample --num_epochs=1 | ||
| XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=GPU python test/test_train_mp_imagenet_fsdp.py --fake_data --use_nested_fsdp --use_small_fake_sample --num_epochs=1 | ||
| # This test fails on GPU with 03/30 TF-pin update (https://github.com/pytorch/xla/pull/4840) | ||
| # XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=GPU python test/test_train_mp_imagenet_fsdp.py --fake_data --use_nested_fsdp --use_small_fake_sample --num_epochs=1 |
There was a problem hiding this comment.
TODO: reword the comment to say this is only happening on CI, can't repo on GPU VM.(do that in a separate pr)
Also do we really need this test, seems like the only difference is to disable the functionization. @alanwaketan
| "This may indicate a bug on the caller side. (b/274683676)"; | ||
| } | ||
| - return rc_keep_alive; | ||
| + return std::make_optional<tsl::core::RefCountPtr<Rendezvous>>(std::move(rc_keep_alive)); |
There was a problem hiding this comment.
this should be fixed in tf right? so we can remove this patch in next release.
There was a problem hiding this comment.
Yes, hopefully we can also remove the triton patch in the next tf update.
| }; | ||
| - return Match(sort->operand(1), match_iota(data->shape().dimensions())) || | ||
| - Match(sort->operand(1), m::Broadcast(match_iota(sort_dims))); | ||
| + return Match(sort->operand(1), match_iota(absl::Span<const int64_t>(data->shape().dimensions().begin(), data->shape().dimensions().size()))) || |
There was a problem hiding this comment.
ditto, should be fixed in tf
| // TODO(yeounoh) currently only support single-slice execution | ||
| execute_options.multi_slice_config = nullptr; | ||
|
|
||
| // Required as of cl/518733871 |
There was a problem hiding this comment.
do not include cl number in open pr, use pr number on public github instead.(fix in next pr).
| execute_options.untuple_result = options.explode_tuple; | ||
| execute_options.strict_shape_checking = false; | ||
|
|
||
| // Required as of cl/518733871 |
JackCaoG
left a comment
There was a problem hiding this comment.
Mostly lgtm, you can fix comments in a separate pr
|
Thanks for the comments, @JackCaoG. Adding the log for the failing GPU tests ( |
|
Tests on GPU |
|
Thanks for the review and comments, Jack. I'll also work with the next tf-pin rotation to make sure we try to remove some of these pathces. |
Update tf pin to 03/2023