Skip to content

Open XLA pin update#5675

Merged
qihqi merged 7 commits intomasterfrom
hanq/pin_update
Oct 11, 2023
Merged

Open XLA pin update#5675
qihqi merged 7 commits intomasterfrom
hanq/pin_update

Conversation

@qihqi
Copy link
Copy Markdown
Collaborator

@qihqi qihqi commented Oct 4, 2023

No description provided.

@@ -1,19 +0,0 @@
upstream CI will fail without this
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we were able to remove this patch? Is it because we updated the compiler in the CI?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to kick off upstream CI build targetting this branch and see whether CI will pass

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah turns out I do still need those patches... otherwise the training job hangs.

@@ -1,14 +0,0 @@
diff --git a/xla/service/gpu/gpu_executable.cc b/xla/service/gpu/gpu_executable.cc
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above

Comment thread WORKSPACE Outdated
"//openxla_patches:gpu_build_file.diff",
],
strip_prefix = "xla-97a5f819faf9ff793b7ba68ff1f31f74f9459c18",
strip_prefix = "xla-7a19856d74569fd1f765cd03bdee84e3b1fdc579",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the libtpu dependency in setup.py to the same date as this commit?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@qihqi
Copy link
Copy Markdown
Collaborator Author

qihqi commented Oct 5, 2023

tested on v4-8:

with command

LD_LIBRARY_PATH=/home/hanq/miniconda3/envs/py310/lib python3 test/test_train_mp_imagenet.py --model=resnet50          --fake_data --num_epochs=10 --log_steps=300          --profile   --use_optimized_kwargs=tpuv4  --drop_last

result:

Old:
| Training Device=xla:0/3 Epoch=1 Step=1800 Loss=0.00135 Rate=1833.71 GlobalRate=918.89 Time=17:20:14
| Training Device=xla:0/1 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.82 GlobalRate=986.79 Time=17:20:35
| Training Device=xla:0/3 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.82 GlobalRate=990.06 Time=17:20:35
| Training Device=xla:0/0 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.81 GlobalRate=982.20 Time=17:20:35
| Training Device=xla:0/2 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.79 GlobalRate=989.61 Time=17:20:35

===
New:
| Training Device=xla:0/3 Epoch=1 Step=1500 Loss=0.00138 Rate=1803.73 GlobalRate=822.80 Time=18:09:52
| Training Device=xla:0/2 Epoch=1 Step=1500 Loss=0.00138 Rate=1803.72 GlobalRate=821.27 Time=18:09:52
| Training Device=xla:0/1 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=911.50 Time=18:10:12
| Training Device=xla:0/3 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=906.47 Time=18:10:12
| Training Device=xla:0/0 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=910.19 Time=18:10:12
| Training Device=xla:0/2 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.63 GlobalRate=904.92 Time=18:10:12
| Training Device=xla:0/3 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.96 GlobalRate=977.43 Time=18:10:33
| Training Device=xla:0/0 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.97 GlobalRate=981.14 Time=18:10:33
| Training Device=xla:0/2 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.97 GlobalRate=975.89 Time=18:10:33
| Training Device=xla:0/1 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.96 GlobalRate=982.45 Time=18:10:33

Comment thread openxla_patches/gpu_build_file.diff Outdated
"@tsl//tsl/platform:casts",
"@tsl//tsl/platform:errors",
- ] + if_cuda([
+ ] + if_cuda_or_rocm([
Copy link
Copy Markdown
Collaborator

@ManfeiBai ManfeiBai Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

this patch looks like for openxla/xla@9938bdb, so curious about the reason to skip the modify of load("//xla/stream_executor:build_defs.bzl", "if_cuda_or_rocm", "if_gpu_is_configured")?

since GPU CI failed with the same issue: RuntimeError: torch_xla/csrc/device.cpp:72 : Invalid device specification: CUDA:0, are they related too?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason.

I started importing on Oct 3 and this change is Oct 4.

@qihqi qihqi force-pushed the hanq/pin_update branch 5 times, most recently from 6c59c2c to 3f57cd1 Compare October 6, 2023 02:46
@alanwaketan alanwaketan self-requested a review October 9, 2023 17:52
@qihqi qihqi force-pushed the hanq/pin_update branch 2 times, most recently from b97aa10 to 2dc72ab Compare October 10, 2023 20:24
Comment thread WORKSPACE
"//openxla_patches:gpu_topk_rewriter.diff",
],
strip_prefix = "xla-97a5f819faf9ff793b7ba68ff1f31f74f9459c18",
strip_prefix = "xla-51b59cfb1999c6f1b3ec59851675044b2c502aae",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving the head to this commit!

Comment thread setup.py Outdated
base_dir = os.path.dirname(os.path.abspath(__file__))

_libtpu_version = '0.1.dev20230825'
_libtpu_version = '0.1.dev20231009'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this should be 0.1.dev20231010 in order to include the open xla commit you specified.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let me enable TPU CI and wait until it finishes.

@qihqi qihqi merged commit 418c751 into master Oct 11, 2023
zpcore pushed a commit that referenced this pull request Oct 19, 2023
Open XLA pin update - updated to 20231010
ghpvnist pushed a commit to ghpvnist/pytorch-xla that referenced this pull request Oct 31, 2023
Open XLA pin update - updated to 20231010
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
Open XLA pin update - updated to 20231010
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
Open XLA pin update - updated to 20231010
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
Open XLA pin update - updated to 20231010
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
Open XLA pin update - updated to 20231010
@qihqi qihqi deleted the hanq/pin_update branch April 29, 2024 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants