Open XLA pin update by qihqi · Pull Request #5675 · pytorch/xla

qihqi · 2023-10-04T22:46:50Z

No description provided.

will-cromar · 2023-10-04T22:58:45Z

@@ -1,19 +0,0 @@
-upstream CI will fail without this


Do you know why we were able to remove this patch? Is it because we updated the compiler in the CI?

I think we need to kick off upstream CI build targetting this branch and see whether CI will pass

yeah turns out I do still need those patches... otherwise the training job hangs.

will-cromar · 2023-10-04T22:59:19Z

@@ -1,14 +0,0 @@
-diff --git a/xla/service/gpu/gpu_executable.cc b/xla/service/gpu/gpu_executable.cc


Same question as above

will-cromar · 2023-10-04T23:00:39Z

+        "//openxla_patches:gpu_build_file.diff",
    ],
-    strip_prefix = "xla-97a5f819faf9ff793b7ba68ff1f31f74f9459c18",
+    strip_prefix = "xla-7a19856d74569fd1f765cd03bdee84e3b1fdc579",


Can you also update the libtpu dependency in setup.py to the same date as this commit?

qihqi · 2023-10-05T18:49:25Z

tested on v4-8:

with command

LD_LIBRARY_PATH=/home/hanq/miniconda3/envs/py310/lib python3 test/test_train_mp_imagenet.py --model=resnet50          --fake_data --num_epochs=10 --log_steps=300          --profile   --use_optimized_kwargs=tpuv4  --drop_last

result:

Old:
| Training Device=xla:0/3 Epoch=1 Step=1800 Loss=0.00135 Rate=1833.71 GlobalRate=918.89 Time=17:20:14
| Training Device=xla:0/1 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.82 GlobalRate=986.79 Time=17:20:35
| Training Device=xla:0/3 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.82 GlobalRate=990.06 Time=17:20:35
| Training Device=xla:0/0 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.81 GlobalRate=982.20 Time=17:20:35
| Training Device=xla:0/2 Epoch=1 Step=2100 Loss=0.00135 Rate=1843.79 GlobalRate=989.61 Time=17:20:35

===
New:
| Training Device=xla:0/3 Epoch=1 Step=1500 Loss=0.00138 Rate=1803.73 GlobalRate=822.80 Time=18:09:52
| Training Device=xla:0/2 Epoch=1 Step=1500 Loss=0.00138 Rate=1803.72 GlobalRate=821.27 Time=18:09:52
| Training Device=xla:0/1 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=911.50 Time=18:10:12
| Training Device=xla:0/3 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=906.47 Time=18:10:12
| Training Device=xla:0/0 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.62 GlobalRate=910.19 Time=18:10:12
| Training Device=xla:0/2 Epoch=1 Step=1800 Loss=0.00135 Rate=1828.63 GlobalRate=904.92 Time=18:10:12
| Training Device=xla:0/3 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.96 GlobalRate=977.43 Time=18:10:33
| Training Device=xla:0/0 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.97 GlobalRate=981.14 Time=18:10:33
| Training Device=xla:0/2 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.97 GlobalRate=975.89 Time=18:10:33
| Training Device=xla:0/1 Epoch=1 Step=2100 Loss=0.00135 Rate=1837.96 GlobalRate=982.45 Time=18:10:33

ManfeiBai · 2023-10-05T20:35:24Z

+         "@tsl//tsl/platform:casts",
+         "@tsl//tsl/platform:errors",
+-    ] + if_cuda([
+    ] + if_cuda_or_rocm([


Thanks!

this patch looks like for openxla/xla@9938bdb, so curious about the reason to skip the modify of load("//xla/stream_executor:build_defs.bzl", "if_cuda_or_rocm", "if_gpu_is_configured")?

since GPU CI failed with the same issue: RuntimeError: torch_xla/csrc/device.cpp:72 : Invalid device specification: CUDA:0, are they related too?

No particular reason.

I started importing on Oct 3 and this change is Oct 4.

alanwaketan · 2023-10-11T00:57:39Z

+        "//openxla_patches:gpu_topk_rewriter.diff",
    ],
-    strip_prefix = "xla-97a5f819faf9ff793b7ba68ff1f31f74f9459c18",
+    strip_prefix = "xla-51b59cfb1999c6f1b3ec59851675044b2c502aae",


Thanks for moving the head to this commit!

alanwaketan · 2023-10-11T01:00:28Z

 base_dir = os.path.dirname(os.path.abspath(__file__))

-_libtpu_version = '0.1.dev20230825'
+_libtpu_version = '0.1.dev20231009'


I suspect this should be 0.1.dev20231010 in order to include the open xla commit you specified.

alanwaketan

LGTM. Let me enable TPU CI and wait until it finishes.

Open XLA pin update - updated to 20231010

qihqi requested review from ManfeiBai and will-cromar October 4, 2023 22:46

qihqi force-pushed the hanq/pin_update branch from 2f43dfd to 63c5a13 Compare October 4, 2023 22:52

will-cromar reviewed Oct 4, 2023

View reviewed changes

qihqi force-pushed the hanq/pin_update branch from 7de99ad to ef9c1cc Compare October 5, 2023 19:43

ManfeiBai reviewed Oct 5, 2023

View reviewed changes

qihqi force-pushed the hanq/pin_update branch 5 times, most recently from 6c59c2c to 3f57cd1 Compare October 6, 2023 02:46

alanwaketan self-requested a review October 9, 2023 17:52

qihqi added 5 commits October 10, 2023 19:00

Open XLA pin update

3e7a229

Add patches back

96ec8bb

CUDA is same as GPU

6d3a97a

Adding more CUDA instead of GPU

a40767a

Also update env vars

6bb9f3e

qihqi force-pushed the hanq/pin_update branch 2 times, most recently from b97aa10 to 2dc72ab Compare October 10, 2023 20:24

Revert cl/561479066

af8bb2f

qihqi force-pushed the hanq/pin_update branch from 2dc72ab to af8bb2f Compare October 10, 2023 21:28

qihqi requested review from JackCaoG, ManfeiBai and will-cromar October 11, 2023 00:55

alanwaketan reviewed Oct 11, 2023

View reviewed changes

move libtpu to oct 10

346c037

alanwaketan approved these changes Oct 11, 2023

View reviewed changes

qihqi merged commit 418c751 into master Oct 11, 2023

zpcore pushed a commit that referenced this pull request Oct 19, 2023

Open XLA pin update (#5675)

fcdb922

Open XLA pin update - updated to 20231010

ghpvnist pushed a commit to ghpvnist/pytorch-xla that referenced this pull request Oct 31, 2023

Open XLA pin update (pytorch#5675)

3992343

Open XLA pin update - updated to 20231010

vanbasten23 mentioned this pull request Nov 1, 2023

update doc to use PJRT_DEVICE=CUDA instead of PJRT_DEVICE=GPU #5754

Merged

mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023

Open XLA pin update (pytorch#5675)

642e026

Open XLA pin update - updated to 20231010

mbzomowski mentioned this pull request Nov 16, 2023

tpu ci module refactor mbzomowski-test-org/xla#7

Merged

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Open XLA pin update (pytorch#5675)

10022f7

Open XLA pin update - updated to 20231010

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Open XLA pin update (#5675)

9edb4f1

Open XLA pin update - updated to 20231010

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Open XLA pin update (#5675)

384d968

Open XLA pin update - updated to 20231010

qihqi deleted the hanq/pin_update branch April 29, 2024 21:18

		@@ -1,14 +0,0 @@
		diff --git a/xla/service/gpu/gpu_executable.cc b/xla/service/gpu/gpu_executable.cc

Conversation

qihqi commented Oct 4, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qihqi commented Oct 5, 2023

Uh oh!

ManfeiBai Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ManfeiBai Oct 5, 2023 •

edited

Loading