[author: jluntamazon] Adding more explicit HLO lowering control by exposing LoweringContext… by aws-kingrj · Pull Request #5431 · pytorch/xla

aws-kingrj · 2023-08-09T23:16:12Z

… (and utilities) to python for Neuron

Currently needed for PyTorch Inference with Neuron
Has been tested in Neuron version of torch-xla for many releases

JackCaoG · 2023-08-15T21:08:07Z

@jluntamazon

JackCaoG · 2023-09-13T00:08:21Z

we can merge it after CI is green. @wonjoolee95 can you cherry0pick this into 2.2 once merged?

seanlatias · 2023-09-13T03:29:40Z

Hmmm... The test results in CI are different from what I observe locally. Need to figure out why.

seanlatias · 2023-09-13T16:58:57Z

@JackCaoG @wonjoolee95 is there a way I can directly test in the CI environment?

JackCaoG · 2023-09-13T17:32:28Z

can you try to run the whole test('test_operations.py') instead of single test? If I have to guess it is other test running before this(which is in the same process) affect the result.

seanlatias · 2023-09-13T17:34:05Z

Yeah, I tried that locally but it still passed.

JackCaoG · 2023-09-13T17:41:50Z

There are logs in https://github.com/pytorch/xla/actions/runs/6174499874/job/16761837847?pr=5431 under Download and Run docker xxx show you how to download the docker of the CI and we have logs that how test is being run. You can pretty much replicate that process on your local machine.

seanlatias · 2023-09-13T17:42:24Z

Cool, thanks.

JackCaoG · 2023-09-13T19:00:24Z

    std::vector<torch::lazy::Value> ir_values;
    for (auto& xtensor : xtensors) {
-      torch::lazy::Value value = xtensor->CurrentIrValue();
+      torch::lazy::Value value = xtensor->GetIrValue();


oh haha, I should catch this when reviewing

seanlatias · 2023-09-13T20:26:38Z

I don't have the access to the docker image. Now the error messages are all because of Check failed: HasValue() . Do you have ideas what is usually the cuase for this?

JackCaoG · 2023-09-13T20:38:25Z

The failure is

buffer with shape f32[10] on device CPU:0 is deleted

and log is

	tsl::CurrentStackTrace[abi:cxx11]()
	torch_xla::runtime::PjRtComputationClient::PjRtData::GetOpaqueHandle()
	torch_xla::LoweringContext::GetParameter(std::shared_ptr<torch::lazy::BackendData> const&)
	torch_xla::DeviceData::Lower(torch_xla::LoweringContext*) const

it is suggesting that it is trying to get the parameter but the parameter buffer is already deleted. This is a bit weird, because the test code is

    a = torch.rand(10, device=device)
    b = torch.rand(10, device=device)
    xm.mark_step()

    result = a + b

The only thing I can guess is aliasing somehow messed up. can you try run with XLA_ENABLE_PARAM_ALIASING=0?

seanlatias · 2023-09-13T20:41:39Z

Jsut to double check, you mean setting the envvar when running this unit test?

JackCaoG · 2023-09-13T20:46:31Z

yea, were you able to repo locally?

seanlatias · 2023-09-13T20:48:59Z

No, no luck. I can still run the test sucessfully locally with and without the envvar.

JackCaoG · 2023-09-13T20:51:57Z

+ echo 'Running in DynamicShape mode: /tmp/pytorch/xla/test/test_operations.py' --verbosity=2
Running in DynamicShape mode: /tmp/pytorch/xla/test/test_operations.py --verbosity=2
+ XLA_EXPERIMENTAL=nonzero:masked_select:masked_scatter
+ run_test /tmp/pytorch/xla/test/test_operations.py --verbosity=2
+ echo 'Running in PjRt runtime: /tmp/pytorch/xla/test/test_operations.py' --verbosity=2
Running in PjRt runtime: /tmp/pytorch/xla/test/test_operations.py --verbosity=2
++ command -v nvidia-smi
+ '[' -x '' ']'
+ PJRT_DEVICE=CPU
+ CPU_NUM_DEVICES=1
+ run_coverage /tmp/pytorch/xla/test/test_operations.py --verbosity=2
+ '[' 0 '!=' 0 ']'
+ python3 /tmp/pytorch/xla/test/test_operations.py --verbosity=2

seems like it is dynamic shape failing. can you run with

XLA_EXPERIMENTAL=nonzero:masked_select:masked_scatter
PJRT_DEVICE=CPU
CPU_NUM_DEVICES=1
python3 /tmp/pytorch/xla/test/test_operations.py --verbosity=2

seanlatias · 2023-09-13T20:57:37Z

It still works locally.

JackCaoG · 2023-09-13T21:01:31Z

@wonjoolee95 Do you have time to take a look at this one?

wonjoo-wj · 2023-09-13T21:38:35Z

Let me try to build this and PyTorch master locally and see if I can reproduce it, it should be quick (like 15 minutes or so). If not, we can just disable the test in the CI for now.

seanlatias · 2023-09-13T21:44:13Z

Thanks @wonjoolee95

wonjoo-wj · 2023-09-13T22:15:50Z

I can also see that running both the entire test_operations.py and this specific test succeed in my CPU VM:

# Running this one test only
(base) wonjoo@wonjoo-cpu:~/pytorch/xla$ python test/test_operations.py TestLoweringContext.test_api
----------------------------------------------------------------------
Ran 1 test in 0.127s

OK

# Running entire `test_operations.py`
(base) wonjoo@wonjoo-cpu:~/pytorch/xla$ python test/test_operations.py 
/home/wonjoo/pytorch/xla/test/test_operations.py:906: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /home/wonjoo/pytorch/build/aten/src/ATen/core/TensorBody.h:489.)
  self.assertIsNone(a.grad)
----------------------------------------------------------------------
Ran 158 tests in 37.020s

OK (skipped=2)

Since we can both verify that this passes successfully in our dev env and cannot reproduce, let's just skip the test to keep the CI happy. @seanlatias, we can skip this with @unittest.skip, and let's leave a comment saying that this failure only happens in CI and leave a link to this PR for context.

seanlatias · 2023-09-14T03:57:09Z

@wonjoolee95 not sure what happens to the GPU test.

wonjoo-wj · 2023-09-14T05:13:02Z

Can you rebase with master one more time so it re-triggers the CI?

… (and utilities) to python for Neuron

seanlatias · 2023-09-14T20:11:16Z

It's still the same. It seems the CI cancels itself automatically when it reaches 4 hr. I also see similar results in other PRs.

JackCaoG · 2023-09-14T20:13:07Z

it is a known issue, because forked pr can't use the remote cache. We can just merge.

seanlatias · 2023-09-14T20:14:10Z

Thanks @JackCaoG @wonjoolee95

…posing LoweringContext… (#5431) * Adding more explicit HLO lowering control by exposing LoweringContext (and utilities) to python for Neuron * fixing linter issues * fixing spacing * apply comments and fix compilation errors * add test for new apis * fix linter * update test * update test * modify test * reverse back to GetIrValue() * update test inputs with random numbers * skip unittest because it only fails in CI --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-186.us-west-2.compute.internal> Co-authored-by: seanlatias <seanlatias@gmail.com>

…posing LoweringContext… (#5431) (#5580) * Adding more explicit HLO lowering control by exposing LoweringContext (and utilities) to python for Neuron * fixing linter issues * fixing spacing * apply comments and fix compilation errors * add test for new apis * fix linter * update test * update test * modify test * reverse back to GetIrValue() * update test inputs with random numbers * skip unittest because it only fails in CI --------- Co-authored-by: aws-kingrj <78175353+aws-kingrj@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-186.us-west-2.compute.internal> Co-authored-by: seanlatias <seanlatias@gmail.com>

* Handle dynamo function without input (#5565) (#5577) * Make cpu tensor on XLA dynamo backend a warning instead of error (#5549) (#5576) * [author: jluntamazon] Adding more explicit HLO lowering control by exposing LoweringContext… (#5431) (#5580) * Adding more explicit HLO lowering control by exposing LoweringContext (and utilities) to python for Neuron * fixing linter issues * fixing spacing * apply comments and fix compilation errors * add test for new apis * fix linter * update test * update test * modify test * reverse back to GetIrValue() * update test inputs with random numbers * skip unittest because it only fails in CI --------- Co-authored-by: aws-kingrj <78175353+aws-kingrj@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-186.us-west-2.compute.internal> Co-authored-by: seanlatias <seanlatias@gmail.com> * fixing num_local_processes typo (#5573) (#5579) Co-authored-by: aws-kingrj <78175353+aws-kingrj@users.noreply.github.com> * Move where clear pending IR is called to avoid crash (#5552) (#5582) * Move where clear pending IR is called to avoid crash * fix CI * fix CI and add some debugging messages * Fix release branch and tag patterns for GitHub Actions (#5587) (#5590) * Improve bernoulli rng-bit-generation memory footprint (#5581) (#5589) * Allow downcasting RngUniform genenration for Bernoulli Co-authored-by: Yeounoh Chung <yeounoh@google.com> * Enable xla:gpu autocast for bfloat16 if not restricted (#5570) (#5591) * Enable autocast for XLA:GPU * linter fix * XLA autocast test for GPU and TPU * linter fix * Ensure that xla autocast is properly enabled for GPU and does not crash when torch cuda is not available. * linter fix * Add tests * Support bf16 * linter fix * exclude unsupported test cases * increase GPU test timeout to 300 Co-authored-by: Yeounoh Chung <yeounoh@google.com> * Cherry-pick: Don't trigger CI build on release tag push (#5595) Copy of #5594 on release branch * formatting --------- Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com> Co-authored-by: Wonjoo Lee <wonjoo@google.com> Co-authored-by: aws-kingrj <78175353+aws-kingrj@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-186.us-west-2.compute.internal> Co-authored-by: seanlatias <seanlatias@gmail.com> Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com> Co-authored-by: Yeounoh Chung <yeounoh@google.com>

JackCaoG self-requested a review August 9, 2023 23:18

aws-kingrj changed the title ~~Adding more explicit HLO lowering control by exposing LoweringContext…~~ [author: jluntamazon] Adding more explicit HLO lowering control by exposing LoweringContext… Aug 10, 2023

JackCaoG reviewed Aug 10, 2023

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp Outdated

JackCaoG reviewed Aug 10, 2023

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp

JackCaoG reviewed Aug 10, 2023

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp Outdated

seanlatias force-pushed the neuron_inference branch from caa8dc1 to 41e3fb6 Compare September 11, 2023 20:45

JackCaoG reviewed Sep 12, 2023

View reviewed changes

Comment thread torch_xla/csrc/init_python_bindings.cpp

seanlatias force-pushed the neuron_inference branch from 41e3fb6 to f296a1a Compare September 12, 2023 22:27

JackCaoG approved these changes Sep 13, 2023

View reviewed changes

vanbasten23 reviewed Sep 13, 2023

View reviewed changes

Comment thread test/test_operations.py

JackCaoG reviewed Sep 13, 2023

View reviewed changes

aws-kingrj and others added 12 commits September 14, 2023 14:59

Adding more explicit HLO lowering control by exposing LoweringContext…

3b5c330

… (and utilities) to python for Neuron

fixing linter issues

a305ccd

fixing spacing

6d90cc2

apply comments and fix compilation errors

4790652

add test for new apis

dfbe3d7

fix linter

b189faa

update test

7bdeb50

update test

5558d58

modify test

8dc7ace

reverse back to GetIrValue()

9bf4483

update test inputs with random numbers

92ac531

skip unittest because it only fails in CI

13686fc

seanlatias force-pushed the neuron_inference branch from 4e345ac to 13686fc Compare September 14, 2023 14:59

JackCaoG merged commit 41b38f5 into pytorch:master Sep 14, 2023

wonjoo-wj mentioned this pull request Sep 14, 2023

Cherry-pick: Adding more explicit HLO lowering control by exposing LoweringContext (and utilities) to python for Neuron #5580

Merged

Conversation

aws-kingrj commented Aug 9, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JackCaoG commented Aug 15, 2023

Uh oh!

Uh oh!

JackCaoG commented Sep 13, 2023

Uh oh!

Uh oh!

seanlatias commented Sep 13, 2023

Uh oh!

seanlatias commented Sep 13, 2023

Uh oh!

JackCaoG commented Sep 13, 2023

Uh oh!

seanlatias commented Sep 13, 2023

Uh oh!

JackCaoG commented Sep 13, 2023

Uh oh!

seanlatias commented Sep 13, 2023

Uh oh!

JackCaoG Sep 13, 2023

Choose a reason for hiding this comment

Uh oh!

seanlatias commented Sep 13, 2023

Uh oh!

JackCaoG commented Sep 13, 2023

Uh oh!

seanlatias commented Sep 13, 2023

Uh oh!

JackCaoG commented Sep 13, 2023

Uh oh!

seanlatias commented Sep 13, 2023

Uh oh!

JackCaoG commented Sep 13, 2023

Uh oh!

seanlatias commented Sep 13, 2023

Uh oh!

JackCaoG commented Sep 13, 2023

Uh oh!

wonjoo-wj commented Sep 13, 2023

Uh oh!

seanlatias commented Sep 13, 2023

Uh oh!

wonjoo-wj commented Sep 13, 2023

Uh oh!

seanlatias commented Sep 14, 2023

Uh oh!

wonjoo-wj commented Sep 14, 2023

Uh oh!

seanlatias commented Sep 14, 2023

Uh oh!

JackCaoG commented Sep 14, 2023

Uh oh!

seanlatias commented Sep 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants