Skip to content

[author: jluntamazon] Adding more explicit HLO lowering control by exposing LoweringContext…#5431

Merged
JackCaoG merged 12 commits intopytorch:masterfrom
aws-kingrj:neuron_inference
Sep 14, 2023
Merged

[author: jluntamazon] Adding more explicit HLO lowering control by exposing LoweringContext…#5431
JackCaoG merged 12 commits intopytorch:masterfrom
aws-kingrj:neuron_inference

Conversation

@aws-kingrj
Copy link
Copy Markdown
Collaborator

… (and utilities) to python for Neuron

  • Currently needed for PyTorch Inference with Neuron
  • Has been tested in Neuron version of torch-xla for many releases

@JackCaoG JackCaoG self-requested a review August 9, 2023 23:18
@aws-kingrj aws-kingrj changed the title Adding more explicit HLO lowering control by exposing LoweringContext… [author: jluntamazon] Adding more explicit HLO lowering control by exposing LoweringContext… Aug 10, 2023
Comment thread torch_xla/csrc/init_python_bindings.cpp Outdated
Comment thread torch_xla/csrc/init_python_bindings.cpp
Comment thread torch_xla/csrc/init_python_bindings.cpp Outdated
@JackCaoG
Copy link
Copy Markdown
Collaborator

@jluntamazon

Comment thread torch_xla/csrc/init_python_bindings.cpp
@JackCaoG
Copy link
Copy Markdown
Collaborator

we can merge it after CI is green. @wonjoolee95 can you cherry0pick this into 2.2 once merged?

Comment thread test/test_operations.py
@seanlatias
Copy link
Copy Markdown
Collaborator

Hmmm... The test results in CI are different from what I observe locally. Need to figure out why.

@seanlatias
Copy link
Copy Markdown
Collaborator

@JackCaoG @wonjoolee95 is there a way I can directly test in the CI environment?

@JackCaoG
Copy link
Copy Markdown
Collaborator

can you try to run the whole test('test_operations.py') instead of single test? If I have to guess it is other test running before this(which is in the same process) affect the result.

@seanlatias
Copy link
Copy Markdown
Collaborator

Yeah, I tried that locally but it still passed.

@JackCaoG
Copy link
Copy Markdown
Collaborator

There are logs in https://github.com/pytorch/xla/actions/runs/6174499874/job/16761837847?pr=5431 under Download and Run docker xxx show you how to download the docker of the CI and we have logs that how test is being run. You can pretty much replicate that process on your local machine.

@seanlatias
Copy link
Copy Markdown
Collaborator

Cool, thanks.

std::vector<torch::lazy::Value> ir_values;
for (auto& xtensor : xtensors) {
torch::lazy::Value value = xtensor->CurrentIrValue();
torch::lazy::Value value = xtensor->GetIrValue();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh haha, I should catch this when reviewing

@seanlatias
Copy link
Copy Markdown
Collaborator

I don't have the access to the docker image. Now the error messages are all because of Check failed: HasValue() . Do you have ideas what is usually the cuase for this?

@JackCaoG
Copy link
Copy Markdown
Collaborator

The failure is

buffer with shape f32[10] on device CPU:0 is deleted

and log is

	tsl::CurrentStackTrace[abi:cxx11]()
	torch_xla::runtime::PjRtComputationClient::PjRtData::GetOpaqueHandle()
	torch_xla::LoweringContext::GetParameter(std::shared_ptr<torch::lazy::BackendData> const&)
	torch_xla::DeviceData::Lower(torch_xla::LoweringContext*) const

it is suggesting that it is trying to get the parameter but the parameter buffer is already deleted. This is a bit weird, because the test code is

    a = torch.rand(10, device=device)
    b = torch.rand(10, device=device)
    xm.mark_step()

    result = a + b

The only thing I can guess is aliasing somehow messed up. can you try run with XLA_ENABLE_PARAM_ALIASING=0?

@seanlatias
Copy link
Copy Markdown
Collaborator

Jsut to double check, you mean setting the envvar when running this unit test?

@JackCaoG
Copy link
Copy Markdown
Collaborator

yea, were you able to repo locally?

@seanlatias
Copy link
Copy Markdown
Collaborator

No, no luck. I can still run the test sucessfully locally with and without the envvar.

@JackCaoG
Copy link
Copy Markdown
Collaborator

+ echo 'Running in DynamicShape mode: /tmp/pytorch/xla/test/test_operations.py' --verbosity=2
Running in DynamicShape mode: /tmp/pytorch/xla/test/test_operations.py --verbosity=2
+ XLA_EXPERIMENTAL=nonzero:masked_select:masked_scatter
+ run_test /tmp/pytorch/xla/test/test_operations.py --verbosity=2
+ echo 'Running in PjRt runtime: /tmp/pytorch/xla/test/test_operations.py' --verbosity=2
Running in PjRt runtime: /tmp/pytorch/xla/test/test_operations.py --verbosity=2
++ command -v nvidia-smi
+ '[' -x '' ']'
+ PJRT_DEVICE=CPU
+ CPU_NUM_DEVICES=1
+ run_coverage /tmp/pytorch/xla/test/test_operations.py --verbosity=2
+ '[' 0 '!=' 0 ']'
+ python3 /tmp/pytorch/xla/test/test_operations.py --verbosity=2

seems like it is dynamic shape failing. can you run with

XLA_EXPERIMENTAL=nonzero:masked_select:masked_scatter
PJRT_DEVICE=CPU
CPU_NUM_DEVICES=1
python3 /tmp/pytorch/xla/test/test_operations.py --verbosity=2

@seanlatias
Copy link
Copy Markdown
Collaborator

image

It still works locally.

@JackCaoG
Copy link
Copy Markdown
Collaborator

@wonjoolee95 Do you have time to take a look at this one?

@wonjoo-wj
Copy link
Copy Markdown
Collaborator

Let me try to build this and PyTorch master locally and see if I can reproduce it, it should be quick (like 15 minutes or so). If not, we can just disable the test in the CI for now.

@seanlatias
Copy link
Copy Markdown
Collaborator

Thanks @wonjoolee95

@wonjoo-wj
Copy link
Copy Markdown
Collaborator

I can also see that running both the entire test_operations.py and this specific test succeed in my CPU VM:

# Running this one test only
(base) wonjoo@wonjoo-cpu:~/pytorch/xla$ python test/test_operations.py TestLoweringContext.test_api
----------------------------------------------------------------------
Ran 1 test in 0.127s

OK

# Running entire `test_operations.py`
(base) wonjoo@wonjoo-cpu:~/pytorch/xla$ python test/test_operations.py 
/home/wonjoo/pytorch/xla/test/test_operations.py:906: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /home/wonjoo/pytorch/build/aten/src/ATen/core/TensorBody.h:489.)
  self.assertIsNone(a.grad)
----------------------------------------------------------------------
Ran 158 tests in 37.020s

OK (skipped=2)

Since we can both verify that this passes successfully in our dev env and cannot reproduce, let's just skip the test to keep the CI happy. @seanlatias, we can skip this with @unittest.skip, and let's leave a comment saying that this failure only happens in CI and leave a link to this PR for context.

@seanlatias
Copy link
Copy Markdown
Collaborator

@wonjoolee95 not sure what happens to the GPU test.

@wonjoo-wj
Copy link
Copy Markdown
Collaborator

Can you rebase with master one more time so it re-triggers the CI?

@seanlatias
Copy link
Copy Markdown
Collaborator

It's still the same. It seems the CI cancels itself automatically when it reaches 4 hr. I also see similar results in other PRs.

@JackCaoG
Copy link
Copy Markdown
Collaborator

it is a known issue, because forked pr can't use the remote cache. We can just merge.

@JackCaoG JackCaoG merged commit 41b38f5 into pytorch:master Sep 14, 2023
@seanlatias
Copy link
Copy Markdown
Collaborator

Thanks @JackCaoG @wonjoolee95

wonjoo-wj pushed a commit that referenced this pull request Sep 14, 2023
…posing LoweringContext… (#5431)

* Adding more explicit HLO lowering control by exposing LoweringContext (and utilities) to python for Neuron

* fixing linter issues

* fixing spacing

* apply comments and fix compilation errors

* add test for new apis

* fix linter

* update test

* update test

* modify test

* reverse back to GetIrValue()

* update test inputs with random numbers

* skip unittest because it only fails in CI

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-186.us-west-2.compute.internal>
Co-authored-by: seanlatias <seanlatias@gmail.com>
wonjoo-wj added a commit that referenced this pull request Sep 15, 2023
…posing LoweringContext… (#5431) (#5580)

* Adding more explicit HLO lowering control by exposing LoweringContext (and utilities) to python for Neuron

* fixing linter issues

* fixing spacing

* apply comments and fix compilation errors

* add test for new apis

* fix linter

* update test

* update test

* modify test

* reverse back to GetIrValue()

* update test inputs with random numbers

* skip unittest because it only fails in CI

---------

Co-authored-by: aws-kingrj <78175353+aws-kingrj@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-186.us-west-2.compute.internal>
Co-authored-by: seanlatias <seanlatias@gmail.com>
will-cromar pushed a commit that referenced this pull request Sep 18, 2023
…posing LoweringContext… (#5431) (#5580)

* Adding more explicit HLO lowering control by exposing LoweringContext (and utilities) to python for Neuron

* fixing linter issues

* fixing spacing

* apply comments and fix compilation errors

* add test for new apis

* fix linter

* update test

* update test

* modify test

* reverse back to GetIrValue()

* update test inputs with random numbers

* skip unittest because it only fails in CI

---------

Co-authored-by: aws-kingrj <78175353+aws-kingrj@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-186.us-west-2.compute.internal>
Co-authored-by: seanlatias <seanlatias@gmail.com>
will-cromar added a commit that referenced this pull request Sep 19, 2023
* Handle dynamo function without input (#5565) (#5577)

* Make cpu tensor on XLA dynamo backend a warning instead of error (#5549) (#5576)

* [author: jluntamazon] Adding more explicit HLO lowering control by exposing LoweringContext… (#5431) (#5580)

* Adding more explicit HLO lowering control by exposing LoweringContext (and utilities) to python for Neuron

* fixing linter issues

* fixing spacing

* apply comments and fix compilation errors

* add test for new apis

* fix linter

* update test

* update test

* modify test

* reverse back to GetIrValue()

* update test inputs with random numbers

* skip unittest because it only fails in CI

---------

Co-authored-by: aws-kingrj <78175353+aws-kingrj@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-186.us-west-2.compute.internal>
Co-authored-by: seanlatias <seanlatias@gmail.com>

* fixing num_local_processes typo (#5573) (#5579)

Co-authored-by: aws-kingrj <78175353+aws-kingrj@users.noreply.github.com>

* Move where clear pending IR is called to avoid crash (#5552) (#5582)

* Move where clear pending IR is called to avoid crash

* fix CI

* fix CI and add some debugging messages

* Fix release branch and tag patterns for GitHub Actions (#5587) (#5590)

* Improve bernoulli rng-bit-generation memory footprint (#5581) (#5589)

* Allow downcasting RngUniform genenration for Bernoulli

Co-authored-by: Yeounoh Chung <yeounoh@google.com>

* Enable xla:gpu autocast for bfloat16 if not restricted (#5570) (#5591)

* Enable autocast for XLA:GPU

* linter fix

* XLA autocast test for GPU and TPU

* linter fix

* Ensure that xla autocast is properly enabled for GPU and does not crash when torch cuda is not available.

* linter fix

* Add tests

* Support bf16

* linter fix

* exclude unsupported test cases

* increase GPU test timeout to 300

Co-authored-by: Yeounoh Chung <yeounoh@google.com>

* Cherry-pick: Don't trigger CI build on release tag push (#5595)

Copy of #5594 on release branch

* formatting

---------

Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com>
Co-authored-by: Wonjoo Lee <wonjoo@google.com>
Co-authored-by: aws-kingrj <78175353+aws-kingrj@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-186.us-west-2.compute.internal>
Co-authored-by: seanlatias <seanlatias@gmail.com>
Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com>
Co-authored-by: Yeounoh Chung <yeounoh@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants