Make it an error if calling sizes() on a dynamic tensor. by vanbasten23 · Pull Request #4998 · pytorch/xla

vanbasten23 · 2023-05-11T00:16:27Z

No description provided.

vanbasten23 · 2023-05-11T00:21:16Z

The current fix

at::IntArrayRef XLATensorImpl::sizes_custom() const {
  // XLA_CHECK(!has_symbolic_sizes_strides_)
  xla::Shape xla_shape = tensor_->shape().get();
  XLA_CHECK(!xla_shape.is_dynamic())
      << "Cannot call sizes_custom() on an XLA tensor with symbolic "
         "sizes/strides";

is wrong. Doing so would fail at nonzero:

*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        torch_xla::XLATensorImpl::sizes_custom() const
      at::functionalization::FunctionalStorageImpl::FunctionalStorageImpl(at::Tensor const&)
        at::FunctionalTensorWrapper::FunctionalTensorWrapper(at::Tensor const&)
        at::functionalization::impl::to_functional_tensor(at::Tensor const&)
        at::_ops::nonzero::redispatch(c10::DispatchKeySet, at::Tensor const&)
        at::_ops::nonzero::call(at::Tensor const&)

So I think the correct fix is that we have to use has_symbolic_sizes_strides_ in tensorImpl instead of xla::Shape. But right now, has_symbolic_sizes_strides_ is not set so need to figure out where it should be set.

vanbasten23 · 2023-05-12T19:59:17Z

hi @ezyang , in this PR I'm trying to make it an error when calling tensor.sizes() on a tensor with dynamic dimensions. Specifically, torch_xla has not set has_symbolic_sizes_strides_ on XLATensorImpl before so this PR tries to do so. Can I check with you about these questions?

With my change, nonzero op fails in functionalization layer even before we do dynamic_tensor.sizes() with error. I could reproduce the error when running our test XLA_EXPERIMENTAL="nonzero:masked_select" python3 pytorch/xla/test/ds/test_dynamic_shapes.py TestDynamicShapes.test_simple_expand. Upon taking a close look, I found https://github.com/pytorch/pytorch/blob/e248016472fd3b5c8aed270704b0fd21fac20842/aten/src/ATen/FunctionalStorageImpl.cpp#L83 evaluate to false.

Do you know how to fix it? I tried to add the c10::DispatchKey::Python to

xla/torch_xla/csrc/tensor_impl.cpp

Line 60 in 2855373

: c10::TensorImpl(c10::DispatchKeySet{c10::DispatchKey::XLA,

but it didn't help.
As you know, torch_xla's tensors don't have storage. What should we expect to get from this call value.storage().sym_nbytes() from https://github.com/pytorch/pytorch/blob/e248016472fd3b5c8aed270704b0fd21fac20842/aten/src/ATen/FunctionalStorageImpl.cpp#L84?

Thanks.

ezyang · 2023-05-15T19:53:44Z

I wouldn't add Python key to XLA tensor. You may need to modify PyTorch core to properly fix this. I think what you want to do, is make the "recompute from scratch code" (that XLA was previously exercising) work with SymInts. That's mostly calling sym_storage_offset instead of storage_offset, and likewise for anything else that needs it.

vanbasten23 · 2023-05-16T18:48:02Z

is make the "recompute from scratch code" (that XLA was previously exercising) work with SymInts.

Thanks for the response @ezyang . Can you be more specific about where the recompute from scratch code (that XLA was previously exercising) is in pytorch's codebase?

ezyang · 2023-05-16T19:23:00Z

this https://github.com/pytorch/pytorch/blob/e248016472fd3b5c8aed270704b0fd21fac20842/aten/src/ATen/FunctionalStorageImpl.cpp#L87-L88

…01634) Fixes [#ISSUE_NUMBER](pytorch/xla#4998) according to [comment](pytorch/xla#4998 (comment)). This change is needed to make sure calling tensor.sizes() will error if the tensor has dynamic dimension in pytorch/xla. Pull Request resolved: #101634 Approved by: https://github.com/ezyang

vanbasten23 · 2023-05-23T16:54:58Z

The unit test test_index_types failed because XLANativeFunctions::index uses nonzero which introduces dynamism. It failed because in

xla/torch_xla/csrc/tensor_util.h

Lines 139 to 165 in 7a1e9c7

    
           inline std::vector<at::Tensor> xla_expand_outplace(at::TensorList to_expand) { 
        
             // expands a list of Tensors; ignores undefined (null) tensors 
        
             bool first = true; 
        
             at::DimVector sizes; 
        
             for (const auto i : c10::irange(to_expand.size())) { 
        
               if (!to_expand[i].defined()) { 
        
                 continue; 
        
               } else if (first) { 
        
                 sizes = to_expand[i].sizes(); 
        
                 first = false; 
        
               } else { 
        
                 sizes = at::infer_size_dimvector(sizes, to_expand[i].sizes()); 
        
               } 
        
             } 
        
             std::vector<at::Tensor> result(to_expand.size()); 
        
             for (const auto i : c10::irange(to_expand.size())) { 
        
               if (!to_expand[i].defined()) { 
        
                 continue; 
        
               } else if (to_expand[i].sizes().equals(sizes)) { 
        
                 result[i] = to_expand[i]; 
        
               } else { 
        
                 result[i] = at::expand_copy(to_expand[i], sizes); 
        
               } 
        
             } 
        
             return result; 
        
           }

, we use sizes() on tensors with dynamic dimensions. I guess this PR catches this bug.

Uploaded a commit to fix

xla/torch_xla/csrc/tensor_util.h

Lines 139 to 165 in 7a1e9c7

    
           inline std::vector<at::Tensor> xla_expand_outplace(at::TensorList to_expand) { 
        
             // expands a list of Tensors; ignores undefined (null) tensors 
        
             bool first = true; 
        
             at::DimVector sizes; 
        
             for (const auto i : c10::irange(to_expand.size())) { 
        
               if (!to_expand[i].defined()) { 
        
                 continue; 
        
               } else if (first) { 
        
                 sizes = to_expand[i].sizes(); 
        
                 first = false; 
        
               } else { 
        
                 sizes = at::infer_size_dimvector(sizes, to_expand[i].sizes()); 
        
               } 
        
             } 
        
             std::vector<at::Tensor> result(to_expand.size()); 
        
             for (const auto i : c10::irange(to_expand.size())) { 
        
               if (!to_expand[i].defined()) { 
        
                 continue; 
        
               } else if (to_expand[i].sizes().equals(sizes)) { 
        
                 result[i] = to_expand[i]; 
        
               } else { 
        
                 result[i] = at::expand_copy(to_expand[i], sizes); 
        
               } 
        
             } 
        
             return result; 
        
           }

.

To repro: XLA_EXPERIMENTAL="nonzero:masked_select" python3 pytorch/xla/test/test_operations.py TestAtenXlaTensor.test_index_types 2>&1 | tee out.txt

vanbasten23 · 2023-05-23T16:55:41Z

The last unit test is fixed but found more unit test failures:

ERROR: test_fill_ (__main__.TestDynamicShapes)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/ds/test_dynamic_shapes.py", line 399, in test_fill_
    t2.fill_(1)
RuntimeError: torch_xla/csrc/aten_xla_type.cpp:1244: SymIntArrayRef expected to contain only concrete integers

======================================================================
FAIL: test_SizeEq_should_not_compile_for_identical_symints (__main__.TestDynamicShapes)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/ds/test_dynamic_shapes.py", line 496, in test_SizeEq_should_not_compile_for_identical_symints
    self.assertIsNone(met.metric_data('CompileTime'))
AssertionError: (1, 15493685.0, ((1684807076.5097666, 15493685.0),)) is not None

Will fix it today.

vanbasten23 · 2023-05-23T23:45:33Z

For the failing test:

ERROR: test_fill_ (__main__.TestDynamicShapes)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/ds/test_dynamic_shapes.py", line 399, in test_fill_
    t2.fill_(1)
RuntimeError: torch_xla/csrc/aten_xla_type.cpp:1244: SymIntArrayRef expected to contain only concrete integers

I would need to symintify XLANativeFunctions::empty_strided_symint(here), XLANativeFunctions::as_strided_copy, and tensor_methods::as_strided.

Since torch_xla has enabled functionalization, do we still need the XLANativeFunctions::empty_strided_symint? @JackCaoG do you know?

vanbasten23 · 2023-06-09T22:13:53Z

There are 2 remaining test failures:

pytorch/xla/test/test_operations.py TestAtenXlaTensor.test_index_types: when I run the test it crashes (with PJRT_DEVICE=TPU): Non-OK-status: pjrt_data.buffer->ToLiteralSync(&literal) status: INTERNAL: Error converting to literal: Can't slice buffer of length 2048 with start_offset=1024 length=0full output. But when I use PJRT_DEVICE=TPU_LEGACY, the test passes.
pytorch/xla/test/ds/test_dynamic_shapes.py TestDynamicShapes.test_fill_: it seems in order to fix it, I need to fix function XLANativeFunctions::empty_strided_symint: create a new XLANativeFunctions::as_strided_copy_symint, a new tensor_methods::as_strided_symint, and a new IR node AsStrided_Symint, assuming we can merge [Functionalization] Remove CreateAsStridedViewInfo #5084.

hi @will-cromar , regarding the above test failure (1), do you have some input on how to debug it?

vanbasten23 · 2023-06-21T21:04:44Z

It seems XLANativeFunctions::empty_strided_symint today outputs unexpected stride info.

I added a test:

 def test_empty_strided(self):
    a = torch.empty_strided((2, 3), (1, 2), device=dev)
    print('a.stride()=', a.stride())
    print('a.size()=', a.size())
    xm.mark_step()

When running # PJRT_DEVICE=TPU_LEGACY XLA_EXPERIMENTAL="nonzero:masked_select" python3 pytorch/xla/test/ds/test_dynamic_shapes.py TestDynamicShapes.test_empty_strided, I got output:

a.stride()= (3, 1)
a.size()= torch.Size([2, 3])

However, https://pytorch.org/docs/stable/generated/torch.empty_strided.html shows the output stride should be (1, 2).

As I think about it, the test result might be expected because in pytorch/xla, the memory is always non-overlapping and contiguous.

vanbasten23 · 2023-06-29T13:29:34Z

CPU CI is failing. To repo, do ptxla@t1v-n-0d95297d-w-0:/workspaces/work$ PJRT_DEVICE=CPU CPU_NUM_DEVICES=1 python3 pytorch/xla/test/test_ops.py -v -k TestOpInfoXLA.test_reference_eager_frexp_xla_float32

…mints

vanbasten23 · 2023-06-30T20:45:47Z

It turns out that I need to directly return empty_symint in XLANativeFunctions::empty_strided_symint. This is because otherwise TestOpInfoXLA.test_reference_eager_frexp_xla_float32 would failed with error. The root cause for that is the stride passed in XLANativeFunctions::empty_strided_symint is [0, 3, 1] (prints): strides should be a positive number. Since pt/xla shouldn't care about stride so I feel we could ignore it even for static size/stride case.

JackCaoG · 2023-07-05T21:05:21Z

    self.assertEqual(dyn_size, dyn_size)
    # Without the code change, met.metric_data('CompileTime')[0] returns 1.
-    self.assertIsNone(met.metric_data('CompileTime'))
+    # self.assertIsNone(met.metric_data('CompileTime'))


remove this line?

This is to show what to test after the below todo is done. So do you still prefer me to remove this?

JackCaoG · 2023-07-05T21:10:29Z

+  return false;
+}
+
+inline std::vector<at::Tensor> xla_expand_outplace_symint(


I think it is better to call this function xla_expand_outplace_symint_helper and you create another xla_expand_outplace_helper which handle the static case. In xla_expand_outplace it will have a if check to dispatch to one of the helper.

vanbasten23 · 2023-07-06T00:23:06Z

Thanks for the review!

* Err if calling sizes() on dynamic tensor * try to set has_symbolic_sizes_strides_ * resolve merge conflict * enable CONTINUE_ON_ERROR * fixed the python test test_SizeEq_should_not_compile_for_identical_symints * fix test_index_types * set CONTINUE_ON_ERROR to true * remove some unwanted code. * add a print * directly set has_symbolic_sizes_strides_ = true * make some fixes. * fix empty_strided_symint * ran linter * change error type in the test. * fix comments * ran linter

* Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239) * Skip calling as_strided in empty_strided_symint. * only return empty_symint conditionally. * add a comment * Add XRT nightly builds (#5261) * Add XRT nightly builds * remove space * Add ToString method for both PjrtData and PjrtShardedData (#5265) * Add ToString method for both PjrtData and PjrtShardedData * on cpu same config will become replicated, dont't check actual op sharding type * fix xrt tostring * Update Sharded graph HLO dumping (#5266) * Disable Bazel remote cache for forked PR (#5259) * disable bazel remote cache if gcloud key is empty * remove remote cache from setup.py * experiment with debug msg * fix flag * add more logs * skip remote chache if credential file is empty * add comment * add logs * add check in test and coverage script * fix condition in coverage test * advance branch pr * allow remote cache if gloud file isn't specified explicitly * remove dummy comment * Suppress debug symbols in OpenXLA code (#5269) * [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268) * Make TPU detection more robust (#5271) * Clean bazel stuff on distutils clean. (#5274) * Clean bazel stuff on distutils clean * Fix python formatting * fix conflict * Fix the error when export_torch_model is given a non-tensor (#5277) However the generated StableHLO graph still hardcodes the non-tensor value. this is not correct, will fix later. * Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282) * Always do build_ext in python setup.py develop (#5273) Bazel should figure out that _XLAC.so is current or not, and trigger rebuild if any cpp files changed. * Remove or improve several hardcoded TPU test conditions (#5272) * Remove or improve several hardcoded TPU test conditions * Fix test condition * Add `runtime.host_index` (#5283) * Make it an error if calling sizes() on a dynamic tensor. (#4998) * Err if calling sizes() on dynamic tensor * try to set has_symbolic_sizes_strides_ * resolve merge conflict * enable CONTINUE_ON_ERROR * fixed the python test test_SizeEq_should_not_compile_for_identical_symints * fix test_index_types * set CONTINUE_ON_ERROR to true * remove some unwanted code. * add a print * directly set has_symbolic_sizes_strides_ = true * make some fixes. * fix empty_strided_symint * ran linter * change error type in the test. * fix comments * ran linter * Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281) * Fix the error where mark_step does not materalize tensors on SPMD:0 * typo * fix test_non_tensor_scalar * Disable torch._dynamo.config.automatic_dynamic_shapes (#5285) * Set torch._dynamo.config.automatic_dynamic_shapes to False * Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape * [Traceable Collecive] Hide token for all_gather (#5232) Summary: This pull request does the following: 1. It hides token for all_gather. 2. It folds the out-of-place all_gather into the regular all_gather. 3. It fixes an issue with the last all_reduce_in_place PR where it forgot to set the token. Test Plan: PJRT_DEVICE=TPU python test/test_mp_all_gather.py * Lower squeeze.dims (#5286) * avoid copy proto in PrepareOutputShardingPropagation (#5287) * Revert "Suppress debug symbols in OpenXLA code (#5269)" This reverts commit 3967d7b. * Revert "fix conflict" This reverts commit e91ad3a. --------- Co-authored-by: iefgnoix <isaacwxf23@gmail.com> Co-authored-by: Will Cromar <wcromar@google.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: stgpetrovic <stgpetrovic@gmail.com> Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com> Co-authored-by: qihqi <hanq@google.com> Co-authored-by: Wonjoo Lee <wonjoo@google.com> Co-authored-by: Jiewen Tan <jwtan@google.com> Co-authored-by: Baole Ai <baoleai01@gmail.com>

* initiak commit * Add test workflow for `xrt` branch (#5241) * Add test workflow for `xrt` branch * Only run for PRs targeting XRT branch * Add function to generate stablehlo based callable from pytorch model (#5216) * Add function to generate stablehlo based callable from pytorch model Added function `torch_xla.experimental.stablehlo_saved_model.export_pytorch_model`. This function will take a pytorch Module and convert it into stablehlo bytecode. * Only run the main CI workflow on PRs targeting master and release branches (#5244) * Only run main CI for master and release branches. * Disabling XRT tests on main CI * AMP for TPUs v3 (#5161) * remove duplicate autocast_test (#5246) * Remove `test_experimental_pjrt_tpu.py` from TPU CI (#5247) * Install `expecttest` in xla_test_job.yaml (#5252) * Add IAM roles for cloudbuild_editors (#5251) * [Functionalization] Remove view in view_symint (#5231) * [Functionalization] Remove view in view_symint Summary: This pull request removes views in tensor_method::view_symint. Test Plan: XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view * Fix linters * fixed the test * ran the linter --------- Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com> * Delete XRT from the main branch (#5240) * Delete XRT from the main branch * Remove dead import * formatting * Remove disable_xrt build option * Fix runtime init * Revert "Remove disable_xrt build option" This reverts commit ba312e7. * Add disable XRT option back * formatting * Prune mesh service * Remove obsolete test * Remove other run server script * Remove XRT config * Update PJRT default device test * Add a file I forgot to save * if using_pjrt -> @requires_pjrt * Remove irrelevant test case * Remove XRT env vars * fix md link * formatting * Remove extra `requires_pjrt` * merge conflicts * Add other autocast back * Add nightly build for cuda 12 (#5253) * Fix the linter command in the CI (#5254) * fix linter command * ran linter * Jack cao g/fix spmd buff is null (#5256) * Fix that non-tensor scalar can't be handled by virtual device * add test * comment * Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239) * Skip calling as_strided in empty_strided_symint. * only return empty_symint conditionally. * add a comment * Add XRT nightly builds (#5261) * Add XRT nightly builds * remove space * [OpenXLA] Migrate to pull XLA from OpenXLA (#5202) PyTorch/XLA migrate to pull XLA from OpenXLA by replacing TensorFlow with OpenXLA after deprecating XRT usage, and replace TensorFlow-pin with OpenXLA-pin to May09 * Add ToString method for both PjrtData and PjrtShardedData (#5265) * Add ToString method for both PjrtData and PjrtShardedData * on cpu same config will become replicated, dont't check actual op sharding type * Update Sharded graph HLO dumping (#5266) * Enable PjRt Client Compilation with StableHLO (#5233) * Enable xla PjRt client compilation with StableHLO * add XLA_STABLEHLO_COMPILE to configuration.yaml * fix merge conflict * dummy commit to trigger ci * Revert "dummy commit to trigger ci" This reverts commit f7aec23. * Disable Bazel remote cache for forked PR (#5259) * disable bazel remote cache if gcloud key is empty * remove remote cache from setup.py * experiment with debug msg * fix flag * add more logs * skip remote chache if credential file is empty * add comment * add logs * add check in test and coverage script * fix condition in coverage test * advance branch pr * allow remote cache if gloud file isn't specified explicitly * remove dummy comment * Suppress debug symbols in OpenXLA code (#5269) * [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268) * Make TPU detection more robust (#5271) * Clean bazel stuff on distutils clean. (#5274) * Clean bazel stuff on distutils clean * Fix python formatting * Delete unused .so file, and .lds files (#5275) * [OpenXLA] Delete unused .so file and .lds files * Fix the error when export_torch_model is given a non-tensor (#5277) However the generated StableHLO graph still hardcodes the non-tensor value. this is not correct, will fix later. * Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282) * Always do build_ext in python setup.py develop (#5273) Bazel should figure out that _XLAC.so is current or not, and trigger rebuild if any cpp files changed. * Remove or improve several hardcoded TPU test conditions (#5272) * Remove or improve several hardcoded TPU test conditions * Fix test condition * Add `runtime.host_index` (#5283) * Make it an error if calling sizes() on a dynamic tensor. (#4998) * Err if calling sizes() on dynamic tensor * try to set has_symbolic_sizes_strides_ * resolve merge conflict * enable CONTINUE_ON_ERROR * fixed the python test test_SizeEq_should_not_compile_for_identical_symints * fix test_index_types * set CONTINUE_ON_ERROR to true * remove some unwanted code. * add a print * directly set has_symbolic_sizes_strides_ = true * make some fixes. * fix empty_strided_symint * ran linter * change error type in the test. * fix comments * ran linter * Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281) * Fix the error where mark_step does not materalize tensors on SPMD:0 * typo * fix test_non_tensor_scalar * Disable torch._dynamo.config.automatic_dynamic_shapes (#5285) * Set torch._dynamo.config.automatic_dynamic_shapes to False * Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape * run linter * wrap only if sharding type is non-replicated * Handle non-tensors * run linter * Call wrap_if_sharded first * Add exception in test for unsharded tensor * fix test * Use torch.Tensor instead of torch.tensor * use .cpu() only for tensors --------- Co-authored-by: Will Cromar <wcromar@google.com> Co-authored-by: qihqi <hanq@google.com> Co-authored-by: Meghan Cowan <cowanmeg@google.com> Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com> Co-authored-by: Jiewen Tan <jwtan@google.com> Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com> Co-authored-by: Wonjoo Lee <wonjoo@google.com> Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com> Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: stgpetrovic <stgpetrovic@gmail.com> Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com>

* Update inline style code to multiline (#5291) * Fix typo in _test.yml (#5172) s/metadtaa/metadata/ * [SPMD][Virtual Device]All tensors should be in SPMD:0 C++ device (#5284) * Move all tensors to SPMD:0 C++ device under spmd context * fix load shards * fix test_mark_sharding_2d by not creating placeholder for virtual device * fix the waitdeviceop for spmd case * Fix test_shard_hashing * fix spmd device casting issue * remove hacks in test_xla_virtual_device.py * add test for new virtual device usage * fix review comments * fix IsTpuDevice * linter * Revert pr #2682 (#5215) * Make README more actionable (#5262) * Make README more actionable * move profiling guide link * text wrapping * [SPMD] Use xs.Mesh in test_2d_tensor_3d_mesh (#5295) * use mesh in test_2d_tensor_3d_mesh * remove attributes patch * [SPMD] Add FSDP sharding for test_train_spmd_linear_model.py (#5299) Summary: This diff adds FSDP sharding for test_train_spmd_linear_model.py. Test Plan: PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_train_spmd_linear_model.py --sharding fsdp * [SPMD] Avoid recompilations in xs.mark_sharding() (#5300) Summary: This pull requests fixes the recompilation issue in xs.mark_sharding(). xtensor->GetXlaData() will compile the program if xtensor is an IR in order to get the BackendData. I believe this is not intended given the error message below suggests only data type xtensors are supported. Test Plan: PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py * [SPMD] Support mark_sharding on IRs (#5301) Summary: This pull requests fixes the recompilation issue in xs.mark_sharding(). xtensor->GetXlaData() will compile the program if xtensor is an IR in order to get the BackendData. I believe this is not intended given the error message below suggests only data type xtensors are supported. Test Plan: PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py * [SPMD] Allow dumping post optimizations hlo (#5302) Summary: This pull request partial reverts the change in #5266 to re-enble dumping post optimizations hlo. Test Plan: XLA_USE_SPMD=1 PJRT_DEVICE=TPU python test/spmd/test_xla_sharding.py -v -k test_xla_sharded_hlo_dump_post_optimizations * Add `_sharded_cpu_state_dict` for distributed checkpointing (#5288) * initiak commit * Add test workflow for `xrt` branch (#5241) * Add test workflow for `xrt` branch * Only run for PRs targeting XRT branch * Add function to generate stablehlo based callable from pytorch model (#5216) * Add function to generate stablehlo based callable from pytorch model Added function `torch_xla.experimental.stablehlo_saved_model.export_pytorch_model`. This function will take a pytorch Module and convert it into stablehlo bytecode. * Only run the main CI workflow on PRs targeting master and release branches (#5244) * Only run main CI for master and release branches. * Disabling XRT tests on main CI * AMP for TPUs v3 (#5161) * remove duplicate autocast_test (#5246) * Remove `test_experimental_pjrt_tpu.py` from TPU CI (#5247) * Install `expecttest` in xla_test_job.yaml (#5252) * Add IAM roles for cloudbuild_editors (#5251) * [Functionalization] Remove view in view_symint (#5231) * [Functionalization] Remove view in view_symint Summary: This pull request removes views in tensor_method::view_symint. Test Plan: XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view * Fix linters * fixed the test * ran the linter --------- Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com> * Delete XRT from the main branch (#5240) * Delete XRT from the main branch * Remove dead import * formatting * Remove disable_xrt build option * Fix runtime init * Revert "Remove disable_xrt build option" This reverts commit ba312e7. * Add disable XRT option back * formatting * Prune mesh service * Remove obsolete test * Remove other run server script * Remove XRT config * Update PJRT default device test * Add a file I forgot to save * if using_pjrt -> @requires_pjrt * Remove irrelevant test case * Remove XRT env vars * fix md link * formatting * Remove extra `requires_pjrt` * merge conflicts * Add other autocast back * Add nightly build for cuda 12 (#5253) * Fix the linter command in the CI (#5254) * fix linter command * ran linter * Jack cao g/fix spmd buff is null (#5256) * Fix that non-tensor scalar can't be handled by virtual device * add test * comment * Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239) * Skip calling as_strided in empty_strided_symint. * only return empty_symint conditionally. * add a comment * Add XRT nightly builds (#5261) * Add XRT nightly builds * remove space * [OpenXLA] Migrate to pull XLA from OpenXLA (#5202) PyTorch/XLA migrate to pull XLA from OpenXLA by replacing TensorFlow with OpenXLA after deprecating XRT usage, and replace TensorFlow-pin with OpenXLA-pin to May09 * Add ToString method for both PjrtData and PjrtShardedData (#5265) * Add ToString method for both PjrtData and PjrtShardedData * on cpu same config will become replicated, dont't check actual op sharding type * Update Sharded graph HLO dumping (#5266) * Enable PjRt Client Compilation with StableHLO (#5233) * Enable xla PjRt client compilation with StableHLO * add XLA_STABLEHLO_COMPILE to configuration.yaml * fix merge conflict * dummy commit to trigger ci * Revert "dummy commit to trigger ci" This reverts commit f7aec23. * Disable Bazel remote cache for forked PR (#5259) * disable bazel remote cache if gcloud key is empty * remove remote cache from setup.py * experiment with debug msg * fix flag * add more logs * skip remote chache if credential file is empty * add comment * add logs * add check in test and coverage script * fix condition in coverage test * advance branch pr * allow remote cache if gloud file isn't specified explicitly * remove dummy comment * Suppress debug symbols in OpenXLA code (#5269) * [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268) * Make TPU detection more robust (#5271) * Clean bazel stuff on distutils clean. (#5274) * Clean bazel stuff on distutils clean * Fix python formatting * Delete unused .so file, and .lds files (#5275) * [OpenXLA] Delete unused .so file and .lds files * Fix the error when export_torch_model is given a non-tensor (#5277) However the generated StableHLO graph still hardcodes the non-tensor value. this is not correct, will fix later. * Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282) * Always do build_ext in python setup.py develop (#5273) Bazel should figure out that _XLAC.so is current or not, and trigger rebuild if any cpp files changed. * Remove or improve several hardcoded TPU test conditions (#5272) * Remove or improve several hardcoded TPU test conditions * Fix test condition * Add `runtime.host_index` (#5283) * Make it an error if calling sizes() on a dynamic tensor. (#4998) * Err if calling sizes() on dynamic tensor * try to set has_symbolic_sizes_strides_ * resolve merge conflict * enable CONTINUE_ON_ERROR * fixed the python test test_SizeEq_should_not_compile_for_identical_symints * fix test_index_types * set CONTINUE_ON_ERROR to true * remove some unwanted code. * add a print * directly set has_symbolic_sizes_strides_ = true * make some fixes. * fix empty_strided_symint * ran linter * change error type in the test. * fix comments * ran linter * Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281) * Fix the error where mark_step does not materalize tensors on SPMD:0 * typo * fix test_non_tensor_scalar * Disable torch._dynamo.config.automatic_dynamic_shapes (#5285) * Set torch._dynamo.config.automatic_dynamic_shapes to False * Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape * run linter * wrap only if sharding type is non-replicated * Handle non-tensors * run linter * Call wrap_if_sharded first * Add exception in test for unsharded tensor * fix test * Use torch.Tensor instead of torch.tensor * use .cpu() only for tensors --------- Co-authored-by: Will Cromar <wcromar@google.com> Co-authored-by: qihqi <hanq@google.com> Co-authored-by: Meghan Cowan <cowanmeg@google.com> Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com> Co-authored-by: Jiewen Tan <jwtan@google.com> Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com> Co-authored-by: Wonjoo Lee <wonjoo@google.com> Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com> Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: stgpetrovic <stgpetrovic@gmail.com> Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com> * Supoort unordered sharding spec correctly (#5305) * Supoort non-ordered sharding spec correctly * use permute instead of transpose * use dim > 2 to suit TPU v3(otherwise can't be divide evenly) * Support unordered sharding spec for partial replication (#5316) * Suport unordered sharding spec for partial replication * add 4d test * handle 2d tensor with 2d mesh case * refactoring * Fix mismatched GPU docker image in the doc. (#5319) * quick refactor on _get_group_assignment (#5318) * Add tf independent serialization (#5308) Create a serialization format for StableHLO graphs and weights without tf.saved_model Need to not use tensorflow because tensorflow is no longer dependency of pytorch/xla. Information saved are enough to reconstruct the tf.saved_model for serving. Information stored: * metadata on which tensor maps which input position * StableHLO version number * metadata on which tensor corresponds to user input or parameter * metadata on shape and dtype of each tensor. * Tensors themselves are saved as numpy arrays using np.save. * Disable coverage for now (#5321) * Enable Some input output aliasing under SPMD (#5320) * Use `_sharded_cpu_state_dict` functionality to Write Items for SPMD Save Planner (#5315) * initial commit * add suggested changes * add unit test * fix test * fix test * add suggested changes * remove is_sharded_tensor check * check if device type is xla in `wrap_if_sharded` * change order * update resolve_data and add more tests * run linter * use subtest * formatting fixes * run linter * handle single tensor for method send_to_device_single (#5317) * handle single tensor for method send_to_device_single * fix broadcast parameter --------- Co-authored-by: Wonjoo Lee <wonjoo@google.com> Co-authored-by: Nikita Shulga <nshulga@meta.com> Co-authored-by: iefgnoix <isaacwxf23@gmail.com> Co-authored-by: Will Cromar <wcromar@google.com> Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com> Co-authored-by: Jiewen Tan <jwtan@google.com> Co-authored-by: Yash Shah <55116947+yashs97@users.noreply.github.com> Co-authored-by: qihqi <hanq@google.com> Co-authored-by: Meghan Cowan <cowanmeg@google.com> Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com> Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>

…torch#101634) Fixes [#ISSUE_NUMBER](pytorch/xla#4998) according to [comment](pytorch/xla#4998 (comment)). This change is needed to make sure calling tensor.sizes() will error if the tensor has dynamic dimension in pytorch/xla. Pull Request resolved: pytorch#101634 Approved by: https://github.com/ezyang

vanbasten23 mentioned this pull request May 17, 2023

Use the symint version of computeStorageNbytes within get_nbytes. pytorch/pytorch#101634

Closed

vanbasten23 force-pushed the errorWhenTensorIsDynamic branch from 5bdbe7c to 067f37a Compare May 22, 2023 17:21

vanbasten23 force-pushed the errorWhenTensorIsDynamic branch from 1e6d1d2 to cbbce86 Compare May 25, 2023 17:36

vanbasten23 force-pushed the errorWhenTensorIsDynamic branch from 134526c to a9a9351 Compare June 8, 2023 20:13

vanbasten23 mentioned this pull request Jun 12, 2023

Set has_symbolic_sizes_strides on tensorImpl. #5164

Draft

vanbasten23 force-pushed the errorWhenTensorIsDynamic branch 3 times, most recently from cd5e578 to e7f8aa1 Compare June 27, 2023 23:59

vanbasten23 added 10 commits June 29, 2023 21:39

Err if calling sizes() on dynamic tensor

9b5b362

try to set has_symbolic_sizes_strides_

d7864a7

resolve merge conflict

fa7540f

enable CONTINUE_ON_ERROR

42cc054

fixed the python test test_SizeEq_should_not_compile_for_identical_sy…

81e4db9

…mints

fix test_index_types

6d83acb

set CONTINUE_ON_ERROR to true

f718397

remove some unwanted code.

4eb37fe

add a print

4b3b1e5

directly set has_symbolic_sizes_strides_ = true

2dceafc

make some fixes.

cbf6ee0

vanbasten23 force-pushed the errorWhenTensorIsDynamic branch from 31dabc4 to cbf6ee0 Compare June 29, 2023 22:24

vanbasten23 added 3 commits June 30, 2023 17:44

fix empty_strided_symint

3ff6cd4

ran linter

803346c

change error type in the test.

07b7e1c

JackCaoG reviewed Jul 5, 2023

View reviewed changes

Comment thread torch_xla/csrc/aten_xla_type.cpp

JackCaoG reviewed Jul 5, 2023

View reviewed changes

Comment thread torch_xla/csrc/ops/index_ops.cpp

JackCaoG reviewed Jul 5, 2023

View reviewed changes

Comment thread torch_xla/csrc/tensor_util.h Outdated

JackCaoG reviewed Jul 5, 2023

View reviewed changes

vanbasten23 added 2 commits July 5, 2023 21:26

fix comments

2a5ffdb

ran linter

c275647

vanbasten23 requested a review from JackCaoG July 5, 2023 21:34

JackCaoG approved these changes Jul 5, 2023

View reviewed changes

vanbasten23 merged commit d3779cb into master Jul 6, 2023

Conversation

vanbasten23 commented May 11, 2023

Uh oh!

vanbasten23 commented May 11, 2023

Uh oh!

vanbasten23 commented May 12, 2023

Uh oh!

ezyang commented May 15, 2023

Uh oh!

vanbasten23 commented May 16, 2023

Uh oh!

ezyang commented May 16, 2023

Uh oh!

vanbasten23 commented May 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanbasten23 commented May 23, 2023

Uh oh!

vanbasten23 commented May 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanbasten23 commented Jun 9, 2023

Uh oh!

vanbasten23 commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanbasten23 commented Jun 29, 2023

Uh oh!

vanbasten23 commented Jun 30, 2023

Uh oh!

JackCaoG Jul 5, 2023

Choose a reason for hiding this comment

Uh oh!

vanbasten23 Jul 5, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JackCaoG Jul 5, 2023

Choose a reason for hiding this comment

Uh oh!

vanbasten23 Jul 5, 2023

Choose a reason for hiding this comment

Uh oh!

vanbasten23 commented Jul 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vanbasten23 commented May 23, 2023 •

edited

Loading

vanbasten23 commented May 23, 2023 •

edited

Loading

vanbasten23 commented Jun 21, 2023 •

edited

Loading