Cherry-pick 2.1 release branch into XRT branch through 9/14#5574
Merged
will-cromar merged 117 commits intoxrtfrom Sep 15, 2023
Merged
Cherry-pick 2.1 release branch into XRT branch through 9/14#5574will-cromar merged 117 commits intoxrtfrom
will-cromar merged 117 commits intoxrtfrom
Conversation
* sharding should be per output of IR Node, instead of per IR Node * Update sharding_hash method * Add test for sharding on IR with multiple output * fix cpu test * Fix a bug in getSharding
* Make python Api to respect the virtual device when SPMD is enabled * fix typo
* Also dump output sharding on HLO file * only dump output sharding if dump format is HLO * add test * fix typo
* Make all-reduce a no-op when world size is 1 * Fix torch.distributed test
* fix amp dtype setting for GPU. * fix ut * fix lint. * minor.
* Add python test for SPMD+Runtime Python API * replace test name * Update test_xla_spmd_python_api_interaction.py
…5352) * Check the actual device instead of query env var for virtual device * revert unneeded change * minor changes
* tweak `atol` and `rtol`
* Skip`DynamoTrainingBasicTest.test_resnet18` on TPU
* Add kokoro presubmit for stablehlo tests
…5367) * [BE] use self.assertEquals instead of str equality in test_zero1.py * Use our own assertEqual * Remove print statements
* Fix ReplicateShardedData for int type * add test
Update dynamo.md to remove note about fallback ops since they're supported now
…erent_input_shape` on TPU (#5373) * tweak `atol` and `rtol` for `test_simple_model_with_different_input_shape` on TPU
…mutability (#5382) * [TEST ONLY] print statements for test_zero1.py to debug * Try fix * Rectify test_zero1.py to account for state_dict modification * Fix lint
#5384) * Add gpu doc for how to build PyTorch/XLA from source with GPU support. * fix typo * fix comments * fix comments
* Add more support for in-place ops in dynamo bridge Run linter * Add check to explicitly sync self tensors Remove debugging lines Update unit tests to a model * Clean up some code Surround in an if-statement Update metrics for fallback related dynamo tests Update cloned args logic Revert "Update metrics for fallback related dynamo tests" This reverts commit 3855f43. * Update single_node flag back to False
Add dynamo test in TPU CI
Summary: During the LLaMA2 experiements, I disovered that manually marking 1D tensors to be replicated can greatly save a lot of memory. Then I disocvered that explicitly replicated spec will get dropped after mark_step. That is caused by PrepareOutputShardingPropagation where it explicitly clear the sharding spec for replicated output. So, I went ahead and fix that. Further, I did some experiements of propogating replicated output and that drop the requirements of manually replicating 1D tensors. Hence, I made this change. I'm still not quite sure why, will follow up later. Test Plan: PJRT_DEVICE=TPU python test/spmd/test_xla_sharding.py
Update more places Add torch_pin
* Update project metadata and remove useless files * Update README * Add manylinux platform tag * formatting
Collaborator
Author
|
Something actually unconditionally calls |
Collaborator
Author
|
I'm able to build the current commit against PyTorch on the 2.1 branch and run MNIST on TPU with XRT 🎊 |
Collaborator
Author
|
I disabled this test that is missing from this branch: I don't know where it got lost, but there's no reason to use stablehlo with XRT anyway |
Collaborator
Author
|
Also, I was hitting a weird build failure on this branch until I updated the CI cache silo name. I wonder if that's why the wheel build is failing. I'll try updating the silo name in this branch. |
JackCaoG
approved these changes
Sep 15, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Skipped commits that update bazel workspace or are incompatible with XRT:
I had to make substantial edits (ie not just renaming imports) to the following commits to make them build against our pins:
Last commit picked: ee72332