deprecate xrt_world_size by zpcore · Pull Request #7679 · pytorch/xla

zpcore · 2024-07-12T22:12:14Z

Deprecate torch_xla.xla_model.xrt_world_size and use torch_xla.runtime.world_size instead.

Add the run_once decorator to function runtime.using_pjrt since we only need to run this once per process. This helps get rid of dynamo compilation issue with xm.all_reduce.

will-cromar

Can you use xr.world_size instead? I don't think we need to rename the function in xla_model

xla/torch_xla/runtime.py

Lines 148 to 152 in 5b8e8e0

    
           def world_size() -> int: 
        
             """Returns the total number of processes participating in the job.""" 
        
             if torch_xla._XLAC._xla_get_replication_devices_count() == 0: 
        
               return 1 
        
             return global_device_count()

zpcore · 2024-07-15T19:49:06Z

Can you use xr.world_size instead? I don't think we need to rename the function in xla_model

xla/torch_xla/runtime.py

Lines 148 to 152 in 5b8e8e0

def world_size() -> int:

"""Returns the total number of processes participating in the job."""

if torch_xla._XLAC._xla_get_replication_devices_count() == 0:

return 1

return global_device_count()

xrt_world_size uses _WORLD_SIZE besides runtime.world_size():

xla/torch_xla/core/xla_model.py

Lines 129 to 132 in 5b8e8e0

    
           global _WORLD_SIZE 
        
           if _WORLD_SIZE is not None: 
        
             return _WORLD_SIZE

. There is a concern that removing _WORLD_SIZE will cause recompilation. @alanwaketan for insights.

alanwaketan · 2024-07-17T21:00:43Z

You will know if there is a recompilation from the test.

zpcore · 2024-07-17T22:58:50Z

Hi @will-cromar , I tried functools.lru_cache and it crashes in multiprocess. I notice that if we use the functools.warps and assign attributes to the functions to be wrapped, it will cause crash. Probably lru_cache uses the func attributes in this case.

I create run_once and moved global var into runtime.py in order to deprecate xrt_world_size.

will-cromar · 2024-07-22T22:34:51Z

+    _ORDINAL = runtime.global_ordinal()
+
+
+def run_once(func):


What is this run_once for? It's a neat idea, but I see you opted for global variables for world size and ordinal

Without the @run_once for using_pjrt, the test test_mp_replication will fail with build the dynamic graph for dynamo compile.

Do you know why that is? Is it because now all calls to get e.g. world_size go through functions wrapped in requires_pjrt, which in turn actually is checking an env var (device_type and _maybe_select_default_device)? Whereas before the call would have been stopped by xm.world_size.

Check my reply here: #7679 (comment)

will-cromar · 2024-07-22T22:35:41Z



+@run_once
 def using_pjrt() -> bool:


Hah, this function also needs to get deprecated since I assume this is always True

We still need to call __maybe_select_default_device(), this is the point why I call using_pjrt() only once.

will-cromar

General question: if run_once makes dynamo happy for using_pjrt, do you know why it does't work for world_size and global_ordinal?

will-cromar · 2024-07-23T22:29:58Z

+    _ORDINAL = runtime.global_ordinal()
+
+
+def run_once(func):


Do you know why that is? Is it because now all calls to get e.g. world_size go through functions wrapped in requires_pjrt, which in turn actually is checking an env var (device_type and _maybe_select_default_device)? Whereas before the call would have been stopped by xm.world_size.

zpcore · 2024-07-23T23:39:12Z

General question: if run_once makes dynamo happy for using_pjrt, do you know why it does't work for world_size and global_ordinal?

Take unittest test/test_mp_replication.py as the example.

Why it work for using_pjrt:
Without run_once, we will see this function get called in from requires_pjrt() wrapper when we call function like world_size():

xla/torch_xla/runtime.py

Lines 36 to 38 in 42aa7bd

    
           def _maybe_select_default_device(): 
        
             if xu.getenv_as(xenv.PJRT_SELECT_DEFAULT_DEVICE, str, 
        
                             '1') == '0' or xenv.PJRT_DEVICE in os.environ:

, which checks for the env var and makes it dynamic. Adding the @run_once, the function has already been executed and got skipped when we do dynamo compile.

Why we need global var like _WORLD_SIZE:
Without it, we will see error like:

  from user code:
   File "/home/piz/pytorch/xla/test/test_mp_replication.py", line 11, in all_reduce
    return xm.all_reduce(xm.REDUCE_SUM, tensor)
  File "/home/piz/pytorch/xla/torch_xla/core/xla_model.py", line 428, in all_reduce
    if runtime.world_size() == 1 and not xu.getenv_as('XLA_ALWAYS_ALLREDUCE',
  File "/home/piz/pytorch/xla/torch_xla/runtime.py", line 128, in wrapper
    return fn(*args, **kwargs)
  File "/home/piz/pytorch/xla/torch_xla/runtime.py", line 186, in world_size
    if torch_xla._XLAC._xla_get_replication_devices_count() == 0:
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

The C API binding _xla_get_replication_devices_count will be called and failed the dynamo compile.

will-cromar

Thanks for the explanation! Filed a follow-up bug to clean up use_pjrt and requires_pjrt and fix a concrete usability issue at #7730

JackCaoG · 2024-07-24T23:52:39Z

TPU CI failure seems relevant, can we fix forward or revert this pr?

zpcore · 2024-07-25T00:06:13Z

I forgot to update the TPU CI test. Let me make a follow up PR now.

miladm · 2024-07-26T18:15:47Z

 ```
 new_rank = xm.get_ordinal()
-world_size = xm.xrt_world_size()
+world_size = xr.world_size()


@zpcore please add import torch_xla.runtime as xr to section 1. it feels this line comes from left field without the import in the documentation.

xm.world_size() is also deprecated. They all point to the same thing. We should only use xr.world_size().

zpcore and others added 2 commits July 12, 2024 22:11

deprecate xrt_world_size

dfe0d2b

Merge branch 'master' into piz/usability_2

cbbac42

zpcore marked this pull request as ready for review July 12, 2024 22:15

zpcore requested a review from will-cromar July 12, 2024 22:15

will-cromar reviewed Jul 12, 2024

View reviewed changes

zpcore requested a review from alanwaketan July 15, 2024 19:49

zpcore added 5 commits July 15, 2024 22:49

replace with xr.world_size

6dae657

update doc

fa7ae8f

update doc

7cfbd9a

fix package

3675b6a

env dynamo issue fix

949777f

zpcore added 3 commits July 17, 2024 21:45

relocate world_size()

8be771c

use run_once

8c495d5

nit update

0418ae0

zpcore and others added 13 commits July 18, 2024 00:55

fix typo

d395d68

fix stateful test

1ee4678

Merge branch 'master' into piz/usability_2

f10fa5e

nit update

a22748f

fix test using forking bug

abf1c87

fix test mp pickle

4c5ee95

fix test mp pickle

3258b92

fix gpu test

e0e9f7c

remove get_local_ordinal

b16a6ac

fix error

54a4420

nit update

c11d945

remove default val

7093338

add dependency

f7e5829

will-cromar reviewed Jul 22, 2024

View reviewed changes

will-cromar reviewed Jul 23, 2024

View reviewed changes

Comment thread torch_xla/runtime.py Outdated

will-cromar reviewed Jul 23, 2024

View reviewed changes

Comment thread torch_xla/runtime.py Outdated

zpcore added 2 commits July 24, 2024 00:01

internalize function

c59c575

fix typo

c68f535

zpcore enabled auto-merge (squash) July 24, 2024 22:21

will-cromar approved these changes Jul 24, 2024

View reviewed changes

zpcore merged commit 7f8ef79 into master Jul 24, 2024

zpcore mentioned this pull request Jul 24, 2024

PyTorch/XLA usability progress tracking #7739

Open

zpcore mentioned this pull request Jul 25, 2024

Fix https://github.com/pytorch/xla/pull/7679 test #7741

Closed

miladm reviewed Jul 26, 2024

View reviewed changes

zpcore mentioned this pull request Sep 12, 2024

[RFC] torch_xla Backward Compatibility Proposal #8000

Open

	def world_size() -> int:
	"""Returns the total number of processes participating in the job."""
	if torch_xla._XLAC._xla_get_replication_devices_count() == 0:
	return 1
	return global_device_count()

Conversation

zpcore commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

will-cromar left a comment

Choose a reason for hiding this comment

Uh oh!

zpcore commented Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alanwaketan commented Jul 17, 2024

Uh oh!

zpcore commented Jul 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

will-cromar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zpcore commented Jul 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

will-cromar left a comment

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented Jul 24, 2024

Uh oh!

zpcore commented Jul 25, 2024

Uh oh!

miladm Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zpcore commented Jul 12, 2024 •

edited

Loading

zpcore commented Jul 15, 2024 •

edited

Loading

zpcore commented Jul 23, 2024 •

edited

Loading

miladm Jul 26, 2024 •

edited

Loading