Fix preserve_rng_state for activation checkpointing by YangFei1990 · Pull Request #4690 · pytorch/xla

YangFei1990 · 2023-02-24T00:10:08Z

In the activation checkpointing implementation we have the preserve_rng_state option, if it is set to True, activation checkpointing should use the same RNG state for the two forward runs in a single step. Consider the following test script with activation checkpoint and a dropout op in the model:

import torch
import torch.utils.checkpoint
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--xla", type=int, required=True)
args = parser.parse_args()

if args.xla:
    import torch_xla.core.xla_model as xm
    import torch_xla.distributed.xla_multiprocessing as xmp
    from torch_xla.distributed.fsdp import checkpoint_module

to_save = []
def save_output(output):
    to_save.append(output.detach().cpu())

class Model(torch.nn.Module):
    def __init__(self, args):
        super().__init__()
        self.x = torch.nn.Linear(128,128)
        self.dropout = torch.nn.Dropout(p=0.1)
        self.args = args
    
    def forward(self, inp):
        x = self.x(inp)
        output = self.dropout(x)
        if self.args.xla:
            xm.add_step_closure(save_output, args=(output,), run_async=False)
        else:
            save_output(output)
        return output

def main(args):
    if args.xla:
        device = xm.xla_device()
    else:
        device = 0
        torch.cuda.set_device(device)

    model = Model(args)
    model.to(device)

    if args.xla:
        model = checkpoint_module(model)

    _input = torch.randn(128, 128, requires_grad=True)
    _input = _input.to(device)

    if not args.xla:
        output = torch.utils.checkpoint.checkpoint(model, _input)
    else:
        output = model(_input)
    output = torch.sum(output)
    output.backward()
    if args.xla:
        xm.mark_step()
    same_output = torch.allclose(to_save[0], to_save[1])
    print(f"xla {args.xla} same_output {same_output}")

if __name__ == "__main__":    
    main(args)

If everything works right same_output should be True. However we observed without XLA it works correctly

python test_dropout_simple.py --xla 0
xla 0 same_output True

But with XLA it is wrong

python test_dropout_simple.py --xla 1
xla 0 same_output False

This PR fixed this issue by also saving/loading the XLA's RNG state in the activation checkpointing implementation. After the fix the output matches between the 2 forwards.

JackCaoG · 2023-02-27T19:25:47Z

Thanks! Mostly LGTM. Can you add a test case to maybe https://github.com/pytorch/xla/blob/master/test/test_operations.py ? You can compare the result with xla device and cpu device. this way we won't regress this.

JackCaoG · 2024-03-19T22:54:34Z

-           torch.cuda.amp.autocast(**ctx.gpu_autocast_kwargs), \
-           torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):
-        outputs = ctx.run_function(*detached_inputs)
+      with xm.fork_rng():


any reason not to pass the rng_devices and ctx.preserve_rng_state ?

It looks like the upstream code doesn't reset the state. @YangFei1990 Do you know why?

I guess upstream seed is handled by torch.random.fork_rng? through I am not sure why it doesn't work with pytorch/xla...

Yeah upstream seed is handled by torch.random.fork_rng. It will fork torch seed but somehow it won't set XLA's RNG. This seed torch_xla._XLAC._xla_get_rng_seed(str(device) is it independent to torch seed? How torch XLA in general handle RNGs?

I did not change the previous behavior, i.e. upstream seed will still be maintained as it was (check code below). I simply add another preserve RNG states.

JackCaoG · 2024-03-19T22:55:40Z

+    output = torch.sum(output)
+    output.backward()
+    xm.mark_step()
+    same_output = torch.allclose(model.to_save[0], model.to_save[1])


what's this to_save?

to_save is the container to hold the output tensor. With activation checkpointing the FWD will run twice, this container can capture both tensors. Check line 2352.

JackCaoG · 2024-03-19T22:56:55Z

+    same_output = torch.allclose(model.to_save[0], model.to_save[1])
+    if not same_output:
+      print(f"in fwd {model.to_save[0]}, in bwd {model.to_save[1]}")
+    self.assertTrue(same_output)


I think you can do something similar to

self.assertTrue(same_output, f"in fwd {model.to_save[0]}, in bwd {model.to_save[1]}")

Awesome didn't know could do that. Updating.

alanwaketan

Mostly, LGTM. Please address the comments.

alanwaketan · 2024-03-19T23:18:24Z

-           torch.cuda.amp.autocast(**ctx.gpu_autocast_kwargs), \
-           torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):
-        outputs = ctx.run_function(*detached_inputs)
+      with xm.fork_rng():


It looks like the upstream code doesn't reset the state. @YangFei1990 Do you know why?

JackCaoG · 2024-03-20T21:45:22Z

I will take care of the backport

…6788) Co-authored-by: Fei <33940270+YangFei1990@users.noreply.github.com>

fix rng state for activation checkpoint

29becf3

JackCaoG self-requested a review February 27, 2023 18:37

JackCaoG added the fsdp label Feb 28, 2023

JackCaoG requested a review from alanwaketan January 4, 2024 17:29

jeffhataws added the backport_2.3 label Mar 14, 2024

YangFei1990 added 3 commits March 14, 2024 17:06

resolve conflicts

05ad6e4

add test

392fe6c

format change

d3fb5a6

JackCaoG reviewed Mar 19, 2024

View reviewed changes

alanwaketan approved these changes Mar 19, 2024

View reviewed changes

YangFei1990 added 2 commits March 19, 2024 16:41

formatting

355efb2

formatting

78c08dd

JackCaoG approved these changes Mar 20, 2024

View reviewed changes

JackCaoG merged commit 66acfeb into pytorch:master Mar 20, 2024

JackCaoG pushed a commit that referenced this pull request Mar 20, 2024

Fix preserve_rng_state for activation checkpointing (#4690)

49efb2b

JackCaoG mentioned this pull request Mar 20, 2024

2.3 backport PR request list #6676

Closed

JackCaoG added a commit that referenced this pull request Mar 22, 2024

Backport Fix preserve_rng_state for activation checkpointing (#4690) (#…

1d6558a

…6788) Co-authored-by: Fei <33940270+YangFei1990@users.noreply.github.com>

Conversation

YangFei1990 commented Feb 24, 2023

Uh oh!

JackCaoG commented Feb 27, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented Mar 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants