Skip to content

Conversation

@zitongzhan
Copy link
Contributor

initial solution for letting lietensor workable in jacrev

@zitongzhan zitongzhan requested a review from hxu296 July 16, 2023 23:27
@zitongzhan
Copy link
Contributor Author

zitongzhan commented Jul 16, 2023

@hxu296 Could you please take a look at the PR and maybe help limit code redundancy in the context manager? So far this is only a workable solution but still have duplicate code segments. Perhaps we could write a general wrapper/decorator that is workable on any torch functions like this type.

Copy link
Member

@hxu296 hxu296 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapping the torch ops worked, nice! I added some comment to point out some places that look weird to me, but generally this is a great step forward!

@hxu296
Copy link
Member

hxu296 commented Jul 17, 2023

I will go ahead and address some of my own comment. After that, we can meet to discuss how to make it more general.

@hxu296 hxu296 requested a review from wang-chen July 17, 2023 03:56
@hxu296
Copy link
Member

hxu296 commented Jul 17, 2023

wrappable_lt is potentially useful to solve problems like #258

@zitongzhan zitongzhan requested a review from hxu296 August 5, 2023 03:54
Copy link
Member

@hxu296 hxu296 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I modified the retain_ltype context manager to restore functions-to-wrap and added tests to assert functions are indeed restored. Next step is to make retain_ltype thread safe. The save, swap, and restore critical regions need to be protected with a semaphore, as we discussed.

Update, according to this PyTorch docs:

DistributedDataParallel uses multiprocessing where a process is created for each GPU, while DataParallel uses multithreading. There are significant caveats to using CUDA models with multiprocessing; unless care is taken to meet the data handling requirements exactly, it is likely that your program will have incorrect or undefined behavior. It is recommended to use DistributedDataParallel, instead of DataParallel to do multi-GPU training, even if there is only a single node.

Using the recommended DistributedDataParallel won't cause race conditions, as it leverages multiprocessing where each process has its own virtual memory. Thread safety is not an issue for DistributedDataParallel.

@@ -0,0 +1,58 @@
import torch
from typing import Callable, Union, Tuple, Optional
from pypose.lietensor.lietensor import retain_ltype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't import pypose in pypose modules, only import pypose in examples and tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest:

from .. import retain_ltype

from .lietensor import Sim3_type, sim3_type, RxSO3_type, rxso3_type
from .lietensor import tensor, translation, rotation, scale, matrix, euler
from .lietensor import mat2SO3, mat2SE3, mat2Sim3, mat2RxSO3, from_matrix, matrix, euler2SO3, vec2skew
from .func import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest

from .lietensor.lietensor import retain_ltype
from . import func

@@ -0,0 +1,58 @@
import torch
from typing import Callable, Union, Tuple, Optional
from pypose.lietensor.lietensor import retain_ltype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest:

from .. import retain_ltype

Comment on lines 3 to 6
from contextlib import contextmanager
from .basics import vec2skew
import collections, numbers, warnings
import collections, numbers, warnings, importlib
from .operation import broadcast_inputs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pyramid style, refer to point 12 in guideline

}

with check_fn_equal(TO_BE_CHECKED):
with retain_ltype():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest using pp.retain_ltype()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants