[Distributed] add nccl primitives by soodoshll · Pull Request #280 · hidet-org/hidet

soodoshll · 2023-06-16T21:23:12Z

What we have now:

automatically finding headers and libraries provided by the pypi package nvidia-nccl-cu11 or nvidia-nccl-cu12
expose nccl's initialization apis to the python runtime
a python launch script to start single-machine-multi-gpu tasks (./examples/distributed/test.py)
modified cudacontext to include a pool of nccl communicators
primitive for all_reduce

yaoyaoding

Leave some comments on how to organize the nccl-related source code. I prefer to putting them under hidet.cuda.nccl submodule. This would make our code structure more clean when we add other vendor libraries like cublas and cudnn.

yaoyaoding · 2023-06-17T21:17:08Z

python/hidet/ffi/ffi.py

+def load_nccl_library():
+    global _LIB_NCCL
+    library_dirs = get_nccl_library_search_dirs()
+    for library_dir in library_dirs:
+        lib_nccl_paths = glob.glob(os.path.join(library_dir, 'libnccl.so*'))
+        if len(lib_nccl_paths) == 0:
+            continue
+        _LIB_NCCL = ctypes.cdll.LoadLibrary(lib_nccl_paths[0])
+        library_paths['nccl'] = lib_nccl_paths[0]
+        break
+    if _LIB_NCCL is None:
+        raise OSError('Can not find nccl library in the following directory: \n' + '\n'.join(library_dirs))


Let's put this part in "hidet/cuda/nccl/ffi.py", and leave "hidet/ffi/ffi.py" to only contain the hidet runtime library.

In the future, when we want to add other library (e.g., cudnn library), we can put to "hidet/cuda/cudnn" and also its ffi.py.

yaoyaoding · 2023-06-17T21:18:02Z

python/hidet/ffi/runtime_api.py

+class NcclUniqueId(Structure):
+    """
+    Defined in nccl.h
+    """
+    _fields_ = [("internal", c_byte * 128)]
+
+class NcclCommunicator:
+    """
+
+    """
+    def __init__(self, handle: int):
+        """
+        Users should not call this constructor directly. Because there are two ways of creating
+        a new communicator: 1) using unique_id and rank ; 2) using split.
+        """
+        if not nccl_available():
+            raise RuntimeError("NCCL Library not found.")
+        self._handle = handle
+
+        # TODO: how to ensure the following two are identical?
+        _comms.append(self)
+        runtime_api.add_nccl_comm(self)
+
+    def __del__(self):
+        """
+        Should we manage the lifetime of communicator object in Python or C++?
+        """
+        nccl_runtime_api.comm_destroy(self)
+
+    def split(self):
+        raise NotImplementedError()


Consider moving this to "hidet/cuda/nccl.py".

@yaoyaoding I have some questions related to code organization:

How should we manage the lifetime of NcclCommunicator objects in Python and ncclComm_t objects in C++? There are two choices:

in Python as it is now. the resource is released when the python object is released. this means if we need a python api get_nccl_comm(idx) in Python, we need to maintain a global list _comms in Python to maintain all communicators and make sure it's aligned with the ncclComm_t list in C++. However, It's likely that we don't need this API in Python, NcclCommunicator.get_id() is more important;

in C++. We don't manage the lifetime of ncclComm_t at all, and only release them when exit. NcclCommunicator will only be a wrapper of ncclComm_t.

IMO, we should follow (1) and abandon get_nccl_comm(idx) in Python. No global list _comms is needed. If users think it's necessary, they can create one by themselves.

Should we import 'nccl.py' in 'nccl/ffi.py' or the opposite? Because NcclCommunicator might depend on some runtime api, therefore we need 'nccl/ffi.py' in 'nccl.py' which might cause circular import.

Good questions!

How should we manage the lifetime of NcclCommunicator objects in Python and ncclComm_t objects in C++?

Because different flow graphs may have different set of communicators, thus, we should create the communicators for each flow graph. Let's keep the communicators in CompiledGraph. Something like

class CompiledGraph: def __init__(...): ... self.nccl_comms = ... # create the communicators, # let's store it as hidet.ffi.utils.Array(void*, num of comm) # and create a runtime api called # "void set_nccl_comms(int num_comm, void** comm_array)" def run_async(...): ... runtime_api.set_nccl_comms(len(self.nccl_comms), self.nccl_comms) ... def __del__(...): # destroy communicators

For FlowGraph.forward(...), we can raise an error for now if we find it is a distributed flow graph.

Should we import 'nccl.py' in 'nccl/ffi.py' or the opposite? Because NcclCommunicator might depend on some runtime api, therefore we need 'nccl/ffi.py' in 'nccl.py' which might cause circular import.

Consider: let the nccl runtime api return integer (the pointer to communicator), and the hidet/cuda/nccl/nccl.py imports hidet/cuda/nccl/ffi.py. The nccl ffi should only be used by nccl.py and we expose the api to the users in nccl.py.

Thanks for your reply! I agree with the idea of attaching communicators to compiled graphs.

Maybe we can leave the modification of CompiledGraph for another PR (which might include upper-level structures like op, graph) ? Let's focus this PR on primitives, otherwise it will be huge can hard to test.

yaoyaoding · 2023-06-17T21:19:32Z

python/hidet/ffi/runtime_api.py

+if nccl_available():
+    class NCCLRuntimeAPI:
+        """
+        Runtime APIs regarding NCCL
+        TODO: Exception handling
+        """
+        _get_version = get_func('ncclGetVersion', [c_void_p], c_int)
+        _get_unique_id = get_func('ncclGetUniqueId', [c_void_p], c_int)
+        _comm_init_rank = get_func('ncclCommInitRank', [c_void_p, c_int, NcclUniqueId, c_int], c_int)
+        _comm_destroy = get_func('ncclCommDestroy', [c_void_p], c_int)
+
+        _comm_user_rank = get_func('ncclCommUserRank', [c_void_p, POINTER(c_int)], c_int)
+        _comm_count = get_func('ncclCommCount', [c_void_p, POINTER(c_int)], c_int)
+
+        @staticmethod
+        def get_version() -> int:
+            version = c_int(0)
+            NCCLRuntimeAPI._get_version(pointer(version))
+            return version.value
+
+        @staticmethod
+        def get_unique_id(comm_id:NcclUniqueId) -> None:
+            """
+            In-place initialization of the NcclUniqueId object
+            """
+            ret = NCCLRuntimeAPI._get_unique_id(pointer(comm_id))
+            assert ret == 0, ret
+
+        @staticmethod
+        def comm_init_rank(ndev:int, comm_id:NcclUniqueId, rank:int) -> int:
+            comm = c_void_p()
+            ret = NCCLRuntimeAPI._comm_init_rank(pointer(comm), ndev, comm_id, rank)
+            assert ret == 0, ret
+            return comm.value
+
+        @staticmethod
+        def comm_destroy(comm:NcclCommunicator) -> None:
+            ret = NCCLRuntimeAPI._comm_destroy(comm._handle)
+            assert ret == 0
+
+    nccl_runtime_api = NCCLRuntimeAPI()
+    _comms: List[NcclCommunicator] = []
+
+    def get_nccl_comm(comm_id: int):
+        return _comms[comm_id]


Also move to hidet.cuda.nccl

yaoyaoding · 2023-06-17T21:21:01Z

python/hidet/libinfo.py

+
+def _get_nccl_dirs():
+    import site
+    return [os.path.join(root, 'nvidia', 'nccl') for root in site.getsitepackages()]
+
+def get_nccl_include_dirs():
+    return [os.path.join(root, 'include') for root in _get_nccl_dirs()]
+
+def get_nccl_library_search_dirs():
+    return [os.path.join(root, 'lib') for root in _get_nccl_dirs()]


Move to hidet.cuda.nccl

yaoyaoding

Thanks @soodoshll !

Overall LGTM. Left some minor suggestions. We can merge this first and then work on other primitives and graph level operators.

yaoyaoding · 2023-06-18T19:53:23Z

include/hidet/runtime/cuda/context.h

+/**
+ * Add a NCCL communicator to the context.
+ */
+DLL void set_nccl_comms(void** comm, int num_comms);


Suggested change

DLL void set_nccl_comms(void** comm, int num_comms);

DLL void set_nccl_comms(int num_comms, void** comm);

Let's keep the order consistent with other APIs in hidet.

yaoyaoding · 2023-06-18T20:00:06Z

python/hidet/ir/primitives/cuda/nccl.py

+from hidet.cuda.nccl import NcclDataType, NcclRedOp
+
+
+def all_reduce(comm_id: int, sendbuff: Expr, recvbuff: Expr, count: Expr, dtype: NcclDataType, op: NcclRedOp):


Suggested change

def all_reduce(comm_id: int, sendbuff: Expr, recvbuff: Expr, count: Expr, dtype: NcclDataType, op: NcclRedOp):

def all_reduce(comm_id: int, sendbuff: Expr, recvbuff: Expr, count: Expr, dtype: DataType, op: NcclRedOp):

Let's pass the hidet.ir.type.DataType to this primitive and convert it into NcclDataType inside the primitive function.

yaoyaoding · 2023-06-18T20:04:29Z

src/hidet/runtime/cuda_context.cpp

+}
+
+DLL void* get_nccl_comm(int idx) {
+    return CudaContext::global()->nccl_comms[idx];


Better to add some check logic here to make sure idx < num_comms.

soodoshll · 2023-06-18T20:32:47Z

@yaoyaoding fixed. Please take a look :)

soodoshll and others added 14 commits June 13, 2023 13:30

add nccl lib finding

7278ebc

run all reduce example

69bd505

update

4760ac6

update

829d07c

update

08d74f9

add primitives/cuda/nccl.py

3d896e6

add ncclcomm to cuda context

fd9dc91

update

1a7795c

update

ddabfdb

create ncclprimitive

925bc8e

update

21f3eae

add test.py

ad94ce0

add include_dirs/linking_dirs/linking_libs/... to IRModule

7349ce8

update

a1d3c50

yaoyaoding reviewed Jun 17, 2023

View reviewed changes

soodoshll added 5 commits June 17, 2023 23:21

reorganize dirs

f0a0f86

update

bd29659

clean

2bc1680

refactor

c2ac73b

update

4bbecc3

yaoyaoding reviewed Jun 18, 2023

View reviewed changes

soodoshll added 2 commits June 18, 2023 16:26

update

9f5b51d

format

a384167

Merge branch 'main' of github.com:hidet-org/hidet into nccl

29ded23

soodoshll changed the title ~~[WIP] add nccl support~~ [Distributed] add nccl primitives Jun 18, 2023

fix bugs

5dd7090

yaoyaoding merged commit bb1612e into hidet-org:main Jun 19, 2023

soodoshll deleted the nccl branch August 3, 2023 16:16

	DLL void set_nccl_comms(void** comm, int num_comms);
	DLL void set_nccl_comms(int num_comms, void** comm);

		from hidet.cuda.nccl import NcclDataType, NcclRedOp


		def all_reduce(comm_id: int, sendbuff: Expr, recvbuff: Expr, count: Expr, dtype: NcclDataType, op: NcclRedOp):

Conversation

soodoshll commented Jun 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaoyaoding left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soodoshll Jun 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaoyaoding left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soodoshll commented Jun 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

soodoshll commented Jun 16, 2023 •

edited

Loading

soodoshll Jun 17, 2023 •

edited

Loading