refactor(extensions/nanoarrow_device): Migrate CUDA device implementation to use the driver API#488

Merged

paleolimbot merged 13 commits intoapache:mainfrom

paleolimbot:device-cuda-driver-api

Jun 4, 2024

Member

paleolimbot commented May 24, 2024 •

edited

Loading

Closes #246.

This PR doesn't change much about the existing implementation (and I think there are some things that need to be changed!), it just eliminates the dependency on the runtime library. The driver API is a better fit here anyway since we're doing very low-level things!

This isn't tested in CI yet (working on that here: #490 ).

paleolimbot added 8 commits

May 24, 2024 14:13


          start on convert to driver api

28993d0


          event things


          oof memcpy

07bdd37


          link + remove from tests

34476c6


          passing buffer tests

62781e3

dev

a243368


          switch order

b10b244


          reuse the buffer copier

0863b30

paleolimbot marked this pull request as ready for review

May 24, 2024 19:32


          remove header

d17aac3

zeroshade reviewed

View reviewed changes

Member

zeroshade left a comment

overall this looks fine to me, just a bunch of nitpicks and questions.

extensions/nanoarrow_device/CMakeLists.txt Show resolved Hide resolved

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c Show resolved Hide resolved

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c

                     break;
                 }
-                cudaSetDevice(prev_device);

Member

zeroshade May 28, 2024

Are we always guaranteed that this has already been called or that we know we're using the correct device?

Member Author

paleolimbot May 29, 2024

Yes, device setting is now managed with the push/pop context (the notion of a "current device" is only available in the runtime API and I don't think we have another option).

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c

Comment on lines -58 to -62

-                int prev_device = 0;
-                cudaError_t result = cudaGetDevice(&prev_device);
-                if (result != cudaSuccess) {
-                  return EINVAL;
-                }

Member

zeroshade May 28, 2024

Does the context itself manage the current device id by pushing and popping it?

Member Author

paleolimbot May 29, 2024

I believe so...it's created using the CUdevice and I stole this from Arrow C++ (the push/pop context always surrounds the cuMemAlloc() there). I also don't have a multi GPU system to test on 🙂

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c

Comment on lines +127 to +130

+                if (err != CUDA_SUCCESS) {
+                  cuCtxPopCurrent(&unused);
                   ArrowFree(allocator_private);
-                  cudaSetDevice(prev_device);
-                  return ENOMEM;
+                  return EIO;

Member

zeroshade May 28, 2024

since this function only returns an error code, but doesn't allow populating an error, is it okay that we're swallowing this error here? Should you leave a TODO comment so that we can remember to improve this?

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c

Comment on lines 147 to 150

               struct ArrowDeviceCudaArrayPrivate {
                 struct ArrowArray parent;
-                cudaEvent_t sync_event;
+                CUevent cu_event;
               };

Member

zeroshade May 28, 2024

This will need to be exposed somehow so that a producer can get access to this in order to record it on a stream or otherwise manage and use the event so that a consumer can benefit.

If we're not exposing this event anywhere yet (since you're creating it privately and not accepting a user provided event) then we should probably just leave it null for now and not bother trying to create and destroy an event until we are also exposing it

Member Author

paleolimbot May 30, 2024

This can now be specified from ArrowDeviceArrayInit()! I have a feeling I will be needing it shortly in the async buffer copying.

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c Outdated

Comment on lines +262 to +263

		// TODO: Synchronize device_src?
		memcpy((void*)dst.data.data, src.data.data, (size_t)src.size_bytes);

Member

zeroshade May 28, 2024

synchronizing wouldn't be limited to the CPU/CUDA_HOST cases. if we need to synchronize, we'd need to synchronize for all cases.

But as I mentioned in the comment above, since we don't expose the event currently, you'd create a deadlock if you try to synchronize since nothing can mark the event as recorded and completed.

Also since the event is at the top level of the ArrowDeviceArray i'd say that if we are going to synchronize, we shouldn't synchronize at this level but rather above this on the call stack. And until we start using the cuMemcpyAsync or other Async calls, we don't need to bother attempting to manage synchronization yet. we can punt on that for now

Member

zeroshade May 28, 2024

You could potentially use cuCtxSynchronize though...?

Member Author

paleolimbot May 29, 2024

That's a great point (synchronization of the source must have happened before this function is called). I was/am worried that cudaMemcpy() might have been flushing something from the device to the page-locked memory that a straight memcpy() wouldn't be doing. I'll look into cuCtxSynchronize() to see if that's doing what I think it is (or whether it should be called before any of this happens anyway).

Member Author

paleolimbot May 30, 2024

I looked into this and I am almost positive that calling cuCtxSynchronize() before memcpy is the right thing to do; however, it should also be done once before copying a bunch of buffers (as you suggested). We really just need to test this properly (which will happen naturally when we implement async buffer copying, since that should result in an not-fully-synchronized array with a non-null sync event to properly test on).

paleolimbot mentioned this pull request

docs(extensions/nanoarrow_device): Document device extension #497

Open

paleolimbot added 4 commits

May 30, 2024 13:54


          actually build device ext with debug

e86708e


          user-supplied sync event

8e50a1d


          fix clangd include

b1c92df


          consolidate comment about sync

a9b3435

Member Author

paleolimbot commented Jun 4, 2024

I'm going to merge this to get started on follow-up work, but feel free to give any comments on this diff and I'll incorporate them into the next PR!

paleolimbot merged commit 9410bd3 into apache:main

paleolimbot mentioned this pull request

chore(python): Add sync_event argument to Python .pxd header #506

Merged

paleolimbot added a commit that referenced this pull request


          chore(python): Add sync_event argument to Python .pxd header (#506)

8c5b3d9

#488 broke the Python package build (which I'd forgotten uses the device
extension).

paleolimbot deleted the device-cuda-driver-api branch

June 6, 2024 16:19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet