refactor(extensions/nanoarrow_device): Migrate CUDA device implementation to use the driver API#488
Conversation
zeroshade
left a comment
There was a problem hiding this comment.
overall this looks fine to me, just a bunch of nitpicks and questions.
| break; | ||
| } | ||
|
|
||
| cudaSetDevice(prev_device); |
There was a problem hiding this comment.
Are we always guaranteed that this has already been called or that we know we're using the correct device?
There was a problem hiding this comment.
Yes, device setting is now managed with the push/pop context (the notion of a "current device" is only available in the runtime API and I don't think we have another option).
| int prev_device = 0; | ||
| cudaError_t result = cudaGetDevice(&prev_device); | ||
| if (result != cudaSuccess) { | ||
| return EINVAL; | ||
| } |
There was a problem hiding this comment.
Does the context itself manage the current device id by pushing and popping it?
There was a problem hiding this comment.
I believe so...it's created using the CUdevice and I stole this from Arrow C++ (the push/pop context always surrounds the cuMemAlloc() there). I also don't have a multi GPU system to test on 🙂
| if (err != CUDA_SUCCESS) { | ||
| cuCtxPopCurrent(&unused); | ||
| ArrowFree(allocator_private); | ||
| cudaSetDevice(prev_device); | ||
| return ENOMEM; | ||
| return EIO; |
There was a problem hiding this comment.
since this function only returns an error code, but doesn't allow populating an error, is it okay that we're swallowing this error here? Should you leave a TODO comment so that we can remember to improve this?
| struct ArrowDeviceCudaArrayPrivate { | ||
| struct ArrowArray parent; | ||
| cudaEvent_t sync_event; | ||
| CUevent cu_event; | ||
| }; |
There was a problem hiding this comment.
This will need to be exposed somehow so that a producer can get access to this in order to record it on a stream or otherwise manage and use the event so that a consumer can benefit.
If we're not exposing this event anywhere yet (since you're creating it privately and not accepting a user provided event) then we should probably just leave it null for now and not bother trying to create and destroy an event until we are also exposing it
There was a problem hiding this comment.
This can now be specified from ArrowDeviceArrayInit()! I have a feeling I will be needing it shortly in the async buffer copying.
| // TODO: Synchronize device_src? | ||
| memcpy((void*)dst.data.data, src.data.data, (size_t)src.size_bytes); |
There was a problem hiding this comment.
synchronizing wouldn't be limited to the CPU/CUDA_HOST cases. if we need to synchronize, we'd need to synchronize for all cases.
But as I mentioned in the comment above, since we don't expose the event currently, you'd create a deadlock if you try to synchronize since nothing can mark the event as recorded and completed.
Also since the event is at the top level of the ArrowDeviceArray i'd say that if we are going to synchronize, we shouldn't synchronize at this level but rather above this on the call stack. And until we start using the cuMemcpyAsync or other Async calls, we don't need to bother attempting to manage synchronization yet. we can punt on that for now
There was a problem hiding this comment.
You could potentially use cuCtxSynchronize though...?
There was a problem hiding this comment.
That's a great point (synchronization of the source must have happened before this function is called). I was/am worried that cudaMemcpy() might have been flushing something from the device to the page-locked memory that a straight memcpy() wouldn't be doing. I'll look into cuCtxSynchronize() to see if that's doing what I think it is (or whether it should be called before any of this happens anyway).
There was a problem hiding this comment.
I looked into this and I am almost positive that calling cuCtxSynchronize() before memcpy is the right thing to do; however, it should also be done once before copying a bunch of buffers (as you suggested). We really just need to test this properly (which will happen naturally when we implement async buffer copying, since that should result in an not-fully-synchronized array with a non-null sync event to properly test on).
|
I'm going to merge this to get started on follow-up work, but feel free to give any comments on this diff and I'll incorporate them into the next PR! |
#488 broke the Python package build (which I'd forgotten uses the device extension).
Closes #246.
This PR doesn't change much about the existing implementation (and I think there are some things that need to be changed!), it just eliminates the dependency on the runtime library. The driver API is a better fit here anyway since we're doing very low-level things!
This isn't tested in CI yet (working on that here: #490 ).