Skip to content

refactor(extensions/nanoarrow_device): Migrate CUDA device implementation to use the driver API#488

Merged
paleolimbot merged 13 commits intoapache:mainfrom
paleolimbot:device-cuda-driver-api
Jun 4, 2024
Merged

refactor(extensions/nanoarrow_device): Migrate CUDA device implementation to use the driver API#488
paleolimbot merged 13 commits intoapache:mainfrom
paleolimbot:device-cuda-driver-api

Conversation

@paleolimbot
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot commented May 24, 2024

Closes #246.

This PR doesn't change much about the existing implementation (and I think there are some things that need to be changed!), it just eliminates the dependency on the runtime library. The driver API is a better fit here anyway since we're doing very low-level things!

This isn't tested in CI yet (working on that here: #490 ).

@paleolimbot paleolimbot marked this pull request as ready for review May 24, 2024 19:32
Copy link
Copy Markdown
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall this looks fine to me, just a bunch of nitpicks and questions.

break;
}

cudaSetDevice(prev_device);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we always guaranteed that this has already been called or that we know we're using the correct device?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, device setting is now managed with the push/pop context (the notion of a "current device" is only available in the runtime API and I don't think we have another option).

Comment on lines -58 to -62
int prev_device = 0;
cudaError_t result = cudaGetDevice(&prev_device);
if (result != cudaSuccess) {
return EINVAL;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the context itself manage the current device id by pushing and popping it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so...it's created using the CUdevice and I stole this from Arrow C++ (the push/pop context always surrounds the cuMemAlloc() there). I also don't have a multi GPU system to test on 🙂

Comment on lines +127 to +130
if (err != CUDA_SUCCESS) {
cuCtxPopCurrent(&unused);
ArrowFree(allocator_private);
cudaSetDevice(prev_device);
return ENOMEM;
return EIO;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this function only returns an error code, but doesn't allow populating an error, is it okay that we're swallowing this error here? Should you leave a TODO comment so that we can remember to improve this?

Comment on lines 147 to 150
struct ArrowDeviceCudaArrayPrivate {
struct ArrowArray parent;
cudaEvent_t sync_event;
CUevent cu_event;
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be exposed somehow so that a producer can get access to this in order to record it on a stream or otherwise manage and use the event so that a consumer can benefit.

If we're not exposing this event anywhere yet (since you're creating it privately and not accepting a user provided event) then we should probably just leave it null for now and not bother trying to create and destroy an event until we are also exposing it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can now be specified from ArrowDeviceArrayInit()! I have a feeling I will be needing it shortly in the async buffer copying.

Comment on lines +262 to +263
// TODO: Synchronize device_src?
memcpy((void*)dst.data.data, src.data.data, (size_t)src.size_bytes);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synchronizing wouldn't be limited to the CPU/CUDA_HOST cases. if we need to synchronize, we'd need to synchronize for all cases.

But as I mentioned in the comment above, since we don't expose the event currently, you'd create a deadlock if you try to synchronize since nothing can mark the event as recorded and completed.

Also since the event is at the top level of the ArrowDeviceArray i'd say that if we are going to synchronize, we shouldn't synchronize at this level but rather above this on the call stack. And until we start using the cuMemcpyAsync or other Async calls, we don't need to bother attempting to manage synchronization yet. we can punt on that for now

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could potentially use cuCtxSynchronize though...?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point (synchronization of the source must have happened before this function is called). I was/am worried that cudaMemcpy() might have been flushing something from the device to the page-locked memory that a straight memcpy() wouldn't be doing. I'll look into cuCtxSynchronize() to see if that's doing what I think it is (or whether it should be called before any of this happens anyway).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this and I am almost positive that calling cuCtxSynchronize() before memcpy is the right thing to do; however, it should also be done once before copying a bunch of buffers (as you suggested). We really just need to test this properly (which will happen naturally when we implement async buffer copying, since that should result in an not-fully-synchronized array with a non-null sync event to properly test on).

@paleolimbot
Copy link
Copy Markdown
Member Author

I'm going to merge this to get started on follow-up work, but feel free to give any comments on this diff and I'll incorporate them into the next PR!

@paleolimbot paleolimbot merged commit 9410bd3 into apache:main Jun 4, 2024
paleolimbot added a commit that referenced this pull request Jun 4, 2024
#488 broke the Python package build (which I'd forgotten uses the device
extension).
@paleolimbot paleolimbot deleted the device-cuda-driver-api branch June 6, 2024 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Device extension should use the CUDA driver library (not the CUDA runtime library)

2 participants