Skip to content

[cuBLAS] Add cublas_gemm_batched and use cublasSetStream to set stream to the current stream in all cublas API calls#423

Merged
yaoyaoding merged 4 commits intomainfrom
cublas_gemm_batched
Feb 12, 2024
Merged

[cuBLAS] Add cublas_gemm_batched and use cublasSetStream to set stream to the current stream in all cublas API calls#423
yaoyaoding merged 4 commits intomainfrom
cublas_gemm_batched

Conversation

@yudi0201
Copy link
Copy Markdown
Contributor

No description provided.

@yudi0201 yudi0201 force-pushed the cublas_gemm_batched branch from e9d69bb to 4a0007f Compare January 31, 2024 18:54
@yudi0201 yudi0201 force-pushed the cublas_gemm_batched branch from 4a0007f to 3d58fbd Compare January 31, 2024 19:00
Copy link
Copy Markdown
Member

@yaoyaoding yaoyaoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yudi0201 !

It looks good to me.

CHECK_CUDA(cudaSetDevice(device));
}

DLL void hidet_cuda_malloc(void **devPtr, size_t size) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider directly return the allocated memory address like

DLL void* hidet_cuda_malloc(size_t size) {
    ...
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like hidet_cuda_get_device(...).

@yudi0201 yudi0201 force-pushed the cublas_gemm_batched branch from 69abe16 to ecd1461 Compare February 6, 2024 23:31
@yudi0201 yudi0201 force-pushed the cublas_gemm_batched branch from ecd1461 to a446270 Compare February 6, 2024 23:32
@yudi0201 yudi0201 changed the title [cublas] Add cublas_gemm_batched [cublas] Add cublas_gemm_batched and use cublasSetStream to set stream to the current stream in all cublas API calls Feb 7, 2024
@yaoyaoding yaoyaoding changed the title [cublas] Add cublas_gemm_batched and use cublasSetStream to set stream to the current stream in all cublas API calls [cuBLAS] Add cublas_gemm_batched and use cublasSetStream to set stream to the current stream in all cublas API calls Feb 8, 2024
Copy link
Copy Markdown
Member

@yaoyaoding yaoyaoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yudi0201!

// Allocate device memory
// first use synchronous versions of malloc and memcpy, later switch to async versions
if (cur_device_ptr_size != 0 && b > cur_device_ptr_size) {
hidet_cuda_free((void *)ptr_a_device);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not hidet_cuda_free_async?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following logic is more readable to me, just as a reference.

if(b > cur_device_ptr_size) {
  if(cur_device_ptr_size > 0) {
    free the three ptrs
  }
  alloc three ptrs
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! I'll modify these in the next revision.


DLL void* hidet_cuda_malloc(size_t size) {
lazy_load_cuda_runtime();
void* devPtr = malloc(sizeof(void*));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void* devPtr = malloc(sizeof(void*));
void* devPtr;

We do not need to allocate a memory region.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but wouldn't doing this result in devPtr being created on the stack, and then we'd be returning a local stack variable?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we call

cudaMallocAsync(&devPtr, ...)

we are passing the address of the pointer to the cuda api function, which will update the pointer value. The pointer is a stack variable and will be valid during we calling the cuda api function. We returns the "value" of the pointer instead of the "address" of the pointer to the callee of hidet_cuda_malloc, this is fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks!


DLL void* hidet_cuda_malloc_async(size_t size, cudaStream_t stream) {
lazy_load_cuda_runtime();
void* devPtr = malloc(sizeof(void*));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the previous one.

@yaoyaoding
Copy link
Copy Markdown
Member

Thanks @yudi0201 !

@yaoyaoding yaoyaoding merged commit 5f76caf into main Feb 12, 2024
@yaoyaoding yaoyaoding deleted the cublas_gemm_batched branch May 26, 2025 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants