Skip to content

Conversation

@bashbaug
Copy link
Contributor

@bashbaug bashbaug commented Nov 8, 2024

This PR contains the draft specification for cl_khr_unified_svm. It is intended to provide early access for review and community feedback, and it is not a ratified specification.

cl_khr_unified_svm adds additional types of Shared Virtual Memory (SVM) to OpenCL. Compared to Coarse-Grained and Fine-Grained SVM in OpenCL 2.0 and newer, the additional types of SVM added by this this extension provides:

  • Sufficient functionality to implement "Unified Shared Memory" (USM) in other APIs, such as SYCL.

  • Additional control over the ownership and accessibility of SVM allocations, to more precisely choose between application performance and programmer convenience.

  • A simpler programming model, by automatically migrating more SVM allocations between devices and the host, or by accessing more SVM allocations on the host without needing to map or unmap the allocation.

Specifically, this extension provides:

  • Extensible interfaces to support many types of SVM, including the SVM types defined in core OpenCL, in this extension, and additional SVM types defined by other combinations of SVM capabilities.

  • Explicit control over memory placement and migration by supporting device-owned SVM allocations for best performance, host-owned SVM allocations for wide visibility, and shared SVM allocations that may migrate between devices and the host.

  • The ability to query detailed SVM capabilities for each SVM allocation type supported by a platform and device.

  • Additional properties to control how memory is allocated and freed, including properties to associate an SVM allocation with both a device and a context.

  • A mechanism to indicate that a kernel may access SVM allocations indirectly, without passing a set of indirectly accessed SVM allocations to the kernel, improving usability and reducing driver overhead for kernels that access many SVM allocations.

  • A new query function to query properties of an SVM allocation.

  • A new function to suggest an SVM allocation type for a set of SVM capabilities.

Because the interfaces defined by this specification are not final and are subject to change they are not intended to be used by shipping software products. If you are interested in using this feature in your software product, please let us know!

Copy link

@illuhad illuhad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see at last what you've been working on :)

You asked for feedback, so here are some things I noticed:

  • I can see that the ability to request individual SVM features can be very powerful, if implementations are happy to carry the burden of dealing with all of the possible combinations. However, I am a bit concerned with adoption if the "common cases" of device/host/shared SVM that you describe are hidden underneath some not necessarily obvious combinations of bitfield values. Can we perhaps have convenience shortcuts for these common cases? This could make user code simpler and make it more attractive to use this memory management model, and it could provide guidance for users and vendors if everybody knew clearly which combinations of capabilities to focus their testing on. This could help portability and adoption.
  • Can we have the ability to register an existing memory allocation to make it device accessible, similarly to cudaHostRegister?

@bashbaug
Copy link
Contributor Author

Thank you for the feedback!

  • I can see that the ability to request individual SVM features can be very powerful, if implementations are happy to carry the burden of dealing with all of the possible combinations. However, I am a bit concerned with adoption if the "common cases" of device/host/shared SVM that you describe are hidden underneath some not necessarily obvious combinations of bitfield values. Can we perhaps have convenience shortcuts for these common cases? This could make user code simpler and make it more attractive to use this memory management model, and it could provide guidance for users and vendors if everybody knew clearly which combinations of capabilities to focus their testing on. This could help portability and adoption.

This is the intent behind the SVM type "capability macros", which we've defined in the spec and in the headers. We've currently defined "macros" for the common cases we've been thinking about, and if other combinations become common they can be added with a simple header update.

Specifically, we've defined capability macros for:

  • CL_SVM_TYPE_MACRO_COARSE_GRAIN_BUFFER_KHR
  • CL_SVM_TYPE_MACRO_FINE_GRAIN_BUFFER_KHR
  • CL_SVM_TYPE_MACRO_DEVICE_KHR
  • CL_SVM_TYPE_MACRO_HOST_KHR
  • CL_SVM_TYPE_MACRO_SINGLE_DEVICE_SHARED_KHR
  • CL_SVM_TYPE_MACRO_SYSTEM_KHR

The capability macros provide a very concise and easy mechanism to request these types of SVM, for example, here is a way to allocate memory equivalent to SYCL "device USM":

    cl_uint index = CL_UINT_MAX;
    clGetSVMSuggestedTypeIndexKHR(
        context(),
        CL_SVM_TYPE_MACRO_DEVICE_KHR,
        0,
        props,
        0,
        &index);

    cl_uint* d_src = (cl_uint*)clSVMAllocWithPropertiesKHR(
        context(),
        props,
        index,
        gwx * sizeof(cl_uint),
        nullptr );
  • Can we have the ability to register an existing memory allocation to make it device accessible, similarly to cudaHostRegister?

We have this request recorded (in document issue (5)), but we are not currently considering it in scope for the initial extension revision, primarily because this functionality isn't supported by the Intel USM extension and hence isn't required to implement SYCL. Does that sound reasonable?

If we don't include it in the initial extension, we could support it via a layered extension, or a subsequent extension version.

@illuhad
Copy link

illuhad commented Nov 18, 2024

Hi Ben!

The capability macros provide a very concise and easy mechanism to request these types of SVM, for example, here is a way to allocate memory equivalent to SYCL "device USM":

Ah, I see, thanks for pointing this out. This was not very obvious to me. Perhaps other users might feel the same way?
For me this workflow would certainly work, but then I'm more of a framework vendor rather than an application developer/end user, so sensitivities might be different.

We have this request recorded (in document issue (5)), but we are not currently considering it in scope for the initial extension revision, primarily because this functionality isn't supported by the Intel USM extension and hence isn't required to implement SYCL. Does that sound reasonable?

I agree that it is not needed for SYCL (at least not today), but I would need it for C++ standard parallelism offloading rather sooner than later in order to make the stack GPU accessible. But I appreciate that the scope of individual extensions must be limited in some way to keep the work manageable in scope and size :)

Comment on lines 311 to 314
| `CL_SVM_CAPABILITY_DEVICE_READ_KHR`
| This type of SVM is accessible on the device for reading.
| `CL_SVM_CAPABILITY_DEVICE_WRITE_KHR`
| This type of SVM is writeable on the device for writing.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clarify exactly what is meant by "accessible on the device for reading" and "accessible on the device for writing", given issues like #770.

  • Do these capabilities only apply to kernels, or do they apply to other APIs as well?

Specifically:

  • What capabilities are needed for clEnqueueSVMMemcpy? Does it matter if the SVM allocation is the memcpy source or destination?
  • What capabilities are needed for clEnqueueSVMMemFill?
  • What capabilities are needed for clEnqueueSVMMigrateMem?

Minor, there is a small typo here too: "writeable on the device for writing" should be "accessible on the device for writing".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in the November 29th memory subgroup. We don't want to repeat the mistakes we made leading to #770, so our intent is that these capabilities apply to all forms of device access, not just kernels. Therefore:

  • What capabilities are needed for clEnqueueSVMMemcpy?
    • DEVICE_READ is needed to be a memcpy source, and DEVICE_WRITE is needed for a memcpy dst.
  • What capabilities are needed for clEnqueueSVMMemFill?
    • DEVICE_WRITE is needed to be a memfill dst.
  • What capabilities are needed for clEnqueueSVMMigrateMem?
    • TBD - do we want DEVICE_READ to migrate to the host (via CL_MIGRATE_MEM_OBJECT_HOST), and DEVICE_WRITE to migrate to the device?

#define CL_KERNEL_EXEC_INFO_SVM_INDIRECT_ACCESS_KHR 0x11BB
----

== Modifications to the OpenCL API Specification
Copy link

@b-sumner b-sumner Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't there any updates to section 3.3 Memory Model?

@bashbaug
Copy link
Contributor Author

Mentioned in the January 7th teleconference: I have OpenCL Intercept Layer support for unified SVM done and in a branch, which I will try to keep up-to-date with the latest specs. I'm open to suggestions, but I probably won't merge the branch until the extensions is a bit further along.

The branch can be found here:
https://github.com/bashbaug/opencl-intercept-layer/tree/cl_khr_unified_svm

Trending "yes" for host USM allocations, both when the host USM allocation is from this context and from another context.
--

. Can a pointer to a device, host, or shared SVM allocation be passed to API functions to read from or write to `cl_mem` objects, such as *clEnqueueReadBuffer* or *clEnqueueWriteImage*?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed briefly in the January 7th memory subgroup. We may want to consider buffers and images separately.

Comment on lines 541 to 542
Whether *clSVMFree* or *clSVMFreeWithPropertiesKHR* is blocking or non-blocking is unspecified.
Applications should not rely on *clSVMFree* or *clSVMFreeWithPropertiesKHR* for synchronization, nor assume that *clSVMFree* or *clVMFreeWithPropertiesKHR* cannot cause deadlocks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the 2025/03/18 memory subgroup TSG, I think this note should be reworded to make it clear that

  • Deadlocks can only happen when clSVMFree (or clSVMFreeWithPropertiesKHR) is called on an allocation that may be used by commands pending execution.
  • "blocking" means "waiting for pending commands that may use the allocation to complete execution"

With these changes, this would be allowed by the "behavior is undefined" statement above but, strictly speaking, the current wording for the note suggests that deadlocks could happen in any scenarios.

@bashbaug bashbaug added the cl_khr_unified_svm Relating to Unified SVM label Jun 10, 2025
.. [.line-through]#Allow zero-sized allocations and require returning a non-null pointer that must be freed.#
.. Allow zero-sized allocations and require returning a `NULL` pointer.
No error will be generated.
Note, it is not currently an error to free a `NULL` pointer.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in the July 1st memory subgroup; see tentative resolution above.

As a follow-up, while the current OpenCL spec says that it is not an error to free a NULL pointer using clSVMFree:

If a NULL pointer is passed in svm_pointer, no action occurs.

Should this apply to the asynchronous clEnqueueSVMFree also? This case is not explicitly described.

EwanC added a commit to EwanC/AdaptiveCpp that referenced this pull request Dec 23, 2025
The OpenCL WG is developing the [cl_khr_unified_svm](KhronosGroup/OpenCL-Docs#1282)
extension with an explicit goal of being sufficient to implement SYCL
USM ontop of. A snapshot of the APIs are currently available in the
upstream OpenCL-Headers repo, hence the commit bump to fetch these.

This PR adds support of using cl_khr_unified_svm as a AdaptiveCpp
OpenCL USM backend. It is tested by using [an emulation
layer](https://github.com/bashbaug/SimpleOpenCLSamples/blob/cl_khr_unified_svm/layers/99_svmplusplus/emulate.cpp)
ontop of a OpenCL CPU implementation with Intel USM extension support.

Tested by running the AdaptiveCpp `usm_tests*` and `queue_tests*` unittests

As the OpenCL extension APIs are subject to change, and there are no
non-emulated implementations. This PR could stay as draft as a prototype
until the OpenCL extension is finalized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cl_khr_unified_svm Relating to Unified SVM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants