Skip to content

Offload: support OpenCL#3315

Merged
hfp merged 1 commit intocp2k:masterfrom
hfp:offload
Mar 13, 2024
Merged

Offload: support OpenCL#3315
hfp merged 1 commit intocp2k:masterfrom
hfp:offload

Conversation

@hfp
Copy link
Member

@hfp hfp commented Mar 12, 2024

  • This work "short-cuts" supporting OpenCL by leveraging DBCSR's OpenCL backend to implement CP2K's Offload interface.
    • Future work (may be soon) can decide/settle where the actual glue code is hosted ("OpenCL backend").
    • OpenCL is very close to CUDA (streams/queues, events, etc), however, an OpenCL BE accounts for differences.
    • Pointer arithmetic on the host using device pointers is not supported with OpenCL (without additional effort), e.g.,
      DBM is implemented by relying on device pointer arithmetic.
    • OpenCL abstracts device memory with an opaque "cl_mem" structure (does not expose actual device pointers).
  • Some functions on CP2K's Offload interface (once derived from DBCSR's ACC interface) remain "TODO", i.e.,
    for the time being offload_runtime.h "disables" support for Grid and PW/FFTs
    • Defines for __NO_OFFLOAD_GRID and __NO_OFFLOAD_PW must be still given for the Fortran code.
  • Some changes remove CUDA and HIP specific code-paths by relying on Offload runtime (e.g., offload_create_buffer).
  • Some changes also address problematic code format (multi-line control-flow w/o curly braces/block, etc).
  • Further updating INSTALL.md and white-listing/removing certain items are part of follow-up changes.

@hfp hfp force-pushed the offload branch 2 times, most recently from 0c7207d to 79b4f0d Compare March 12, 2024 19:25
@hfp hfp marked this pull request as ready for review March 12, 2024 20:12
@hfp
Copy link
Member Author

hfp commented Mar 13, 2024

Test QS/regtest-tddfpt-force-2/h2o_f13.inp with CUDA Pascal Regtest is flagged wrong. However, this is apparently unrelated to this PR. CUDA Pascal Regtest was mainly triggered to check successful compilation.

* This work "short-cuts" supporting OpenCL by leveraging DBCSR's OpenCL backend to implement CP2K's Offload interface.
  - Future work (may be soon) can decide/settle where the actual glue code is hosted making the "OpenCL backend".
  - OpenCL is very close to CUDA (streams/queues, events, etc), however, an OpenCL backend accounts for differences.
  - Pointer arithmetic on the host using device pointers is not supported with OpenCL without additional effort.
  - OpenCL abstracts device memory being "cl_mem" structure and does not expose actual device pointers.
* Some functions on CP2K's Offload interface (once derived from DBCSR's ACC interface) remain "TODO".
* For the time being offload_runtime.h "disables" support for Grid and PW/FFTs
  - Defines for __NO_OFFLOAD_GRID and __NO_OFFLOAD_PW must be still given for the Fortran code.
* Some changes remove CUDA and HIP specific code-paths by relying on Offload runtime (e.g., offload_create_buffer).
* Some changes also address problematic code format (multi-line control-flow w/o curly braces/block, etc).
* Further updating INSTALL.md and white-listing/removing certain items are part of follow-up changes.
* Cleaned FLAG_EXCEPTIONS (tools/precommit/check_file_properties.py).
@hfp hfp merged commit 4b070db into cp2k:master Mar 13, 2024
@oschuett
Copy link
Member

This is very exciting!

Should we setup a dashboard test for it?

DBM is implemented by relying on device pointer arithmetic.

This should be straightforward to change.

@hfp
Copy link
Member Author

hfp commented Mar 13, 2024

Thanks! There are more PRs to come, at least one covering the actual DBM changes/kernel. I mean this perfectly works on NVidia too, so it should be simple to finally setup a Dashboard entry.

@hfp
Copy link
Member Author

hfp commented Mar 13, 2024

Also, I have to work on providing a clean/minimal ARCH file and support in our CMake infra ;-)

@hfp
Copy link
Member Author

hfp commented Mar 13, 2024

To share a little sneak peek (@JWilhelm), I ran benchmarks/QS_low_scaling_GW/GW.inp on systems with four Intel GPUs which looks like:

job nn nr nt tts
--
cp2k-gwls-192481 1 16 7 340.494
cp2k-gwls-192547 2 32 3 162.532
cp2k-gwls-192549 4 16 7 105.373
cp2k-gwls-192483 16 16 7 56.934

The columns denote the job title, number of nodes, number of ranks per node, number of threads (no smt), and total time to solution (in seconds). My DBM-kernel so far is not optimized at this point, e.g., does not even use shared/local memory.

@hfp hfp deleted the offload branch March 15, 2024 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants