Hi,
I have observed that CP2K deadlocks with certain rank counts when running with the ELPA backend. I don't know ELPA very well, but with the debugger I think I could gather enough info to pinpoint the problem.
This is what happens in the run, I run with 8 MPI ranks.
- CP2K calls ELPA solver with a comm of size 4.
- ELPA initializes the GPUs for the 4 ranks in
check_for_gpu() #442.
- ELPA solver succeeds, execution continues normally.
- Later, CP2K calls ELPA solver again, but with a comm of size 8
- Now there is a deadlock at
check_for_gpu() #507. As the first 4 ranks were already initialized, they have exited the function earlier #453, and do not reach the Allreduce, so the last 4 ranks will hang there forever.
I don't know whether calling ELPA with different comm sizes is allowed or not? But my first thought would be that check_for_gpu() should first query the value of all rank's gpuIsInitialized and restart the initialization for everyone if any of the ranks was not initialized, this seems to fix the deadlock in my case.