Skip to content

DLA-Future complex eigensolvers#3813

Merged
oschuett merged 4 commits intocp2k:masterfrom
RMeli:dlaf-cfm
Dec 17, 2024
Merged

DLA-Future complex eigensolvers#3813
oschuett merged 4 commits intocp2k:masterfrom
RMeli:dlaf-cfm

Conversation

@RMeli
Copy link
Member

@RMeli RMeli commented Dec 16, 2024

Introduce DLA-Future complex standard and generalized eigensolvers.

Other changes:

  • Moved complex Cholesky decompositions to new module
  • Moved cp_cfm_upper_to_full from rpa_gw_kpoints_util to cp_cfm_basic_linalg

As for #3799, tests in regtest-dlaf-2 require libxc libint and therefore will not run in the current CMake CI action.

@RMeli
Copy link
Member Author

RMeli commented Dec 16, 2024

I think the CI timeout is an issue of thread binding. Locally, if I run with --bind-to core:2 I get

>>> /tmp/rmeli/spack-stage/spack-stage-cp2k-master-lei4ue472mgchmksqn2uypijyqkh5sz5/spack-build-lei4ue4/bin/TEST-2024-12-16_11-40-33/QS/regtest-dlaf
    H2O-6.inp                                                                             -17.14603642           OK (   2.57 sec)
    c_2.inp                                                                               -45.68042106           OK (   5.89 sec)

While if I run with the docker container, which uses --bind-to none I do not hit the timeout but the test is indeed very slow:

#23 371.7 >>> /opt/cp2k/build/bin/TEST-2024-12-16_10-23-56/QS/regtest-dlaf
#23 371.7     H2O-6.inp                                                                             -17.14603642           OK (   2.48 sec)
#23 371.7     c_2.inp                                                                               -45.68042106           OK ( 187.95 sec)

@oschuett
Copy link
Member

While if I run with the docker container, which uses --bind-to none I do not hit the timeout but the test is indeed very slow:

So far we had no issues with bind-to-none - on the contrary. Could you please verify that it's really the culprit by running the docker container without it.

@oschuett
Copy link
Member

tests in regtest-dlaf-2 require libxc libint and therefore will not run in the current CMake CI action.

This is really not good. I believe we have the following options:

  1. Add DLAF to the toolchain.
  2. Get Spack build cache to work in the CI.
  3. Switch to libcint.
  4. Stomach the cost incl. the prolonged wait time for developers.

@RMeli
Copy link
Member Author

RMeli commented Dec 16, 2024

@oschuett I agree. A multi-stage build would prevent re-building the dependencies every time, which is what we used for DLA-Future.

We have recently discussed internally at CSCS about possible avenues for (part of) the CP2K CI (especially GPU tests), and we have some propositions. I'll write you and @mkrack and e-mail in the coming days. (Let me know if I should include anyone else in the discussion.)

For the timeout, I think this might be the issue now that we have multiple tests running concurrently:

# Bind pika threads to first two cores. This is a hack. Do not use for production!
export PIKA_PROCESS_MASK="0x3"

I'll try to do more tests locally, and try to figure out exactly what is happening.

@oschuett
Copy link
Member

CSCS about possible avenues for (part of) the CP2K CI (especially GPU tests),

Sounds great! I'm looking forward to the email :-)

For the timeout, I think this might be the issue now that we have multiple tests running concurrently:

Oh, I had totally forgotten about the PIKA_PROCESS_MASK. Ideally, pika could figure this out by itself. Otherwise, we could add some logic to the do_regtest.py script. We already do something similar for GPUs.

@RMeli
Copy link
Member Author

RMeli commented Dec 17, 2024

I disabled binding for pika, and set the number of pika threads to the same number of OpenMP threads. Tests now run in a reasonable time:

>>> /opt/cp2k/build/bin/TEST-2024-12-16_17-20-05/QS/regtest-dlaf
    H2O-6.inp                                                                             -17.14603642           OK (   0.57 sec)
    c_2.inp                                                                               -45.68042106           OK (   5.29 sec)

To get this PR going, may I suggest to enable libint temporarily and do a single run just before merging? Then we can work on improving the CI for Spack + CMake in the new year.

@oschuett
Copy link
Member

I disabled binding for pika, and set the number of pika threads to the same number of OpenMP threads.

Nice! That looks like a much better solution.

To get this PR going, may I suggest to enable libint temporarily and do a single run just before merging?

I think, we should just enable libint and leave it on. Since this will prolong the test by ~30 minutes, I'd move it to the weekly cadence.

@RMeli
Copy link
Member Author

RMeli commented Dec 17, 2024

I think, we should just enable libint and leave it on. Since this will prolong the test by ~30 minutes, I'd move it to the weekly cadence.

Ok, I'll add it here then. While at it, I'll see if I can actually force Spack by temporarily reviving #3331.

RUN spack external find --all --not-buildable

# Enable Spack build cache
ARG SPACK_BUILD_CACHE=develop-2024-12-15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't realize that there is now a build cache for develop tags. That's awesome!
Should we maybe link it to the Spack version that we install above:

ARG SPACK_VERSION=develop-2024-12-15
...
git fetch --quiet --depth 1 origin ${{SPACK_VERSION}} --no-tags
...
spack mirror add ${{SPACK_VERSION}} https://binaries.spack.io/${{SPACK_VERSION}}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spack does a weekly build. I think they should be separate, otherwise we will have to wait up to a week to be able to update the Spack version.

Unfortunately, as I just mentioned below, it is not possible to use libint from the build cache. I tried multiple ways to force Spack to use it, but it can't satisfy all the constraints with the version provided by the build cache.

This still saves a bit of time, since several dependencies are not built but taken directly from the build cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, as I just mentioned below, it is not possible to use libint from the build cache. I tried multiple ways to force Spack to use it, but it can't satisfy all the constraints with the version provided by the build cache.

Presumably, we're relying too much on Ubuntu packages. With the build cache in place I'd be ready to lean in and install most packages from Spack.

Spack does a weekly build. I think they should be separate, otherwise we will have to wait up to a week to be able to update the Spack version.

Ok, that makes sense. My main worry is that there will be some change that leads to a lot of cache misses. Maybe as a safe-guard we could time box it:

timeout 10m spack -e myenv concretize -f

In any case, all of this could also happen in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably, we're relying too much on Ubuntu packages.

That is a good point. I can try to experiment locally, and see if I can come up with a combination where libint is indeed pulled from the cache. Another option to look into would be to setup our own OCI registry for the build cache.

Happy to have a look at this in a separate PR of course.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I tried to build without tapping into Ubuntu packages but the problem persists. Spack is not very explicit in why it fails, but there must be some incompatibility between the cached libint dependencies and other packages. One workaround could be to use unify: when_possible instead of unify: true, but I feat this will create more problems than it solves (there will be multiple version of some libraries etc.).

@RMeli
Copy link
Member Author

RMeli commented Dec 17, 2024

@oschuett as suggested above I enabled libint in the CMake CI action and now all DLA-Future tests run in CI:

>>> /opt/cp2k/build/bin/TEST-2024-12-17_12-35-15/QS/regtest-dlaf-2
    Ne-pbc-shortrange.inp                                                                   890.608044           OK (   1.89 sec)
    Ar-HF-2p-SOC-os.inp                                                                     261.347679           OK (   2.26 sec)
<<< /opt/cp2k/build/bin/TEST-2024-12-17_12-35-15/QS/regtest-dlaf-2 (8 of 352) done in 4.15 sec
>>> /opt/cp2k/build/bin/TEST-2024-12-17_12-35-15/QS/regtest-dlaf
    H2O-6.inp                                                                             -17.14603642           OK (   0.58 sec)
    c_2.inp                                                                               -45.68042106           OK (   6.10 sec)
<<< /opt/cp2k/build/bin/TEST-2024-12-17_12-35-15/QS/regtest-dlaf (9 of 352) done in 6.68 sec

I re-enabled Spack build cache, but I was unable to use libint from there because of some constraints in the dependency tree. We could look into enabling our own build cache, and later on moving the Spack + CMake CI action to a multi-stage build (possibly on top of CSCS CI).

@oschuett
Copy link
Member

Happy to have a look at this in a separate PR of course.

Ok, then I'll merge this PR now.

Furthermore, I'll move the "cmake" test to weekly cadence and I'll also rename it to "spack" because that has become its main role. For testing the CMake build system we can now use the "ssmp" test.

@oschuett oschuett merged commit 8512fd5 into cp2k:master Dec 17, 2024
@RMeli RMeli deleted the dlaf-cfm branch December 17, 2024 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants