Make SmallRingDescriptor, WienerNumbersDescriptor, and ALogPDescriptor parallelisable by JonasSchaub · Pull Request #1234 · cdk/cdk

JonasSchaub · 2025-09-29T14:28:52Z

In the three descriptor classes, instance variables relating to the current calculation, i.e. the given molecule, were turned into local method variables and are now passed between internal private methods as parameters, so that the classes are parallelisable.

We are doing descriptor calculations for batches of molecules and tried different ways of doing that in parallel for speed-up. This way, we realised that only these three descriptors were not parallelisable.

@egonw, since it was only three in the end, I didn't report them to you directly but instead went on to simply fix them immediately.

I also "linted" the classes a bit here and there and fixed some smaller Sonar complaints.

What do you think?

johnmay · 2025-09-30T08:36:24Z

Looks good, not a blocker but sometimes when you need to pass many variables around it can be useful to encapsulate them in a private inner class (e.g. ALogPDescriptor.State).

I still maintain the correct way to parallelise descriptors is you just have copies of the descriptors for each thread :-). What does your parallelisation look like - it might be good idea to integrate a parallel mode into the descriptor engine class.

JonasSchaub · 2025-09-30T09:34:41Z

Looks good, not a blocker but sometimes when you need to pass many variables around it can be useful to encapsulate them in a private inner class (e.g. ALogPDescriptor.State).

Thanks for the tip, should have thought about it, sorry.

I still maintain the correct way to parallelise descriptors is you just have copies of the descriptors for each thread :-).

We did that and it is the fastest way, so far, as it seems (if you are not under any memory constraints). Now, we would like to check whether we can still gain some speed when the descriptor instances are shared between the threads.

What does your parallelisation look like - it might be good idea to integrate a parallel mode into the descriptor engine class.

Our two parallelisation strategies are 1) Distribute the molecules onto threads (one per mol) and have them access shared descriptor classes and 2) Distribute the descriptors onto threads (one per desc) and have them access the shared molecules (hence my recent "fix" of the LargestPiSystemDescriptor that changed flags in the mol but did not make a copy first like others).

We are basically working on a "parallelised descriptor engine" that we would also present to you for CDK integration when it's complete. If you want to already have a look (we are still actively working on it):
https://github.com/JonasSchaub/CDK-Descriptor-Calculation/blob/dev-manuel/src/main/java/de/unijena/cheminf/clustering/desccalc/Descriptor.java

Parallelisation method 1) is not really there yet because these three descriptors are not ready for it yet. Instead, we experimented with synchronizing the shared descriptor instance and creating a new instance each time (like you suggested).

The main point of our library is to generate a large float matrix of descriptor values for a given set of molecules that can, e.g., be used for clustering.

johnmay · 2025-09-30T11:01:10Z

Our two parallelisation strategies are 1) Distribute the molecules onto threads (one per mol) and have them access shared descriptor classes and 2) Distribute the descriptors onto threads (one per desc) and have them access the shared molecules (hence my recent "fix" of the LargestPiSystemDescriptor that changed flags in the mol but did not make a copy first like others).

OK the best strategy as actually to chunk your input and give those to each thread. Depends on timings but something like 4096 molecules per chunk - keep them in the input format (e.g. SMILES) as long as possible. You want to minimise the synchronisation points as much as possible - this is particularly important on higher core counts.

JonasSchaub added 6 commits September 23, 2025 15:43

makes Wiener numbers descriptor threadsafe

225361f

makes ALogPDescriptor threadsafe

e8ef0fe

makes SmallRingDescriptor threadsafe

5f0516b

a bit of code formatting and refactoring

ebeec09

a bit of refactoring in the ALogPDescriptor class

07a89d7

a bit of refactoring and linting in the ALogPDescriptor class

4907fdc

johnmay merged commit 50ecd9e into cdk:main Sep 30, 2025
6 of 7 checks passed

JonasSchaub deleted the descriptor-parallelisation branch September 30, 2025 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make SmallRingDescriptor, WienerNumbersDescriptor, and ALogPDescriptor parallelisable#1234

Make SmallRingDescriptor, WienerNumbersDescriptor, and ALogPDescriptor parallelisable#1234
johnmay merged 6 commits intocdk:mainfrom
JonasSchaub:descriptor-parallelisation

JonasSchaub commented Sep 29, 2025

Uh oh!

johnmay commented Sep 30, 2025

Uh oh!

JonasSchaub commented Sep 30, 2025

Uh oh!

johnmay commented Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JonasSchaub commented Sep 29, 2025

Uh oh!

johnmay commented Sep 30, 2025

Uh oh!

JonasSchaub commented Sep 30, 2025

Uh oh!

johnmay commented Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants