Skip to content

Make SmallRingDescriptor, WienerNumbersDescriptor, and ALogPDescriptor parallelisable#1234

Merged
johnmay merged 6 commits intocdk:mainfrom
JonasSchaub:descriptor-parallelisation
Sep 30, 2025
Merged

Make SmallRingDescriptor, WienerNumbersDescriptor, and ALogPDescriptor parallelisable#1234
johnmay merged 6 commits intocdk:mainfrom
JonasSchaub:descriptor-parallelisation

Conversation

@JonasSchaub
Copy link
Copy Markdown
Contributor

In the three descriptor classes, instance variables relating to the current calculation, i.e. the given molecule, were turned into local method variables and are now passed between internal private methods as parameters, so that the classes are parallelisable.

We are doing descriptor calculations for batches of molecules and tried different ways of doing that in parallel for speed-up. This way, we realised that only these three descriptors were not parallelisable.

@egonw, since it was only three in the end, I didn't report them to you directly but instead went on to simply fix them immediately.

I also "linted" the classes a bit here and there and fixed some smaller Sonar complaints.

What do you think?

@johnmay
Copy link
Copy Markdown
Member

johnmay commented Sep 30, 2025

Looks good, not a blocker but sometimes when you need to pass many variables around it can be useful to encapsulate them in a private inner class (e.g. ALogPDescriptor.State).

I still maintain the correct way to parallelise descriptors is you just have copies of the descriptors for each thread :-). What does your parallelisation look like - it might be good idea to integrate a parallel mode into the descriptor engine class.

@JonasSchaub
Copy link
Copy Markdown
Contributor Author

Looks good, not a blocker but sometimes when you need to pass many variables around it can be useful to encapsulate them in a private inner class (e.g. ALogPDescriptor.State).

Thanks for the tip, should have thought about it, sorry.

I still maintain the correct way to parallelise descriptors is you just have copies of the descriptors for each thread :-).

We did that and it is the fastest way, so far, as it seems (if you are not under any memory constraints). Now, we would like to check whether we can still gain some speed when the descriptor instances are shared between the threads.

What does your parallelisation look like - it might be good idea to integrate a parallel mode into the descriptor engine class.

Our two parallelisation strategies are 1) Distribute the molecules onto threads (one per mol) and have them access shared descriptor classes and 2) Distribute the descriptors onto threads (one per desc) and have them access the shared molecules (hence my recent "fix" of the LargestPiSystemDescriptor that changed flags in the mol but did not make a copy first like others).

We are basically working on a "parallelised descriptor engine" that we would also present to you for CDK integration when it's complete. If you want to already have a look (we are still actively working on it):
https://github.com/JonasSchaub/CDK-Descriptor-Calculation/blob/dev-manuel/src/main/java/de/unijena/cheminf/clustering/desccalc/Descriptor.java

Parallelisation method 1) is not really there yet because these three descriptors are not ready for it yet. Instead, we experimented with synchronizing the shared descriptor instance and creating a new instance each time (like you suggested).

The main point of our library is to generate a large float matrix of descriptor values for a given set of molecules that can, e.g., be used for clustering.

@johnmay
Copy link
Copy Markdown
Member

johnmay commented Sep 30, 2025

Our two parallelisation strategies are 1) Distribute the molecules onto threads (one per mol) and have them access shared descriptor classes and 2) Distribute the descriptors onto threads (one per desc) and have them access the shared molecules (hence my recent "fix" of the LargestPiSystemDescriptor that changed flags in the mol but did not make a copy first like others).

OK the best strategy as actually to chunk your input and give those to each thread. Depends on timings but something like 4096 molecules per chunk - keep them in the input format (e.g. SMILES) as long as possible. You want to minimise the synchronisation points as much as possible - this is particularly important on higher core counts.

@johnmay johnmay merged commit 50ecd9e into cdk:main Sep 30, 2025
6 of 7 checks passed
@JonasSchaub JonasSchaub deleted the descriptor-parallelisation branch September 30, 2025 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants