Conversation
|
Looks good, not a blocker but sometimes when you need to pass many variables around it can be useful to encapsulate them in a private inner class (e.g. ALogPDescriptor.State). I still maintain the correct way to parallelise descriptors is you just have copies of the descriptors for each thread :-). What does your parallelisation look like - it might be good idea to integrate a parallel mode into the descriptor engine class. |
Thanks for the tip, should have thought about it, sorry.
We did that and it is the fastest way, so far, as it seems (if you are not under any memory constraints). Now, we would like to check whether we can still gain some speed when the descriptor instances are shared between the threads.
Our two parallelisation strategies are 1) Distribute the molecules onto threads (one per mol) and have them access shared descriptor classes and 2) Distribute the descriptors onto threads (one per desc) and have them access the shared molecules (hence my recent "fix" of the LargestPiSystemDescriptor that changed flags in the mol but did not make a copy first like others). We are basically working on a "parallelised descriptor engine" that we would also present to you for CDK integration when it's complete. If you want to already have a look (we are still actively working on it): Parallelisation method 1) is not really there yet because these three descriptors are not ready for it yet. Instead, we experimented with synchronizing the shared descriptor instance and creating a new instance each time (like you suggested). The main point of our library is to generate a large float matrix of descriptor values for a given set of molecules that can, e.g., be used for clustering. |
OK the best strategy as actually to chunk your input and give those to each thread. Depends on timings but something like 4096 molecules per chunk - keep them in the input format (e.g. SMILES) as long as possible. You want to minimise the synchronisation points as much as possible - this is particularly important on higher core counts. |
In the three descriptor classes, instance variables relating to the current calculation, i.e. the given molecule, were turned into local method variables and are now passed between internal private methods as parameters, so that the classes are parallelisable.
We are doing descriptor calculations for batches of molecules and tried different ways of doing that in parallel for speed-up. This way, we realised that only these three descriptors were not parallelisable.
@egonw, since it was only three in the end, I didn't report them to you directly but instead went on to simply fix them immediately.
I also "linted" the classes a bit here and there and fixed some smaller Sonar complaints.
What do you think?