speed-up Structure instantiation#4415
Conversation
Signed-off-by: Daniel Zügner <daniel.zuegner@gmail.com>
|
Thanks. There is no deep reason why the hashes need to be the atomic number. But I am confused why hash collisions would occur if it is the atomic number. The atomic number is pretty much unique (except for isotopes, which is an esoteric use case)? Why would a hash collision occur for different elements with the atomic number as the hash? Is the main performance benefit just the increase in lru_cache size? |
|
The problem is that the Only increasing |
|
I see. Thanks. I am merging this now. The test errors are the result of poorly written tests comparing strings. I will fix those on my end. |
Summary
This PR speeds up the instantiation of
Structureobjects by preventing hash collisions in thelru_cacheofget_el_spand increasing itsmaxsize. The issue is that currentlyElementobjects are hashed to the same value as the integer atomic numbers (e.g.,Element[H]maps to the same hash asint(1)). This forces thelru_hashto perform an expensive__eq__comparison between the two, which reduces the performance of instantiating manyStructureobjects. Also here we increase themaxsizeofget_el_sp'slru_cacheto 1024 for further performance improvements.This reduces time taken to instantiate 100,000
Structureobjects from 31 seconds to 8.7s (avoid hash collisions) to 6.1s (also increasemaxsizeto 1024). Test snippet:Unless I'm missing some deep reason for why the hash of
Elementobjects should be the same as their atomic number hashes, this should not lead to any different behavior of the code.Major changes:
ElementBase__hash__function to map to a different value than the atomic numbermaxsizeofget_el_sp'slru_cacheto 1024Checklist
ruff.mypy.duecredit@due.dcitedecorators to reference relevant papers by DOI (example)Tip: Install
pre-commithooks to auto-check types and linting before every commit: