[WIP] free speed/mem optimizations with ahash, dary_heap, and compact_str#1618
Conversation
…bindings broken without library refactor)
🥲 Super cool work! Would be cool to see which of In the meantime, feel free to open a smaller PR with only the |
|
@McPatate, sure, I can repackage this - although @MeetThePatel is also welcome to cherry-pick or take inspiration from this too. PS: As far as FxHash vs AHash goes, I think we might even get even better performanec thanks to recent optimality breakthough:
Ideally one of the existing, mature hash libraries would pick up the key ideas, but that doens't seem to have happened yet. |
|
From my understanding, the hashmaps presented in the paper are optimized for reducing worst-case performance. I quickly glanced at the algorithm a couple of weeks ago, and I suspect the wall-clock performance of average case would be lower than SwissTable (what std::collections::HashMap uses), as SwissTable has been SIMD optimized, although I haven't tested this at all, just a hunch. |
I'd give it some time, the paper is quite recent! I'd rather go for a more mature and maintained library for the moment, and in any case changing the hashmap datastructure seems quite straightforward. |
|
➕ to @McPatate's answer! |
Summary
Given that this library is largely an interface to hash maps of strings in rust, we can get "free" 5-25% free speedups by using stable, well-tested drop-ins like
ahash::HashMap,dary_heap::NHeap, andCompactString.The improvements span both training and subsequent encode/decode.
Notes
smolor a custom Huffman encoding for shorter lengths, using a BiHashMap like frombimap, etc. This was the best performing.bencheslook good (shown below).valgrind(although there was already ~420K leaked withcargo benchon HEAD).Issue
Because of the way that the interface is organized across the core rust library and py/node bindings, there isn't an easy way to merge this with support for encode/decode.
For example, because
Modelis defined on the rust side andVocabtraits are used differently between different models, we'd have to usepyo3within the rust library forPyFromObject.In theory, we could implement these changes only within the trainer, but the real user-facing/environmental impact would be to implement into the encode/decode bindings where most usage probably occurs.
Choices
Assuming you want to merge something like this, I think we have a few choices:
Trainers.Example Benchmark (i7-12700K)
NB: We replaced data/big.txt with a much larger text corpus (271M vs 6.2M) but results were comparable for original data/big.txt.
Results:
Before
After
time -vcomparisons (new vs old):