[WIP] free speed/mem optimizations with ahash, dary_heap, and compact_str by mjbommar · Pull Request #1618 · huggingface/tokenizers

mjbommar · 2024-08-21T16:26:32Z

Summary

Given that this library is largely an interface to hash maps of strings in rust, we can get "free" 5-25% free speedups by using stable, well-tested drop-ins like ahash::HashMap, dary_heap::NHeap, and CompactString.

The improvements span both training and subsequent encode/decode.

Notes

We tested a few other options, like directly hashing the strings, using smol or a custom Huffman encoding for shorter lengths, using a BiHashMap like from bimap, etc. This was the best performing.
We have already tested the improvements with the rust library directly on very large training corpora and have seen no issues on linux/x86 and apple silicon.
Microbenchmarks from benches look good (shown below).
No regressions after testing with valgrind (although there was already ~420K leaked with cargo bench on HEAD).

Issue

Because of the way that the interface is organized across the core rust library and py/node bindings, there isn't an easy way to merge this with support for encode/decode.

For example, because Model is defined on the rust side and Vocab traits are used differently between different models, we'd have to use pyo3 within the rust library for PyFromObject.

In theory, we could implement these changes only within the trainer, but the real user-facing/environmental impact would be to implement into the encode/decode bindings where most usage probably occurs.

Choices

Assuming you want to merge something like this, I think we have a few choices:

Refactor the way that Vocab is managed within the Rust library and where the traits are implemented.
Only implement improvements within Trainers.

Example Benchmark (i7-12700K)

NB: We replaced data/big.txt with a much larger text corpus (271M vs 6.2M) but results were comparable for original data/big.txt.

$ git checkout HEAD~1
$ cargo build && /usr/bin/time -v cargo bench --bench bpe_benchmark
$ git checkout consolidated-optimization-ahash-dary-compact-str
$ cargo build && /usr/bin/time -v cargo bench --bench bpe_benchmark

Results:

Before

BPE GPT2 encode         time:   [10.512 µs 10.527 µs 10.544 µs]
                        change: [-0.3616% +0.8589% +1.7420%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild

BPE GPT2 encode batch   time:   [4.6695 ms 4.6921 ms 4.7171 ms]
                        change: [+11.476% +12.090% +12.746%] (p = 0.00 < 0.05)
                        Performance has regressed.

BPE GPT2 encode, no cache
                        time:   [18.042 µs 18.094 µs 18.163 µs]
                        change: [+9.9213% +11.676% +13.421%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild

BPE GPT2 encode batch, no cache
                        time:   [5.0045 ms 5.0233 ms 5.0421 ms]
                        change: [+9.4710% +10.196% +10.992%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) low mild
  1 (5.00%) high mild

BPE Train vocabulary (small)
                        time:   [15.129 ms 15.234 ms 15.415 ms]
                        change: [+4.1576% +5.2339% +6.3317%] (p = 0.00 < 0.05)
                        Performance has regressed.

Benchmarking BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 22.4s.
BPE Train vocabulary (big)
                        time:   [2.2645 s 2.2734 s 2.2828 s]
                        change: [+6.4027% +6.9885% +7.6193%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

After

BPE GPT2 encode         time:   [10.600 µs 10.634 µs 10.673 µs]
                        change: [+0.2136% +1.1903% +2.4177%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 20 measurements (15.00%)
  1 (5.00%) high mild
  2 (10.00%) high severe

BPE GPT2 encode batch   time:   [4.4834 ms 4.5053 ms 4.5327 ms]
                        change: [-4.6968% -4.1793% -3.6501%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild

BPE GPT2 encode, no cache
                        time:   [15.949 µs 16.350 µs 16.889 µs]
                        change: [-9.8540% -7.9585% -6.2720%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 20 measurements (35.00%)
  4 (20.00%) low severe
  3 (15.00%) high mild

BPE GPT2 encode batch, no cache
                        time:   [4.5396 ms 4.5578 ms 4.5762 ms]
                        change: [-9.9078% -9.1506% -8.2170%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe

BPE Train vocabulary (small)
                        time:   [14.401 ms 14.483 ms 14.599 ms]
                        change: [-6.5321% -5.6261% -4.6370%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

Benchmarking BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 21.1s.
BPE Train vocabulary (big)
                        time:   [2.1073 s 2.1099 s 2.1129 s]
                        change: [-7.5971% -7.1924% -6.8103%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

`time -v` comparisons (new vs old):

System time (s): 72.6 vs 76.5
Maximum resident set size (kbytes): 913772 vs 915372

…bindings broken without library refactor)

McPatate · 2025-03-24T15:52:19Z

(although there was already ~420K leaked with cargo bench on HEAD)

🥲

Super cool work! Would be cool to see which of FxHashMap & AHash performs best.

In the meantime, feel free to open a smaller PR with only the dary_heap addition, which afaiu doesn't overlap with @MeetThePatel's work

mjbommar · 2025-03-24T16:36:19Z

@McPatate, sure, I can repackage this - although @MeetThePatel is also welcome to cherry-pick or take inspiration from this too.

PS: As far as FxHash vs AHash goes, I think we might even get even better performanec thanks to recent optimality breakthough:

Ideally one of the existing, mature hash libraries would pick up the key ideas, but that doens't seem to have happened yet.

MeetThePatel · 2025-03-24T20:51:52Z

From my understanding, the hashmaps presented in the paper are optimized for reducing worst-case performance. I quickly glanced at the algorithm a couple of weeks ago, and I suspect the wall-clock performance of average case would be lower than SwissTable (what std::collections::HashMap uses), as SwissTable has been SIMD optimized, although I haven't tested this at all, just a hunch.

McPatate · 2025-03-25T11:33:07Z

Ideally one of the existing, mature hash libraries would pick up the key ideas, but that doens't seem to have happened yet.

I'd give it some time, the paper is quite recent! I'd rather go for a more mature and maintained library for the moment, and in any case changing the hashmap datastructure seems quite straightforward.

ArthurZucker · 2025-05-27T06:04:39Z

➕ to @McPatate's answer!

McPatate · 2025-06-21T07:11:30Z

Closing as #1799 has been merged. Thank you for the contribution @mjbommar!

free speed/mem optimizations with ahash, dary_heap, and compact_str (…

a8aee8d

…bindings broken without library refactor)

mjbommar mentioned this pull request Mar 22, 2025

Switch to FXHash #1752

Closed

Narsil mentioned this pull request Jun 16, 2025

Consolidated optimization ahash dary compact str #1799

Merged

McPatate closed this Jun 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] free speed/mem optimizations with ahash, dary_heap, and compact_str#1618

[WIP] free speed/mem optimizations with ahash, dary_heap, and compact_str#1618
mjbommar wants to merge 1 commit intohuggingface:mainfrom
alea-institute:consolidated-optimization-ahash-dary-compact-str

mjbommar commented Aug 21, 2024

Uh oh!

McPatate commented Mar 24, 2025

Uh oh!

mjbommar commented Mar 24, 2025

Uh oh!

MeetThePatel commented Mar 24, 2025

Uh oh!

McPatate commented Mar 25, 2025

Uh oh!

ArthurZucker commented May 27, 2025

Uh oh!

McPatate commented Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mjbommar commented Aug 21, 2024

Summary

Notes

Issue

Choices

Example Benchmark (i7-12700K)

Before

After

time -v comparisons (new vs old):

Uh oh!

McPatate commented Mar 24, 2025

Uh oh!

mjbommar commented Mar 24, 2025

Uh oh!

MeetThePatel commented Mar 24, 2025

Uh oh!

McPatate commented Mar 25, 2025

Uh oh!

ArthurZucker commented May 27, 2025

Uh oh!

McPatate commented Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`time -v` comparisons (new vs old):