add a clear_cache function! by ArthurZucker · Pull Request #1662 · huggingface/tokenizers

ArthurZucker · 2024-10-21T09:47:55Z

Tried the provided script with this:

In [4]: from transformers import AutoTokenizer
   ...: import gc
   ...: 
   ...: tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
   ...: refresh_every = 100
   ...: 
   ...: for i in range(100000):
   ...:   s = f'{i} {i} ' * 10000
   ...:   tokenizer.encode(s)
   ...:   gc.collect()
   ...:   if i % 100 == 0:
   ...:     print(i)
   ...:   if i % refresh_every == 0:
   ...:     tokenizer._tokenizer.model.clear_cache()

and had no "leaks"

HuggingFaceDocBuilderDev · 2024-10-21T09:50:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tomaarsen · 2024-10-21T10:55:01Z

For reviewers, you should be able to test this with:

pip install "git+https://github.com/huggingface/tokenizers.git@fix-cache-issues#egg=tokenizers&subdirectory=bindings/python"

If you get an error like

  Preparing editable metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Preparing editable metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [16 lines of output]
          Updating crates.io index
      error: failed to select a version for `env_logger`.
          ... required by package `tokenizers-python v0.20.1-dev.0 (C:\code\tokenizers\bindings\python)`
      versions that meet the requirements `^0.11` are: 0.11.5, 0.11.4, 0.11.3, 0.11.2, 0.11.1, 0.11.0

      the package `tokenizers-python` depends on `env_logger`, with features: `anstream` but `env_logger` does not have these features.
       It has an optional dependency with that name, but that dependency uses the "dep:" syntax in the features table, so it does not have an implicit feature with that name.


      failed to select a version for `env_logger` which could resolve this conflict
      ðŸ’¥ maturin failed
        Caused by: Cargo metadata failed. Does your crate compile with `cargo build`?
        Caused by: `cargo metadata` exited with an error:
      Error running maturin: Command '['maturin', 'pep517', 'write-dist-info', '--metadata-directory', 'C:\\Users\\tom\\AppData\\Local\\Temp\\pip-modern-metadata-z0rrcx8n', '--interpreter', 'C:\\Users\\tom\\.conda\\envs\\sentence-transformers\\python.exe']' returned non-zero exit status 1.
      Checking for Rust toolchain....
      Running `maturin pep517 write-dist-info --metadata-directory C:\Users\tom\AppData\Local\Temp\pip-modern-metadata-z0rrcx8n --interpreter C:\Users\tom\.conda\envs\sentence-transformers\python.exe`
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

then consider

rustup update

Click for example script

import random
import string
import time
import psutil
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('xlm-roberta-base')

def random_string(length: int) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))

for iteration in range(99999999):
    start_t = time.time()
    tokenizer.encode_batch([random_string(12345) for _ in range(200)])
    memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
    delta_t = time.time() - start_t
    print(f"{iteration:02d}: {memory_usage_in_MiB:.2f}MB, {delta_t:.2f}s")

Without clearing cache

00: 357.11MB, 0.37s
01: 425.61MB, 0.51s
02: 494.96MB, 0.67s
03: 571.87MB, 0.88s
04: 654.53MB, 1.02s
05: 711.05MB, 1.22s
06: 764.70MB, 1.41s
07: 858.09MB, 1.56s
08: 920.98MB, 1.77s
09: 1026.02MB, 1.95s
10: 1080.01MB, 2.13s
11: 1145.63MB, 2.31s
12: 1217.38MB, 2.50s
13: 1278.04MB, 2.71s
14: 1350.16MB, 2.94s
15: 1555.26MB, 3.15s
16: 1642.76MB, 3.36s
17: 1713.18MB, 3.64s
18: 1786.88MB, 3.89s
19: 1860.20MB, 4.04s
20: 1933.23MB, 4.37s

With clearing cache

00: 355.80MB, 0.35s
01: 362.49MB, 0.39s
02: 359.70MB, 0.41s
03: 361.36MB, 0.39s
04: 362.23MB, 0.40s
05: 361.70MB, 0.39s
06: 362.02MB, 0.41s
07: 362.80MB, 0.42s
08: 365.39MB, 0.43s
09: 544.84MB, 0.41s
10: 382.10MB, 0.41s
11: 547.98MB, 0.40s
12: 561.95MB, 0.40s
13: 569.80MB, 0.42s
14: 544.12MB, 0.41s
15: 365.12MB, 0.41s
16: 544.34MB, 0.41s
17: 371.02MB, 0.43s
18: 540.93MB, 0.44s
19: 567.88MB, 0.44s
20: 572.43MB, 0.43s
21: 546.32MB, 0.40s
22: 370.14MB, 0.42s
23: 547.97MB, 0.40s
24: 572.14MB, 0.42s
25: 570.01MB, 0.42s
26: 575.11MB, 0.42s
27: 576.92MB, 0.43s
28: 551.04MB, 0.43s
29: 371.13MB, 0.42s
30: 545.81MB, 0.45s

It seems to work now, switching between ~350 and ~580 MBs, and a constant latency, matching the lowest latency for the baseline. Well done finding the issue @ArthurZucker

Tom Aarsen

ArthurZucker · 2024-10-21T11:59:39Z

Yep, I can also do:

give the ability to set the cache_size
automatically set cache_size based on available RAM usage, which IMO would be better

nixonjin · 2025-02-21T02:16:23Z

I met an error: AttributeError: 'tokenizers.models.WordPiece' object has no attribute 'clear_cache', could anyone tell me how to fix it?

add a clear_cache function!

6e0175a

propagate to unigram

ab3236f

ArthurZucker added 4 commits October 21, 2024 14:53

use sysinfo to pre-allocate a big enough cache

714a3bd

for now compile but breaking

c81c34a

update

b2c667c

good defaults?

a56f73e

ArthurZucker closed this Nov 6, 2024

ArthurZucker mentioned this pull request Nov 6, 2024

More cache options. #1675

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a clear_cache function!#1662

add a clear_cache function!#1662
ArthurZucker wants to merge 6 commits intomainfrom
fix-cache-issues

ArthurZucker commented Oct 21, 2024 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Oct 21, 2024

Uh oh!

tomaarsen commented Oct 21, 2024

Uh oh!

ArthurZucker commented Oct 21, 2024

Uh oh!

nixonjin commented Feb 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ArthurZucker commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 21, 2024

Uh oh!

tomaarsen commented Oct 21, 2024

Without clearing cache

With clearing cache

Uh oh!

ArthurZucker commented Oct 21, 2024

Uh oh!

nixonjin commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ArthurZucker commented Oct 21, 2024 •

edited

Loading

nixonjin commented Feb 21, 2025 •

edited

Loading