Skip to content

add a clear_cache function!#1662

Closed
ArthurZucker wants to merge 6 commits intomainfrom
fix-cache-issues
Closed

add a clear_cache function!#1662
ArthurZucker wants to merge 6 commits intomainfrom
fix-cache-issues

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented Oct 21, 2024

Fixes #1539

Tried the provided script with this:

In [4]: from transformers import AutoTokenizer
   ...: import gc
   ...: 
   ...: tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
   ...: refresh_every = 100
   ...: 
   ...: for i in range(100000):
   ...:   s = f'{i} {i} ' * 10000
   ...:   tokenizer.encode(s)
   ...:   gc.collect()
   ...:   if i % 100 == 0:
   ...:     print(i)
   ...:   if i % refresh_every == 0:
   ...:     tokenizer._tokenizer.model.clear_cache()

and had no "leaks"

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@tomaarsen
Copy link
Copy Markdown
Member

For reviewers, you should be able to test this with:

pip install "git+https://github.com/huggingface/tokenizers.git@fix-cache-issues#egg=tokenizers&subdirectory=bindings/python"

If you get an error like

  Preparing editable metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Preparing editable metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [16 lines of output]
          Updating crates.io index
      error: failed to select a version for `env_logger`.
          ... required by package `tokenizers-python v0.20.1-dev.0 (C:\code\tokenizers\bindings\python)`
      versions that meet the requirements `^0.11` are: 0.11.5, 0.11.4, 0.11.3, 0.11.2, 0.11.1, 0.11.0

      the package `tokenizers-python` depends on `env_logger`, with features: `anstream` but `env_logger` does not have these features.
       It has an optional dependency with that name, but that dependency uses the "dep:" syntax in the features table, so it does not have an implicit feature with that name.


      failed to select a version for `env_logger` which could resolve this conflict
      💥 maturin failed
        Caused by: Cargo metadata failed. Does your crate compile with `cargo build`?
        Caused by: `cargo metadata` exited with an error:
      Error running maturin: Command '['maturin', 'pep517', 'write-dist-info', '--metadata-directory', 'C:\\Users\\tom\\AppData\\Local\\Temp\\pip-modern-metadata-z0rrcx8n', '--interpreter', 'C:\\Users\\tom\\.conda\\envs\\sentence-transformers\\python.exe']' returned non-zero exit status 1.
      Checking for Rust toolchain....
      Running `maturin pep517 write-dist-info --metadata-directory C:\Users\tom\AppData\Local\Temp\pip-modern-metadata-z0rrcx8n --interpreter C:\Users\tom\.conda\envs\sentence-transformers\python.exe`
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

then consider

rustup update

Click for example script
import random
import string
import time
import psutil
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('xlm-roberta-base')

def random_string(length: int) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))

for iteration in range(99999999):
    start_t = time.time()
    tokenizer.encode_batch([random_string(12345) for _ in range(200)])
    memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
    delta_t = time.time() - start_t
    print(f"{iteration:02d}: {memory_usage_in_MiB:.2f}MB, {delta_t:.2f}s")

Without clearing cache

00: 357.11MB, 0.37s
01: 425.61MB, 0.51s
02: 494.96MB, 0.67s
03: 571.87MB, 0.88s
04: 654.53MB, 1.02s
05: 711.05MB, 1.22s
06: 764.70MB, 1.41s
07: 858.09MB, 1.56s
08: 920.98MB, 1.77s
09: 1026.02MB, 1.95s
10: 1080.01MB, 2.13s
11: 1145.63MB, 2.31s
12: 1217.38MB, 2.50s
13: 1278.04MB, 2.71s
14: 1350.16MB, 2.94s
15: 1555.26MB, 3.15s
16: 1642.76MB, 3.36s
17: 1713.18MB, 3.64s
18: 1786.88MB, 3.89s
19: 1860.20MB, 4.04s
20: 1933.23MB, 4.37s

With clearing cache

00: 355.80MB, 0.35s
01: 362.49MB, 0.39s
02: 359.70MB, 0.41s
03: 361.36MB, 0.39s
04: 362.23MB, 0.40s
05: 361.70MB, 0.39s
06: 362.02MB, 0.41s
07: 362.80MB, 0.42s
08: 365.39MB, 0.43s
09: 544.84MB, 0.41s
10: 382.10MB, 0.41s
11: 547.98MB, 0.40s
12: 561.95MB, 0.40s
13: 569.80MB, 0.42s
14: 544.12MB, 0.41s
15: 365.12MB, 0.41s
16: 544.34MB, 0.41s
17: 371.02MB, 0.43s
18: 540.93MB, 0.44s
19: 567.88MB, 0.44s
20: 572.43MB, 0.43s
21: 546.32MB, 0.40s
22: 370.14MB, 0.42s
23: 547.97MB, 0.40s
24: 572.14MB, 0.42s
25: 570.01MB, 0.42s
26: 575.11MB, 0.42s
27: 576.92MB, 0.43s
28: 551.04MB, 0.43s
29: 371.13MB, 0.42s
30: 545.81MB, 0.45s

It seems to work now, switching between ~350 and ~580 MBs, and a constant latency, matching the lowest latency for the baseline. Well done finding the issue @ArthurZucker

  • Tom Aarsen

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

Yep, I can also do:

  • give the ability to set the cache_size
  • automatically set cache_size based on available RAM usage, which IMO would be better

@nixonjin
Copy link
Copy Markdown

nixonjin commented Feb 21, 2025

I met an error: AttributeError: 'tokenizers.models.WordPiece' object has no attribute 'clear_cache', could anyone tell me how to fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak for large strings

4 participants