Make `Cache` a subclass of `torch.Tensor` by IlyasMoutawwakil · Pull Request #35792 · huggingface/transformers

IlyasMoutawwakil · 2025-01-20T13:22:34Z

What does this PR do?

Both torch script tracing and torch dynamo/fx have restrictions on input types (torch script has more) which makes the export fail as one torch module (the model) is passing another (the cache) around as its input. Having Cache be a subclass of torch.Tensor bypasses these issues and imo makes more sense as the Cache class has no forward and is just a container of torch tensors.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

gante

In principle LGTM. I'm calling up the torch.export<>transformers expert to review to double-check these changes are also okay for that goal as well 🤗

Question: Cache object holds a list of tensors, usually with a pair of tensors per layer. On some cases, we can have different tensors of a cache on different devices. Would this conflict with the new inheritance?

Double-checks:

Have you confirmed that slow llama tests and slow cache tests have no regressions with respect to main? (RUN_SLOW=1 py.test tests/models/llama/test_modeling_lama.py -vv and RUN_SLOW=1 py.test tests/utils/test_cache_utils.py -vv)
Have you confirmed that llama + static cache + compilation preserves throughput? (can share a script if needed :) )

src/transformers/cache_utils.py

gante · 2025-01-20T18:51:28Z

src/transformers/utils/fx.py

-    {},
-    proxy_factory_fn=create_cache_proxy_factory_fn(StaticCache),
-)
+# def create_cache_proxy_factory_fn(orig_cache_cls: Type[Cache]) -> Callable[[Node], HFCacheProxy]:


This is for optimum and you're part of optimum, so I'm assuming it's okay :D

Yeah I'm not sure why this is was needed as well, tagging @echarlaix @mht-sharma for more info

Not sure either

adding @michaelbenayoun who worked on this

It was to be able to record Cache related operations in the fx graph. If another easier solution has been found, I'm all for it.

gante · 2025-01-20T18:59:06Z

@guangy10 as requested on Slack, have a look if you're available 🙏

guangy10 · 2025-01-21T20:21:58Z

For the correctness testing, no extensive testing, but we do have some correctness guarantee for supported models test_export_static_cache (pointer). Can you run slow tests on this PR?

Also I'm not exactly sure if the StaticCache will be functioning as expected. Because with nn.Module the Cache is registered as a mutable buffer and lifted to the graph input during export. I'm curious how it works with tensor subclass. It seems like tensor subclasses do not directly support buffer registration like nn.Module does. Can we compare the graph between using the nn.Module solution vs. the tensor subclass solution.

Alternatively, since the motivation is to handle the legacy torch script tracing (I assume the traffic to this path will be lower and lower over time), would it be a cleaner separation if we create a dedicated Cache subclass for it but keeping the one for pytorch2.0+ as nn.Module? No need to maintain compatibility to the torch script solution.

IlyasMoutawwakil · 2025-01-22T09:45:42Z

Question: Cache object holds a list of tensors, usually with a pair of tensors per layer. On some cases, we can have different tensors of a cache on different devices. Would this conflict with the new inheritance?

Shouldn't be an issue as we're not using the _make_subclass() but rather _make_wrapper_subclass(), the difference is explained by @albanD:

These two functions do quite different things. The main difference is that when you do _make_subclass(), the current object is a honest to goodness Tensor with data in its storage and everything. When you do _make_wrapper_subclass(), the current object has no data and it is expected that some field on the Tensor will be another Tensor (hence the outer one being called wrapper) that contains real data.
in https://dev-discuss.pytorch.org/t/whats-the-difference-between-torch-tensor-make-subclass-and-torch-tensor-make-wrapper-subclass/1839

One example is the QuantizedTensor subclass which has two dtypes (a public one qt.dtype and an internal one qt._data.dtype

Have you confirmed that slow llama tests and slow cache tests have no regressions with respect to main? (RUN_SLOW=1 py.test tests/models/llama/test_modeling_lama.py -vv and RUN_SLOW=1 py.test tests/utils/test_cache_utils.py -vv)
Have you confirmed that llama + static cache + compilation preserves throughput? (can share a script if needed :) )

Running them right now (btw is there a way to trigger them on the CI ?), I was only running llama fast tests and llama+executorch integration tests.

tests/models/llama/test_modeling_llama.py

IlyasMoutawwakil · 2025-01-22T14:36:34Z

Edit: confirmed these two tests fail on main as well

Running RUN_SLOW=1 pytest tests/models/llama/test_modeling_llama.py -vv give two errors which I guess are related the machine I'm testing on (A100 vs the A10 that's used in the CI) ;

FAILED tests/models/llama/test_modeling_llama.py::LlamaIntegrationTest::test_llama_3_1_hard - AssertionError: 'Tell[74 chars]ical social and political upheaval in France t[557 chars]s.\n' != 'Tell[74 chars]ical political...
FAILED tests/models/llama/test_modeling_llama.py::LlamaIntegrationTest::test_model_7b_logits_bf16 - AssertionError: False is not true

in the first social and political is reversed to political and social :

E       AssertionError: 'Tell[74 chars]ical social and political upheaval in France t[557 chars]s.\n' != 'Tell[74 chars]ical political and social upheaval in France t[557 chars]s.\n'
E       Diff is 1259 characters long. Set self.maxDiff to None to see it.

in the second the assertion is not verbose enough:

>       self.assertTrue(
            torch.allclose(
                EXPECTED_MEAN[self.cuda_compute_capability_major_version].to(torch_device),
                out.logits.float().mean(-1),
                atol=1e-2,
                rtol=1e-2
            )
        )
E       AssertionError: False is not true

adding some verbosity:

E       AssertionError: False is not true : Expected: tensor([[-6.5208, -4.1218, -4.9377, -3.2536,  0.8127, -2.9811,  1.2918, -3.3848]],
E              device='cuda:0')
E       Got: tensor([[-6.5081, -4.1175, -4.9761, -3.1678,  0.8199, -3.0029,  1.2809, -3.3309]],
E              device='cuda:0')

src/transformers/cache_utils.py

IlyasMoutawwakil · 2025-01-24T12:35:03Z

Thanks everyone, I'm replacing this PR with #35873 that's less restrictive.

guangy10 · 2025-01-24T22:22:05Z

Thanks everyone, I'm replacing this PR with #35873 that's less restrictive.

Yeah, this one looks much cleaner. Do you mind rerun the slow export/executorch tests on this PR?

After that, can you run cross-repo integration tests in Optimum (running it locally is fine) ? @echarlaix recently moved ExecuTorch to a new repo huggingface/optimum-executorch, however, all documentations/tutorials are deleted unintentionally after the move.

It used to be as simple as:

pip install optimum[exporters-executorch]

Override the installed transformers version to your dev version including this PR, then simply run

pytest executorch/*/test_*.py -s -vvvv --durations=0

@echarlaix @michaelbenayoun can you guide @IlyasMoutawwakil how to run executorch e2e tests?

IlyasMoutawwakil · 2025-01-26T13:14:35Z

Do you mind rerun the slow export/executorch tests on this PR?

They pass, I made sure both llama and cache slow tests are passing on this PR before dropping it so that it could be used for future references when subclassing the tensor class.

After that, can you run cross-repo integration tests in Optimum (running it locally is fine) ?

Can do that later.

IlyasMoutawwakil added 7 commits January 20, 2025 14:17

use tensor cache instead of module cache

d4b631e

unproxy cache

a77a94b

torch tensor subclassing

45bb39b

fix boolean evaluation

8606594

style

95c1686

fix zamba and jamba dynamic cache

d269417

make cache class exportable and executorch compatible

b67b6eb

IlyasMoutawwakil force-pushed the tensor-cache branch from 1114e7e to b67b6eb Compare January 20, 2025 17:47

gante reviewed Jan 20, 2025

View reviewed changes

IlyasMoutawwakil commented Jan 22, 2025

View reviewed changes

tests/models/llama/test_modeling_llama.py Show resolved Hide resolved

extract wrapper kwargs from init signature to correctly instantate

4950a9e

IlyasMoutawwakil added 2 commits January 22, 2025 15:42

add clone and to

6e9799c

fix test_cache_utils

da60604

IlyasMoutawwakil force-pushed the tensor-cache branch from 5829a6a to da60604 Compare January 22, 2025 14:43

IlyasMoutawwakil and others added 3 commits January 22, 2025 15:53

Merge branch 'main' into tensor-cache

85c71b0

add device and dtype setters

2bbbbbc

revert

485f959

IlyasMoutawwakil commented Jan 22, 2025

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

IlyasMoutawwakil and others added 5 commits January 22, 2025 17:18

Update src/transformers/cache_utils.py

2f4e0bc

more reverts

338f595

Merge branch 'main' into tensor-cache

dc1bd15

rebased

80b49d7

fixed dynamic cache

5ccb79c

IlyasMoutawwakil commented Jan 24, 2025

View reviewed changes

src/transformers/cache_utils.py Show resolved Hide resolved

IlyasMoutawwakil force-pushed the tensor-cache branch from e4534a0 to 5ccb79c Compare January 24, 2025 10:50

IlyasMoutawwakil mentioned this pull request Jan 24, 2025

Make cache traceable #35873

Merged

5 tasks

Merge branch 'main' into tensor-cache

090d9c4

IlyasMoutawwakil closed this Jan 24, 2025

IlyasMoutawwakil mentioned this pull request Feb 24, 2025

Support tracable dynamicKVcache #36311

Merged

Conversation

IlyasMoutawwakil commented Jan 20, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gante Jan 20, 2025

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

mht-sharma Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

echarlaix Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

michaelbenayoun Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

gante commented Jan 20, 2025

Uh oh!

guangy10 commented Jan 21, 2025

Uh oh!

IlyasMoutawwakil commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

IlyasMoutawwakil commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IlyasMoutawwakil commented Jan 24, 2025

Uh oh!

guangy10 commented Jan 24, 2025

Uh oh!

IlyasMoutawwakil commented Jan 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

IlyasMoutawwakil commented Jan 22, 2025 •

edited

Loading

IlyasMoutawwakil commented Jan 22, 2025 •

edited

Loading