Support sharded safetensors in TF by Rocketknight1 · Pull Request #29350 · huggingface/transformers

Rocketknight1 · 2024-02-28T16:02:28Z

Right now our TF safetensors loading doesn't support sharded checkpoints, which is a problem as more and more big models move to safetensors weights only! This is currently blocking @a8nova's PR at #26870.

As sharded safetensors saving for TF was also missing, this PR adds that as well, and expands the tests to cover both.

TODO:

Check we prioritize safetensors vs. TF weights correctly when both are present (and make sure behaviour matches for sharded vs. unsharded)
Add tests for sharded safetensors loading from both PT and TF format
Add test for saving sharded safetensors

HuggingFaceDocBuilderDev · 2024-02-28T16:22:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2024-03-04T18:16:16Z

This should be ready for review! cc @ArthurZucker @a8nova

LysandreJik

Looks good! When would you have in mind an eventual switch to safetensors serialization by default?

LysandreJik · 2024-03-07T13:55:46Z

src/transformers/modeling_tf_pytorch_utils.py

    if tf_model._keys_to_ignore_on_load_unexpected is not None:
        for pat in tf_model._keys_to_ignore_on_load_unexpected:
            unexpected_keys = [k for k in unexpected_keys if re.search(pat, k) is None]
+    if not skip_logger_warnings:


Is this necessary? 🤔

It is, unfortunately! The reason is that this function is used both to load shards and non-sharded checkpoints. When it's loading a non-sharded checkpoint, we want to log missing keys immediately. When it's loading a shard, there will always be lots of "missing" keys, but we don't want to log those - instead, we only want to log keys that are missing from every shard, which we will only know after all shards have been loaded. This is handled in the sharded loading function.

LysandreJik · 2024-03-07T13:56:35Z

src/transformers/modeling_tf_pytorch_utils.py

+):
+    all_loading_infos = []
+    for shard in safetensors_shards:
+        with safe_open(shard, framework="tf") as safetensors_archive:


Shouldn't this load from the PT framework if we're "loading pytorch shards in tensorflow models"?

safe_open(framework="tf") just loads the tensors as tf.Tensor instead of torch.Tensor - the actual value of the tensor is unchanged. However, we still need to handle weight renaming + transposes, so we still need a pt-to-tf function.

src/transformers/modeling_tf_utils.py

LysandreJik · 2024-03-07T14:01:40Z

tests/test_modeling_tf_utils.py

                for p1, p2 in zip(model.weights, ref_model.weights):
                    assert np.allclose(p1.numpy(), p2.numpy())

+    @require_safetensors


safetensors is now a base dependency so maybe we should eventually just remove all of these

Makes sense - do you want me to just do it in this PR?

Rocketknight1 · 2024-03-07T14:12:11Z

@LysandreJik I think now that we have proper support we can switch to safetensors by default immediately, either in this PR or in a follow-up.

LysandreJik · 2024-03-07T14:25:31Z

I'd wait for a few weeks just to ensure we don't have reports of failure and switch for the next version. WDYT?

Rocketknight1 · 2024-03-07T14:32:26Z

Sounds good to me!

Rocketknight1 · 2024-03-07T15:37:40Z

@LysandreJik is there anything else to be resolved before I merge this? (Except for the failing test, but that's not specific to this PR)

amyeroberts

Thanks for working on this and enabling sharded support!

Just a few small comments / questions

amyeroberts · 2024-03-12T19:42:45Z

tests/test_modeling_tf_utils.py

-        # This should not raise even if there are two types of sharded weights
-        # This should discard the safetensors weights in favor of the .h5 sharded weights
-        TFBertModel.from_pretrained("hf-internal-testing/tiny-bert-tf-safetensors-h5-sharded")


Don't we still want this test to make sure things are backwards compatible for now - I can load sharded h5 files even if safetensor weights are available?

amyeroberts · 2024-03-12T19:46:48Z

tests/test_modeling_tf_utils.py

+                    # Note: pickle adds some junk so the weight of the file can end up being slightly bigger than
+                    # the size asked for (since we count parameters)
+                    if size >= max_size_int + 50000:
+                        with h5py.File(shard_file, "r") as state_file:


What does this represent here?

Good catch - that was copied from the h5 test, and wouldn't work for safetensors - we just got lucky that it wasn't called in the tests anyway. I removed it!

amyeroberts · 2024-03-12T20:41:48Z

src/transformers/modeling_tf_pytorch_utils.py

+    mismatched_keys = sum([info["mismatched_keys"] for info in all_loading_infos], [])
+
+    if not skip_logger_warnings:
+        if len(unexpected_keys) > 0:


AFAICT, all these checks are the same as the ones above. Can we abstract these out to e.g. `validate_keys(unmatched_keys, missing_keys, mismatched_keys) and call that in both of the functions?

amyeroberts · 2024-03-12T20:44:13Z

src/transformers/modeling_tf_pytorch_utils.py

+    unexpected_keys = sum([info["unexpected_keys"] for info in all_loading_infos], [])
+    mismatched_keys = sum([info["mismatched_keys"] for info in all_loading_infos], [])
+
+    if not skip_logger_warnings:


My understanding is that we want to have skip_logger_warning=True when calling load_pytorch_state_dict_in_tf2_model here, but I don't see why we're enabling skipping here? Silencing the logging warnings should really be a hidden functionality (maybe with a param _skip_logger_warnings and not something people calling either function use

You're right, actually - this function is only called once and warnings are always emitted, so the argument isn't needed at all. I removed it!

Rocketknight1 · 2024-03-15T14:54:08Z

All comments addressed @amyeroberts! I think we should be ready now.

amyeroberts

Thanks for adding this support!

amyeroberts · 2024-03-15T15:10:50Z

src/transformers/modeling_tf_pytorch_utils.py

+    return tf_model

+
+def _log_key_warnings(missing_keys, unexpected_keys, mismatched_keys, class_name):


nit - definition should go above the lines of code where it's used i.e. before load_pytorch_state_dict_in_tf2_model

amyeroberts · 2024-03-15T15:11:35Z

src/transformers/modeling_tf_utils.py



-def tf_shard_checkpoint(weights, max_shard_size="10GB"):
+def tf_shard_checkpoint(weights, max_shard_size="10GB", weights_name: str = TF2_WEIGHTS_NAME):


Should we have the default shard size match the one in load_tf_weights?

I got here by following the torch code, which also has the same issue! Specifically, save_pretrained() has a default size of 5GB, but the actual checkpoint sharding methods have a default size of 10GB. In general, though, the value passed from save_pretrained() will override those values.

It's a very minor detail either way, since I think both 5GB and 10GB shards work fine! We could consider standardizing everything at some point, but I don't think it's a high priority.

src/transformers/modeling_tf_utils.py

amyeroberts · 2024-03-15T15:14:35Z

src/transformers/modeling_tf_utils.py

            with safe_open(resolved_archive_file, framework="tf") as f:
                safetensors_metadata = f.metadata()
-            if safetensors_metadata is None or safetensors_metadata.get("format") not in ["pt", "tf", "flax"]:
+            if safetensors_metadata is None or safetensors_metadata.get("format") not in ["pt", "tf", "flax", "mlx"]:


Looks like we might want a constant e.g. SUPPORTED_SAFE_FORMATS = ["pt", "tf", "flax", "mlx"] so we don't have to update this in several locations here and for PT (for another PR)

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Rocketknight1 · 2024-03-15T16:10:37Z

This looks ready to go, cc @a8nova! We have a branch cut + release planned on Monday, though, and since this touches a lot of core code I don't want to merge it right before a release. Instead, I suggest merging it right after the branch cut, and then we can finalize and merge the PRs that are blocked by it: TF-IDEFICS, TF-Gemma and possibly the Mistral/Mixtral PRs if @ariG23498 can have one of them ready by then (no stress, obviously!)

Then we could launch all the new Keras models together in the following release and do a section in the release notes about them, crediting @a8nova and @ariG23498?

Rocketknight1 · 2024-03-20T14:23:02Z

cc @a8nova and @ariG23498, this has now been merged. If you rebase your PRs, that should resolve any issues with sharded safetensors loading!

Rocketknight1 mentioned this pull request Feb 28, 2024

Port IDEFICS to tensorflow #26870

Merged

5 tasks

Rocketknight1 marked this pull request as ready for review March 4, 2024 18:02

ArthurZucker self-requested a review March 6, 2024 03:37

LysandreJik approved these changes Mar 7, 2024

View reviewed changes

Rocketknight1 force-pushed the supported_sharded_safetensors_loading_in_tf branch from 46e5c56 to efdb604 Compare March 7, 2024 14:33

Rocketknight1 force-pushed the supported_sharded_safetensors_loading_in_tf branch 3 times, most recently from a1001c8 to a9f240e Compare March 12, 2024 16:52

amyeroberts reviewed Mar 12, 2024

View reviewed changes

Rocketknight1 added 15 commits March 15, 2024 14:49

Initial commit (still lots of unfinished bits)

080c24a

(Still untested) add safetensors sharding to save_pretrained

aaa0212

Fix savetensors saving, update default shard size to match PT

43a1b2c

Add proper loading of TF-format safetensors

36014c7

Revert default size in case that changes things

ab90852

Fix incorrect index name

7eae1b4

Update loading priority

0a9cf9a

Update tests

73d3d75

Make the tests a little more stringent

b09d757

Expand tests

8329061

Add sharded cross-test

ffc0e3a

Fix argument name

03efce9

One more test fix

f415eb8

Adding mlx to the list of allowed formats

e5aace5

Remove irrelevant block for safetensors

e0faed0

Rocketknight1 added 2 commits March 15, 2024 14:49

Refactor warning logging into a separate function

a7dbf06

Remove unused skip_logger_warnings arg

e7a2c24

Rocketknight1 force-pushed the supported_sharded_safetensors_loading_in_tf branch from 55519b6 to e7a2c24 Compare March 15, 2024 14:49

amyeroberts approved these changes Mar 15, 2024

View reviewed changes

Rocketknight1 and others added 2 commits March 15, 2024 15:32

Update src/transformers/modeling_tf_utils.py

7ef18f9

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Move function def

d75d69a

Rocketknight1 merged commit 11ef35e into main Mar 20, 2024

Rocketknight1 deleted the supported_sharded_safetensors_loading_in_tf branch March 20, 2024 14:22

		return tf_model


		def _log_key_warnings(missing_keys, unexpected_keys, mismatched_keys, class_name):



		def tf_shard_checkpoint(weights, max_shard_size="10GB"):
		def tf_shard_checkpoint(weights, max_shard_size="10GB", weights_name: str = TF2_WEIGHTS_NAME):

Conversation

Rocketknight1 commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 28, 2024

Uh oh!

Rocketknight1 commented Mar 4, 2024

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Mar 7, 2024

Uh oh!

LysandreJik commented Mar 7, 2024

Uh oh!

Rocketknight1 commented Mar 7, 2024

Uh oh!

Rocketknight1 commented Mar 7, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Mar 15, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Mar 15, 2024

Uh oh!

Rocketknight1 commented Mar 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Rocketknight1 commented Feb 28, 2024 •

edited

Loading

Rocketknight1 Mar 13, 2024 •

edited

Loading