Make Whisper Encoder's sinusoidal PE non-trainable by default by gau-nernst · Pull Request #26032 · huggingface/transformers

gau-nernst · 2023-09-07T14:39:47Z

What does this PR do?

I'm not too familiar with Jax/Flax and can't find a simple way to set a variable a non-trainable in Flax. Do advise on how I should approach this.

Should we have a test for this behavior also? i.e. test that Whisper Encoder PE is non-trainable by default.

Another note. Should Encoder's positional encodings be initialized with sinusoids? Just like the official repo

https://github.com/openai/whisper/blob/main/whisper/model.py#L150

def sinusoids(length, channels, max_timescale=10000):
    """Returns sinusoids for positional embedding"""
    assert channels % 2 == 0
    log_timescale_increment = np.log(max_timescale) / (channels // 2 - 1)
    inv_timescales = torch.exp(-log_timescale_increment * torch.arange(channels // 2))
    scaled_time = torch.arange(length)[:, np.newaxis] * inv_timescales[np.newaxis, :]

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sanchit-gandhi

sanchit-gandhi · 2023-09-11T16:12:09Z

Hey @gau-nernst - thanks very much for opening this PR! Looks like a great start already. I pushed the Flax changes in the latest commit. In short, the simplest way of setting the parameters to un-trainable in Flax is by stopping the back-prop through the layers. Otherwise, we need to explicitly pass a dict to the optimiser that defines which parameters are trainable/non-trainable (see https://colab.research.google.com/drive/1K-5bz6R6kt9GAvaUHvzYvvA-IOAO2PhL#scrollTo=BrF6Dtb8GlkJ)

There's not a test to check that the embed params are non-trainable, but you could certainly add one. This could follow the style of test that we use to check that we correctly freeze the encoder when we do decoder-only fine-tuning:

transformers/tests/models/whisper/test_modeling_whisper.py

Line 338 in ce2e7ef

def test_requires_grad_with_frozen_encoder(self):

Regarding initialising the weights with sinusoidal embeddings - I agree that this should be the default case! In 99% of cases users will just use the model from pre-trained, in which case the embeddings will be initialised with the sinusoids, but if a user were to randomly initialise the model, the embeddings would be initialised incorrectly.

HuggingFaceDocBuilderDev · 2023-09-11T16:27:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

gau-nernst · 2023-09-12T02:00:35Z

That's a great solution! However, from what I understand, it means that in the Flax implementation, it is not possible (or easy) to re-enable training for positional encodings? (something that we discussed previously)

sanchit-gandhi · 2023-09-14T14:13:19Z

It's possible (but a bit involved) to add functionality to toggle whether we train the PE's in Flax. However, to me this PR is a bug fix, rather than a feature addition. I agree with what you said in the issue that we should not train the embeddings, since they used fixed sinusoidal embeddings in the original implementation, so I think it's fine if we do a straight fix and always freeze the embeddings here, since this is the correct behaviour.

gau-nernst · 2023-09-16T04:31:26Z

I added sinusoids weight init for PyTorch implementation. Looking at TF and Flax, I'm not sure where to put weight init. It seems like there is no weight init code in TF? For Flax, I see this but don't really understand what's going on.

transformers/src/transformers/models/whisper/modeling_flax_whisper.py

Lines 865 to 895 in 0a55d9f

    
           def init_weights(self, rng: jax.random.PRNGKey, input_shape: Tuple, params: FrozenDict = None) -> FrozenDict: 
        
               # init input tensors 
        
               input_features = jnp.zeros(input_shape, dtype="f4") 
        
               input_features = input_features.at[(..., -1)].set(self.config.eos_token_id) 
        
               decoder_input_ids = jnp.zeros((input_shape[0], 1), dtype="i4") 
        
               decoder_attention_mask = jnp.ones_like(decoder_input_ids) 
        
               batch_size, sequence_length = decoder_input_ids.shape 
        
               decoder_position_ids = jnp.broadcast_to(jnp.arange(sequence_length)[None, :], (batch_size, sequence_length)) 
        
               params_rng, dropout_rng = jax.random.split(rng) 
        
               rngs = {"params": params_rng, "dropout": dropout_rng} 
        
               random_params = self.module.init( 
        
                   rngs, 
        
                   input_features=input_features, 
        
                   decoder_input_ids=decoder_input_ids, 
        
                   decoder_attention_mask=decoder_attention_mask, 
        
                   decoder_position_ids=decoder_position_ids, 
        
               )["params"] 
        
               if params is not None: 
        
                   random_params = flatten_dict(unfreeze(random_params)) 
        
                   params = flatten_dict(unfreeze(params)) 
        
                   for missing_key in self._missing_keys: 
        
                       params[missing_key] = random_params[missing_key] 
        
                   self._missing_keys = set() 
        
                   return freeze(unflatten_dict(params)) 
        
               else: 
        
                   return random_params

From other TF Keras and Flax code I have seen, I think the typical pattern is to pass weight init function to a module when it is created? I'm not sure what is the pattern HF is using here.

sanchit-gandhi · 2023-09-18T15:31:41Z

The init_weights function in Flax is used to initialise the Flax model's parameters by passing a set of dummy inputs (zeros and ones). Flax traces out the shapes of the weights that you get when you pass these dummy inputs, and initialises weights with the right shapes accordingly (see https://flax.readthedocs.io/en/latest/guides/flax_basics.html#model-parameters-initialization).

This init_weights function won't actually change the values of the weights, just their shapes and dtypes. To change the initialising function, we can pass an argument embedding_init function to the init of the embedding layer: https://flax.readthedocs.io/en/latest/api_reference/flax.linen/_autosummary/flax.linen.Embed.html#flax.linen.Embed.embedding_init

The init function should be an instance of a JAX initialiser. That is, it should take the PRNG Key as the first argument, as well as the shape and target dtype of the module: https://jax.readthedocs.io/en/latest/jax.nn.initializers.html

sanchit-gandhi

Thanks for your follow-up work on this issue @gau-nernst! Nice job on getting the Flax and TF parts working as well 👏 I've left some suggestions below on how we could potentially re-factor the code a bit to make it as clear as possible for the final PR, let me know if you have any questions!

sanchit-gandhi · 2023-09-25T14:50:05Z

src/transformers/models/whisper/modeling_flax_whisper.py



+# Copied from transformers.models.whisper.modeling_whisper.sinusoids
+def sinusoids(length: int, channels: int, max_timescale: float = 10000) -> np.ndarray:


Rather than defining this first function and then wrapping it with a very shallow second function embedding_init, I think it would be cleaner to define just one new function (sinusoidal_embedding_init) that takes three arguments:

key: JAX PRNGKey (un-used, required to match the signature of the init function)

shape: tuple of (length, channels, max_timescale)

dtype: dtype of the computation

And returns the sinusoidal weights. To me, this would make the code a bit cleaner and easier to follow. How does this sound to you?

Sure, that works as well. I used numpy initially to generate the sinusoids so that we can copy it across the 3 files and avoid errors. But having separate functions is fine by me too.

sanchit-gandhi · 2023-09-25T14:50:30Z

src/transformers/models/whisper/modeling_tf_whisper.py

        self.conv1 = tf.keras.layers.Conv1D(self.embed_dim, kernel_size=3, strides=1, padding="valid", name="conv1")
        self.conv2 = tf.keras.layers.Conv1D(self.embed_dim, kernel_size=3, strides=2, padding="valid", name="conv2")

+        def embedding_init(shape, dtype=None):


Probably same here for TF too?

sanchit-gandhi · 2023-09-25T14:51:48Z

src/transformers/models/whisper/modeling_whisper.py


+def sinusoids(length: int, channels: int, max_timescale: float = 10000) -> np.ndarray:
+    """Returns sinusoids for positional embedding"""
+    assert channels % 2 == 0


Let's try to avoid assert statements in favour of ValueErrors:

Suggested change

assert channels % 2 == 0

if channels % 2 != 0:

raise ValueError(f"Number of channels has to be divisible by 2 for sinusoidal positional embeddings, got {channels} channels.")

sanchit-gandhi · 2023-09-25T15:01:12Z

src/transformers/models/whisper/modeling_whisper.py

-            module.weight.data.normal_(mean=0.0, std=std)
+            if not module.weight.requires_grad:
+                # sinusoidal positional encodings used in WhisperEncoder
+                with torch.no_grad():


I'm not sure this is safe - if we freeze the decoder embeddings then they'll get incorrectly initialised (since they'll be detected as requires_grad=False). Ideally, we need a way of just isolating the Encoder embeddings for this weight init. Do you think you could have a go at this?

I'm not 100% sure when _init_weights() is called. If it is always called at the end of __init__() (in post_init()?), and we don't freeze anything else in __init__(), it should still work as intended. However, I agree that relying on this behavior is error-prone and not exactly clean.

I don't think there is a clean way for _init_weights() to know an nn.Embedding() layer is from the Encoder? If that is the case, I think the best solution is to initialize sinusoids after _init_weights() is called within __init__()?

sanchit-gandhi · 2023-09-25T15:02:01Z

src/transformers/models/whisper/modeling_whisper.py

 ]


+def sinusoids(length: int, channels: int, max_timescale: float = 10000) -> np.ndarray:


IMO cleaner to do this entire function in PyTorch for the PyTorch modelling file

tests/models/whisper/test_modeling_whisper.py

sanchit-gandhi

Thanks for updating the sinusoids functions to their respective libraries! They look a lot better. Just a few small comments regarding the initialisation :)

sanchit-gandhi · 2023-09-28T16:42:12Z

src/transformers/models/whisper/modeling_flax_whisper.py

 remat = nn_partitioning.remat


+def sinusoidal_embedding_init(max_timescale: float = 10000):


Sorry @gau-nernst, can we not just define one function here? We should move away from defining two functions, where the outer one just calls the inner one directly

The reason I do it like this is to allow changing the default max_timescale value ("parameterized" function), so that it is on feature-parity with the PyTorch version. If we just define 1 function, there is no way to change max_timescale value, since Jax/Keras will only call the function with (shape, dtype) (and additionally RNG for Jax) (technically we can still bypass this by using functools.partial()). I follow Jax for this design (https://github.com/google/jax/blob/3247db774ea387098bd9d9049886030dc666cb39/jax/_src/nn/initializers.py#L133-L157). Another way is to make it a class (like Keras does).

Realistically the users won't be able to specify max_timescale to the model anyway since we don't expose it. So it would be fine for me to make max_timescale as a hard-coded constant also.

Yeah if max_timescales is not a reachable argument by the user let's just hardcode it. We tend to do this anyway for sinusoidal embeddings:

transformers/src/transformers/models/gptj/modeling_flax_gptj.py

Lines 109 to 110 in 3911774

def create_sinusoidal_positions(num_pos, dim):

inv_freq = 1.0 / (10000 ** (np.arange(0, dim, 2) / dim))

sanchit-gandhi · 2023-09-28T16:44:18Z

src/transformers/models/whisper/modeling_whisper.py

        # Initialize weights and apply final processing
        self.post_init()
+        with torch.no_grad():
+            self.embed_positions.weight.copy_(sinusoids(self.max_source_positions, embed_dim))


This was more appropriate in _init_weights! Let's keep it there

This still holds - can this go in _init_weights if possible?

tests/models/whisper/test_modeling_whisper.py

gau-nernst · 2023-09-30T01:14:25Z

@sanchit-gandhi I fixed the embedding init for TF and Flax as you requested. I also add test for TF and Flax. For Flax, I don't add a test for non-trainable sinusoidal embedding, because I don't know how to do it cleanly. For checking the weight init in Flax, I don't know Flax semantics so well, so I added a rather "crude" solution to get the encoder position embeddings.

sanchit-gandhi

Very nice @gau-nernst - especially the Flax init which is really clean now 👌 Could the PT init go in _init_weights? Otherwise it all looks good to me!

sanchit-gandhi · 2023-10-02T16:03:52Z

src/transformers/models/whisper/modeling_flax_whisper.py

 remat = nn_partitioning.remat


+def sinusoidal_embedding_init(key, shape, dtype=jnp.float_) -> jax.Array:


Really nice!

src/transformers/models/whisper/modeling_flax_whisper.py

sanchit-gandhi · 2023-10-02T16:05:14Z

src/transformers/models/whisper/modeling_flax_whisper.py

        hidden_states = jax.nn.gelu(self.conv2(hidden_states), approximate=False)

        embed_positions = self.embed_positions(jnp.arange(self.config.max_source_positions))
+        # freeze the sinusoidal embeddings by stopping the back-prop


Note to reviewer: by default we freeze the embeddings in Flax, and don't provide an override. See this explanation for detail: #26032 (comment)

src/transformers/models/whisper/modeling_tf_whisper.py

sanchit-gandhi · 2023-10-02T16:09:08Z

tests/models/whisper/test_modeling_flax_whisper.py

                    max_diff = (base_params[key] - base_params_from_head[key]).sum().item()
                    self.assertLessEqual(max_diff, 1e-3, msg=f"{key} not identical")

+    def test_encoder_sinusoidal_embed_positions(self):


To test that the Flax embeddings are non-trainable (frozen), you can follow this Flax Wav2Vec2 test:

transformers/tests/models/wav2vec2/test_modeling_flax_wav2vec2.py

Line 281 in 4b4c6aa

def test_freeze_feature_encoder(self):

To me this is optional: we know that the embeddings are initialised correctly through your test, and that grads are set to zero by action of jax.lax.stop_gradient, so up to you if you want to add this!

sanchit-gandhi · 2023-10-02T16:10:00Z

src/transformers/models/whisper/modeling_whisper.py

        # Initialize weights and apply final processing
        self.post_init()
+        with torch.no_grad():
+            self.embed_positions.weight.copy_(sinusoids(self.max_source_positions, embed_dim))


This still holds - can this go in _init_weights if possible?

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

gau-nernst · 2023-10-03T01:04:08Z

Very nice @gau-nernst - especially the Flax init which is really clean now 👌 Could the PT init go in _init_weights? Otherwise it all looks good to me!

_init_weights() does not see the module name, it only sees the module itself. To make _init_weights() recognize encoder positional embedding, we probably need to set a private attribute to the module.

sanchit-gandhi · 2023-10-04T14:15:45Z

Could we change the _init_weights logic to:

Loop through all modules
Check if module is encoder. If yes: loop through all the sub-modules. When we get to the pos embeddings, do the sinusoidal init
Check if module is decoder. If yes: loop through all the sub-modules. When we get to the pos embeddings, do the normal init

gau-nernst · 2023-10-04T15:53:51Z

I put sinusoidal init in _init_weights() but in a different way. Relying on the fact that nn.Module.apply() will traverse the children in a depth-first search manner (leaf modules will be applied first), if we check for Whisper encoder in _init_weights(), it will override the default initialization for positional embeddings.

    def apply(self: T, fn: Callable[['Module'], None]) -> T:
        ...
        for module in self.children():
            module.apply(fn)
        fn(self)
        return self

src/transformers/models/whisper/modeling_whisper.py

sanchit-gandhi

Thanks for iterating here @gau-nernst and for the fruitful comment discussions! Requesting a TF review from @Rocketknight1 and maintainer review from @ArthurZucker. Thanks both!

sanchit-gandhi · 2023-10-09T16:13:29Z

src/transformers/models/whisper/modeling_tf_whisper.py

 LARGE_NEGATIVE = -1e8


+def sinusoidal_embedding_init(shape, dtype=tf.float32) -> tf.Tensor:


Would appreciate a TF review here!

Rocketknight1

TF code looks good to me! Doing it either this way or creating a tf.constant should both work.

ArthurZucker

Thanks for your contribution! 😉

gau-nernst and others added 2 commits September 7, 2023 22:35

set encoder's PE as non-trainable

96836c7

freeze flax

bd1c117

gau-nernst added 5 commits September 15, 2023 19:43

init sinusoids

cfee299

add test for non-trainable embed positions

d9100a4

simplify TF encoder embed_pos

a834466

Merge branch 'main' into whisper_encoder_pe

601c0bd

revert tf

321a22e

clean up

509246d

gau-nernst added 9 commits September 23, 2023 08:31

Merge branch 'main' into whisper_encoder_pe

fd57b62

add sinusoidal init for jax

4c64318

make consistent sinusoidal function

789a7dc

fix dtype

0f25a1b

add default dtype

92c64d6

use numpy for sinusoids. fix jax

2523387

add sinusoid init for TF

165bf72

fix

f29206c

use custom embedding

8121061

gau-nernst marked this pull request as ready for review September 23, 2023 03:38

sanchit-gandhi reviewed Sep 25, 2023

View reviewed changes

gau-nernst added 3 commits September 26, 2023 22:39

use specialized init for each impl

ac2ab13

fix sinusoids init. add test for pytorch

adb6c5c

fix TF dtype

0305d8b

gau-nernst requested a review from sanchit-gandhi September 27, 2023 11:37

sanchit-gandhi reviewed Sep 28, 2023

View reviewed changes

gau-nernst commented Sep 29, 2023

View reviewed changes

tests/models/whisper/test_modeling_whisper.py Show resolved Hide resolved

gau-nernst added 4 commits September 30, 2023 00:13

simplify sinusoid init for flax and tf

7b942b8

add tests for TF

80e62c1

change default dtype to float32

724ce68

add sinusoid test for flax

2d3936e

gau-nernst requested a review from sanchit-gandhi September 30, 2023 01:14

sanchit-gandhi reviewed Oct 2, 2023

View reviewed changes

gau-nernst and others added 2 commits October 3, 2023 09:00

Update src/transformers/models/whisper/modeling_flax_whisper.py

cbfe93e

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

Update src/transformers/models/whisper/modeling_tf_whisper.py

1cff53a

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

move sinusoidal init to _init_weights

38bf5fd

sanchit-gandhi reviewed Oct 5, 2023

View reviewed changes

src/transformers/models/whisper/modeling_whisper.py Show resolved Hide resolved

sanchit-gandhi approved these changes Oct 9, 2023

View reviewed changes

Rocketknight1 approved these changes Oct 9, 2023

View reviewed changes

ArthurZucker approved these changes Oct 10, 2023

View reviewed changes

sanchit-gandhi merged commit 1e3c9dd into huggingface:main Oct 11, 2023



		# Copied from transformers.models.whisper.modeling_whisper.sinusoids
		def sinusoids(length: int, channels: int, max_timescale: float = 10000) -> np.ndarray:

	assert channels % 2 == 0
	if channels % 2 != 0:
	raise ValueError(f"Number of channels has to be divisible by 2 for sinusoidal positional embeddings, got {channels} channels.")

		remat = nn_partitioning.remat


		def sinusoidal_embedding_init(max_timescale: float = 10000):

	def create_sinusoidal_positions(num_pos, dim):
	inv_freq = 1.0 / (10000 ** (np.arange(0, dim, 2) / dim))

		remat = nn_partitioning.remat


		def sinusoidal_embedding_init(key, shape, dtype=jnp.float_) -> jax.Array:

		LARGE_NEGATIVE = -1e8


		def sinusoidal_embedding_init(shape, dtype=tf.float32) -> tf.Tensor:

Conversation

gau-nernst commented Sep 7, 2023

What does this PR do?

Before submitting

Who can review?

Uh oh!

sanchit-gandhi commented Sep 11, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Sep 11, 2023

Uh oh!

gau-nernst commented Sep 12, 2023

Uh oh!

sanchit-gandhi commented Sep 14, 2023

Uh oh!

gau-nernst commented Sep 16, 2023

Uh oh!

sanchit-gandhi commented Sep 18, 2023

Uh oh!

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gau-nernst commented Sep 30, 2023

Uh oh!

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gau-nernst commented Oct 3, 2023

Uh oh!

sanchit-gandhi commented Oct 4, 2023

Uh oh!

gau-nernst commented Oct 4, 2023

Uh oh!

Uh oh!

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment