Cohere: Use `diff` tool instead of Copied from mechanism by younesbelkada · Pull Request #31211 · huggingface/transformers

younesbelkada · 2024-06-03T15:45:06Z

What does this PR do?

As per title

younesbelkada · 2024-06-03T16:05:48Z

src/transformers/models/cohere/diff_cohere.py

+ALL_LAYERNORM_LAYERS.append(CohereRMSNorm)
+
+
+class CohereLayerNorm(CohereRMSNorm):


In case users rely on CohereLayerNorm class

younesbelkada · 2024-06-03T16:06:23Z

src/transformers/models/cohere/diff_cohere.py

+    def logit_scale(self):
+        logger.warning(
+            "`logit_scale` attribute is going to be deprecated in future versions, please use `model.config.logit_scale` instead."
+        )
+        return self.config.logit_scale
+
+    @property
+    def tie_word_embeddings(self):
+        logger.warning(
+            "`tie_word_embeddings` attribute is going to be deprecated in future versions, please use `model.config.tie_word_embeddings` instead."
+        )
+        return self.config.tie_word_embeddings


these attributes are public, but I suggest to use the config variable directly to make it cleaner with a deprecation cycle

younesbelkada · 2024-06-03T16:06:40Z

src/transformers/models/cohere/modeling_cohere.py

+class CohereLinearScalingRotaryEmbedding(CohereRotaryEmbedding):
+    """CohereRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
+
+    def forward(self, x, position_ids):
+        # difference to the original RoPE: a scaling factor is aplied to the position ids
+        position_ids = position_ids.float() / self.scaling_factor
+        cos, sin = super().forward(x, position_ids)
+        return cos, sin
+
+
+class CohereDynamicNTKScalingRotaryEmbedding(CohereRotaryEmbedding):
+    """CohereRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
+
+    def forward(self, x, position_ids):
+        # difference to the original RoPE: inv_freq is recomputed when the sequence length > original length
+        seq_len = torch.max(position_ids) + 1
+        if seq_len > self.max_position_embeddings:
+            base = self.base * (
+                (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
+            ) ** (self.dim / (self.dim - 2))
+            inv_freq = 1.0 / (
+                base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(x.device) / self.dim)
+            )
+            self.register_buffer("inv_freq", inv_freq, persistent=False)  # TODO joao: this may break with compilation
+
+        cos, sin = super().forward(x, position_ids)
+        return cos, sin


There classes are never used but I couldn't find a way to remove them

there is no way to do so 😓 Maybe a skip layer?

hmmm yeah, or maybe it is ok to manually remove them from now

younesbelkada · 2024-06-03T16:21:05Z

src/transformers/models/cohere/diff_cohere.py

+    @property
+    def logit_scale(self):
+        logger.warning(
+            "`logit_scale` attribute is going to be deprecated in future versions, please use `model.config.logit_scale` instead."
+        )
+        return self.config.logit_scale
+
+    @property
+    def tie_word_embeddings(self):
+        logger.warning(
+            "`tie_word_embeddings` attribute is going to be deprecated in future versions, please use `model.config.tie_word_embeddings` instead."
+        )
+        return self.config.tie_word_embeddings


Any idea why these are not propagated in the generated modeling code?

I'll have to dive a bit into this!

Ok that's on me to do now!

HuggingFaceDocBuilderDev · 2024-06-03T16:34:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Nice!

ArthurZucker · 2024-06-19T07:43:00Z

src/transformers/models/cohere/diff_cohere.py

+    model_type = "cohere"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    def __init__(


For the init we can use the super.init() and should use super_kwargs to only change the ones that are actually different from the default we have in gemme 😉

ArthurZucker · 2024-06-19T07:43:55Z

src/transformers/models/cohere/diff_cohere.py

+    @property
+    def logit_scale(self):
+        logger.warning(
+            "`logit_scale` attribute is going to be deprecated in future versions, please use `model.config.logit_scale` instead."
+        )
+        return self.config.logit_scale
+
+    @property
+    def tie_word_embeddings(self):
+        logger.warning(
+            "`tie_word_embeddings` attribute is going to be deprecated in future versions, please use `model.config.tie_word_embeddings` instead."
+        )
+        return self.config.tie_word_embeddings


Ok that's on me to do now!

ArthurZucker · 2024-06-19T07:46:09Z

src/transformers/models/cohere/modeling_cohere.py

+class CohereLinearScalingRotaryEmbedding(CohereRotaryEmbedding):
+    """CohereRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
+
+    def forward(self, x, position_ids):
+        # difference to the original RoPE: a scaling factor is aplied to the position ids
+        position_ids = position_ids.float() / self.scaling_factor
+        cos, sin = super().forward(x, position_ids)
+        return cos, sin
+
+
+class CohereDynamicNTKScalingRotaryEmbedding(CohereRotaryEmbedding):
+    """CohereRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
+
+    def forward(self, x, position_ids):
+        # difference to the original RoPE: inv_freq is recomputed when the sequence length > original length
+        seq_len = torch.max(position_ids) + 1
+        if seq_len > self.max_position_embeddings:
+            base = self.base * (
+                (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
+            ) ** (self.dim / (self.dim - 2))
+            inv_freq = 1.0 / (
+                base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(x.device) / self.dim)
+            )
+            self.register_buffer("inv_freq", inv_freq, persistent=False)  # TODO joao: this may break with compilation
+
+        cos, sin = super().forward(x, position_ids)
+        return cos, sin


there is no way to do so 😓 Maybe a skip layer?

ArthurZucker · 2024-06-19T07:46:43Z

src/transformers/models/cohere/modeling_cohere.py

 _CONFIG_FOR_DOC = "CohereConfig"


-# Copied from transformers.models.llama.modeling_llama._get_unpad_data


that's a problem no? the unpad_data should still be present!

For some reason it has been pasred below .. https://github.com/huggingface/transformers/pull/31211/files#r1647528352

younesbelkada · 2024-06-20T12:52:42Z

src/transformers/models/cohere/modeling_cohere.py

+        return attn_output, None, past_key_value
+
+
+def _get_unpad_data(attention_mask):


the _get_unpad_data is pasted here @ArthurZucker

github-actions · 2024-07-15T08:03:51Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

younesbelkada added 3 commits June 3, 2024 17:44

add v2

dd51830

add v2

dc78118

fix all issues

6135a1f

younesbelkada commented Jun 3, 2024

View reviewed changes

fix

4f6104a

younesbelkada requested a review from ArthurZucker June 3, 2024 16:19

younesbelkada commented Jun 3, 2024

View reviewed changes

Merge branch 'main' into cohere-diff-2

bdb4cc9

ArthurZucker reviewed Jun 19, 2024

View reviewed changes

This was referenced Jun 20, 2024

Granite language models #31502

Merged

Cohere: Use diff instead of Copied from mechanism #31518

Closed

younesbelkada commented Jun 20, 2024

View reviewed changes

github-actions bot closed this Jul 23, 2024

		ALL_LAYERNORM_LAYERS.append(CohereRMSNorm)


		class CohereLayerNorm(CohereRMSNorm):

		_CONFIG_FOR_DOC = "CohereConfig"


		# Copied from transformers.models.llama.modeling_llama._get_unpad_data

		return attn_output, None, past_key_value


		def _get_unpad_data(attention_mask):

Conversation

younesbelkada commented Jun 3, 2024

What does this PR do?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 3, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants