Add correct batched handling for apply_chat_template by Rocketknight1 · Pull Request #29222 · huggingface/transformers

Rocketknight1 · 2024-02-22T18:11:58Z

apply_chat_template has a few issues since it was written. Firstly, by default it returns the naked input_ids rather than a dict, and secondly it didn't support rendering a batch of chats simultaneously. This PR makes a few changes:

Batched chats are now supported, and we sniff the input to figure out what the user is passing
return_dict now defaults to None. For now, we interpret this as False to maintain backward compatibility, but this PR adds a warning that the default behaviour will be changing to True to match other tokenizer methods.

cc @siddk @lewtun who have both requested this or something like it!

HuggingFaceDocBuilderDev · 2024-02-22T18:31:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2024-02-23T14:22:22Z

Should be ready for review now! cc @ArthurZucker

ArthurZucker

LGTM !

src/transformers/tokenization_utils_base.py

ArthurZucker · 2024-02-28T00:46:52Z

src/transformers/tokenization_utils_base.py

+                "In version 4.40, `return_dict` will be set to `True` by default. "
+                "Please explicitly set `return_dict` to `False` to maintain the current behaviour, "
+                "or set it to `True` to get the new behaviour immediately."
+            )


would be nice to explain why this should be set to True for example? I have no idea

I changed my mind about this and removed the warning to make this a simpler PR!

src/transformers/tokenization_utils_base.py

ArthurZucker · 2024-02-28T00:55:44Z

src/transformers/tokenization_utils_base.py

+            )
+
+        if not batched:
+            rendered = rendered[0]


should we not always return a batched output? (breaking but we can warn)

I'm not sure - other tokenizer methods don't auto-batch a single input, right? (And sorry for taking so long to reply here!)

src/transformers/tokenization_utils_base.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Rocketknight1 · 2024-03-12T13:47:17Z

This should be ready for re-review now, cc @amyeroberts @ArthurZucker! I simplified the PR by removing the deprecation warning - I'm not sure if we want to move to return_dict=True that quickly anyway. As a result, this shouldn't result in any behaviour changes now, it only adds new functionality.

amyeroberts

Thanks for adding this!

Two things before merge:

Question about the default value of return_dict
Let's wait for @ArthurZucker to get back to confirm desired batching behaviour

src/transformers/tokenization_utils_base.py

tests/test_tokenization_common.py

amyeroberts · 2024-03-13T16:55:17Z

src/transformers/tokenization_utils_base.py

                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.
-            return_dict (`bool`, *optional*, defaults to `False`):
+            return_dict (`bool`, *optional*):


Why change the default to None here? AFAICT, this doesn't change things. It gets set to False if tokenize is True, but it's only used in truth checks on L1763 and L1773 (which shouldn't really do this if the value can be none anyway) and False or None will have the same result

Fixed! You're right - this is leftovers from when I was planning to slowly make return_dict=True the default.

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

# Conflicts: # src/transformers/tokenization_utils_base.py

Rocketknight1 · 2024-03-20T15:50:19Z

Merging this now that the branch cut has passed!

Rocketknight1 mentioned this pull request Feb 22, 2024

Extend Chat Template Tokenization for Training/Finetuning #27609

Open

Rocketknight1 marked this pull request as ready for review February 22, 2024 19:37

Rocketknight1 force-pushed the batched_apply_chat_template branch from 684008a to ed7136f Compare February 23, 2024 14:22

Rocketknight1 requested a review from ArthurZucker February 23, 2024 14:22

ArthurZucker approved these changes Feb 28, 2024

View reviewed changes

Rocketknight1 and others added 18 commits March 12, 2024 13:42

Add correct batched handling for apply_chat_template

902fcb6

Fix warning method

1b0b39f

Add error for incompatible options

d5e33ae

expand tests

5c29d1b

Add a skip for markuplm

c636dc0

Add skips for other layout models

6085481

Skip for LayoutLMv2

f06ab58

Slightly update the warning message

bea365a

Update src/transformers/tokenization_utils_base.py

205701c

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Update src/transformers/tokenization_utils_base.py

ea8c259

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Update src/transformers/tokenization_utils_base.py

d3d449d

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Update src/transformers/tokenization_utils_base.py

66cb2b6

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Update src/transformers/tokenization_utils_base.py

57f5949

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Update src/transformers/tokenization_utils_base.py

7c16d17

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

typo fix

933ae05

Update docstring for conversation kwarg

d0df051

Update return docstring

04e8db5

Remove the warning, improve error message

cb0bbb6

Rocketknight1 force-pushed the batched_apply_chat_template branch from da681cb to cb0bbb6 Compare March 12, 2024 13:42

amyeroberts approved these changes Mar 13, 2024

View reviewed changes

Rocketknight1 and others added 2 commits March 13, 2024 17:56

Update src/transformers/tokenization_utils_base.py

63284df

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update src/transformers/tokenization_utils_base.py

98f65a5

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Rocketknight1 and others added 8 commits March 13, 2024 17:57

Update tests/test_tokenization_common.py

1a33472

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update tests/test_tokenization_common.py

6ce8d5e

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Remove return_dict=None

0a963fb

Merge branch 'main' into batched_apply_chat_template

cdd3d0b

# Conflicts: # src/transformers/tokenization_utils_base.py

Fix up some merge cruft

e057508

More merge cruft

c56908e

Add another skip

0358619

Add another skip

bb91697

Rocketknight1 merged commit 9d99948 into main Mar 20, 2024

Rocketknight1 deleted the batched_apply_chat_template branch March 20, 2024 15:50

Conversation

Rocketknight1 commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 22, 2024

Uh oh!

Rocketknight1 commented Feb 23, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Mar 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rocketknight1 commented Mar 12, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amyeroberts Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Mar 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Rocketknight1 commented Feb 22, 2024 •

edited

Loading