Add correct batched handling for apply_chat_template#29222
Add correct batched handling for apply_chat_template#29222Rocketknight1 merged 28 commits intomainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
684008a to
ed7136f
Compare
|
Should be ready for review now! cc @ArthurZucker |
| "In version 4.40, `return_dict` will be set to `True` by default. " | ||
| "Please explicitly set `return_dict` to `False` to maintain the current behaviour, " | ||
| "or set it to `True` to get the new behaviour immediately." | ||
| ) |
There was a problem hiding this comment.
would be nice to explain why this should be set to True for example? I have no idea
There was a problem hiding this comment.
I changed my mind about this and removed the warning to make this a simpler PR!
| ) | ||
|
|
||
| if not batched: | ||
| rendered = rendered[0] |
There was a problem hiding this comment.
should we not always return a batched output? (breaking but we can warn)
There was a problem hiding this comment.
I'm not sure - other tokenizer methods don't auto-batch a single input, right? (And sorry for taking so long to reply here!)
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
da681cb to
cb0bbb6
Compare
|
This should be ready for re-review now, cc @amyeroberts @ArthurZucker! I simplified the PR by removing the deprecation warning - I'm not sure if we want to move to |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for adding this!
Two things before merge:
- Question about the default value of
return_dict - Let's wait for @ArthurZucker to get back to confirm desired batching behaviour
| - `'np'`: Return NumPy `np.ndarray` objects. | ||
| - `'jax'`: Return JAX `jnp.ndarray` objects. | ||
| return_dict (`bool`, *optional*, defaults to `False`): | ||
| return_dict (`bool`, *optional*): |
There was a problem hiding this comment.
Why change the default to None here? AFAICT, this doesn't change things. It gets set to False if tokenize is True, but it's only used in truth checks on L1763 and L1773 (which shouldn't really do this if the value can be none anyway) and False or None will have the same result
There was a problem hiding this comment.
Fixed! You're right - this is leftovers from when I was planning to slowly make return_dict=True the default.
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
# Conflicts: # src/transformers/tokenization_utils_base.py
|
Merging this now that the branch cut has passed! |
apply_chat_templatehas a few issues since it was written. Firstly, by default it returns the nakedinput_idsrather than a dict, and secondly it didn't support rendering a batch of chats simultaneously. This PR makes a few changes:return_dictnow defaults toNone. For now, we interpret this asFalseto maintain backward compatibility, but this PR adds a warning that the default behaviour will be changing toTrueto match other tokenizer methods.cc @siddk @lewtun who have both requested this or something like it!