Idefics: fix position ids#33907
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks, there might be a test worth adding for this no? (we break that often for peft 😄 )
| # create position_ids on the fly for batch generation | ||
| position_ids = attention_mask.long().cumsum(-1) - 1 | ||
| position_ids.masked_fill_(attention_mask == 0, 1) | ||
| position_ids = position_ids[:, -seq_length:] |
There was a problem hiding this comment.
The attention mask is of shape seq_length_with_past, so that makes sense!
|
When this branch can be merged? |
|
Will take a look at your comment and merge after seeing what is wrong with labels. TBH. labels should not be affected by position ids, but I'll dig into |
|
@ArthurZucker can you take one more look pls? I added fix for |
Maybe PEFT would be a better place to add this test. If so, LMK and I can work on something (probably based on this code). Since that would only test a single architecture, it wouldn't be super useful IMO but still better than nothing. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Just missing info on the shift, we broke this recently (I think we never had this) would be in favor of doing less work for the users, but properly documenting !
| # we use the input attention mask to shift the logits and labels, because it is 2D. | ||
| # we also crop attn mask in case it is longer, which happens in PrefixTuning with peft | ||
| shift_attention_mask = attention_mask[:, -(logits.shape[1] - 1) :].to(logits.device) | ||
| logger.warning_once( | ||
| "The final logits were masked using attention mask before calculating the loss. " | ||
| "This behavior will be removed in v4.48, you should be masking the `labels` with `-100` " | ||
| "in data collator before training starts." | ||
| ) |
There was a problem hiding this comment.
that's the only thing I am curious about, we are introducing something new that we know we are gonna deprecate. Not sure this makes a lot of sense to me! Would either not add this, or just not progagate to other models than idefics
There was a problem hiding this comment.
Yes, I don't know how this got introduced as none of our llms mask out labels. We can also not deprecate and stop propagating, yes. I'll take care of it when new models are added
So, in that case we remove deprecation and leave the fix for PEFT tuning
There was a problem hiding this comment.
So, in that case we remove deprecation and leave the fix for PEFT tuning
exactly! We already get bashed enough for the platora of warning we produce, let's not add one more! 😉
There was a problem hiding this comment.
Thanks for addressing the issue in this PR.
So, in that case we remove deprecation and leave the fix for PEFT tuning
What fix are we talking about here?
There was a problem hiding this comment.
oh, that's more transformer-side fix. We are talking about attention mask, which gets extra virtual tokens with PEFT so we have to crop them off to match the shape of labels for CELoss
There was a problem hiding this comment.
Got it, thanks, I thought I might have to fix something in PEFT.
* fix position ids * fix labels also * fix copies * oops, not that one * dont deprecate
What does this PR do?
Fixes #33852 by cropping position ids to the length of inputs and cleans up a little bit the way inputs are prepared for generation
TF equivalence tests are failing for me locally and also failing on
main, so those are not related to this PR