-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
I can see that the collator handles unpacking but I'm not sure how padding tokens are handled in the current code. When I apply the collator on input ids, it leaves the padding tokens in there.
Please let me know what I should change to get it to properly remove the padding tokens. Thank you!
collator = DataCollatorWithPacking(
tokenizer=tokenizer,
mlm=True,
mlm_probability=0.15,
mask_replace_prob=0.8,
random_replace_prob=0.1,
pack_sequences=True
)
# Prepare data in format expected by collator
hf_input = [
{
"input_ids": input_ids[0].tolist(),
"attention_mask": attention_mask[0].tolist()
},
{
"input_ids": input_ids[1].tolist(),
"attention_mask": attention_mask[1].tolist()
}
]
hf_batch = collator(hf_input)Input IDs shape: torch.Size([1, 20])
Labels shape: torch.Size([1, 20])
Position IDs shape: torch.Size([1, 20])
CU Seqlens shape: torch.Size([3])
Max Seqlen: tensor([10])
Sequence 1: ['[CLS]', 'hello', 'world', 'how', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[CLS]', 'are', '[MASK]', 'doing', '?', '?', '[SEP]', '[PAD]', '[PAD]', '[PAD]']Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels