How is unpadding handled when unpacking?

I can see that the collator handles unpacking but I'm not sure how padding tokens are handled in the current code. When I apply the collator on input ids, it leaves the padding tokens in there.

Please let me know what I should change to get it to properly remove the padding tokens. Thank you!


```py
collator = DataCollatorWithPacking(
        tokenizer=tokenizer,
        mlm=True,
        mlm_probability=0.15,
        mask_replace_prob=0.8,
        random_replace_prob=0.1,
        pack_sequences=True
    )

# Prepare data in format expected by collator
hf_input = [
        {
            "input_ids": input_ids[0].tolist(),
            "attention_mask": attention_mask[0].tolist()
        },
        {
            "input_ids": input_ids[1].tolist(),
            "attention_mask": attention_mask[1].tolist()
        }
    ]
hf_batch = collator(hf_input)
```


```py
Input IDs shape: torch.Size([1, 20])
Labels shape: torch.Size([1, 20])
Position IDs shape: torch.Size([1, 20])
CU Seqlens shape: torch.Size([3])
Max Seqlen: tensor([10])
  Sequence 1: ['[CLS]', 'hello', 'world', 'how', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[CLS]', 'are', '[MASK]', 'doing', '?', '?', '[SEP]', '[PAD]', '[PAD]', '[PAD]']
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is unpadding handled when unpacking? #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How is unpadding handled when unpacking? #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions