Skip to content

How is unpadding handled when unpacking? #7

@karinazad

Description

@karinazad

I can see that the collator handles unpacking but I'm not sure how padding tokens are handled in the current code. When I apply the collator on input ids, it leaves the padding tokens in there.

Please let me know what I should change to get it to properly remove the padding tokens. Thank you!

collator = DataCollatorWithPacking(
        tokenizer=tokenizer,
        mlm=True,
        mlm_probability=0.15,
        mask_replace_prob=0.8,
        random_replace_prob=0.1,
        pack_sequences=True
    )

# Prepare data in format expected by collator
hf_input = [
        {
            "input_ids": input_ids[0].tolist(),
            "attention_mask": attention_mask[0].tolist()
        },
        {
            "input_ids": input_ids[1].tolist(),
            "attention_mask": attention_mask[1].tolist()
        }
    ]
hf_batch = collator(hf_input)
Input IDs shape: torch.Size([1, 20])
Labels shape: torch.Size([1, 20])
Position IDs shape: torch.Size([1, 20])
CU Seqlens shape: torch.Size([3])
Max Seqlen: tensor([10])
  Sequence 1: ['[CLS]', 'hello', 'world', 'how', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[CLS]', 'are', '[MASK]', 'doing', '?', '?', '[SEP]', '[PAD]', '[PAD]', '[PAD]']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions