Fix: Using IterableDataset crashes training by oKatanaaa · Pull Request #8 · unslothai/unsloth-zoo

oKatanaaa · 2024-10-28T18:34:19Z

When using a dataset that inherits from IterableDataset, it causes training runs to crash with the following error:

│ /opt/conda/lib/python3.10/site-packages/unsloth_zoo/tokenizer_utils.py:238 in                    │
│ fix_untrained_tokens                                                                             │
│                                                                                                  │
│   235 │   pass                                                                                   │
│   236 │                                                                                          │
│   237 │   # Check the first 250, last 250 input_ids                                              │
│ ❱ 238 │   size_dataset = len(train_dataset)                                                      │
│   239 │   size = min(size_dataset, 250)                                                          │
│   240 │   for j in range(size):                                                                  │
│   241 │   │   input_ids = train_dataset[j]
TypeError: object of type 'IterableDataset' has no len()

The code in tokenizer_utils assumes that users will always use an indexable dataset causing the error. I've added simple checks for instances of IterableDataset. Although it is a half measure, since those checks kinda disable parts of tokenizer_utils functionality (when encountering an IterableDatset).

Ideally, functions in tokenizer_utils should not assume anything about the datasets and do everything in the most generic way possible offering maximum compatibility with existing transformers/datasets tools.

Multi-reviewer pass on the autocast wrapper / norm-upcast path: - Instance-level forward (#2): an instance attribute `model.forward` (Unsloth runtime forward patching) shadows class-method overrides, so mutating __class__ silently bypassed the wrapper -> fp32 norm met a bf16 linear with no autocast and crashed. Now wrap the instance attribute when present; otherwise subclass as before. - Wrapper gating (unslothai#5, unslothai#7): install the wrapper iff fp32 norm params actually exist (from our upcast, the legacy env upcast, or an external _pre_set_compute_dtype policy) -- not on the upcast DECISION. Fixes the rollback path leaving external fp32 norms exposed, and stops wrapping models with no fp32 norm. Add _unwrap_forward_in_bf16_autocast for re-prepare (unslothai#10). - config.architectures leak (unslothai#8/unslothai#9): keep the original __name__ on the generated subclass (unique __qualname__ for registration) so save_pretrained records the base architecture. - Device detection (unslothai#11): recurse into mapping/list/tuple batches and fall back to the model's parameter device instead of defaulting to "cuda". - Legacy UNSLOTH_UPCAST_LAYERNORM (#1/#3/unslothai#4): route through the shared _cast_named_module + union matcher and honour the external-policy deferral. - Recursive external-ownership guard (unslothai#6): record descendants of tagged modules (the external policy casts recursively). - Fresh-interpreter pickle test (unslothai#12): real subprocess load. Shared helpers: _find_tensor_device_type, _call_forward_with_bf16_autocast, _canonical_module_name, _cast_named_module. Unit suite: 25 passed.

fix: check iterable datasets

0580b69

danielhanchen merged commit d2f2903 into unslothai:main Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Using IterableDataset crashes training#8

Fix: Using IterableDataset crashes training#8
danielhanchen merged 1 commit into
unslothai:mainfrom
oKatanaaa:fix

oKatanaaa commented Oct 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oKatanaaa commented Oct 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants