Skip to content

Fix: Using IterableDataset crashes training#8

Merged
danielhanchen merged 1 commit into
unslothai:mainfrom
oKatanaaa:fix
Oct 29, 2024
Merged

Fix: Using IterableDataset crashes training#8
danielhanchen merged 1 commit into
unslothai:mainfrom
oKatanaaa:fix

Conversation

@oKatanaaa

Copy link
Copy Markdown
Contributor

When using a dataset that inherits from IterableDataset, it causes training runs to crash with the following error:

│ /opt/conda/lib/python3.10/site-packages/unsloth_zoo/tokenizer_utils.py:238 in                    │
│ fix_untrained_tokens                                                                             │
│                                                                                                  │
│   235 │   pass                                                                                   │
│   236 │                                                                                          │
│   237 │   # Check the first 250, last 250 input_ids                                              │
│ ❱ 238 │   size_dataset = len(train_dataset)                                                      │
│   239 │   size = min(size_dataset, 250)                                                          │
│   240 │   for j in range(size):                                                                  │
│   241 │   │   input_ids = train_dataset[j]
TypeError: object of type 'IterableDataset' has no len()

The code in tokenizer_utils assumes that users will always use an indexable dataset causing the error. I've added simple checks for instances of IterableDataset. Although it is a half measure, since those checks kinda disable parts of tokenizer_utils functionality (when encountering an IterableDatset).

Ideally, functions in tokenizer_utils should not assume anything about the datasets and do everything in the most generic way possible offering maximum compatibility with existing transformers/datasets tools.

@danielhanchen danielhanchen merged commit d2f2903 into unslothai:main Oct 29, 2024
mmathew23 added a commit to mmathew23/unsloth-zoo that referenced this pull request May 22, 2026
Multi-reviewer pass on the autocast wrapper / norm-upcast path:

- Instance-level forward (#2): an instance attribute `model.forward`
  (Unsloth runtime forward patching) shadows class-method overrides, so
  mutating __class__ silently bypassed the wrapper -> fp32 norm met a bf16
  linear with no autocast and crashed. Now wrap the instance attribute when
  present; otherwise subclass as before.
- Wrapper gating (unslothai#5, unslothai#7): install the wrapper iff fp32 norm params actually
  exist (from our upcast, the legacy env upcast, or an external
  _pre_set_compute_dtype policy) -- not on the upcast DECISION. Fixes the
  rollback path leaving external fp32 norms exposed, and stops wrapping models
  with no fp32 norm. Add _unwrap_forward_in_bf16_autocast for re-prepare (unslothai#10).
- config.architectures leak (unslothai#8/unslothai#9): keep the original __name__ on the
  generated subclass (unique __qualname__ for registration) so save_pretrained
  records the base architecture.
- Device detection (unslothai#11): recurse into mapping/list/tuple batches and fall
  back to the model's parameter device instead of defaulting to "cuda".
- Legacy UNSLOTH_UPCAST_LAYERNORM (#1/#3/unslothai#4): route through the shared
  _cast_named_module + union matcher and honour the external-policy deferral.
- Recursive external-ownership guard (unslothai#6): record descendants of tagged
  modules (the external policy casts recursively).
- Fresh-interpreter pickle test (unslothai#12): real subprocess load.

Shared helpers: _find_tensor_device_type, _call_forward_with_bf16_autocast,
_canonical_module_name, _cast_named_module. Unit suite: 25 passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants