Skip to content

Checkpoints created prior to refactoring of bf16_optimizer not working #2382

@mayank31398

Description

@mayank31398

@stas00 @tjruwase fyi
Minimalistic working script for BLOOM-176B:

>>> import torch
>>> torch.load("/net/llm-shared-nfs/data/BLOOM/models--bigscience--bloom-optimizer-states/snapshots/fffeb1434b96997490396f46df742fb0be8f7774/global_step95000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/serialization.py", line 1049, in _load
    result = unpickler.load()
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/serialization.py", line 1042, in find_class
    return super().find_class(mod_name, name)
AttributeError: Can't get attribute 'fragment_address' on <module 'deepspeed.runtime.bf16_optimizer' from '/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/runtime/bf16_optimizer.py'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions