04/11/2023 16:59:12 0:INFO: Prepare everything with our accelerator.
[2023-04-11 16:59:12,036] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
04112023 16:59:13|INFO|torch.distributed.distributed_c10d| Added key: store_based_barrier_key:2 to store for rank: 0
04112023 16:59:13|INFO|torch.distributed.distributed_c10d| Added key: store_based_barrier_key:2 to store for rank: 1
04112023 16:59:13|INFO|torch.distributed.distributed_c10d| Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
/usr/local/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
warnings.warn(
04112023 16:59:13|INFO|torch.distributed.distributed_c10d| Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
/usr/local/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
warnings.warn(
[2023-04-11 16:59:13,796] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[2023-04-11 16:59:13,796] [INFO] [engine.py:1086:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2023-04-11 16:59:13,796] [INFO] [engine.py:1092:_configure_optimizer] Using client Optimizer as basic optimizer
[2023-04-11 16:59:13,878] [INFO] [engine.py:1108:_configure_optimizer] DeepSpeed Basic Optimizer = AdamW
[2023-04-11 16:59:13,878] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-04-11 16:59:13,878] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2023-04-11 16:59:13,878] [INFO] [engine.py:1410:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2023-04-11 16:59:13,887] [INFO] [stage3.py:275:__init__] Reduce bucket size 500000000
[2023-04-11 16:59:13,887] [INFO] [stage3.py:276:__init__] Prefetch bucket size 50000000
Using /home/hadoop-hmart-waimai-rank/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/hadoop-hmart-waimai-rank/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Emitting ninja build file /home/hadoop-hmart-waimai-rank/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.5212891101837158 seconds
Loading extension module utils...
Time to load utils op: 0.5023727416992188 seconds
[2023-04-11 16:59:16,286] [INFO] [stage3.py:567:_setup_for_real_optimizer] optimizer state initialized
Using /home/hadoop-hmart-waimai-rank/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005068778991699219 seconds
[2023-04-11 16:59:16,615] [INFO] [utils.py:828:see_memory_usage] After initializing ZeRO optimizer
[2023-04-11 16:59:16,616] [INFO] [utils.py:829:see_memory_usage] MA 7.45 GB Max_MA 10.52 GB CA 11.47 GB Max_CA 11 GB
[2023-04-11 16:59:16,616] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 5.49 GB, percent = 2.4%
[2023-04-11 16:59:16,616] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-04-11 16:59:16,616] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2023-04-11 16:59:16,616] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-04-11 16:59:16,617] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001], mom=[(0.9, 0.999)]
[2023-04-11 16:59:16,618] [INFO] [config.py:1059:print] DeepSpeedEngine configuration:
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print] amp_enabled .................. False
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print] amp_params ................... False
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": null,
"exps_dir": null,
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print] bfloat16_enabled ............. False
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print] checkpoint_tag_validation_enabled True
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print] checkpoint_tag_validation_fail False
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print] communication_data_type ...... None
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] curriculum_enabled ........... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] curriculum_params ............ False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] dataloader_drop_last ......... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] disable_allgather ............ False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] dump_state ................... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] dynamic_loss_scale_args ...... None
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] eigenvalue_enabled ........... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] eigenvalue_gas_boundary_resolution 1
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] eigenvalue_layer_num ......... 0
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] eigenvalue_max_iter .......... 100
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] eigenvalue_stability ......... 1e-06
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] eigenvalue_tol ............... 0.01
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] eigenvalue_verbose ........... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] elasticity_enabled ........... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] fp16_enabled ................. True
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print] fp16_master_weights_and_gradients False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] fp16_mixed_quantize .......... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] global_rank .................. 0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] gradient_accumulation_steps .. 1
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] gradient_clipping ............ 0.0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] gradient_predivide_factor .... 1.0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] initial_dynamic_scale ........ 4294967296
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] loss_scale ................... 0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] memory_breakdown ............. False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] optimizer_legacy_fusion ...... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] optimizer_name ............... None
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] optimizer_params ............. None
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] pld_enabled .................. False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] pld_params ................... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] prescale_gradients ........... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] quantize_change_rate ......... 0.001
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] quantize_groups .............. 1
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] quantize_offset .............. 1000
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] quantize_period .............. 1000
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] quantize_rounding ............ 0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] quantize_start_bits .......... 16
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] quantize_target_bits ......... 8
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] quantize_training_enabled .... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] quantize_type ................ 0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] quantize_verbose ............. False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] scheduler_name ............... None
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] scheduler_params ............. None
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] sparse_attention ............. None
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] sparse_gradients_enabled ..... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] steps_per_print .............. inf
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] tensorboard_enabled .......... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] tensorboard_job_name ......... DeepSpeedJobName
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] tensorboard_output_path ......
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] train_batch_size ............. 16
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] train_micro_batch_size_per_gpu 8
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] use_quantizer_kernel ......... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] wall_clock_breakdown ......... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] world_size ................... 2
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] zero_allow_untested_optimizer True
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] zero_config .................. {
"stage": 3,
"contiguous_gradients": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"load_from_fp32_weights": true,
"elastic_checkpoint": false,
"offload_param": null,
"offload_optimizer": null,
"sub_group_size": 1.000000e+09,
"prefetch_bucket_size": 5.000000e+07,
"param_persistence_threshold": 1.000000e+05,
"max_live_parameters": 1.000000e+09,
"max_reuse_distance": 1.000000e+09,
"gather_16bit_weights_on_model_save": false,
"ignore_unused_parameters": true,
"round_robin_gradients": false,
"legacy_stage1": false
}
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] zero_enabled ................. True
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print] zero_optimization_stage ...... 3
[2023-04-11 16:59:16,622] [INFO] [config.py:1065:print] json = {
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none"
},
"offload_param": {
"device": "none"
},
"stage3_gather_16bit_weights_on_model_save": false
},
"steps_per_print": inf,
"fp16": {
"enabled": true,
"auto_cast": true
},
"bf16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
Using /home/hadoop-hmart-waimai-rank/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004420280456542969 seconds
04/11/2023 16:59:16 0:INFO: set weight type
04/11/2023 16:59:16 0:INFO: Move text_encode and vae to gpu and cast to weight_dtype
04/11/2023 16:59:16 0:INFO: [starship] accelerate not support all python data type
04/11/2023 16:59:16 0:INFO: ***** Running training *****
04/11/2023 16:59:16 0:INFO: Num examples = 400
04/11/2023 16:59:16 0:INFO: Num Epochs = 100
04/11/2023 16:59:16 0:INFO: Instantaneous batch size per device = 8
04/11/2023 16:59:16 0:INFO: Total train batch size (w. parallel, distributed & accumulation) = 16
04/11/2023 16:59:16 0:INFO: Gradient Accumulation steps = 1
04/11/2023 16:59:16 0:INFO: Total optimization steps = 2500
Steps: 0%| | 0/2500 [00:00<?, ?it/s]Parameter containing:
tensor([], device='cuda:0', dtype=torch.float16)
Traceback (most recent call last):
File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/main.py", line 29, in <module>
main()
File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/main.py", line 21, in main
run_aigc(args)
File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/task.py", line 61, in run_aigc
train(args)
File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/diffuser/train_txt2img.py", line 526, in train
encoder_hidden_states = text_encoder(batch["input_ids"].to(accelerator.device))[0]
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 823, in forward
return self.text_model(
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 719, in forward
hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 234, in forward
inputs_embeds = self.token_embedding(input_ids)
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 160, in forward
return F.embedding(
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D
Parameter containing:
tensor([], device='cuda:1', dtype=torch.float16)
Steps: 0%| | 0/2500 [00:05<?, ?it/s]
Traceback (most recent call last):
File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/main.py", line 29, in <module>
main()
File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/main.py", line 21, in main
run_aigc(args)
File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/task.py", line 61, in run_aigc
train(args)
File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/diffuser/train_txt2img.py", line 526, in train
encoder_hidden_states = text_encoder(batch["input_ids"].to(accelerator.device))[0]
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 823, in forward
return self.text_model(
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 719, in forward
hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 234, in forward
inputs_embeds = self.token_embedding(input_ids)
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 160, in forward
return F.embedding(
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 32336) of binary: /usr/local/conda/bin/python
Traceback (most recent call last):
File "/usr/local/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/conda/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/app/main.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-04-11_16:59:28
host : workbenchxwmx64350ee0-f9ggd
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 32337)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-11_16:59:28
host : workbenchxwmx64350ee0-f9ggd
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 32336)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
real 0m26.485s
user 0m23.241s
sys 0m22.802s
but it does not work.
File "train_text_to_image.py ", line 718, in <module>
main()
File "train_text_to_image.py ", line 648, in main
encoder_hidden_states = text_encoder(batch["input_ids"])[0]
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 739, in forward
return_dict=return_dict,
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 636, in forward
hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 165, in forward
inputs_embeds = self.token_embedding(input_ids)
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 160, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2183, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D
The goal is to be able to use Zero3 normally.
System Info
Describe the bug
An error is reported when using deepspeed's zero stage3 finetune diffusers/examples/text_to_image/train_text_to_image.py script. My machine's GPU is 2*A100, running on deepspeed zero stage3
error log is
I read huggingface/diffusers#1865 , https://www.deepspeed.ai/tutorials/zero/#allocating-massive-megatron-lm-models and https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.zero.GatheredParameters
modify /usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py as this:
but it does not work.
Who can help?
No response
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
I am also experiencing the same issue as mentioned in huggingface/diffusers#1865, therefore I have copied the reproduction steps from the original post.
/home/kas/zero_stage3_offload_config.json
Expected behavior
The goal is to be able to use Zero3 normally.