-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Closed
Description
Hi.
I'm trying to use deepspeed in my code with multiple models, but got an error like below. Do you have any idea to solve this issue? Thanks in advance.
File "train_ds.py", line 98, in <module>
solver = Solver(opt)
File "/data2/1konny/svg/solver_ds.py", line 40, in __init__
self.init_models_and_optimizers()
File "/data2/1konny/svg/solver_ds.py", line 117, in init_models_and_optimizers
self.decoder, self.decoder_optimizer, _, _ = ds.initialize(opt, model=decoder, model_parameters=decoder_params)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/__init__.py", line 87, in initialize
collate_fn=collate_fn)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/pt/deepspeed_light.py", line 123, in __init__
dist.init_process_group(backend="nccl")
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 372, in init_process_group
raise RuntimeError("trying to initialize the default process group "
RuntimeError: trying to initialize the default process group twice!
ds_config.json
{
"train_batch_size": 4,
"gradient_accumulation_steps": 1,
"steps_per_print": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0001,
"max_grad_norm": 1.0,
"betas": [
0.9,
0.999
]
}
}
}
command-line
deepspeed train_ds.py --deepspeed --deepspeed_config deepspeed_util/ds_config.json ...
code
training_data = load_dataset()
encoder_params = filter(lambda p: p.requires_grad, encoder.parameters())
decoder_params = filter(lambda p: p.requires_grad, decoder.parameters())
self.encoder, self.encoder_optim, train_loader, _ = deepspeed.initialize(opt, model=encoder, model_parameters=encoder_params, training_data=training_data)
self.decoder, self.decoder_optim, _, _ = deepspeed.initialize(opt, model=decoder, model_parameters=decoder_params)
Metadata
Metadata
Assignees
Labels
No labels