[BUG] RuntimeError: Ninja is required to load C++ extensions

Hi, 

I am getting the following error when running pretrain_gpt.sh 

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch']
torch version .................... 1.8.2+cu111
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/qfs/people/shar703/scripts/mega_ai/deepspeed_megatron/DeepSpeed/deepspeed']
deepspeed info ................... 0.5.9+1d295ff, 1d295ff, master
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1
**** Git info for Megatron: git_hash=1ac4a44 git_branch=main ****
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  bert_binary_head ................................ True
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  checkpoint_activations .......................... True
  checkpoint_in_cpu ............................... False
  checkpoint_num_layers ........................... 1
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_train_tokens ........................... 0
  consumed_valid_samples .......................... 0
  contigious_checkpointing ........................ False
  cpu_optimizer ................................... False
  cpu_torch_adam .................................. False
  curriculum_learning ............................. False
  data_impl ....................................... infer
  data_parallel_size .............................. 1
  data_path ....................................... ['cord19/chemistry_cord19_abstract_document']
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_seq_length .............................. None
  deepscale ....................................... False
  deepscale_config ................................ None
  deepspeed ....................................... False
  deepspeed_activation_checkpointing .............. False
  deepspeed_config ................................ None
  deepspeed_mpi ................................... False
  distribute_checkpointed_activations ............. False
  distributed_backend ............................. nccl
  embedding_path .................................. None
  encoder_seq_length .............................. 1024
  eod_mask_loss ................................... False
  eval_interval ................................... 100
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  ffn_hidden_size ................................. 4096
  finetune ........................................ False
  fp16 ............................................ True
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  global_batch_size ............................... 8
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 1024
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_dim ......................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  kv_channels ..................................... 64
  layernorm_epsilon ............................... 1e-05
  lazy_mpu_init ................................... None
  load ............................................ checkpoints/gpt2_345m
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 10
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.00015
  lr_decay_iters .................................. 320000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_decay_tokens ................................. None
  lr_warmup_fraction .............................. 0.01
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_prob ....................................... 0.15
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 1024
  memory_centric_tiled_linear ..................... False
  merge_file ...................................... ../deepspeed_megatron/gpt_files/gpt2-merges.txt
  micro_batch_size ................................ 4
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mmap_warmup ..................................... False
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_save_optim ................................... None
  no_save_rng ..................................... None
  num_attention_heads ............................. 16
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  override_lr_scheduler ........................... False
  params_dtype .................................... torch.float16
  partition_activations ........................... False
  patch_dim ....................................... 16
  pipeline_model_parallel_size .................... 1
  profile_backward ................................ False
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  remote_device ................................... none
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  sample_rate ..................................... 1.0
  save ............................................ checkpoints/gpt2_345m
  save_interval ................................... 500
  scatter_gather_tensors_in_pipeline .............. True
  scattered_embeddings ............................ False
  seed ............................................ 1234
  seq_length ...................................... 1024
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  split ........................................... 969, 30, 1
  split_transformers .............................. False
  synchronize_each_layer .......................... False
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  tile_factor ..................................... 1
  titles_data_path ................................ None
  tokenizer_type .................................. GPT2BPETokenizer
  train_iters ..................................... 500000
  train_samples ................................... None
  train_tokens .................................... None
  use_checkpoint_lr_scheduler ..................... False
  use_contiguous_buffers_in_ddp ................... False
  use_cpu_initialization .......................... None
  use_one_sent_docs ............................... False
  use_pin_memory .................................. False
  virtual_pipeline_model_parallel_size ............ None
  vocab_extra_ids ................................. 0
  vocab_file ...................................... ../deepspeed_megatron/gpt_files/gpt2-vocab.json
  weight_decay .................................... 0.01
  world_size ...................................... 1
  zero_allgather_bucket_size ...................... 0.0
  zero_contigious_gradients ....................... False
  zero_reduce_bucket_size ......................... 0.0
  zero_reduce_scatter ............................. False
  zero_stage ...................................... 1.0
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 2
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
> compiling dataset index builder ...
make: Entering directory `/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/data'
make: Nothing to be done for `default'.
make: Leaving directory `/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/data'
>>> done with dataset index builder. Compilation time: 0.051 seconds
> compiling and loading fused kernels ...
Traceback (most recent call last):
  File "/people/shar703/anaconda3/envs/deepspeed/bin/ninja", line 33, in <module>
    sys.exit(load_entry_point('ninja', 'console_scripts', 'ninja')())
  File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/__init__.py", line 51, in ninja
    raise SystemExit(_program('ninja', sys.argv[1:]))
  File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/__init__.py", line 47, in _program
    return subprocess.call([os.path.join(BIN_DIR, name)] + args)
  File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py", line 340, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: '/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/data/bin/ninja'
Traceback (most recent call last):
  File "pretrain_gpt.py", line 231, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/training.py", line 96, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/initialize.py", line 89, in initialize_megatron
    _compile_dependencies()
  File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/initialize.py", line 137, in _compile_dependencies
    fused_kernels.load(args)
  File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/fused_kernels/__init__.py", line 71, in load
    scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper(
  File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/fused_kernels/__init__.py", line 47, in _cpp_extention_load_helper
    return cpp_extension.load(
  File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
    return _jit_compile(
  File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1292, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1373, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1429, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] RuntimeError: Ninja is required to load C++ extensions #1687

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] RuntimeError: Ninja is required to load C++ extensions #1687

Description

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]