Skip to content

[Bug] granite-vision-3.2-2b failing on sglang with "LlavaNextForConditionalGeneration not supported" #4062

@didier-durand

Description

@didier-durand

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Hi,

I have successfully run the 3.1 versions of granite models on SGLang project (https://github.com/sgl-project/sglang)

I am now trying to run granite-vision-3.2-2b

But it fails, with the messages below: in particular Model architectures ['LlavaNextForConditionalGeneration'] are not supported for now?

will IBM work with SGLang project allow this model to run as well on SGLang to be able to leverage its inference acceleration ? It seems that the collaboration has been working for v3.1. see https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/granite.py

Note: it seems to be specific to granite-vision-3.2-2b because granite-3.2-2b-instruct works fine with SGLang

Thanks,
Didier

bash-5.2# python3.12 -m sglang.launch_server --model ibm-granite/granite-vision-3.2-2b --model-path ibm-granite/granite-vision-3.2-2b --port 30000 --host 0.0.0.0 --log-level debug --trust-remote-code --tensor-parallel-size 4 --enable-p2p-check --disable-cuda-graph
INFO 03-04 09:02:43 __init__.py:190] Automatically detected platform cuda.
[2025-03-04 09:02:46] Setting Triton cache manager to: sglang.srt.utils:CustomCacheManager
[2025-03-04 09:02:46] server_args=ServerArgs(model_path='ibm-granite/granite-vision-3.2-2b', tokenizer_path='ibm-granite/granite-vision-3.2-2b', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='ibm-granite/granite-vision-3.2-2b', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30000, mem_fraction_static=0.85, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=4, stream_interval=1, stream_output=False, random_seed=108653913, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='debug', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=80, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=True, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False)
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
[2025-03-04 09:02:53 TP0] Init torch distributed begin.
[2025-03-04 09:02:53 TP0] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP1] Init torch distributed begin.
[2025-03-04 09:02:53 TP3] Init torch distributed begin.
[2025-03-04 09:02:53 TP2] Init torch distributed begin.
[2025-03-04 09:02:53 TP1] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP2] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP3] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP0] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP2] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP0] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP2] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP1] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP3] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP3] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP1] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP3] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP2] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP0] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP1] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP0] Binding to tcp://127.0.0.1:51275
[2025-03-04 09:02:53 TP0] Message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<sglang.srt.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f72a6c752e0>, local_subscribe_port=51275, remote_subscribe_port=None)
[2025-03-04 09:02:53 TP3] Connecting to tcp://127.0.0.1:51275
[2025-03-04 09:02:53 TP2] Connecting to tcp://127.0.0.1:51275
[2025-03-04 09:02:53 TP1] Connecting to tcp://127.0.0.1:51275
[2025-03-04 09:02:54 TP2] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP0] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP1] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP3] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 240, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 194, in __init__
    self.load_model()
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 317, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 357, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 136, in _initialize_model
    model_class, _ = get_model_architecture(model_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/utils.py", line 37, in get_model_architecture
    return ModelRegistry.resolve_model_cls(architectures)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/registry.py", line 65, in resolve_model_cls
    return self._raise_for_unsupported(architectures)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/registry.py", line 32, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['LlavaNextForConditionalGeneration'] are not supported for now. Supported architectures: dict_keys(['BaichuanForCausalLM', 'ChatGLMModel', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'DbrxForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'DeepseekV3ForCausalLM', 'ExaoneForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma2ForSequenceClassification', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GraniteForCausalLM', 'Grok1ForCausalLM', 'Grok1ModelForCausalLM', 'InternLM2ForCausalLM', 'InternLM2ForRewardModel', 'LlamaForCausalLM', 'Phi3ForCausalLM', 'InternLM3ForCausalLM', 'LlamaForClassification', 'LlamaForCausalLMEagle', 'LlamaEmbeddingModel', 'MistralModel', 'LlamaForSequenceClassification', 'LlamaForSequenceClassificationWithNormal_Weights', 'LlavaLlamaForCausalLM', 'LlavaQwenForCausalLM', 'LlavaMistralForCausalLM', 'LlavaVidForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MiniCPMV', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MllamaForConditionalGeneration', 'OlmoForCausalLM', 'Olmo2ForCausalLM', 'OlmoeForCausalLM', 'Phi3SmallForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2ForCausalLMEagle', 'Qwen2MoeForCausalLM', 'Qwen2VLForConditionalGeneration', 'StableLmForCausalLM', 'TorchNativeLlamaForCausalLM', 'TorchNativePhi3ForCausalLM', 'XverseForCausalLM', 'XverseMoeForCausalLM', 'YiVLForCausalLM'])

Reproduction

Start SGLang with model Granite with following command line

python3.12 -m sglang.launch_server --model ibm-granite/granite-vision-3.2-2b --model-path ibm-granite/granite-vision-3.2-2b --port 30000 --host 0.0.0.0 --log-level debug --trust-remote-code --tensor-parallel-size 4 --enable-p2p-check --disable-cuda-graph

Similar command for instruct works fine:

python3.12 -m sglang.launch_server --model ibm-granite/granite-3.2-2b-instruct --model-path ibm-granite/granite-3.2-2b-instruct --port 30000 --host 0.0.0.0 --log-level debug --trust-remote-code --tensor-parallel-size 4 --enable-p2p-check --disable-cuda-graph

Environment

SGLang 0.4.3 containerized in Amazon Linux 2023 and running in an AWS ECS cluster

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions