Skip to content

fix AttributeError: 'LazyValue' object has no attribute 'keys' in eplb_manager.py for qwen3 moe#21822

Merged
Fridge003 merged 8 commits intosgl-project:mainfrom
Evgueni-Petrov-aka-espetrov:patch-2
Apr 9, 2026
Merged

fix AttributeError: 'LazyValue' object has no attribute 'keys' in eplb_manager.py for qwen3 moe#21822
Fridge003 merged 8 commits intosgl-project:mainfrom
Evgueni-Petrov-aka-espetrov:patch-2

Conversation

@Evgueni-Petrov-aka-espetrov
Copy link
Copy Markdown
Contributor

@Evgueni-Petrov-aka-espetrov Evgueni-Petrov-aka-espetrov commented Apr 1, 2026

Motivation

found this typo while tuning sglang for qwen3 coder 480b in disaggregated prefill-decode setup

the exception is thrown by eplb_manager.py during the 1st attempt to rebalance the experts

  File "/sgl-workspace/sglang/python/sglang/srt/eplb/eplb_manager.py", line 110, in _compute_update_layer_ids_chunks
    list(self._model_runner.model.routed_experts_weights_of_layer.keys())
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'LazyValue' object has no attribute 'keys'

Modifications

get rid of unnecessary wrapper

Accuracy Tests

n/a

Speed Tests and Profiling

n/a

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the LazyValue utility and its usage within the Qwen3 MoE model implementation. Specifically, the routed_experts_weights_of_layer attribute is now initialized directly via a dictionary comprehension in the load_weights method rather than being wrapped in a lazy-loading function. I have no feedback to provide as there were no review comments to evaluate.

@Evgueni-Petrov-aka-espetrov
Copy link
Copy Markdown
Contributor Author

Created a matching issue #21833

@Evgueni-Petrov-aka-espetrov
Copy link
Copy Markdown
Contributor Author

i've checked that this bug has been introduced in v0.5.9 with the intention to speed-up weight loading

to create this attribute only when necessary, we may patch LazyValue from utils as follows:

class LazyValue:
    def __init__(self, creator: Callable):
        self._creator = creator
        self._value = None

    def __getattr__(self, name):  # fix qwen3 coder 480b eplb
        return getattr(self.value, name)

    def __getitem__(self, key):
        return self.value[key]

    def __setitem__(self, key, value):
        self.value[key] = value

    @property
    def value(self):
        if self._creator is not None:
            self._value = self._creator()
            self._creator = None
        return self._value

Comment thread python/sglang/srt/models/qwen3_moe.py Outdated
)
}
)
self.routed_experts_weights_of_layer = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct way of fixing this should be implementing a getitem method for LazyValue class?

def __getattr__(self, name):
return getattr(self.value, name)

def __getitem__(self, key):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check whether self.value itself supports getitem method. self.value might be something other than dict

Copy link
Copy Markdown
Contributor Author

@Evgueni-Petrov-aka-espetrov Evgueni-Petrov-aka-espetrov Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if value is other than dict or list, LazyValue.getitem raises exactly the same exception as value itself would raise

i think this is ok
or do i miss something?

from typing import Callable

class LazyValue:
    def __init__(self, creator: Callable):
        self._creator = creator
        self._value = None

    def __getattr__(self, name):
        return getattr(self.value, name)

    def __getitem__(self, key):
        return self.value[key]

    def __setitem__(self, key, value):
        self.value[key] = value

    @property
    def value(self):
        if self._creator is not None:
            self._value = self._creator()
            self._creator = None
        return self._value


d = {10: 'ten'}
lvd = LazyValue(lambda: d)
print(lvd.keys())  # dict_keys([10])


d = 1
lvd = LazyValue(lambda: d)

try:
    print(lvd.keys())
except Exception as e:
    print(f'{e}')  # 'int' object has no attribute 'keys'

try:
    print(lvd[0])
except Exception as e:
    print(f'{e}')  # 'int' object is not subscriptable

try:
    print(d[0])
except Exception as e:
    print(f'{e}')  # 'int' object is not subscriptable

@Fridge003
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 6, 2026
vroomfondel added a commit to vroomfondel/dgxarley that referenced this pull request Apr 7, 2026
**Upstream status** as of 2026-04-06:
- Qwen3.5: fixed via [PR #19767](sgl-project/sglang#19767) (merged 2026-03-09, included in v0.5.10)
- Qwen3: [PR #21461](sgl-project/sglang#21461) — closed without merge 2026-03-30 (CI failure), superseded by #21822
- Qwen3: [PR #21822](sgl-project/sglang#21822) — new fix opened 2026-03-26, addresses `AttributeError: 'LazyValue' object has no attribute 'keys'` in `eplb_manager.py` for Qwen3 MoE. Code review 2026-04-04 by `Fridge003` and `Evgueni-Petrov-aka-espetrov`. Alternative `LazyValue.__getattr__` approach proposed (avoids modifying the model class). **Approved** by `Fridge003` on 2026-04-06, CI rerun triggered — awaiting merge. (Duplicate [PR #21820](sgl-project/sglang#21820) was closed same day in favour of #21822.) Not in v0.5.10

When `--enable-eplb` is active with EP, the `EPLBManager` crashes after its first rebalance interval (default: 1000 forward passes):
- SGLang PR #17137 — non-Marlin WNA16MoE port (does not fix EP bug)
- SGLang #14158 — update_weights_from_tensor for WNA16MoE (unrelated)
- SGLang [PR #13715](sgl-project/sglang#13715) — fix EPLB + FP4 weight tensor filtering (merged, different issue)
- SGLang [PR #20963](sgl-project/sglang#20963) — Nvidia modelopt refactoring (1/N). Under active review: reviewer `Edwardf0t1` asked for end-to-end verification 2026-03-31, author `wenscarl` responded 2026-04-01 and posted 3 further inline review responses 2026-04-06. Not stalled but awaiting approval. Migrates the NVFP4 code as-is — expected vehicle for EP-awareness fixes (#20869, #21630). Watch this PR for resolution of the NVFP4 input_scale and CutlassMoEParams bugs
- SGLang [PR #21822](sgl-project/sglang#21822) — new EPLB/Qwen3 fix (opened 2026-03-26). Addresses `LazyValue.keys()` AttributeError. Code review 2026-04-04 by `Fridge003` and `Evgueni-Petrov-aka-espetrov`. Alternative `LazyValue.__getattr__` approach proposed. **Approved** by `Fridge003` on 2026-04-06, CI rerun triggered — awaiting merge

"Good code is like humor: when you have to explain it, it’s bad." - Cory House
P.S.: Code reviews and approvals are crucial for maintaining high-quality software.
@Evgueni-Petrov-aka-espetrov
Copy link
Copy Markdown
Contributor Author

looks like all failure are unrelated to this PR?

@Fridge003

stage-b-test-1-gpu-large-amd (linux-mi325-1gpu-sglang, 1)

  • timeout 30m

stage-c-test-large-8-gpu-amd-mi35x (linux-mi35x-gpu-8, 0)

  File "/sgl-workspace/aiter/aiter/jit/core.py", line 971, in wrapper
    return op(*args, **kwargs)
RuntimeError: invalid argument for batch_prefill
Exception raised from mha_batch_prefill at /sgl-workspace/aiter/csrc/py_itfs_ck/mha_batch_prefill_kernels.cu:820 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7fd6d3b9233c in /opt/venv/lib/python3.10/site-packages/torch/lib/libc10.so)

stage-c-test-large-8-gpu-amd-mi35x (linux-mi35x-gpu-8, 1)

[122](https://github.com/sgl-project/sglang/actions/runs/24030416140/job/70155978033?pr=21822#step:6:1123)
  File "/sglang-checkout/python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py", line 361, in do_load_weights
    future.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/sglang-checkout/python/sglang/srt/layers/linear.py", line 269, in weight_loader
    assert param.size() == loaded_weight.size()
AssertionError

stage-b-test-1-gpu-large (11)

Model google/gemma-3-4b-it achieved accuracy: 0.3756
[METRIC] mmmu_score=0.37556 labels={"model": "google/gemma-3-4b-it", "eval": "mmmu", "api": "lmms-eval"}
Error testing google/gemma-3-4b-it: 0.37556 not greater than or equal to 0.38 : Model google/gemma-3-4b-it accuracy (0.3756) below expected threshold (0.3800)

stage-b-test-4-gpu-b200

Error: Fast-fail: skipping — root cause job(s): wait-for-stage-b, stage-b-test-1-gpu-large (11)

@Fridge003 Fridge003 merged commit b9c3169 into sgl-project:main Apr 9, 2026
197 of 219 checks passed
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants