### Checklist - [x] I searched related issues but found no solution. - [x] The bug persists in the latest version. - [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback. - [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed. - [x] Please use English. Otherwise, it will be closed. ### Describe the bug We are seeing some inconcistent CI jobs failure - https://github.com/sgl-project/sglang/actions/runs/23887510568/job/69803384226?pr=21917 - https://github.com/sgl-project/sglang/actions/runs/23611723130/job/68774232428?pr=20739 - https://github.com/sgl-project/sglang/actions/runs/23686664687/job/69007370788?pr=18582 - https://github.com/sgl-project/sglang/actions/runs/24013952407/job/70080527996?pr=21700 - https://github.com/sgl-project/sglang/actions/runs/24056522219/job/70170772487?pr=22214 - https://github.com/sgl-project/sglang/actions/runs/24417866008/job/71333251796?pr=20989 - https://github.com/sgl-project/sglang/actions/runs/24533868084/job/71726183565?pr=22994 - https://github.com/sgl-project/sglang/actions/runs/24776816081/job/72497880441?pr=22998 - https://github.com/sgl-project/sglang/actions/runs/25006445016/job/73409545672?pr=23850 with the following error ``` /pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `NaN detected! verify: target model logits` failed. ``` which corresponds to https://github.com/sgl-project/sglang/blob/5cc246e095abc9bf35a316f5a955fc07663cc077/python/sglang/srt/speculative/eagle_worker.py#L759 Note: this issue is first discovered here https://github.com/sgl-project/sglang/pull/19664/changes#diff-842a36702d0063dc2d3f6f14ef88dae596fdc2216ba6b4f1ef6c00e2513c82e8 TODO: - [ ] Consistently reproduce this issue - [ ] Root cause analysis and fix it ### Reproduction Very hard to reproduce locally. ### Environment CI environment
Checklist
Describe the bug
We are seeing some inconcistent CI jobs failure
with the following error
which corresponds to
sglang/python/sglang/srt/speculative/eagle_worker.py
Line 759 in 5cc246e
Note: this issue is first discovered here https://github.com/sgl-project/sglang/pull/19664/changes#diff-842a36702d0063dc2d3f6f14ef88dae596fdc2216ba6b4f1ef6c00e2513c82e8
TODO:
Reproduction
Very hard to reproduce locally.
Environment
CI environment