Skip to content

[Bug] test_eagle_infer_b fails inconsistently in CI #22096

@kpham-sgl

Description

@kpham-sgl

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

We are seeing some inconcistent CI jobs failure

with the following error

/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `NaN detected! verify: target model logits` failed.

which corresponds to

maybe_detect_nan(logits_output.next_token_logits, "verify: target model logits")

Note: this issue is first discovered here https://github.com/sgl-project/sglang/pull/19664/changes#diff-842a36702d0063dc2d3f6f14ef88dae596fdc2216ba6b4f1ef6c00e2513c82e8

TODO:

  • Consistently reproduce this issue
  • Root cause analysis and fix it

Reproduction

Very hard to reproduce locally.

Environment

CI environment

Metadata

Metadata

Assignees

Labels

Good Pro IssueIssues for experienced contributors; requires a solid understanding of SGLang internals.speculative-decoding

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions