Enhance sglang engine dumping tests in dump comparator#19681
Enhance sglang engine dumping tests in dump comparator#19681fzyzcjy merged 608 commits intosgl-project:mainfrom
Conversation
Remove manual JSONL parsing, SummaryRecord assertions, and _find_summary helper. The comparator's own exit code (with --forbid-skip) already enforces zero failures and zero skips.
… e2e test Comma separates independent parallel axes, colon attaches a qualifier (like partial reduction) to an axis. The parser correctly rejects 'partial' as an unknown axis name.
Auto-dumped core fields (positions, etc.) lack dims metadata, so the comparator cannot unshard them across TP ranks and skips them. Instead of --forbid-skip (which treats any skip as failure), parse the JSON summary to assert failed==0 and passed>0.
Collect moe_dp_rank/size and attn_cp_rank/size in _SGLangPlugin.collect_parallel_info() using existing SGLang APIs that were previously not being dumped.
In dp_attn mode, dp_size > 1 but MLP tensors have data on all ranks. The existing dp filter would incorrectly drop valid ranks. This adds a `// dp:=<group_name>` syntax in dims strings that redirects dp filtering to use `<group>_rank`/`<group>_size` fields from metadata instead of the default `dp_rank`/`dp_size`. - dims.py: parse_dims() strips `//` section; new extract_dp_group_alias() - dp_utils.py: filter/extract accept dp_group_alias parameter - bundle_comparator.py: swap dims override before dp filter, pass alias
- test_dims.py: extract_dp_group_alias, parse_dims with //, resolve_dim_names with // - test_dp_utils.py: _extract_dp_info and filter with dp_group_alias param - test_entrypoint.py: E2E tests for dp alias noop, override-dims alias, real alias filtering
Assert that sglang_parallel_info in dump metadata contains the newly added moe_dp_rank/size and attn_cp_rank/size fields alongside the existing tp_rank/size.
Check every group (tp, pp, moe_ep, moe_tp, moe_dp, attn_tp, attn_dp, local_attn_dp, attn_cp, enable_dp_attention) rather than only the newly added ones.
The first two tests now model the realistic scenario where dp_size=2 with one empty rank (attention-style DP), while // dp:=moe_dp redirects dp filtering to use the alias group. This avoids the aligner failure that occurs when 2 replicated non-empty items reach it without sharding.
- test_dp_alias_absent_group_noop: single rank with dp_size=1, verifies // syntax doesn't break comparison - test_dp_alias_via_override_dims: uses moe_dp_rank/moe_dp_size fields so override-dims with // dp:=moe_dp triggers real alias-based filtering
parse_dims() now returns DimsSpec (dims + dp_group_alias) instead of list[DimSpec]. This removes the separate extract_dp_group_alias() public API and keeps dp group alias extraction as an internal detail of the parsing step.
# is more natural as an annotation/pragma marker and avoids ambiguity with URL fragments or division operators.
… control Instead of all-or-nothing --forbid-skip, --allow-skip-pattern accepts a regex: tensor names matching the pattern are allowed to skip (e.g. core auto-dump fields like positions/seq_lens that lack dims metadata at TP>1). Default '.*' allows all skips; '^$' forbids all (equivalent to old --forbid-skip).
Replace scattered print_record() calls with a centralized report_sink singleton that tees output to both stdout and an auto-generated JSONL report file. This eliminates output_format parameter threading through the call chain. - Add ReportSink class with configure/add/close lifecycle - Add --report-path and --no-report CLI arguments - Default report path: <target-path>/comparator_report.jsonl - Remove output_format from emit_display_records, _consume_comparison_records, WarningSink - Add TestReportOutput test class and conftest autouse fixture for isolation
- Report path now printed via report_sink.add(ReportPathRecord(...)) so it respects json/text output format - --no-report removed; pass --report-path '' to disable instead
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors and enhances the SGLang engine's debug comparator. The core changes revolve around a new, more robust dimension specification system that improves how tensor dimensions are parsed and aligned, particularly for complex sharding and fusion scenarios. This also includes a migration to a more comprehensive logging system and more flexible data parallel filtering, ensuring the comparator can handle a wider range of debugging needs with greater precision and clarity. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci