Skip to content

Fix batch invariant ops#11368

Merged
Fridge003 merged 6 commits intomainfrom
bhe/fix_batch_invariant_ops
Oct 11, 2025
Merged

Fix batch invariant ops#11368
Fridge003 merged 6 commits intomainfrom
bhe/fix_batch_invariant_ops

Conversation

@hebiao064
Copy link
Copy Markdown
Collaborator

@hebiao064 hebiao064 commented Oct 9, 2025

Motivation

Without this change, we will see many NaN from certain shape matmul, for example 32, 128 x 128, 1024

Fix Persistent Matmul Kernel: Remove Incorrect tile_id_c Logic

Problem

The persistent matmul kernel had a critical bug where computation and storage used misaligned tile IDs, causing incorrect results to be written to wrong memory locations, which caused gibberish result for our Grok 10B and v6 Mini model.

Root Cause

# Buggy: compute tile N, store to tile (N + NUM_SMS)
for tile_id in tl.range(start_pid, num_tiles, NUM_SMS):
    pid_m, pid_n = _compute_pid(tile_id, ...)      # Compute tile N
    # ... accumulate ...
    
    tile_id_c += NUM_SMS
    pid_m, pid_n = _compute_pid(tile_id_c, ...)    # ❌ Store to different tile!
    tl.store(c_ptrs, c, mask=c_mask)

Solution

Use the same pid_m, pid_n for both computation and storage:

# Fixed: compute and store to same tile
for tile_id in tl.range(start_pid, num_tiles, NUM_SMS):
    pid_m, pid_n = _compute_pid(tile_id, ...)      # Compute tile N
    # ... accumulate ...
    # Reuse pid_m, pid_n for storage
    tl.store(c_ptrs, c, mask=c_mask)               # ✅ Store to same tile

Additional Changes

  • Added explicit bfloat16 and float32 dtype handling

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Comment thread test/srt/batch_invariant/test_batch_invariant_ops.py
Comment thread test/srt/batch_invariant/test_batch_invariant_ops.py
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Fridge003 Fridge003 merged commit eae9a9f into main Oct 11, 2025
175 of 200 checks passed
@Fridge003 Fridge003 deleted the bhe/fix_batch_invariant_ops branch October 11, 2025 03:49
lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants